mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-09 01:52:41 +00:00
deploy: e2fcb42f5f176d9e39eb38506ab99d0a3adaf202
This commit is contained in:
parent
6dfba8b4a3
commit
a1d1f2e434
@ -4,7 +4,7 @@ Train an RNN language model
|
|||||||
======================================
|
======================================
|
||||||
|
|
||||||
If you have enough text data, you can train a neural network language model (NNLM) to improve
|
If you have enough text data, you can train a neural network language model (NNLM) to improve
|
||||||
the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from
|
the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from
|
||||||
scratch.
|
scratch.
|
||||||
|
|
||||||
.. HINT::
|
.. HINT::
|
||||||
@ -15,23 +15,23 @@ scratch.
|
|||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
This tutorial is based on the LibriSpeech recipe. Please check it out for the necessary
|
This tutorial is based on the LibriSpeech recipe. Please check it out for the necessary
|
||||||
python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set
|
python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set
|
||||||
for illustration purpose. You can also collect your own data. The data format is quite simple:
|
for illustration purpose. You can also collect your own data. The data format is quite simple:
|
||||||
each line should contain a complete sentence, and words should be separated by space.
|
each line should contain a complete sentence, and words should be separated by space.
|
||||||
|
|
||||||
First, let's download the training data for the RNNLM. This can be done via the
|
First, let's download the training data for the RNNLM. This can be done via the
|
||||||
following command:
|
following command:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
$ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
|
$ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
|
||||||
$ gzip -d librispeech-lm-norm.txt.gz
|
$ gzip -d librispeech-lm-norm.txt.gz
|
||||||
|
|
||||||
As we are training a BPE-level RNNLM, we need to tokenize the training text, which requires a
|
As we are training a BPE-level RNNLM, we need to tokenize the training text, which requires a
|
||||||
BPE tokenizer. This can be achieved by executing the following command:
|
BPE tokenizer. This can be achieved by executing the following command:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
$ # if you don't have the BPE
|
$ # if you don't have the BPE
|
||||||
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
|
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
|
||||||
$ cd icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500
|
$ cd icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500
|
||||||
@ -56,11 +56,11 @@ sentence length.
|
|||||||
--out-statistics data/lang_bpe_500/lm_data_stats.txt
|
--out-statistics data/lang_bpe_500/lm_data_stats.txt
|
||||||
|
|
||||||
|
|
||||||
The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say
|
The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say
|
||||||
you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt``
|
you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt``
|
||||||
and ``--lm-archive data/lang_bpe_500/lm-data-valid.pt`` when calling ``./local/prepare_lm_training_data.py``.
|
and ``--lm-archive data/lang_bpe_500/lm-data-valid.pt`` when calling ``./local/prepare_lm_training_data.py``.
|
||||||
|
|
||||||
After completing the previous steps, the training and testing sets for training RNNLM are ready.
|
After completing the previous steps, the training and testing sets for training RNNLM are ready.
|
||||||
The next step is to train the RNNLM model. The training command is as follows:
|
The next step is to train the RNNLM model. The training command is as follows:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
@ -77,7 +77,7 @@ The next step is to train the RNNLM model. The training command is as follows:
|
|||||||
--use-fp16 0 \
|
--use-fp16 0 \
|
||||||
--tie-weights 1 \
|
--tie-weights 1 \
|
||||||
--embedding-dim 2048 \
|
--embedding-dim 2048 \
|
||||||
--hidden_dim 2048 \
|
--hidden-dim 2048 \
|
||||||
--num-layers 3 \
|
--num-layers 3 \
|
||||||
--batch-size 300 \
|
--batch-size 300 \
|
||||||
--lm-data rnn_lm/data/lang_bpe_500/sorted_lm_data.pt \
|
--lm-data rnn_lm/data/lang_bpe_500/sorted_lm_data.pt \
|
||||||
@ -93,12 +93,3 @@ The next step is to train the RNNLM model. The training command is as follows:
|
|||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
The training of RNNLM can take a long time (usually a couple of days).
|
The training of RNNLM can take a long time (usually a couple of days).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -162,7 +162,7 @@ $<span class="w"> </span>./rnn_lm/train.py<span class="w"> </span><span class="s
|
|||||||
<span class="w"> </span>--use-fp16<span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="se">\</span>
|
<span class="w"> </span>--use-fp16<span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="se">\</span>
|
||||||
<span class="w"> </span>--tie-weights<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
<span class="w"> </span>--tie-weights<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||||
<span class="w"> </span>--embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
<span class="w"> </span>--embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||||
<span class="w"> </span>--hidden_dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
<span class="w"> </span>--hidden-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||||
<span class="w"> </span>--num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
|
<span class="w"> </span>--num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
|
||||||
<span class="w"> </span>--batch-size<span class="w"> </span><span class="m">300</span><span class="w"> </span><span class="se">\</span>
|
<span class="w"> </span>--batch-size<span class="w"> </span><span class="m">300</span><span class="w"> </span><span class="se">\</span>
|
||||||
<span class="w"> </span>--lm-data<span class="w"> </span>rnn_lm/data/lang_bpe_500/sorted_lm_data.pt<span class="w"> </span><span class="se">\</span>
|
<span class="w"> </span>--lm-data<span class="w"> </span>rnn_lm/data/lang_bpe_500/sorted_lm_data.pt<span class="w"> </span><span class="se">\</span>
|
||||||
|
File diff suppressed because one or more lines are too long
Loading…
x
Reference in New Issue
Block a user