deploy: e2fcb42f5f176d9e39eb38506ab99d0a3adaf202

This commit is contained in:
marcoyang1998 2024-01-09 07:44:23 +00:00
parent 6dfba8b4a3
commit a1d1f2e434
3 changed files with 11 additions and 20 deletions

View File

@ -4,7 +4,7 @@ Train an RNN language model
====================================== ======================================
If you have enough text data, you can train a neural network language model (NNLM) to improve If you have enough text data, you can train a neural network language model (NNLM) to improve
the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from
scratch. scratch.
.. HINT:: .. HINT::
@ -15,23 +15,23 @@ scratch.
.. note:: .. note::
This tutorial is based on the LibriSpeech recipe. Please check it out for the necessary This tutorial is based on the LibriSpeech recipe. Please check it out for the necessary
python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set
for illustration purpose. You can also collect your own data. The data format is quite simple: for illustration purpose. You can also collect your own data. The data format is quite simple:
each line should contain a complete sentence, and words should be separated by space. each line should contain a complete sentence, and words should be separated by space.
First, let's download the training data for the RNNLM. This can be done via the First, let's download the training data for the RNNLM. This can be done via the
following command: following command:
.. code-block:: bash .. code-block:: bash
$ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz $ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
$ gzip -d librispeech-lm-norm.txt.gz $ gzip -d librispeech-lm-norm.txt.gz
As we are training a BPE-level RNNLM, we need to tokenize the training text, which requires a As we are training a BPE-level RNNLM, we need to tokenize the training text, which requires a
BPE tokenizer. This can be achieved by executing the following command: BPE tokenizer. This can be achieved by executing the following command:
.. code-block:: bash .. code-block:: bash
$ # if you don't have the BPE $ # if you don't have the BPE
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15 $ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
$ cd icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500 $ cd icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500
@ -56,11 +56,11 @@ sentence length.
--out-statistics data/lang_bpe_500/lm_data_stats.txt --out-statistics data/lang_bpe_500/lm_data_stats.txt
The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say
you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt`` you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt``
and ``--lm-archive data/lang_bpe_500/lm-data-valid.pt`` when calling ``./local/prepare_lm_training_data.py``. and ``--lm-archive data/lang_bpe_500/lm-data-valid.pt`` when calling ``./local/prepare_lm_training_data.py``.
After completing the previous steps, the training and testing sets for training RNNLM are ready. After completing the previous steps, the training and testing sets for training RNNLM are ready.
The next step is to train the RNNLM model. The training command is as follows: The next step is to train the RNNLM model. The training command is as follows:
.. code-block:: bash .. code-block:: bash
@ -77,7 +77,7 @@ The next step is to train the RNNLM model. The training command is as follows:
--use-fp16 0 \ --use-fp16 0 \
--tie-weights 1 \ --tie-weights 1 \
--embedding-dim 2048 \ --embedding-dim 2048 \
--hidden_dim 2048 \ --hidden-dim 2048 \
--num-layers 3 \ --num-layers 3 \
--batch-size 300 \ --batch-size 300 \
--lm-data rnn_lm/data/lang_bpe_500/sorted_lm_data.pt \ --lm-data rnn_lm/data/lang_bpe_500/sorted_lm_data.pt \
@ -93,12 +93,3 @@ The next step is to train the RNNLM model. The training command is as follows:
.. note:: .. note::
The training of RNNLM can take a long time (usually a couple of days). The training of RNNLM can take a long time (usually a couple of days).

View File

@ -162,7 +162,7 @@ $<span class="w"> </span>./rnn_lm/train.py<span class="w"> </span><span class="s
<span class="w"> </span>--use-fp16<span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="se">\</span> <span class="w"> </span>--use-fp16<span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--tie-weights<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span> <span class="w"> </span>--tie-weights<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span> <span class="w"> </span>--embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--hidden_dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span> <span class="w"> </span>--hidden-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span> <span class="w"> </span>--num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--batch-size<span class="w"> </span><span class="m">300</span><span class="w"> </span><span class="se">\</span> <span class="w"> </span>--batch-size<span class="w"> </span><span class="m">300</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-data<span class="w"> </span>rnn_lm/data/lang_bpe_500/sorted_lm_data.pt<span class="w"> </span><span class="se">\</span> <span class="w"> </span>--lm-data<span class="w"> </span>rnn_lm/data/lang_bpe_500/sorted_lm_data.pt<span class="w"> </span><span class="se">\</span>

File diff suppressed because one or more lines are too long