mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-09 01:52:41 +00:00
deploy: e2fcb42f5f176d9e39eb38506ab99d0a3adaf202
This commit is contained in:
parent
6dfba8b4a3
commit
a1d1f2e434
@ -4,7 +4,7 @@ Train an RNN language model
|
||||
======================================
|
||||
|
||||
If you have enough text data, you can train a neural network language model (NNLM) to improve
|
||||
the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from
|
||||
the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from
|
||||
scratch.
|
||||
|
||||
.. HINT::
|
||||
@ -15,23 +15,23 @@ scratch.
|
||||
.. note::
|
||||
|
||||
This tutorial is based on the LibriSpeech recipe. Please check it out for the necessary
|
||||
python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set
|
||||
python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set
|
||||
for illustration purpose. You can also collect your own data. The data format is quite simple:
|
||||
each line should contain a complete sentence, and words should be separated by space.
|
||||
|
||||
First, let's download the training data for the RNNLM. This can be done via the
|
||||
First, let's download the training data for the RNNLM. This can be done via the
|
||||
following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
|
||||
$ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
|
||||
$ gzip -d librispeech-lm-norm.txt.gz
|
||||
|
||||
As we are training a BPE-level RNNLM, we need to tokenize the training text, which requires a
|
||||
BPE tokenizer. This can be achieved by executing the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
|
||||
$ # if you don't have the BPE
|
||||
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
|
||||
$ cd icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500
|
||||
@ -56,11 +56,11 @@ sentence length.
|
||||
--out-statistics data/lang_bpe_500/lm_data_stats.txt
|
||||
|
||||
|
||||
The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say
|
||||
you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt``
|
||||
The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say
|
||||
you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt``
|
||||
and ``--lm-archive data/lang_bpe_500/lm-data-valid.pt`` when calling ``./local/prepare_lm_training_data.py``.
|
||||
|
||||
After completing the previous steps, the training and testing sets for training RNNLM are ready.
|
||||
After completing the previous steps, the training and testing sets for training RNNLM are ready.
|
||||
The next step is to train the RNNLM model. The training command is as follows:
|
||||
|
||||
.. code-block:: bash
|
||||
@ -77,7 +77,7 @@ The next step is to train the RNNLM model. The training command is as follows:
|
||||
--use-fp16 0 \
|
||||
--tie-weights 1 \
|
||||
--embedding-dim 2048 \
|
||||
--hidden_dim 2048 \
|
||||
--hidden-dim 2048 \
|
||||
--num-layers 3 \
|
||||
--batch-size 300 \
|
||||
--lm-data rnn_lm/data/lang_bpe_500/sorted_lm_data.pt \
|
||||
@ -93,12 +93,3 @@ The next step is to train the RNNLM model. The training command is as follows:
|
||||
.. note::
|
||||
|
||||
The training of RNNLM can take a long time (usually a couple of days).
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
@ -162,7 +162,7 @@ $<span class="w"> </span>./rnn_lm/train.py<span class="w"> </span><span class="s
|
||||
<span class="w"> </span>--use-fp16<span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--tie-weights<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--hidden_dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--hidden-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--batch-size<span class="w"> </span><span class="m">300</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-data<span class="w"> </span>rnn_lm/data/lang_bpe_500/sorted_lm_data.pt<span class="w"> </span><span class="se">\</span>
|
||||
|
File diff suppressed because one or more lines are too long
Loading…
x
Reference in New Issue
Block a user