mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-08 17:42:21 +00:00
fix typo (#1455)
This commit is contained in:
parent
5445ea6df6
commit
e2fcb42f5f
@ -4,7 +4,7 @@ Train an RNN language model
|
|||||||
======================================
|
======================================
|
||||||
|
|
||||||
If you have enough text data, you can train a neural network language model (NNLM) to improve
|
If you have enough text data, you can train a neural network language model (NNLM) to improve
|
||||||
the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from
|
the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from
|
||||||
scratch.
|
scratch.
|
||||||
|
|
||||||
.. HINT::
|
.. HINT::
|
||||||
@ -15,23 +15,23 @@ scratch.
|
|||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
This tutorial is based on the LibriSpeech recipe. Please check it out for the necessary
|
This tutorial is based on the LibriSpeech recipe. Please check it out for the necessary
|
||||||
python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set
|
python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set
|
||||||
for illustration purpose. You can also collect your own data. The data format is quite simple:
|
for illustration purpose. You can also collect your own data. The data format is quite simple:
|
||||||
each line should contain a complete sentence, and words should be separated by space.
|
each line should contain a complete sentence, and words should be separated by space.
|
||||||
|
|
||||||
First, let's download the training data for the RNNLM. This can be done via the
|
First, let's download the training data for the RNNLM. This can be done via the
|
||||||
following command:
|
following command:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
$ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
|
$ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz
|
||||||
$ gzip -d librispeech-lm-norm.txt.gz
|
$ gzip -d librispeech-lm-norm.txt.gz
|
||||||
|
|
||||||
As we are training a BPE-level RNNLM, we need to tokenize the training text, which requires a
|
As we are training a BPE-level RNNLM, we need to tokenize the training text, which requires a
|
||||||
BPE tokenizer. This can be achieved by executing the following command:
|
BPE tokenizer. This can be achieved by executing the following command:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
$ # if you don't have the BPE
|
$ # if you don't have the BPE
|
||||||
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
|
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
|
||||||
$ cd icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500
|
$ cd icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500
|
||||||
@ -56,11 +56,11 @@ sentence length.
|
|||||||
--out-statistics data/lang_bpe_500/lm_data_stats.txt
|
--out-statistics data/lang_bpe_500/lm_data_stats.txt
|
||||||
|
|
||||||
|
|
||||||
The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say
|
The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say
|
||||||
you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt``
|
you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt``
|
||||||
and ``--lm-archive data/lang_bpe_500/lm-data-valid.pt`` when calling ``./local/prepare_lm_training_data.py``.
|
and ``--lm-archive data/lang_bpe_500/lm-data-valid.pt`` when calling ``./local/prepare_lm_training_data.py``.
|
||||||
|
|
||||||
After completing the previous steps, the training and testing sets for training RNNLM are ready.
|
After completing the previous steps, the training and testing sets for training RNNLM are ready.
|
||||||
The next step is to train the RNNLM model. The training command is as follows:
|
The next step is to train the RNNLM model. The training command is as follows:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
@ -77,7 +77,7 @@ The next step is to train the RNNLM model. The training command is as follows:
|
|||||||
--use-fp16 0 \
|
--use-fp16 0 \
|
||||||
--tie-weights 1 \
|
--tie-weights 1 \
|
||||||
--embedding-dim 2048 \
|
--embedding-dim 2048 \
|
||||||
--hidden_dim 2048 \
|
--hidden-dim 2048 \
|
||||||
--num-layers 3 \
|
--num-layers 3 \
|
||||||
--batch-size 300 \
|
--batch-size 300 \
|
||||||
--lm-data rnn_lm/data/lang_bpe_500/sorted_lm_data.pt \
|
--lm-data rnn_lm/data/lang_bpe_500/sorted_lm_data.pt \
|
||||||
@ -93,12 +93,3 @@ The next step is to train the RNNLM model. The training command is as follows:
|
|||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
The training of RNNLM can take a long time (usually a couple of days).
|
The training of RNNLM can take a long time (usually a couple of days).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user