Add documentation for RNNLM training (#1267)

* add documentation for training an RNNLM
2023-09-25 10:48:50 +08:00 · 2023-09-25 10:48:50 +08:00 · 97f9b9c33b
commit 97f9b9c33b
parent ef5da4824d
4 changed files with 115 additions and 2 deletions
--- a/docs/source/decoding-with-langugage-models/index.rst
+++ b/docs/source/decoding-with-langugage-models/index.rst
@ -2,12 +2,13 @@ Decoding with language models
 =============================
 This section describes how to use external langugage models 
-during decoding to improve the WER of transducer models.
+during decoding to improve the WER of transducer models. To train an external language model,
 please refer to this tutorial: :ref:`train_nnlm`.
 The following decoding methods with external langugage models are available:
-.. list-table:: LM-rescoring-based methods vs shallow-fusion-based methods (The numbers in each field is WER on test-clean, WER on test-other and decoding time on test-clean)
+.. list-table:: 
   :widths: 25 50
   :header-rows: 1
--- a/docs/source/recipes/RNN-LM/index.rst
+++ b/docs/source/recipes/RNN-LM/index.rst
@ -0,0 +1,7 @@
 RNN-LM
 ======
 .. toctree::
   :maxdepth: 2
   librispeech/lm-training
--- a/docs/source/recipes/RNN-LM/librispeech/lm-training.rst
+++ b/docs/source/recipes/RNN-LM/librispeech/lm-training.rst
@ -0,0 +1,104 @@
 .. _train_nnlm:
 Train an RNN langugage model
 ======================================
 If you have enough text data, you can train a neural network language model (NNLM) to improve
 the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from 
 scratch.
 .. HINT::
    For how to use an NNLM during decoding, please refer to the following tutorials:
    :ref:`shallow_fusion`, :ref:`LODR`, :ref:`rescoring`
 .. note::
    This tutorial is based on the LibriSpeech recipe. Please check it out for the necessary
    python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set 
    for illustration purpose. You can also collect your own data. The data format is quite simple:
    each line should contain a complete sentence, and words should be separated by space.
 First, let's download the training data for the RNNLM. This can be done via the 
 following command:
 .. code-block:: bash
    $ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz 
    $ gzip -d librispeech-lm-norm.txt.gz
 As we are training a BPE-level RNNLM, we need to tokenize the training text, which requires a
 BPE tokenizer. This can be achieved by executing the following command:
 .. code-block:: bash
    $ # if you don't have the BPE
    $ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
    $ cd icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500
    $ git lfs pull --include bpe.model
    $ cd ../../..
    $ ./local/prepare_lm_training_data.py \
        --bpe-model icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500/bpe.model \
        --lm-data librispeech-lm-norm.txt \
        --lm-archive data/lang_bpe_500/lm_data.pt
 Now, you should have a file name ``lm_data.pt`` file store under the directory ``data/lang_bpe_500``.
 This is the packed training data for the RNNLM. We then sort the training data according to its
 sentence length.
 .. code-block:: bash
    $ # This could take a while (~ 20 minutes), feel free to grab a cup of coffee :)
    $ ./local/sort_lm_training_data.py \
        --in-lm-data data/lang_bpe_500/lm_data.pt \
        --out-lm-data data/lang_bpe_500/sorted_lm_data.pt \
        --out-statistics data/lang_bpe_500/lm_data_stats.txt
 The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say 
 you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt`` 
 and ``--lm-archive data/lang_bpe_500/lm-data-valid.pt`` when calling ``./local/prepare_lm_training_data.py``.
 After completing the previous steps, the training and testing sets for training RNNLM are ready. 
 The next step is to train the RNNLM model. The training command is as follows:
 .. code-block:: bash
    $ # assume you are in the icefall root directory
    $ cd rnn_lm
    $ ln -s ../../egs/librispeech/ASR/data .
    $ cd ..
    $ ./rnn_lm/train.py \
        --world-size 4 \
        --exp-dir ./rnn_lm/exp \
        --start-epoch 0 \
        --num-epochs 10 \
        --use-fp16 0 \
        --tie-weights 1 \
        --embedding-dim 2048 \
        --hidden_dim 2048 \
        --num-layers 3 \
        --batch-size 300 \
        --lm-data rnn_lm/data/lang_bpe_500/sorted_lm_data.pt \
        --lm-data-valid rnn_lm/data/lang_bpe_500/sorted_lm_data.pt
 .. note::
    You can adjust the RNNLM hyper parameters to control the size of the RNNLM,
    such as embedding dimension and hidden state dimension. For more details, please
    run ``./rnn_lm/train.py --help``.
 .. note::
    The training of RNNLM can take a long time (usually a couple of days).
--- a/docs/source/recipes/index.rst
+++ b/docs/source/recipes/index.rst
@ -15,3 +15,4 @@ We may add recipes for other tasks as well in the future.
   Non-streaming-ASR/index
   Streaming-ASR/index
   RNN-LM/index