From 97f9b9c33b9e3d4a7152c45f28dec397202aabb6 Mon Sep 17 00:00:00 2001
From: marcoyang1998 <45973641+marcoyang1998@users.noreply.github.com>
Date: Mon, 25 Sep 2023 10:48:50 +0800
Subject: [PATCH 1/3] Add documentation for RNNLM training (#1267)

* add documentation for training an RNNLM
---
 .../decoding-with-langugage-models/index.rst  |   5 +-
 docs/source/recipes/RNN-LM/index.rst          |   7 ++
 .../RNN-LM/librispeech/lm-training.rst        | 104 ++++++++++++++++++
 docs/source/recipes/index.rst                 |   1 +
 4 files changed, 115 insertions(+), 2 deletions(-)
 create mode 100644 docs/source/recipes/RNN-LM/index.rst
 create mode 100644 docs/source/recipes/RNN-LM/librispeech/lm-training.rst

diff --git a/docs/source/decoding-with-langugage-models/index.rst b/docs/source/decoding-with-langugage-models/index.rst
index 6e5e3a4d9..c49da9a4e 100644
--- a/docs/source/decoding-with-langugage-models/index.rst
+++ b/docs/source/decoding-with-langugage-models/index.rst
@@ -2,12 +2,13 @@ Decoding with language models
 =============================
 
 This section describes how to use external langugage models 
-during decoding to improve the WER of transducer models.
+during decoding to improve the WER of transducer models. To train an external language model,
+please refer to this tutorial: :ref:`train_nnlm`.
 
 The following decoding methods with external langugage models are available:
 
 
-.. list-table:: LM-rescoring-based methods vs shallow-fusion-based methods (The numbers in each field is WER on test-clean, WER on test-other and decoding time on test-clean)
+.. list-table:: 
    :widths: 25 50
    :header-rows: 1
 
diff --git a/docs/source/recipes/RNN-LM/index.rst b/docs/source/recipes/RNN-LM/index.rst
new file mode 100644
index 000000000..4b74e64c7
--- /dev/null
+++ b/docs/source/recipes/RNN-LM/index.rst
@@ -0,0 +1,7 @@
+RNN-LM
+======
+
+.. toctree::
+   :maxdepth: 2
+
+   librispeech/lm-training
\ No newline at end of file
diff --git a/docs/source/recipes/RNN-LM/librispeech/lm-training.rst b/docs/source/recipes/RNN-LM/librispeech/lm-training.rst
new file mode 100644
index 000000000..736120275
--- /dev/null
+++ b/docs/source/recipes/RNN-LM/librispeech/lm-training.rst
@@ -0,0 +1,104 @@
+.. _train_nnlm:
+
+Train an RNN langugage model
+======================================
+
+If you have enough text data, you can train a neural network language model (NNLM) to improve
+the WER of your E2E ASR system. This tutorial shows you how to train an RNNLM from 
+scratch.
+
+.. HINT::
+
+    For how to use an NNLM during decoding, please refer to the following tutorials:
+    :ref:`shallow_fusion`, :ref:`LODR`, :ref:`rescoring`
+
+.. note::
+
+    This tutorial is based on the LibriSpeech recipe. Please check it out for the necessary
+    python scripts for this tutorial. We use the LibriSpeech LM-corpus as the LM training set 
+    for illustration purpose. You can also collect your own data. The data format is quite simple:
+    each line should contain a complete sentence, and words should be separated by space.
+
+First, let's download the training data for the RNNLM. This can be done via the 
+following command:
+
+.. code-block:: bash
+
+    $ wget https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz 
+    $ gzip -d librispeech-lm-norm.txt.gz
+
+As we are training a BPE-level RNNLM, we need to tokenize the training text, which requires a
+BPE tokenizer. This can be achieved by executing the following command:
+
+.. code-block:: bash
+    
+    $ # if you don't have the BPE
+    $ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
+    $ cd icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500
+    $ git lfs pull --include bpe.model
+    $ cd ../../..
+
+    $ ./local/prepare_lm_training_data.py \
+        --bpe-model icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500/bpe.model \
+        --lm-data librispeech-lm-norm.txt \
+        --lm-archive data/lang_bpe_500/lm_data.pt
+
+Now, you should have a file name ``lm_data.pt`` file store under the directory ``data/lang_bpe_500``.
+This is the packed training data for the RNNLM. We then sort the training data according to its
+sentence length.
+
+.. code-block:: bash
+
+    $ # This could take a while (~ 20 minutes), feel free to grab a cup of coffee :)
+    $ ./local/sort_lm_training_data.py \
+        --in-lm-data data/lang_bpe_500/lm_data.pt \
+        --out-lm-data data/lang_bpe_500/sorted_lm_data.pt \
+        --out-statistics data/lang_bpe_500/lm_data_stats.txt
+
+
+The aforementioned steps can be repeated to create a a validation set for you RNNLM. Let's say 
+you have a validation set in ``valid.txt``, you can just set ``--lm-data valid.txt`` 
+and ``--lm-archive data/lang_bpe_500/lm-data-valid.pt`` when calling ``./local/prepare_lm_training_data.py``.
+
+After completing the previous steps, the training and testing sets for training RNNLM are ready. 
+The next step is to train the RNNLM model. The training command is as follows:
+
+.. code-block:: bash
+
+    $ # assume you are in the icefall root directory
+    $ cd rnn_lm
+    $ ln -s ../../egs/librispeech/ASR/data .
+    $ cd ..
+    $ ./rnn_lm/train.py \
+        --world-size 4 \
+        --exp-dir ./rnn_lm/exp \
+        --start-epoch 0 \
+        --num-epochs 10 \
+        --use-fp16 0 \
+        --tie-weights 1 \
+        --embedding-dim 2048 \
+        --hidden_dim 2048 \
+        --num-layers 3 \
+        --batch-size 300 \
+        --lm-data rnn_lm/data/lang_bpe_500/sorted_lm_data.pt \
+        --lm-data-valid rnn_lm/data/lang_bpe_500/sorted_lm_data.pt
+
+
+.. note::
+
+    You can adjust the RNNLM hyper parameters to control the size of the RNNLM,
+    such as embedding dimension and hidden state dimension. For more details, please
+    run ``./rnn_lm/train.py --help``.
+
+.. note::
+
+    The training of RNNLM can take a long time (usually a couple of days).
+
+
+
+
+
+
+
+
+
diff --git a/docs/source/recipes/index.rst b/docs/source/recipes/index.rst
index 63793275c..7265e1cf6 100644
--- a/docs/source/recipes/index.rst
+++ b/docs/source/recipes/index.rst
@@ -15,3 +15,4 @@ We may add recipes for other tasks as well in the future.
 
    Non-streaming-ASR/index
    Streaming-ASR/index
+   RNN-LM/index

From e17f884ace2dba7561d4d4eaaac6726234cad20f Mon Sep 17 00:00:00 2001
From: marcoyang1998 <45973641+marcoyang1998@users.noreply.github.com>
Date: Mon, 25 Sep 2023 15:36:40 +0800
Subject: [PATCH 2/3] Fix docs for MVQ (#1272)

* typo fix
---
 .../librispeech/distillation.rst                 | 16 ++++++++--------
 egs/librispeech/ASR/distillation_with_hubert.sh  |  2 ++
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/docs/source/recipes/Non-streaming-ASR/librispeech/distillation.rst b/docs/source/recipes/Non-streaming-ASR/librispeech/distillation.rst
index 2e8d0893a..37edf7de9 100644
--- a/docs/source/recipes/Non-streaming-ASR/librispeech/distillation.rst
+++ b/docs/source/recipes/Non-streaming-ASR/librispeech/distillation.rst
@@ -47,7 +47,7 @@ The data preparation contains several stages, you can use the following two
 options:
 
   - ``--stage``
-  - ``--stop-stage``
+  - ``--stop_stage``
 
 to control which stage(s) should be run. By default, all stages are executed.
 
@@ -56,8 +56,8 @@ For example,
 .. code-block:: bash
 
   $ cd egs/librispeech/ASR
-  $ ./prepare.sh --stage 0 --stop-stage 0 # run only stage 0
-  $ ./prepare.sh --stage 2 --stop-stage 5 # run from stage 2 to stage 5
+  $ ./prepare.sh --stage 0 --stop_stage 0 # run only stage 0
+  $ ./prepare.sh --stage 2 --stop_stage 5 # run from stage 2 to stage 5
 
 .. HINT::
 
@@ -108,15 +108,15 @@ As usual, you can control the stages you want to run by specifying the following
 two options:
 
   - ``--stage``
-  - ``--stop-stage``
+  - ``--stop_stage``
 
 For example,
 
 .. code-block:: bash
 
   $ cd egs/librispeech/ASR
-  $ ./distillation_with_hubert.sh --stage 0 --stop-stage 0 # run only stage 0
-  $ ./distillation_with_hubert.sh --stage 2 --stop-stage 4 # run from stage 2 to stage 5
+  $ ./distillation_with_hubert.sh --stage 0 --stop_stage 0 # run only stage 0
+  $ ./distillation_with_hubert.sh --stage 2 --stop_stage 4 # run from stage 2 to stage 5
 
 Here are a few options in `./distillation_with_hubert.sh <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/distillation_with_hubert.sh>`_
 you need to know before you proceed.
@@ -134,7 +134,7 @@ and prepares MVQ-augmented training manifests.
 
 .. code-block:: bash
 
-  $ ./distillation_with_hubert.sh --stage 2 --stop-stage 2 # run only stage 2
+  $ ./distillation_with_hubert.sh --stage 2 --stop_stage 2 # run only stage 2
 
 Please see the
 following screenshot for the output of an example execution.
@@ -172,7 +172,7 @@ To perform training, please run stage 3 by executing the following command.
 
 .. code-block:: bash
 
-  $ ./prepare.sh --stage 3 --stop-stage 3 # run MVQ training
+  $ ./prepare.sh --stage 3 --stop_stage 3 # run MVQ training
 
 Here is the code snippet for training:
 
diff --git a/egs/librispeech/ASR/distillation_with_hubert.sh b/egs/librispeech/ASR/distillation_with_hubert.sh
index 6aaa0333b..a5b0b85af 100755
--- a/egs/librispeech/ASR/distillation_with_hubert.sh
+++ b/egs/librispeech/ASR/distillation_with_hubert.sh
@@ -56,6 +56,8 @@ use_extracted_codebook=True
 #   "hubert_xtralarge_ll60k" -> pretrained model without fintuing
 teacher_model_id=hubert_xtralarge_ll60k_finetune_ls960
 
+. shared/parse_options.sh || exit 1
+
 log() {
   # This function is from espnet
   local fname=${BASH_SOURCE[1]##*/}

From 1b565dd25198f700bcfe88e86a0f6a435e11a429 Mon Sep 17 00:00:00 2001
From: zr_jin <peter.jin.cn@gmail.com>
Date: Tue, 26 Sep 2023 15:41:39 +0800
Subject: [PATCH 3/3] added softlinks to local dir (#1273)

---
 egs/tedlium3/ASR/conformer_ctc2/local              | 1 +
 egs/tedlium3/ASR/pruned_transducer_stateless/local | 1 +
 egs/tedlium3/ASR/transducer_stateless/local        | 1 +
 egs/tedlium3/ASR/zipformer/local                   | 1 +
 4 files changed, 4 insertions(+)
 create mode 120000 egs/tedlium3/ASR/conformer_ctc2/local
 create mode 120000 egs/tedlium3/ASR/pruned_transducer_stateless/local
 create mode 120000 egs/tedlium3/ASR/transducer_stateless/local
 create mode 120000 egs/tedlium3/ASR/zipformer/local

diff --git a/egs/tedlium3/ASR/conformer_ctc2/local b/egs/tedlium3/ASR/conformer_ctc2/local
new file mode 120000
index 000000000..c820590c5
--- /dev/null
+++ b/egs/tedlium3/ASR/conformer_ctc2/local
@@ -0,0 +1 @@
+../local
\ No newline at end of file
diff --git a/egs/tedlium3/ASR/pruned_transducer_stateless/local b/egs/tedlium3/ASR/pruned_transducer_stateless/local
new file mode 120000
index 000000000..c820590c5
--- /dev/null
+++ b/egs/tedlium3/ASR/pruned_transducer_stateless/local
@@ -0,0 +1 @@
+../local
\ No newline at end of file
diff --git a/egs/tedlium3/ASR/transducer_stateless/local b/egs/tedlium3/ASR/transducer_stateless/local
new file mode 120000
index 000000000..c820590c5
--- /dev/null
+++ b/egs/tedlium3/ASR/transducer_stateless/local
@@ -0,0 +1 @@
+../local
\ No newline at end of file
diff --git a/egs/tedlium3/ASR/zipformer/local b/egs/tedlium3/ASR/zipformer/local
new file mode 120000
index 000000000..c820590c5
--- /dev/null
+++ b/egs/tedlium3/ASR/zipformer/local
@@ -0,0 +1 @@
+../local
\ No newline at end of file