deploy: 11523c5b894f42ded965dcb974fef9a8a8122518

This commit is contained in:
marcoyang1998 2023-07-06 11:11:44 +00:00
parent 9d2d2225d7
commit 48b954a308
61 changed files with 1921 additions and 36 deletions

View File

@ -0,0 +1,184 @@
.. _LODR:
LODR for RNN Transducer
=======================
As a type of E2E model, neural transducers are usually considered as having an internal
language model, which learns the language level information on the training corpus.
In real-life scenario, there is often a mismatch between the training corpus and the target corpus space.
This mismatch can be a problem when decoding for neural transducer models with language models as its internal
language can act "against" the external LM. In this tutorial, we show how to use
`Low-order Density Ratio <https://arxiv.org/abs/2203.16776>`_ to alleviate this effect to further improve the performance
of langugae model integration.
.. note::
This tutorial is based on the recipe
`pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
which is a streaming transducer model trained on `LibriSpeech`_.
However, you can easily apply LODR to other recipes.
If you encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`__.
.. note::
For simplicity, the training and testing corpus in this tutorial are the same (`LibriSpeech`_). However,
you can change the testing set to any other domains (e.g `GigaSpeech`_) and prepare the language models
using that corpus.
First, let's have a look at some background information. As the predecessor of LODR, Density Ratio (DR) is first proposed `here <https://arxiv.org/abs/2002.11268>`_
to address the language information mismatch between the training
corpus (source domain) and the testing corpus (target domain). Assuming that the source domain and the test domain
are acoustically similar, DR derives the following formular for decoding with Bayes' theorem:
.. math::
\text{score}\left(y_u|\mathit{x},y\right) =
\log p\left(y_u|\mathit{x},y_{1:u-1}\right) +
\lambda_1 \log p_{\text{Target LM}}\left(y_u|\mathit{x},y_{1:u-1}\right) -
\lambda_2 \log p_{\text{Source LM}}\left(y_u|\mathit{x},y_{1:u-1}\right)
where :math:`\lambda_1` and :math:`\lambda_2` are the weights of LM scores for target domain and source domain respectively.
Here, the source domain LM is trained on the training corpus. The only difference in the above formular compared to
shallow fusion is the subtraction of the source domain LM.
Some works treat the predictor and the joiner of the neural transducer as its internal LM. However, the LM is
considered to be weak and can only capture low-level language information. Therefore, `LODR <https://arxiv.org/abs/2203.16776>`__ proposed to use
a low-order n-gram LM as an approximation of the ILM of the neural transducer. This leads to the following formula
during decoding for transducer model:
.. math::
\text{score}\left(y_u|\mathit{x},y\right) =
\log p_{rnnt}\left(y_u|\mathit{x},y_{1:u-1}\right) +
\lambda_1 \log p_{\text{Target LM}}\left(y_u|\mathit{x},y_{1:u-1}\right) -
\lambda_2 \log p_{\text{bi-gram}}\left(y_u|\mathit{x},y_{1:u-1}\right)
In LODR, an additional bi-gram LM estimated on the source domain (e.g training corpus) is required. Comared to DR,
the only difference lies in the choice of source domain LM. According to the original `paper <https://arxiv.org/abs/2203.16776>`_,
LODR achieves similar performance compared DR in both intra-domain and cross-domain settings.
As a bi-gram is much faster to evaluate, LODR is usually much faster.
Now, we will show you how to use LODR in ``icefall``.
For illustration purpose, we will use a pre-trained ASR model from this `link <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`_.
If you want to train your model from scratch, please have a look at :ref:`non_streaming_librispeech_pruned_transducer_stateless`.
The testing scenario here is intra-domain (we decode the model trained on `LibriSpeech`_ on `LibriSpeech`_ testing sets).
As the initial step, let's download the pre-trained model.
.. code-block:: bash
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
$ pushd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
To test the model, let's have a look at the decoding results **without** using LM. This can be done via the following command:
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--exp-dir $exp_dir \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search
The following WERs are achieved on test-clean and test-other:
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 3.11 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.93 best for test-other
Then, we download the external language model and bi-gram LM that are necessary for LODR.
Note that the bi-gram is estimated on the LibriSpeech 960 hours' text.
.. code-block:: bash
$ # download the external LM
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
$ # create a symbolic link so that the checkpoint can be loaded
$ pushd icefall-librispeech-rnn-lm/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt
$ popd
$
$ # download the bi-gram
$ git lfs install
$ git clone https://huggingface.co/marcoyang/librispeech_bigram
$ pushd data/lang_bpe_500
$ ln -s ../../librispeech_bigram/2gram.fst.txt .
$ popd
Then, we perform LODR decoding by setting ``--decoding-method`` to ``modified_beam_search_lm_LODR``:
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ lm_dir=./icefall-librispeech-rnn-lm/exp
$ lm_scale=0.42
$ LODR_scale=-0.24
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--beam-size 4 \
--exp-dir $exp_dir \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_lm_LODR \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
--use-shallow-fusion 1 \
--lm-type rnn \
--lm-exp-dir $lm_dir \
--lm-epoch 99 \
--lm-scale $lm_scale \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500 \
--tokens-ngram 2 \
--ngram-lm-scale $LODR_scale
There are two extra arguments that need to be given when doing LODR. ``--tokens-ngram`` specifies the order of n-gram. As we
are using a bi-gram, we set it to 2. ``--ngram-lm-scale`` is the scale of the bi-gram, it should be a negative number
as we are subtracting the bi-gram's score during decoding.
The decoding results obtained with the above command are shown below:
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 2.61 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 6.74 best for test-other
Recall that the lowest WER we obtained in :ref:`shallow_fusion` with beam size of 4 is ``2.77/7.08``, LODR
indeed **further improves** the WER. We can do even better if we increase ``--beam-size``:
.. list-table:: WER of LODR with different beam sizes
:widths: 25 25 50
:header-rows: 1
* - Beam size
- test-clean
- test-other
* - 4
- 2.61
- 6.74
* - 8
- 2.45
- 6.38
* - 12
- 2.4
- 6.23

View File

@ -0,0 +1,12 @@
Decoding with language models
=============================
This section describes how to use external langugage models
during decoding to improve the WER of transducer models.
.. toctree::
:maxdepth: 2
shallow-fusion
LODR
rescoring

View File

@ -0,0 +1,252 @@
.. _rescoring:
LM rescoring for Transducer
=================================
LM rescoring is a commonly used approach to incorporate external LM information. Unlike shallow-fusion-based
methods (see :ref:`shallow-fusion`, :ref:`LODR`), rescoring is usually performed to re-rank the n-best hypotheses after beam search.
Rescoring is usually more efficient than shallow fusion since less computation is performed on the external LM.
In this tutorial, we will show you how to use external LM to rescore the n-best hypotheses decoded from neural transducer models in
`icefall <https://github.com/k2-fsa/icefall>`__.
.. note::
This tutorial is based on the recipe
`pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
which is a streaming transducer model trained on `LibriSpeech`_.
However, you can easily apply shallow fusion to other recipes.
If you encounter any problems, please open an issue `here <https://github.com/k2-fsa/icefall/issues>`_.
.. note::
For simplicity, the training and testing corpus in this tutorial is the same (`LibriSpeech`_). However, you can change the testing set
to any other domains (e.g `GigaSpeech`_) and use an external LM trained on that domain.
.. HINT::
We recommend you to use a GPU for decoding.
For illustration purpose, we will use a pre-trained ASR model from this `link <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`__.
If you want to train your model from scratch, please have a look at :ref:`non_streaming_librispeech_pruned_transducer_stateless`.
As the initial step, let's download the pre-trained model.
.. code-block:: bash
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
$ pushd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
As usual, we first test the model's performance without external LM. This can be done via the following command:
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--exp-dir $exp_dir \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search
The following WERs are achieved on test-clean and test-other:
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 3.11 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.93 best for test-other
Now, we will try to improve the above WER numbers via external LM rescoring. We will download
a pre-trained LM from this `link <https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm>`__.
.. note::
This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus.
You may also train a RNN LM from scratch. Please refer to this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/rnn_lm/train.py>`__
for training a RNN LM and this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/transformer_lm/train.py>`__ to train a transformer LM.
.. code-block:: bash
$ # download the external LM
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
$ # create a symbolic link so that the checkpoint can be loaded
$ pushd icefall-librispeech-rnn-lm/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt
$ popd
With the RNNLM available, we can rescore the n-best hypotheses generated from `modified_beam_search`. Here,
`n` should be the number of beams, i.e ``--beam-size``. The command for LM rescoring is
as follows. Note that the ``--decoding-method`` is set to `modified_beam_search_lm_rescore` and ``--use-shallow-fusion``
is set to `False`.
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ lm_dir=./icefall-librispeech-rnn-lm/exp
$ lm_scale=0.43
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--beam-size 4 \
--exp-dir $exp_dir \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_lm_rescore \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
--use-shallow-fusion 0 \
--lm-type rnn \
--lm-exp-dir $lm_dir \
--lm-epoch 99 \
--lm-scale $lm_scale \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 2.93 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.6 best for test-other
Great! We made some improvements! Increasing the size of the n-best hypotheses will further boost the performance,
see the following table:
.. list-table:: WERs of LM rescoring with different beam sizes
:widths: 25 25 25
:header-rows: 1
* - Beam size
- test-clean
- test-other
* - 4
- 2.93
- 7.6
* - 8
- 2.67
- 7.11
* - 12
- 2.59
- 6.86
In fact, we can also apply LODR (see :ref:`LODR`) when doing LM rescoring. To do so, we need to
download the bi-gram required by LODR:
.. code-block:: bash
$ # download the bi-gram
$ git lfs install
$ git clone https://huggingface.co/marcoyang/librispeech_bigram
$ pushd data/lang_bpe_500
$ ln -s ../../librispeech_bigram/2gram.arpa .
$ popd
Then we can performn LM rescoring + LODR by changing the decoding method to `modified_beam_search_lm_rescore_LODR`.
.. note::
This decoding method requires the dependency of `kenlm <https://github.com/kpu/kenlm>`_. You can install it
via this command: `pip install https://github.com/kpu/kenlm/archive/master.zip`.
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ lm_dir=./icefall-librispeech-rnn-lm/exp
$ lm_scale=0.43
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--beam-size 4 \
--exp-dir $exp_dir \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_lm_rescore_LODR \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
--use-shallow-fusion 0 \
--lm-type rnn \
--lm-exp-dir $lm_dir \
--lm-epoch 99 \
--lm-scale $lm_scale \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500
You should see the following WERs after executing the commands above:
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 2.9 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.57 best for test-other
It's slightly better than LM rescoring. If we further increase the beam size, we will see
further improvements from LM rescoring + LODR:
.. list-table:: WERs of LM rescoring + LODR with different beam sizes
:widths: 25 25 25
:header-rows: 1
* - Beam size
- test-clean
- test-other
* - 4
- 2.9
- 7.57
* - 8
- 2.63
- 7.04
* - 12
- 2.52
- 6.73
As mentioned earlier, LM rescoring is usually faster than shallow-fusion based methods.
Here, we benchmark the WERs and decoding speed of them:
.. list-table:: LM-rescoring-based methods vs shallow-fusion-based methods (The numbers in each field is WER on test-clean, WER on test-other and decoding time on test-clean)
:widths: 25 25 25 25
:header-rows: 1
* - Decoding method
- beam=4
- beam=8
- beam=12
* - `modified_beam_search`
- 3.11/7.93; 132s
- 3.1/7.95; 177s
- 3.1/7.96; 210s
* - `modified_beam_search_lm_shallow_fusion`
- 2.77/7.08; 262s
- 2.62/6.65; 352s
- 2.58/6.65; 488s
* - LODR
- 2.61/6.74; 400s
- 2.45/6.38; 610s
- 2.4/6.23; 870s
* - `modified_beam_search_lm_rescore`
- 2.93/7.6; 156s
- 2.67/7.11; 203s
- 2.59/6.86; 255s
* - `modified_beam_search_lm_rescore_LODR`
- 2.9/7.57; 160s
- 2.63/7.04; 203s
- 2.52/6.73; 263s
.. note::
Decoding is performed with a single 32G V100, we set ``--max-duration`` to 600.
Decoding time here is only for reference and it may vary.

View File

@ -0,0 +1,176 @@
.. _shallow_fusion:
Shallow fusion for Transducer
=================================
External language models (LM) are commonly used to improve WERs for E2E ASR models.
This tutorial shows you how to perform ``shallow fusion`` with an external LM
to improve the word-error-rate of a transducer model.
.. note::
This tutorial is based on the recipe
`pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
which is a streaming transducer model trained on `LibriSpeech`_.
However, you can easily apply shallow fusion to other recipes.
If you encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`_.
.. note::
For simplicity, the training and testing corpus in this tutorial is the same (`LibriSpeech`_). However, you can change the testing set
to any other domains (e.g `GigaSpeech`_) and use an external LM trained on that domain.
.. HINT::
We recommend you to use a GPU for decoding.
For illustration purpose, we will use a pre-trained ASR model from this `link <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`__.
If you want to train your model from scratch, please have a look at :ref:`non_streaming_librispeech_pruned_transducer_stateless`.
As the initial step, let's download the pre-trained model.
.. code-block:: bash
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
$ pushd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
To test the model, let's have a look at the decoding results without using LM. This can be done via the following command:
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--exp-dir $exp_dir \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search
The following WERs are achieved on test-clean and test-other:
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 3.11 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.93 best for test-other
These are already good numbers! But we can further improve it by using shallow fusion with external LM.
Training a language model usually takes a long time, we can download a pre-trained LM from this `link <https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm>`__.
.. code-block:: bash
$ # download the external LM
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
$ # create a symbolic link so that the checkpoint can be loaded
$ pushd icefall-librispeech-rnn-lm/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt
$ popd
.. note::
This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus.
You may also train a RNN LM from scratch. Please refer to this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/rnn_lm/train.py>`__
for training a RNN LM and this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/transformer_lm/train.py>`__ to train a transformer LM.
To use shallow fusion for decoding, we can execute the following command:
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ lm_dir=./icefall-librispeech-rnn-lm/exp
$ lm_scale=0.29
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--beam-size 4 \
--exp-dir $exp_dir \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_lm_shallow_fusion \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
--use-shallow-fusion 1 \
--lm-type rnn \
--lm-exp-dir $lm_dir \
--lm-epoch 99 \
--lm-scale $lm_scale \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500
Note that we set ``--decoding-method modified_beam_search_lm_shallow_fusion`` and ``--use-shallow-fusion True``
to use shallow fusion. ``--lm-type`` specifies the type of neural LM we are going to use, you can either choose
between ``rnn`` or ``transformer``. The following three arguments are associated with the rnn:
- ``--rnn-lm-embedding-dim``
The embedding dimension of the RNN LM
- ``--rnn-lm-hidden-dim``
The hidden dimension of the RNN LM
- ``--rnn-lm-num-layers``
The number of RNN layers in the RNN LM.
The decoding result obtained with the above command are shown below.
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 2.77 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.08 best for test-other
The improvement of shallow fusion is very obvious! The relative WER reduction on test-other is around 10.5%.
A few parameters can be tuned to further boost the performance of shallow fusion:
- ``--lm-scale``
Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large,
the LM score may dominant during decoding, leading to bad WER. A typical value of this is around 0.3.
- ``--beam-size``
The number of active paths in the search beam. It controls the trade-off between decoding efficiency and accuracy.
Here, we also show how `--beam-size` effect the WER and decoding time:
.. list-table:: WERs and decoding time (on test-clean) of shallow fusion with different beam sizes
:widths: 25 25 25 25
:header-rows: 1
* - Beam size
- test-clean
- test-other
- Decoding time on test-clean (s)
* - 4
- 2.77
- 7.08
- 262
* - 8
- 2.62
- 6.65
- 352
* - 12
- 2.58
- 6.65
- 488
As we see, a larger beam size during shallow fusion improves the WER, but is also slower.

View File

@ -34,3 +34,8 @@ speech recognition recipes using `k2 <https://github.com/k2-fsa/k2>`_.
contributing/index
huggingface/index
.. toctree::
:maxdepth: 2
decoding-with-langugage-models/index

View File

@ -1,7 +1,7 @@
Distillation with HuBERT
========================
This tutorial shows you how to perform knowledge distillation in `icefall`_
This tutorial shows you how to perform knowledge distillation in `icefall <https://github.com/k2-fsa/icefall>`_
with the `LibriSpeech`_ dataset. The distillation method
used here is called "Multi Vector Quantization Knowledge Distillation" (MVQ-KD).
Please have a look at our paper `Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation <https://arxiv.org/abs/2211.00508>`_
@ -13,7 +13,7 @@ for more details about MVQ-KD.
`pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`_.
Currently, we only implement MVQ-KD in this recipe. However, MVQ-KD is theoretically applicable to all recipes
with only minor changes needed. Feel free to try out MVQ-KD in different recipes. If you
encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`_.
encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`__.
.. note::
@ -217,7 +217,7 @@ the following command.
--exp-dir $exp_dir \
--enable-distillation True
You should get similar results as `here <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS-100hours.md#distillation-with-hubert>`_.
You should get similar results as `here <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS-100hours.md#distillation-with-hubert>`__.
That's all! Feel free to experiment with your own setups and report your results.
If you encounter any problems during training, please open up an issue `here <https://github.com/k2-fsa/icefall/issues>`_.
If you encounter any problems during training, please open up an issue `here <https://github.com/k2-fsa/icefall/issues>`__.

View File

@ -8,10 +8,10 @@ with the `LibriSpeech <https://www.openslr.org/12>`_ dataset.
.. Note::
The tutorial is suitable for `pruned_transducer_stateless <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless>`_,
`pruned_transducer_stateless2 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless2>`_,
`pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`_,
`pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless5>`_,
The tutorial is suitable for `pruned_transducer_stateless <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless>`__,
`pruned_transducer_stateless2 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless2>`__,
`pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`__,
`pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless5>`__,
We will take pruned_transducer_stateless4 as an example in this tutorial.
.. HINT::
@ -237,7 +237,7 @@ them, please modify ``./pruned_transducer_stateless4/train.py`` directly.
.. NOTE::
The options for `pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless5/train.py>`_ are a little different from
The options for `pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless5/train.py>`__ are a little different from
other recipes. It allows you to configure ``--num-encoder-layers``, ``--dim-feedforward``, ``--nhead``, ``--encoder-dim``, ``--decoder-dim``, ``--joiner-dim`` from commandline, so that you can train models with different size with pruned_transducer_stateless5.
@ -529,13 +529,13 @@ Download pretrained models
If you don't want to train from scratch, you can download the pretrained models
by visiting the following links:
- `pruned_transducer_stateless <https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless-2022-03-12>`_
- `pruned_transducer_stateless <https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless-2022-03-12>`__
- `pruned_transducer_stateless2 <https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless2-2022-04-29>`_
- `pruned_transducer_stateless2 <https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless2-2022-04-29>`__
- `pruned_transducer_stateless4 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless4-2022-06-03>`_
- `pruned_transducer_stateless4 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless4-2022-06-03>`__
- `pruned_transducer_stateless5 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless5-2022-07-07>`_
- `pruned_transducer_stateless5 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless5-2022-07-07>`__
See `<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md>`_
for the details of the above pretrained models

View File

@ -45,9 +45,9 @@ the input features.
We have three variants of Emformer models in ``icefall``.
- ``pruned_stateless_emformer_rnnt2`` using Emformer from torchaudio, see `LibriSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_stateless_emformer_rnnt2>`_.
- ``pruned_stateless_emformer_rnnt2`` using Emformer from torchaudio, see `LibriSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_stateless_emformer_rnnt2>`__.
- ``conv_emformer_transducer_stateless`` using ConvEmformer implemented by ourself. Different from the Emformer in torchaudio,
ConvEmformer has a convolution in each layer and uses the mechanisms in our reworked conformer model.
See `LibriSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/conv_emformer_transducer_stateless>`_.
See `LibriSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/conv_emformer_transducer_stateless>`__.
- ``conv_emformer_transducer_stateless2`` using ConvEmformer implemented by ourself. The only difference from the above one is that
it uses a simplified memory bank. See `LibriSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/conv_emformer_transducer_stateless2>`_.

View File

@ -6,10 +6,10 @@ with the `LibriSpeech <https://www.openslr.org/12>`_ dataset.
.. Note::
The tutorial is suitable for `pruned_transducer_stateless <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless>`_,
`pruned_transducer_stateless2 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless2>`_,
`pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`_,
`pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless5>`_,
The tutorial is suitable for `pruned_transducer_stateless <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless>`__,
`pruned_transducer_stateless2 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless2>`__,
`pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`__,
`pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless5>`__,
We will take pruned_transducer_stateless4 as an example in this tutorial.
.. HINT::
@ -264,7 +264,7 @@ them, please modify ``./pruned_transducer_stateless4/train.py`` directly.
.. NOTE::
The options for `pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless5/train.py>`_ are a little different from
The options for `pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless5/train.py>`__ are a little different from
other recipes. It allows you to configure ``--num-encoder-layers``, ``--dim-feedforward``, ``--nhead``, ``--encoder-dim``, ``--decoder-dim``, ``--joiner-dim`` from commandline, so that you can train models with different size with pruned_transducer_stateless5.

View File

@ -6,7 +6,7 @@ with the `LibriSpeech <https://www.openslr.org/12>`_ dataset.
.. Note::
The tutorial is suitable for `pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
The tutorial is suitable for `pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`__,
.. HINT::
@ -642,7 +642,7 @@ Download pretrained models
If you don't want to train from scratch, you can download the pretrained models
by visiting the following links:
- `pruned_transducer_stateless7_streaming <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`_
- `pruned_transducer_stateless7_streaming <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`__
See `<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md>`_
for the details of the above pretrained models

View File

@ -59,6 +59,9 @@
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -59,6 +59,9 @@
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -65,6 +65,9 @@
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -59,6 +59,9 @@
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -0,0 +1,292 @@
<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>LODR for RNN Transducer &mdash; icefall 0.1 documentation</title>
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<!--[if lt IE 9]>
<script src="../_static/js/html5shiv.min.js"></script>
<![endif]-->
<script src="../_static/jquery.js"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js"></script>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
<script src="../_static/doctools.js"></script>
<script src="../_static/sphinx_highlight.js"></script>
<script async="async" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<script src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="LM rescoring for Transducer" href="rescoring.html" />
<link rel="prev" title="Shallow fusion for Transducer" href="shallow-fusion.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home">
icefall
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../installation/index.html">Installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../faqs.html">Frequently Asked Questions (FAQs)</a></li>
<li class="toctree-l1"><a class="reference internal" href="../model-export/index.html">Model export</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../recipes/index.html">Recipes</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul class="current">
<li class="toctree-l1 current"><a class="reference internal" href="index.html">Decoding with language models</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="shallow-fusion.html">Shallow fusion for Transducer</a></li>
<li class="toctree-l2 current"><a class="current reference internal" href="#">LODR for RNN Transducer</a></li>
<li class="toctree-l2"><a class="reference internal" href="rescoring.html">LM rescoring for Transducer</a></li>
</ul>
</li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">icefall</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item"><a href="index.html">Decoding with language models</a></li>
<li class="breadcrumb-item active">LODR for RNN Transducer</li>
<li class="wy-breadcrumbs-aside">
<a href="https://github.com/k2-fsa/icefall/blob/master/docs/source/decoding-with-langugage-models/LODR.rst" class="fa fa-github"> Edit on GitHub</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="lodr-for-rnn-transducer">
<span id="lodr"></span><h1>LODR for RNN Transducer<a class="headerlink" href="#lodr-for-rnn-transducer" title="Permalink to this heading"></a></h1>
<p>As a type of E2E model, neural transducers are usually considered as having an internal
language model, which learns the language level information on the training corpus.
In real-life scenario, there is often a mismatch between the training corpus and the target corpus space.
This mismatch can be a problem when decoding for neural transducer models with language models as its internal
language can act “against” the external LM. In this tutorial, we show how to use
<a class="reference external" href="https://arxiv.org/abs/2203.16776">Low-order Density Ratio</a> to alleviate this effect to further improve the performance
of langugae model integration.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This tutorial is based on the recipe
<a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming">pruned_transducer_stateless7_streaming</a>,
which is a streaming transducer model trained on <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>.
However, you can easily apply LODR to other recipes.
If you encounter any problems, please open an issue here <a class="reference external" href="https://github.com/k2-fsa/icefall/issues">icefall</a>.</p>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>For simplicity, the training and testing corpus in this tutorial are the same (<a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>). However,
you can change the testing set to any other domains (e.g <a class="reference external" href="https://github.com/SpeechColab/GigaSpeech">GigaSpeech</a>) and prepare the language models
using that corpus.</p>
</div>
<p>First, lets have a look at some background information. As the predecessor of LODR, Density Ratio (DR) is first proposed <a class="reference external" href="https://arxiv.org/abs/2002.11268">here</a>
to address the language information mismatch between the training
corpus (source domain) and the testing corpus (target domain). Assuming that the source domain and the test domain
are acoustically similar, DR derives the following formular for decoding with Bayes theorem:</p>
<div class="math notranslate nohighlight">
\[\text{score}\left(y_u|\mathit{x},y\right) =
\log p\left(y_u|\mathit{x},y_{1:u-1}\right) +
\lambda_1 \log p_{\text{Target LM}}\left(y_u|\mathit{x},y_{1:u-1}\right) -
\lambda_2 \log p_{\text{Source LM}}\left(y_u|\mathit{x},y_{1:u-1}\right)\]</div>
<p>where <span class="math notranslate nohighlight">\(\lambda_1\)</span> and <span class="math notranslate nohighlight">\(\lambda_2\)</span> are the weights of LM scores for target domain and source domain respectively.
Here, the source domain LM is trained on the training corpus. The only difference in the above formular compared to
shallow fusion is the subtraction of the source domain LM.</p>
<p>Some works treat the predictor and the joiner of the neural transducer as its internal LM. However, the LM is
considered to be weak and can only capture low-level language information. Therefore, <a class="reference external" href="https://arxiv.org/abs/2203.16776">LODR</a> proposed to use
a low-order n-gram LM as an approximation of the ILM of the neural transducer. This leads to the following formula
during decoding for transducer model:</p>
<div class="math notranslate nohighlight">
\[\text{score}\left(y_u|\mathit{x},y\right) =
\log p_{rnnt}\left(y_u|\mathit{x},y_{1:u-1}\right) +
\lambda_1 \log p_{\text{Target LM}}\left(y_u|\mathit{x},y_{1:u-1}\right) -
\lambda_2 \log p_{\text{bi-gram}}\left(y_u|\mathit{x},y_{1:u-1}\right)\]</div>
<p>In LODR, an additional bi-gram LM estimated on the source domain (e.g training corpus) is required. Comared to DR,
the only difference lies in the choice of source domain LM. According to the original <a class="reference external" href="https://arxiv.org/abs/2203.16776">paper</a>,
LODR achieves similar performance compared DR in both intra-domain and cross-domain settings.
As a bi-gram is much faster to evaluate, LODR is usually much faster.</p>
<p>Now, we will show you how to use LODR in <code class="docutils literal notranslate"><span class="pre">icefall</span></code>.
For illustration purpose, we will use a pre-trained ASR model from this <a class="reference external" href="https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29">link</a>.
If you want to train your model from scratch, please have a look at <a class="reference internal" href="../recipes/Non-streaming-ASR/librispeech/pruned_transducer_stateless.html#non-streaming-librispeech-pruned-transducer-stateless"><span class="std std-ref">Pruned transducer statelessX</span></a>.
The testing scenario here is intra-domain (we decode the model trained on <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a> on <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a> testing sets).</p>
<p>As the initial step, lets download the pre-trained model.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">&quot;pretrained.pt&quot;</span>
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
</pre></div>
</div>
<p>To test the model, lets have a look at the decoding results <strong>without</strong> using LM. This can be done via the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search
</pre></div>
</div>
<p>The following WERs are achieved on test-clean and test-other:</p>
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
$ beam_size_4 3.11 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.93 best for test-other
</pre></div>
</div>
<p>Then, we download the external language model and bi-gram LM that are necessary for LODR.
Note that the bi-gram is estimated on the LibriSpeech 960 hours text.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="c1"># download the external LM</span>
$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
$<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-librispeech-rnn-lm/exp
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">&quot;pretrained.pt&quot;</span>
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt
$<span class="w"> </span><span class="nb">popd</span>
$
$<span class="w"> </span><span class="c1"># download the bi-gram</span>
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>install
$<span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/marcoyang/librispeech_bigram
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>data/lang_bpe_500
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>../../librispeech_bigram/2gram.fst.txt<span class="w"> </span>.
$<span class="w"> </span><span class="nb">popd</span>
</pre></div>
</div>
<p>Then, we perform LODR decoding by setting <code class="docutils literal notranslate"><span class="pre">--decoding-method</span></code> to <code class="docutils literal notranslate"><span class="pre">modified_beam_search_lm_LODR</span></code>:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$<span class="w"> </span><span class="nv">lm_dir</span><span class="o">=</span>./icefall-librispeech-rnn-lm/exp
$<span class="w"> </span><span class="nv">lm_scale</span><span class="o">=</span><span class="m">0</span>.42
$<span class="w"> </span><span class="nv">LODR_scale</span><span class="o">=</span>-0.24
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--beam-size<span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search_lm_LODR<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
<span class="w"> </span>--use-shallow-fusion<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-type<span class="w"> </span>rnn<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-exp-dir<span class="w"> </span><span class="nv">$lm_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-scale<span class="w"> </span><span class="nv">$lm_scale</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--rnn-lm-embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--rnn-lm-hidden-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--rnn-lm-num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-vocab-size<span class="w"> </span><span class="m">500</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--tokens-ngram<span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--ngram-lm-scale<span class="w"> </span><span class="nv">$LODR_scale</span>
</pre></div>
</div>
<p>There are two extra arguments that need to be given when doing LODR. <code class="docutils literal notranslate"><span class="pre">--tokens-ngram</span></code> specifies the order of n-gram. As we
are using a bi-gram, we set it to 2. <code class="docutils literal notranslate"><span class="pre">--ngram-lm-scale</span></code> is the scale of the bi-gram, it should be a negative number
as we are subtracting the bi-grams score during decoding.</p>
<p>The decoding results obtained with the above command are shown below:</p>
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
$ beam_size_4 2.61 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 6.74 best for test-other
</pre></div>
</div>
<p>Recall that the lowest WER we obtained in <a class="reference internal" href="shallow-fusion.html#shallow-fusion"><span class="std std-ref">Shallow fusion for Transducer</span></a> with beam size of 4 is <code class="docutils literal notranslate"><span class="pre">2.77/7.08</span></code>, LODR
indeed <strong>further improves</strong> the WER. We can do even better if we increase <code class="docutils literal notranslate"><span class="pre">--beam-size</span></code>:</p>
<table class="docutils align-default" id="id1">
<caption><span class="caption-number">Table 2 </span><span class="caption-text">WER of LODR with different beam sizes</span><a class="headerlink" href="#id1" title="Permalink to this table"></a></caption>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 50%" />
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Beam size</p></th>
<th class="head"><p>test-clean</p></th>
<th class="head"><p>test-other</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>4</p></td>
<td><p>2.61</p></td>
<td><p>6.74</p></td>
</tr>
<tr class="row-odd"><td><p>8</p></td>
<td><p>2.45</p></td>
<td><p>6.38</p></td>
</tr>
<tr class="row-even"><td><p>12</p></td>
<td><p>2.4</p></td>
<td><p>6.23</p></td>
</tr>
</tbody>
</table>
</section>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="shallow-fusion.html" class="btn btn-neutral float-left" title="Shallow fusion for Transducer" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="rescoring.html" class="btn btn-neutral float-right" title="LM rescoring for Transducer" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>
<div role="contentinfo">
<p>&#169; Copyright 2021, icefall development team.</p>
</div>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>

View File

@ -0,0 +1,135 @@
<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Decoding with language models &mdash; icefall 0.1 documentation</title>
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<!--[if lt IE 9]>
<script src="../_static/js/html5shiv.min.js"></script>
<![endif]-->
<script src="../_static/jquery.js"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js"></script>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
<script src="../_static/doctools.js"></script>
<script src="../_static/sphinx_highlight.js"></script>
<script src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Shallow fusion for Transducer" href="shallow-fusion.html" />
<link rel="prev" title="Huggingface spaces" href="../huggingface/spaces.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home">
icefall
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../installation/index.html">Installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../faqs.html">Frequently Asked Questions (FAQs)</a></li>
<li class="toctree-l1"><a class="reference internal" href="../model-export/index.html">Model export</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../recipes/index.html">Recipes</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul class="current">
<li class="toctree-l1 current"><a class="current reference internal" href="#">Decoding with language models</a><ul>
<li class="toctree-l2"><a class="reference internal" href="shallow-fusion.html">Shallow fusion for Transducer</a></li>
<li class="toctree-l2"><a class="reference internal" href="LODR.html">LODR for RNN Transducer</a></li>
<li class="toctree-l2"><a class="reference internal" href="rescoring.html">LM rescoring for Transducer</a></li>
</ul>
</li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">icefall</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item active">Decoding with language models</li>
<li class="wy-breadcrumbs-aside">
<a href="https://github.com/k2-fsa/icefall/blob/master/docs/source/decoding-with-langugage-models/index.rst" class="fa fa-github"> Edit on GitHub</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="decoding-with-language-models">
<h1>Decoding with language models<a class="headerlink" href="#decoding-with-language-models" title="Permalink to this heading"></a></h1>
<p>This section describes how to use external langugage models
during decoding to improve the WER of transducer models.</p>
<div class="toctree-wrapper compound">
<ul>
<li class="toctree-l1"><a class="reference internal" href="shallow-fusion.html">Shallow fusion for Transducer</a></li>
<li class="toctree-l1"><a class="reference internal" href="LODR.html">LODR for RNN Transducer</a></li>
<li class="toctree-l1"><a class="reference internal" href="rescoring.html">LM rescoring for Transducer</a></li>
</ul>
</div>
</section>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="../huggingface/spaces.html" class="btn btn-neutral float-left" title="Huggingface spaces" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="shallow-fusion.html" class="btn btn-neutral float-right" title="Shallow fusion for Transducer" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>
<div role="contentinfo">
<p>&#169; Copyright 2021, icefall development team.</p>
</div>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>

View File

@ -0,0 +1,386 @@
<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>LM rescoring for Transducer &mdash; icefall 0.1 documentation</title>
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<!--[if lt IE 9]>
<script src="../_static/js/html5shiv.min.js"></script>
<![endif]-->
<script src="../_static/jquery.js"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js"></script>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
<script src="../_static/doctools.js"></script>
<script src="../_static/sphinx_highlight.js"></script>
<script src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="prev" title="LODR for RNN Transducer" href="LODR.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home">
icefall
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../installation/index.html">Installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../faqs.html">Frequently Asked Questions (FAQs)</a></li>
<li class="toctree-l1"><a class="reference internal" href="../model-export/index.html">Model export</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../recipes/index.html">Recipes</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul class="current">
<li class="toctree-l1 current"><a class="reference internal" href="index.html">Decoding with language models</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="shallow-fusion.html">Shallow fusion for Transducer</a></li>
<li class="toctree-l2"><a class="reference internal" href="LODR.html">LODR for RNN Transducer</a></li>
<li class="toctree-l2 current"><a class="current reference internal" href="#">LM rescoring for Transducer</a></li>
</ul>
</li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">icefall</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item"><a href="index.html">Decoding with language models</a></li>
<li class="breadcrumb-item active">LM rescoring for Transducer</li>
<li class="wy-breadcrumbs-aside">
<a href="https://github.com/k2-fsa/icefall/blob/master/docs/source/decoding-with-langugage-models/rescoring.rst" class="fa fa-github"> Edit on GitHub</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="lm-rescoring-for-transducer">
<span id="rescoring"></span><h1>LM rescoring for Transducer<a class="headerlink" href="#lm-rescoring-for-transducer" title="Permalink to this heading"></a></h1>
<p>LM rescoring is a commonly used approach to incorporate external LM information. Unlike shallow-fusion-based
methods (see <span class="xref std std-ref">shallow-fusion</span>, <a class="reference internal" href="LODR.html#lodr"><span class="std std-ref">LODR for RNN Transducer</span></a>), rescoring is usually performed to re-rank the n-best hypotheses after beam search.
Rescoring is usually more efficient than shallow fusion since less computation is performed on the external LM.
In this tutorial, we will show you how to use external LM to rescore the n-best hypotheses decoded from neural transducer models in
<a class="reference external" href="https://github.com/k2-fsa/icefall">icefall</a>.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This tutorial is based on the recipe
<a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming">pruned_transducer_stateless7_streaming</a>,
which is a streaming transducer model trained on <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>.
However, you can easily apply shallow fusion to other recipes.
If you encounter any problems, please open an issue <a class="reference external" href="https://github.com/k2-fsa/icefall/issues">here</a>.</p>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>For simplicity, the training and testing corpus in this tutorial is the same (<a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>). However, you can change the testing set
to any other domains (e.g <a class="reference external" href="https://github.com/SpeechColab/GigaSpeech">GigaSpeech</a>) and use an external LM trained on that domain.</p>
</div>
<div class="admonition hint">
<p class="admonition-title">Hint</p>
<p>We recommend you to use a GPU for decoding.</p>
</div>
<p>For illustration purpose, we will use a pre-trained ASR model from this <a class="reference external" href="https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29">link</a>.
If you want to train your model from scratch, please have a look at <a class="reference internal" href="../recipes/Non-streaming-ASR/librispeech/pruned_transducer_stateless.html#non-streaming-librispeech-pruned-transducer-stateless"><span class="std std-ref">Pruned transducer statelessX</span></a>.</p>
<p>As the initial step, lets download the pre-trained model.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">&quot;pretrained.pt&quot;</span>
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
</pre></div>
</div>
<p>As usual, we first test the models performance without external LM. This can be done via the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search
</pre></div>
</div>
<p>The following WERs are achieved on test-clean and test-other:</p>
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
$ beam_size_4 3.11 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.93 best for test-other
</pre></div>
</div>
<p>Now, we will try to improve the above WER numbers via external LM rescoring. We will download
a pre-trained LM from this <a class="reference external" href="https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm">link</a>.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus.
You may also train a RNN LM from scratch. Please refer to this <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/master/icefall/rnn_lm/train.py">script</a>
for training a RNN LM and this <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/master/icefall/transformer_lm/train.py">script</a> to train a transformer LM.</p>
</div>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="c1"># download the external LM</span>
$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
$<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-librispeech-rnn-lm/exp
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">&quot;pretrained.pt&quot;</span>
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt
$<span class="w"> </span><span class="nb">popd</span>
</pre></div>
</div>
<p>With the RNNLM available, we can rescore the n-best hypotheses generated from <cite>modified_beam_search</cite>. Here,
<cite>n</cite> should be the number of beams, i.e <code class="docutils literal notranslate"><span class="pre">--beam-size</span></code>. The command for LM rescoring is
as follows. Note that the <code class="docutils literal notranslate"><span class="pre">--decoding-method</span></code> is set to <cite>modified_beam_search_lm_rescore</cite> and <code class="docutils literal notranslate"><span class="pre">--use-shallow-fusion</span></code>
is set to <cite>False</cite>.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$<span class="w"> </span><span class="nv">lm_dir</span><span class="o">=</span>./icefall-librispeech-rnn-lm/exp
$<span class="w"> </span><span class="nv">lm_scale</span><span class="o">=</span><span class="m">0</span>.43
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--beam-size<span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search_lm_rescore<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
<span class="w"> </span>--use-shallow-fusion<span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-type<span class="w"> </span>rnn<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-exp-dir<span class="w"> </span><span class="nv">$lm_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-scale<span class="w"> </span><span class="nv">$lm_scale</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--rnn-lm-embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--rnn-lm-hidden-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--rnn-lm-num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-vocab-size<span class="w"> </span><span class="m">500</span>
</pre></div>
</div>
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
$ beam_size_4 2.93 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.6 best for test-other
</pre></div>
</div>
<p>Great! We made some improvements! Increasing the size of the n-best hypotheses will further boost the performance,
see the following table:</p>
<table class="docutils align-default" id="id1">
<caption><span class="caption-number">Table 3 </span><span class="caption-text">WERs of LM rescoring with different beam sizes</span><a class="headerlink" href="#id1" title="Permalink to this table"></a></caption>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Beam size</p></th>
<th class="head"><p>test-clean</p></th>
<th class="head"><p>test-other</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>4</p></td>
<td><p>2.93</p></td>
<td><p>7.6</p></td>
</tr>
<tr class="row-odd"><td><p>8</p></td>
<td><p>2.67</p></td>
<td><p>7.11</p></td>
</tr>
<tr class="row-even"><td><p>12</p></td>
<td><p>2.59</p></td>
<td><p>6.86</p></td>
</tr>
</tbody>
</table>
<p>In fact, we can also apply LODR (see <a class="reference internal" href="LODR.html#lodr"><span class="std std-ref">LODR for RNN Transducer</span></a>) when doing LM rescoring. To do so, we need to
download the bi-gram required by LODR:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="c1"># download the bi-gram</span>
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>install
$<span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/marcoyang/librispeech_bigram
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>data/lang_bpe_500
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>../../librispeech_bigram/2gram.arpa<span class="w"> </span>.
$<span class="w"> </span><span class="nb">popd</span>
</pre></div>
</div>
<p>Then we can performn LM rescoring + LODR by changing the decoding method to <cite>modified_beam_search_lm_rescore_LODR</cite>.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This decoding method requires the dependency of <a class="reference external" href="https://github.com/kpu/kenlm">kenlm</a>. You can install it
via this command: <cite>pip install https://github.com/kpu/kenlm/archive/master.zip</cite>.</p>
</div>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$<span class="w"> </span><span class="nv">lm_dir</span><span class="o">=</span>./icefall-librispeech-rnn-lm/exp
$<span class="w"> </span><span class="nv">lm_scale</span><span class="o">=</span><span class="m">0</span>.43
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--beam-size<span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search_lm_rescore_LODR<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
<span class="w"> </span>--use-shallow-fusion<span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-type<span class="w"> </span>rnn<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-exp-dir<span class="w"> </span><span class="nv">$lm_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-scale<span class="w"> </span><span class="nv">$lm_scale</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--rnn-lm-embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--rnn-lm-hidden-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--rnn-lm-num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-vocab-size<span class="w"> </span><span class="m">500</span>
</pre></div>
</div>
<p>You should see the following WERs after executing the commands above:</p>
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
$ beam_size_4 2.9 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.57 best for test-other
</pre></div>
</div>
<p>Its slightly better than LM rescoring. If we further increase the beam size, we will see
further improvements from LM rescoring + LODR:</p>
<table class="docutils align-default" id="id2">
<caption><span class="caption-number">Table 4 </span><span class="caption-text">WERs of LM rescoring + LODR with different beam sizes</span><a class="headerlink" href="#id2" title="Permalink to this table"></a></caption>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Beam size</p></th>
<th class="head"><p>test-clean</p></th>
<th class="head"><p>test-other</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>4</p></td>
<td><p>2.9</p></td>
<td><p>7.57</p></td>
</tr>
<tr class="row-odd"><td><p>8</p></td>
<td><p>2.63</p></td>
<td><p>7.04</p></td>
</tr>
<tr class="row-even"><td><p>12</p></td>
<td><p>2.52</p></td>
<td><p>6.73</p></td>
</tr>
</tbody>
</table>
<p>As mentioned earlier, LM rescoring is usually faster than shallow-fusion based methods.
Here, we benchmark the WERs and decoding speed of them:</p>
<table class="docutils align-default" id="id3">
<caption><span class="caption-number">Table 5 </span><span class="caption-text">LM-rescoring-based methods vs shallow-fusion-based methods (The numbers in each field is WER on test-clean, WER on test-other and decoding time on test-clean)</span><a class="headerlink" href="#id3" title="Permalink to this table"></a></caption>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Decoding method</p></th>
<th class="head"><p>beam=4</p></th>
<th class="head"><p>beam=8</p></th>
<th class="head"><p>beam=12</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><cite>modified_beam_search</cite></p></td>
<td><p>3.11/7.93; 132s</p></td>
<td><p>3.1/7.95; 177s</p></td>
<td><p>3.1/7.96; 210s</p></td>
</tr>
<tr class="row-odd"><td><p><cite>modified_beam_search_lm_shallow_fusion</cite></p></td>
<td><p>2.77/7.08; 262s</p></td>
<td><p>2.62/6.65; 352s</p></td>
<td><p>2.58/6.65; 488s</p></td>
</tr>
<tr class="row-even"><td><p>LODR</p></td>
<td><p>2.61/6.74; 400s</p></td>
<td><p>2.45/6.38; 610s</p></td>
<td><p>2.4/6.23; 870s</p></td>
</tr>
<tr class="row-odd"><td><p><cite>modified_beam_search_lm_rescore</cite></p></td>
<td><p>2.93/7.6; 156s</p></td>
<td><p>2.67/7.11; 203s</p></td>
<td><p>2.59/6.86; 255s</p></td>
</tr>
<tr class="row-even"><td><p><cite>modified_beam_search_lm_rescore_LODR</cite></p></td>
<td><p>2.9/7.57; 160s</p></td>
<td><p>2.63/7.04; 203s</p></td>
<td><p>2.52/6.73; 263s</p></td>
</tr>
</tbody>
</table>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Decoding is performed with a single 32G V100, we set <code class="docutils literal notranslate"><span class="pre">--max-duration</span></code> to 600.
Decoding time here is only for reference and it may vary.</p>
</div>
</section>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="LODR.html" class="btn btn-neutral float-left" title="LODR for RNN Transducer" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
</div>
<hr/>
<div role="contentinfo">
<p>&#169; Copyright 2021, icefall development team.</p>
</div>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>

View File

@ -0,0 +1,296 @@
<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Shallow fusion for Transducer &mdash; icefall 0.1 documentation</title>
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<!--[if lt IE 9]>
<script src="../_static/js/html5shiv.min.js"></script>
<![endif]-->
<script src="../_static/jquery.js"></script>
<script src="../_static/_sphinx_javascript_frameworks_compat.js"></script>
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
<script src="../_static/doctools.js"></script>
<script src="../_static/sphinx_highlight.js"></script>
<script src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="LODR for RNN Transducer" href="LODR.html" />
<link rel="prev" title="Decoding with language models" href="index.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../index.html" class="icon icon-home">
icefall
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../installation/index.html">Installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../faqs.html">Frequently Asked Questions (FAQs)</a></li>
<li class="toctree-l1"><a class="reference internal" href="../model-export/index.html">Model export</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../recipes/index.html">Recipes</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul class="current">
<li class="toctree-l1 current"><a class="reference internal" href="index.html">Decoding with language models</a><ul class="current">
<li class="toctree-l2 current"><a class="current reference internal" href="#">Shallow fusion for Transducer</a></li>
<li class="toctree-l2"><a class="reference internal" href="LODR.html">LODR for RNN Transducer</a></li>
<li class="toctree-l2"><a class="reference internal" href="rescoring.html">LM rescoring for Transducer</a></li>
</ul>
</li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">icefall</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item"><a href="index.html">Decoding with language models</a></li>
<li class="breadcrumb-item active">Shallow fusion for Transducer</li>
<li class="wy-breadcrumbs-aside">
<a href="https://github.com/k2-fsa/icefall/blob/master/docs/source/decoding-with-langugage-models/shallow-fusion.rst" class="fa fa-github"> Edit on GitHub</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="shallow-fusion-for-transducer">
<span id="shallow-fusion"></span><h1>Shallow fusion for Transducer<a class="headerlink" href="#shallow-fusion-for-transducer" title="Permalink to this heading"></a></h1>
<p>External language models (LM) are commonly used to improve WERs for E2E ASR models.
This tutorial shows you how to perform <code class="docutils literal notranslate"><span class="pre">shallow</span> <span class="pre">fusion</span></code> with an external LM
to improve the word-error-rate of a transducer model.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This tutorial is based on the recipe
<a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming">pruned_transducer_stateless7_streaming</a>,
which is a streaming transducer model trained on <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>.
However, you can easily apply shallow fusion to other recipes.
If you encounter any problems, please open an issue here <a class="reference external" href="https://github.com/k2-fsa/icefall/issues">icefall</a>.</p>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>For simplicity, the training and testing corpus in this tutorial is the same (<a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>). However, you can change the testing set
to any other domains (e.g <a class="reference external" href="https://github.com/SpeechColab/GigaSpeech">GigaSpeech</a>) and use an external LM trained on that domain.</p>
</div>
<div class="admonition hint">
<p class="admonition-title">Hint</p>
<p>We recommend you to use a GPU for decoding.</p>
</div>
<p>For illustration purpose, we will use a pre-trained ASR model from this <a class="reference external" href="https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29">link</a>.
If you want to train your model from scratch, please have a look at <a class="reference internal" href="../recipes/Non-streaming-ASR/librispeech/pruned_transducer_stateless.html#non-streaming-librispeech-pruned-transducer-stateless"><span class="std std-ref">Pruned transducer statelessX</span></a>.</p>
<p>As the initial step, lets download the pre-trained model.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">&quot;pretrained.pt&quot;</span>
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
</pre></div>
</div>
<p>To test the model, lets have a look at the decoding results without using LM. This can be done via the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search
</pre></div>
</div>
<p>The following WERs are achieved on test-clean and test-other:</p>
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
$ beam_size_4 3.11 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.93 best for test-other
</pre></div>
</div>
<p>These are already good numbers! But we can further improve it by using shallow fusion with external LM.
Training a language model usually takes a long time, we can download a pre-trained LM from this <a class="reference external" href="https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm">link</a>.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="c1"># download the external LM</span>
$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
$<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-librispeech-rnn-lm/exp
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">&quot;pretrained.pt&quot;</span>
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt
$<span class="w"> </span><span class="nb">popd</span>
</pre></div>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus.
You may also train a RNN LM from scratch. Please refer to this <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/master/icefall/rnn_lm/train.py">script</a>
for training a RNN LM and this <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/master/icefall/transformer_lm/train.py">script</a> to train a transformer LM.</p>
</div>
<p>To use shallow fusion for decoding, we can execute the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$<span class="w"> </span><span class="nv">lm_dir</span><span class="o">=</span>./icefall-librispeech-rnn-lm/exp
$<span class="w"> </span><span class="nv">lm_scale</span><span class="o">=</span><span class="m">0</span>.29
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--beam-size<span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search_lm_shallow_fusion<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
<span class="w"> </span>--use-shallow-fusion<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-type<span class="w"> </span>rnn<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-exp-dir<span class="w"> </span><span class="nv">$lm_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-scale<span class="w"> </span><span class="nv">$lm_scale</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--rnn-lm-embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--rnn-lm-hidden-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--rnn-lm-num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--lm-vocab-size<span class="w"> </span><span class="m">500</span>
</pre></div>
</div>
<p>Note that we set <code class="docutils literal notranslate"><span class="pre">--decoding-method</span> <span class="pre">modified_beam_search_lm_shallow_fusion</span></code> and <code class="docutils literal notranslate"><span class="pre">--use-shallow-fusion</span> <span class="pre">True</span></code>
to use shallow fusion. <code class="docutils literal notranslate"><span class="pre">--lm-type</span></code> specifies the type of neural LM we are going to use, you can either choose
between <code class="docutils literal notranslate"><span class="pre">rnn</span></code> or <code class="docutils literal notranslate"><span class="pre">transformer</span></code>. The following three arguments are associated with the rnn:</p>
<ul class="simple">
<li><dl class="simple">
<dt><code class="docutils literal notranslate"><span class="pre">--rnn-lm-embedding-dim</span></code></dt><dd><p>The embedding dimension of the RNN LM</p>
</dd>
</dl>
</li>
<li><dl class="simple">
<dt><code class="docutils literal notranslate"><span class="pre">--rnn-lm-hidden-dim</span></code></dt><dd><p>The hidden dimension of the RNN LM</p>
</dd>
</dl>
</li>
<li><dl class="simple">
<dt><code class="docutils literal notranslate"><span class="pre">--rnn-lm-num-layers</span></code></dt><dd><p>The number of RNN layers in the RNN LM.</p>
</dd>
</dl>
</li>
</ul>
<p>The decoding result obtained with the above command are shown below.</p>
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
$ beam_size_4 2.77 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.08 best for test-other
</pre></div>
</div>
<p>The improvement of shallow fusion is very obvious! The relative WER reduction on test-other is around 10.5%.
A few parameters can be tuned to further boost the performance of shallow fusion:</p>
<ul>
<li><p><code class="docutils literal notranslate"><span class="pre">--lm-scale</span></code></p>
<blockquote>
<div><p>Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large,
the LM score may dominant during decoding, leading to bad WER. A typical value of this is around 0.3.</p>
</div></blockquote>
</li>
<li><p><code class="docutils literal notranslate"><span class="pre">--beam-size</span></code></p>
<blockquote>
<div><p>The number of active paths in the search beam. It controls the trade-off between decoding efficiency and accuracy.</p>
</div></blockquote>
</li>
</ul>
<p>Here, we also show how <cite>beam-size</cite> effect the WER and decoding time:</p>
<table class="docutils align-default" id="id2">
<caption><span class="caption-number">Table 1 </span><span class="caption-text">WERs and decoding time (on test-clean) of shallow fusion with different beam sizes</span><a class="headerlink" href="#id2" title="Permalink to this table"></a></caption>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Beam size</p></th>
<th class="head"><p>test-clean</p></th>
<th class="head"><p>test-other</p></th>
<th class="head"><p>Decoding time on test-clean (s)</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>4</p></td>
<td><p>2.77</p></td>
<td><p>7.08</p></td>
<td><p>262</p></td>
</tr>
<tr class="row-odd"><td><p>8</p></td>
<td><p>2.62</p></td>
<td><p>6.65</p></td>
<td><p>352</p></td>
</tr>
<tr class="row-even"><td><p>12</p></td>
<td><p>2.58</p></td>
<td><p>6.65</p></td>
<td><p>488</p></td>
</tr>
</tbody>
</table>
<p>As we see, a larger beam size during shallow fusion improves the WER, but is also slower.</p>
</section>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="index.html" class="btn btn-neutral float-left" title="Decoding with language models" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="LODR.html" class="btn btn-neutral float-right" title="LODR for RNN Transducer" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>
<div role="contentinfo">
<p>&#169; Copyright 2021, icefall development team.</p>
</div>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>

View File

@ -59,6 +59,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -51,6 +51,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -58,6 +58,9 @@
<li class="toctree-l2"><a class="reference internal" href="spaces.html">Huggingface spaces</a></li>
</ul>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -58,6 +58,9 @@
<li class="toctree-l2"><a class="reference internal" href="spaces.html">Huggingface spaces</a></li>
</ul>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -19,6 +19,7 @@
<script src="../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<link rel="next" title="Decoding with language models" href="../decoding-with-langugage-models/index.html" />
<link rel="prev" title="Pre-trained models" href="pretrained-models.html" />
</head>
@ -60,6 +61,9 @@
</li>
</ul>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>
@ -144,6 +148,7 @@ the following YouTube channel by <a class="reference external" href="https://www
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="pretrained-models.html" class="btn btn-neutral float-left" title="Pre-trained models" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="../decoding-with-langugage-models/index.html" class="btn btn-neutral float-right" title="Decoding with language models" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>

View File

@ -53,6 +53,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>
@ -148,6 +151,16 @@ speech recognition recipes using <a class="reference external" href="https://git
</li>
</ul>
</div>
<div class="toctree-wrapper compound">
<ul>
<li class="toctree-l1"><a class="reference internal" href="decoding-with-langugage-models/index.html">Decoding with language models</a><ul>
<li class="toctree-l2"><a class="reference internal" href="decoding-with-langugage-models/shallow-fusion.html">Shallow fusion for Transducer</a></li>
<li class="toctree-l2"><a class="reference internal" href="decoding-with-langugage-models/LODR.html">LODR for RNN Transducer</a></li>
<li class="toctree-l2"><a class="reference internal" href="decoding-with-langugage-models/rescoring.html">LM rescoring for Transducer</a></li>
</ul>
</li>
</ul>
</div>
</section>

View File

@ -76,6 +76,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -67,6 +67,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -75,6 +75,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -75,6 +75,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -74,6 +74,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -66,6 +66,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -68,6 +68,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -66,6 +66,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -66,6 +66,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -61,6 +61,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

Binary file not shown.

View File

@ -69,6 +69,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -69,6 +69,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -69,6 +69,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -69,6 +69,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -64,6 +64,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -72,6 +72,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -72,6 +72,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>
@ -103,7 +106,7 @@
<section id="distillation-with-hubert">
<h1>Distillation with HuBERT<a class="headerlink" href="#distillation-with-hubert" title="Permalink to this heading"></a></h1>
<p>This tutorial shows you how to perform knowledge distillation in <a href="#id7"><span class="problematic" id="id8">`icefall`_</span></a>
<p>This tutorial shows you how to perform knowledge distillation in <a class="reference external" href="https://github.com/k2-fsa/icefall">icefall</a>
with the <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a> dataset. The distillation method
used here is called “Multi Vector Quantization Knowledge Distillation” (MVQ-KD).
Please have a look at our paper <a class="reference external" href="https://arxiv.org/abs/2211.00508">Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation</a>
@ -119,7 +122,7 @@ encounter any problems, please open an issue here <a class="reference external"
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>We assume you have read the page <a class="reference internal" href="../../../installation/index.html#install-icefall"><span class="std std-ref">Installation</span></a> and have setup
the environment for <a href="#id9"><span class="problematic" id="id10">`icefall`_</span></a>.</p>
the environment for <a class="reference external" href="https://github.com/k2-fsa/icefall">icefall</a>.</p>
</div>
<div class="admonition hint">
<p class="admonition-title">Hint</p>
@ -190,7 +193,7 @@ run <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/m
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>There are 5 stages in total, the first and second stage will be automatically skipped
when choosing to downloaded codebook indexes prepared by <a href="#id11"><span class="problematic" id="id12">`icefall`_</span></a>.
when choosing to downloaded codebook indexes prepared by <a class="reference external" href="https://github.com/k2-fsa/icefall">icefall</a>.
Of course, you can extract and compute the codebook indexes by yourself. This
will require you downloading a HuBERT-XL model and it can take a while for
the extraction of codebook indexes.</p>
@ -226,10 +229,10 @@ and prepares MVQ-augmented training manifests.</p>
</div>
<p>Please see the
following screenshot for the output of an example execution.</p>
<figure class="align-center" id="id5">
<figure class="align-center" id="id4">
<a class="reference internal image-reference" href="../../../_images/distillation_codebook.png"><img alt="Downloading codebook indexes and preparing training manifest." src="../../../_images/distillation_codebook.png" style="width: 800px;" /></a>
<figcaption>
<p><span class="caption-number">Fig. 6 </span><span class="caption-text">Downloading codebook indexes and preparing training manifest.</span><a class="headerlink" href="#id5" title="Permalink to this image"></a></p>
<p><span class="caption-number">Fig. 6 </span><span class="caption-text">Downloading codebook indexes and preparing training manifest.</span><a class="headerlink" href="#id4" title="Permalink to this image"></a></p>
</figcaption>
</figure>
<div class="admonition hint">
@ -241,10 +244,10 @@ set <code class="docutils literal notranslate"><span class="pre">use_extracted_c
<code class="docutils literal notranslate"><span class="pre">num_codebooks</span></code> by yourself.</p>
</div>
<p>Now, you should see the following files under the directory <code class="docutils literal notranslate"><span class="pre">./data/vq_fbank_layer36_cb8</span></code>.</p>
<figure class="align-center" id="id6">
<figure class="align-center" id="id5">
<a class="reference internal image-reference" href="../../../_images/distillation_directory.png"><img alt="MVQ-augmented training manifests" src="../../../_images/distillation_directory.png" style="width: 800px;" /></a>
<figcaption>
<p><span class="caption-number">Fig. 7 </span><span class="caption-text">MVQ-augmented training manifests.</span><a class="headerlink" href="#id6" title="Permalink to this image"></a></p>
<p><span class="caption-number">Fig. 7 </span><span class="caption-text">MVQ-augmented training manifests.</span><a class="headerlink" href="#id5" title="Permalink to this image"></a></p>
</figcaption>
</figure>
<p>Whola! You are ready to perform knowledge distillation training now!</p>

View File

@ -72,6 +72,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -72,6 +72,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>
@ -381,10 +384,10 @@ $<span class="w"> </span>tensorboard<span class="w"> </span>dev<span class="w">
<p>Note there is a URL in the above output. Click it and you will see
the following screenshot:</p>
<blockquote>
<div><figure class="align-center" id="id9">
<div><figure class="align-center" id="id4">
<a class="reference external image-reference" href="https://tensorboard.dev/experiment/QOGSPBgsR8KzcRMmie9JGw/"><img alt="TensorBoard screenshot" src="../../../_images/librispeech-pruned-transducer-tensorboard-log.jpg" style="width: 600px;" /></a>
<figcaption>
<p><span class="caption-number">Fig. 5 </span><span class="caption-text">TensorBoard screenshot.</span><a class="headerlink" href="#id9" title="Permalink to this image"></a></p>
<p><span class="caption-number">Fig. 5 </span><span class="caption-text">TensorBoard screenshot.</span><a class="headerlink" href="#id4" title="Permalink to this image"></a></p>
</figcaption>
</figure>
</div></blockquote>

View File

@ -72,6 +72,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -72,6 +72,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -72,6 +72,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -68,6 +68,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -68,6 +68,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -68,6 +68,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -67,6 +67,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -67,6 +67,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -62,6 +62,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -66,6 +66,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -67,6 +67,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -67,6 +67,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>
@ -380,10 +383,10 @@ $<span class="w"> </span>tensorboard<span class="w"> </span>dev<span class="w">
<p>Note there is a URL in the above output. Click it and you will see
the following screenshot:</p>
<blockquote>
<div><figure class="align-center" id="id4">
<div><figure class="align-center" id="id5">
<a class="reference external image-reference" href="https://tensorboard.dev/experiment/lzGnETjwRxC3yghNMd4kPw/"><img alt="TensorBoard screenshot" src="../../../_images/librispeech-lstm-transducer-tensorboard-log.png" style="width: 600px;" /></a>
<figcaption>
<p><span class="caption-number">Fig. 10 </span><span class="caption-text">TensorBoard screenshot.</span><a class="headerlink" href="#id4" title="Permalink to this image"></a></p>
<p><span class="caption-number">Fig. 10 </span><span class="caption-text">TensorBoard screenshot.</span><a class="headerlink" href="#id5" title="Permalink to this image"></a></p>
</figcaption>
</figure>
</div></blockquote>

View File

@ -67,6 +67,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>
@ -400,10 +403,10 @@ $<span class="w"> </span>tensorboard<span class="w"> </span>dev<span class="w">
<p>Note there is a URL in the above output. Click it and you will see
the following screenshot:</p>
<blockquote>
<div><figure class="align-center" id="id10">
<div><figure class="align-center" id="id5">
<a class="reference external image-reference" href="https://tensorboard.dev/experiment/97VKXf80Ru61CnP2ALWZZg/"><img alt="TensorBoard screenshot" src="../../../_images/streaming-librispeech-pruned-transducer-tensorboard-log.jpg" style="width: 600px;" /></a>
<figcaption>
<p><span class="caption-number">Fig. 9 </span><span class="caption-text">TensorBoard screenshot.</span><a class="headerlink" href="#id10" title="Permalink to this image"></a></p>
<p><span class="caption-number">Fig. 9 </span><span class="caption-text">TensorBoard screenshot.</span><a class="headerlink" href="#id5" title="Permalink to this image"></a></p>
</figcaption>
</figure>
</div></blockquote>

View File

@ -67,6 +67,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -58,6 +58,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

View File

@ -54,6 +54,9 @@
<ul>
<li class="toctree-l1"><a class="reference internal" href="contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>

File diff suppressed because one or more lines are too long