mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-12-11 06:55:27 +00:00
deploy: 11523c5b894f42ded965dcb974fef9a8a8122518
This commit is contained in:
parent
9d2d2225d7
commit
48b954a308
184
_sources/decoding-with-langugage-models/LODR.rst.txt
Normal file
184
_sources/decoding-with-langugage-models/LODR.rst.txt
Normal file
@ -0,0 +1,184 @@
|
||||
.. _LODR:
|
||||
|
||||
LODR for RNN Transducer
|
||||
=======================
|
||||
|
||||
|
||||
As a type of E2E model, neural transducers are usually considered as having an internal
|
||||
language model, which learns the language level information on the training corpus.
|
||||
In real-life scenario, there is often a mismatch between the training corpus and the target corpus space.
|
||||
This mismatch can be a problem when decoding for neural transducer models with language models as its internal
|
||||
language can act "against" the external LM. In this tutorial, we show how to use
|
||||
`Low-order Density Ratio <https://arxiv.org/abs/2203.16776>`_ to alleviate this effect to further improve the performance
|
||||
of langugae model integration.
|
||||
|
||||
.. note::
|
||||
|
||||
This tutorial is based on the recipe
|
||||
`pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
|
||||
which is a streaming transducer model trained on `LibriSpeech`_.
|
||||
However, you can easily apply LODR to other recipes.
|
||||
If you encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`__.
|
||||
|
||||
|
||||
.. note::
|
||||
|
||||
For simplicity, the training and testing corpus in this tutorial are the same (`LibriSpeech`_). However,
|
||||
you can change the testing set to any other domains (e.g `GigaSpeech`_) and prepare the language models
|
||||
using that corpus.
|
||||
|
||||
First, let's have a look at some background information. As the predecessor of LODR, Density Ratio (DR) is first proposed `here <https://arxiv.org/abs/2002.11268>`_
|
||||
to address the language information mismatch between the training
|
||||
corpus (source domain) and the testing corpus (target domain). Assuming that the source domain and the test domain
|
||||
are acoustically similar, DR derives the following formular for decoding with Bayes' theorem:
|
||||
|
||||
.. math::
|
||||
|
||||
\text{score}\left(y_u|\mathit{x},y\right) =
|
||||
\log p\left(y_u|\mathit{x},y_{1:u-1}\right) +
|
||||
\lambda_1 \log p_{\text{Target LM}}\left(y_u|\mathit{x},y_{1:u-1}\right) -
|
||||
\lambda_2 \log p_{\text{Source LM}}\left(y_u|\mathit{x},y_{1:u-1}\right)
|
||||
|
||||
|
||||
where :math:`\lambda_1` and :math:`\lambda_2` are the weights of LM scores for target domain and source domain respectively.
|
||||
Here, the source domain LM is trained on the training corpus. The only difference in the above formular compared to
|
||||
shallow fusion is the subtraction of the source domain LM.
|
||||
|
||||
Some works treat the predictor and the joiner of the neural transducer as its internal LM. However, the LM is
|
||||
considered to be weak and can only capture low-level language information. Therefore, `LODR <https://arxiv.org/abs/2203.16776>`__ proposed to use
|
||||
a low-order n-gram LM as an approximation of the ILM of the neural transducer. This leads to the following formula
|
||||
during decoding for transducer model:
|
||||
|
||||
.. math::
|
||||
|
||||
\text{score}\left(y_u|\mathit{x},y\right) =
|
||||
\log p_{rnnt}\left(y_u|\mathit{x},y_{1:u-1}\right) +
|
||||
\lambda_1 \log p_{\text{Target LM}}\left(y_u|\mathit{x},y_{1:u-1}\right) -
|
||||
\lambda_2 \log p_{\text{bi-gram}}\left(y_u|\mathit{x},y_{1:u-1}\right)
|
||||
|
||||
In LODR, an additional bi-gram LM estimated on the source domain (e.g training corpus) is required. Comared to DR,
|
||||
the only difference lies in the choice of source domain LM. According to the original `paper <https://arxiv.org/abs/2203.16776>`_,
|
||||
LODR achieves similar performance compared DR in both intra-domain and cross-domain settings.
|
||||
As a bi-gram is much faster to evaluate, LODR is usually much faster.
|
||||
|
||||
Now, we will show you how to use LODR in ``icefall``.
|
||||
For illustration purpose, we will use a pre-trained ASR model from this `link <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`_.
|
||||
If you want to train your model from scratch, please have a look at :ref:`non_streaming_librispeech_pruned_transducer_stateless`.
|
||||
The testing scenario here is intra-domain (we decode the model trained on `LibriSpeech`_ on `LibriSpeech`_ testing sets).
|
||||
|
||||
As the initial step, let's download the pre-trained model.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
|
||||
$ pushd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$ git lfs pull --include "pretrained.pt"
|
||||
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
|
||||
|
||||
To test the model, let's have a look at the decoding results **without** using LM. This can be done via the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
|
||||
$ ./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 99 \
|
||||
--avg 1 \
|
||||
--use-averaged-model False \
|
||||
--exp-dir $exp_dir \
|
||||
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--decoding-method modified_beam_search
|
||||
|
||||
The following WERs are achieved on test-clean and test-other:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 3.11 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.93 best for test-other
|
||||
|
||||
Then, we download the external language model and bi-gram LM that are necessary for LODR.
|
||||
Note that the bi-gram is estimated on the LibriSpeech 960 hours' text.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ # download the external LM
|
||||
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
|
||||
$ # create a symbolic link so that the checkpoint can be loaded
|
||||
$ pushd icefall-librispeech-rnn-lm/exp
|
||||
$ git lfs pull --include "pretrained.pt"
|
||||
$ ln -s pretrained.pt epoch-99.pt
|
||||
$ popd
|
||||
$
|
||||
$ # download the bi-gram
|
||||
$ git lfs install
|
||||
$ git clone https://huggingface.co/marcoyang/librispeech_bigram
|
||||
$ pushd data/lang_bpe_500
|
||||
$ ln -s ../../librispeech_bigram/2gram.fst.txt .
|
||||
$ popd
|
||||
|
||||
Then, we perform LODR decoding by setting ``--decoding-method`` to ``modified_beam_search_lm_LODR``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$ lm_dir=./icefall-librispeech-rnn-lm/exp
|
||||
$ lm_scale=0.42
|
||||
$ LODR_scale=-0.24
|
||||
$ ./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 99 \
|
||||
--avg 1 \
|
||||
--use-averaged-model False \
|
||||
--beam-size 4 \
|
||||
--exp-dir $exp_dir \
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--decoding-method modified_beam_search_lm_LODR \
|
||||
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
--use-shallow-fusion 1 \
|
||||
--lm-type rnn \
|
||||
--lm-exp-dir $lm_dir \
|
||||
--lm-epoch 99 \
|
||||
--lm-scale $lm_scale \
|
||||
--lm-avg 1 \
|
||||
--rnn-lm-embedding-dim 2048 \
|
||||
--rnn-lm-hidden-dim 2048 \
|
||||
--rnn-lm-num-layers 3 \
|
||||
--lm-vocab-size 500 \
|
||||
--tokens-ngram 2 \
|
||||
--ngram-lm-scale $LODR_scale
|
||||
|
||||
There are two extra arguments that need to be given when doing LODR. ``--tokens-ngram`` specifies the order of n-gram. As we
|
||||
are using a bi-gram, we set it to 2. ``--ngram-lm-scale`` is the scale of the bi-gram, it should be a negative number
|
||||
as we are subtracting the bi-gram's score during decoding.
|
||||
|
||||
The decoding results obtained with the above command are shown below:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 2.61 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 6.74 best for test-other
|
||||
|
||||
Recall that the lowest WER we obtained in :ref:`shallow_fusion` with beam size of 4 is ``2.77/7.08``, LODR
|
||||
indeed **further improves** the WER. We can do even better if we increase ``--beam-size``:
|
||||
|
||||
.. list-table:: WER of LODR with different beam sizes
|
||||
:widths: 25 25 50
|
||||
:header-rows: 1
|
||||
|
||||
* - Beam size
|
||||
- test-clean
|
||||
- test-other
|
||||
* - 4
|
||||
- 2.61
|
||||
- 6.74
|
||||
* - 8
|
||||
- 2.45
|
||||
- 6.38
|
||||
* - 12
|
||||
- 2.4
|
||||
- 6.23
|
||||
12
_sources/decoding-with-langugage-models/index.rst.txt
Normal file
12
_sources/decoding-with-langugage-models/index.rst.txt
Normal file
@ -0,0 +1,12 @@
|
||||
Decoding with language models
|
||||
=============================
|
||||
|
||||
This section describes how to use external langugage models
|
||||
during decoding to improve the WER of transducer models.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
shallow-fusion
|
||||
LODR
|
||||
rescoring
|
||||
252
_sources/decoding-with-langugage-models/rescoring.rst.txt
Normal file
252
_sources/decoding-with-langugage-models/rescoring.rst.txt
Normal file
@ -0,0 +1,252 @@
|
||||
.. _rescoring:
|
||||
|
||||
LM rescoring for Transducer
|
||||
=================================
|
||||
|
||||
LM rescoring is a commonly used approach to incorporate external LM information. Unlike shallow-fusion-based
|
||||
methods (see :ref:`shallow-fusion`, :ref:`LODR`), rescoring is usually performed to re-rank the n-best hypotheses after beam search.
|
||||
Rescoring is usually more efficient than shallow fusion since less computation is performed on the external LM.
|
||||
In this tutorial, we will show you how to use external LM to rescore the n-best hypotheses decoded from neural transducer models in
|
||||
`icefall <https://github.com/k2-fsa/icefall>`__.
|
||||
|
||||
.. note::
|
||||
|
||||
This tutorial is based on the recipe
|
||||
`pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
|
||||
which is a streaming transducer model trained on `LibriSpeech`_.
|
||||
However, you can easily apply shallow fusion to other recipes.
|
||||
If you encounter any problems, please open an issue `here <https://github.com/k2-fsa/icefall/issues>`_.
|
||||
|
||||
.. note::
|
||||
|
||||
For simplicity, the training and testing corpus in this tutorial is the same (`LibriSpeech`_). However, you can change the testing set
|
||||
to any other domains (e.g `GigaSpeech`_) and use an external LM trained on that domain.
|
||||
|
||||
.. HINT::
|
||||
|
||||
We recommend you to use a GPU for decoding.
|
||||
|
||||
For illustration purpose, we will use a pre-trained ASR model from this `link <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`__.
|
||||
If you want to train your model from scratch, please have a look at :ref:`non_streaming_librispeech_pruned_transducer_stateless`.
|
||||
|
||||
As the initial step, let's download the pre-trained model.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
|
||||
$ pushd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$ git lfs pull --include "pretrained.pt"
|
||||
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
|
||||
|
||||
As usual, we first test the model's performance without external LM. This can be done via the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
|
||||
$ ./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 99 \
|
||||
--avg 1 \
|
||||
--use-averaged-model False \
|
||||
--exp-dir $exp_dir \
|
||||
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--decoding-method modified_beam_search
|
||||
|
||||
The following WERs are achieved on test-clean and test-other:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 3.11 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.93 best for test-other
|
||||
|
||||
Now, we will try to improve the above WER numbers via external LM rescoring. We will download
|
||||
a pre-trained LM from this `link <https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm>`__.
|
||||
|
||||
.. note::
|
||||
|
||||
This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus.
|
||||
You may also train a RNN LM from scratch. Please refer to this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/rnn_lm/train.py>`__
|
||||
for training a RNN LM and this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/transformer_lm/train.py>`__ to train a transformer LM.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ # download the external LM
|
||||
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
|
||||
$ # create a symbolic link so that the checkpoint can be loaded
|
||||
$ pushd icefall-librispeech-rnn-lm/exp
|
||||
$ git lfs pull --include "pretrained.pt"
|
||||
$ ln -s pretrained.pt epoch-99.pt
|
||||
$ popd
|
||||
|
||||
|
||||
With the RNNLM available, we can rescore the n-best hypotheses generated from `modified_beam_search`. Here,
|
||||
`n` should be the number of beams, i.e ``--beam-size``. The command for LM rescoring is
|
||||
as follows. Note that the ``--decoding-method`` is set to `modified_beam_search_lm_rescore` and ``--use-shallow-fusion``
|
||||
is set to `False`.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$ lm_dir=./icefall-librispeech-rnn-lm/exp
|
||||
$ lm_scale=0.43
|
||||
$ ./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 99 \
|
||||
--avg 1 \
|
||||
--use-averaged-model False \
|
||||
--beam-size 4 \
|
||||
--exp-dir $exp_dir \
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--decoding-method modified_beam_search_lm_rescore \
|
||||
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
--use-shallow-fusion 0 \
|
||||
--lm-type rnn \
|
||||
--lm-exp-dir $lm_dir \
|
||||
--lm-epoch 99 \
|
||||
--lm-scale $lm_scale \
|
||||
--lm-avg 1 \
|
||||
--rnn-lm-embedding-dim 2048 \
|
||||
--rnn-lm-hidden-dim 2048 \
|
||||
--rnn-lm-num-layers 3 \
|
||||
--lm-vocab-size 500
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 2.93 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.6 best for test-other
|
||||
|
||||
Great! We made some improvements! Increasing the size of the n-best hypotheses will further boost the performance,
|
||||
see the following table:
|
||||
|
||||
.. list-table:: WERs of LM rescoring with different beam sizes
|
||||
:widths: 25 25 25
|
||||
:header-rows: 1
|
||||
|
||||
* - Beam size
|
||||
- test-clean
|
||||
- test-other
|
||||
* - 4
|
||||
- 2.93
|
||||
- 7.6
|
||||
* - 8
|
||||
- 2.67
|
||||
- 7.11
|
||||
* - 12
|
||||
- 2.59
|
||||
- 6.86
|
||||
|
||||
In fact, we can also apply LODR (see :ref:`LODR`) when doing LM rescoring. To do so, we need to
|
||||
download the bi-gram required by LODR:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ # download the bi-gram
|
||||
$ git lfs install
|
||||
$ git clone https://huggingface.co/marcoyang/librispeech_bigram
|
||||
$ pushd data/lang_bpe_500
|
||||
$ ln -s ../../librispeech_bigram/2gram.arpa .
|
||||
$ popd
|
||||
|
||||
Then we can performn LM rescoring + LODR by changing the decoding method to `modified_beam_search_lm_rescore_LODR`.
|
||||
|
||||
.. note::
|
||||
|
||||
This decoding method requires the dependency of `kenlm <https://github.com/kpu/kenlm>`_. You can install it
|
||||
via this command: `pip install https://github.com/kpu/kenlm/archive/master.zip`.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$ lm_dir=./icefall-librispeech-rnn-lm/exp
|
||||
$ lm_scale=0.43
|
||||
$ ./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 99 \
|
||||
--avg 1 \
|
||||
--use-averaged-model False \
|
||||
--beam-size 4 \
|
||||
--exp-dir $exp_dir \
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--decoding-method modified_beam_search_lm_rescore_LODR \
|
||||
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
--use-shallow-fusion 0 \
|
||||
--lm-type rnn \
|
||||
--lm-exp-dir $lm_dir \
|
||||
--lm-epoch 99 \
|
||||
--lm-scale $lm_scale \
|
||||
--lm-avg 1 \
|
||||
--rnn-lm-embedding-dim 2048 \
|
||||
--rnn-lm-hidden-dim 2048 \
|
||||
--rnn-lm-num-layers 3 \
|
||||
--lm-vocab-size 500
|
||||
|
||||
You should see the following WERs after executing the commands above:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 2.9 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.57 best for test-other
|
||||
|
||||
It's slightly better than LM rescoring. If we further increase the beam size, we will see
|
||||
further improvements from LM rescoring + LODR:
|
||||
|
||||
.. list-table:: WERs of LM rescoring + LODR with different beam sizes
|
||||
:widths: 25 25 25
|
||||
:header-rows: 1
|
||||
|
||||
* - Beam size
|
||||
- test-clean
|
||||
- test-other
|
||||
* - 4
|
||||
- 2.9
|
||||
- 7.57
|
||||
* - 8
|
||||
- 2.63
|
||||
- 7.04
|
||||
* - 12
|
||||
- 2.52
|
||||
- 6.73
|
||||
|
||||
As mentioned earlier, LM rescoring is usually faster than shallow-fusion based methods.
|
||||
Here, we benchmark the WERs and decoding speed of them:
|
||||
|
||||
.. list-table:: LM-rescoring-based methods vs shallow-fusion-based methods (The numbers in each field is WER on test-clean, WER on test-other and decoding time on test-clean)
|
||||
:widths: 25 25 25 25
|
||||
:header-rows: 1
|
||||
|
||||
* - Decoding method
|
||||
- beam=4
|
||||
- beam=8
|
||||
- beam=12
|
||||
* - `modified_beam_search`
|
||||
- 3.11/7.93; 132s
|
||||
- 3.1/7.95; 177s
|
||||
- 3.1/7.96; 210s
|
||||
* - `modified_beam_search_lm_shallow_fusion`
|
||||
- 2.77/7.08; 262s
|
||||
- 2.62/6.65; 352s
|
||||
- 2.58/6.65; 488s
|
||||
* - LODR
|
||||
- 2.61/6.74; 400s
|
||||
- 2.45/6.38; 610s
|
||||
- 2.4/6.23; 870s
|
||||
* - `modified_beam_search_lm_rescore`
|
||||
- 2.93/7.6; 156s
|
||||
- 2.67/7.11; 203s
|
||||
- 2.59/6.86; 255s
|
||||
* - `modified_beam_search_lm_rescore_LODR`
|
||||
- 2.9/7.57; 160s
|
||||
- 2.63/7.04; 203s
|
||||
- 2.52/6.73; 263s
|
||||
|
||||
.. note::
|
||||
|
||||
Decoding is performed with a single 32G V100, we set ``--max-duration`` to 600.
|
||||
Decoding time here is only for reference and it may vary.
|
||||
176
_sources/decoding-with-langugage-models/shallow-fusion.rst.txt
Normal file
176
_sources/decoding-with-langugage-models/shallow-fusion.rst.txt
Normal file
@ -0,0 +1,176 @@
|
||||
.. _shallow_fusion:
|
||||
|
||||
Shallow fusion for Transducer
|
||||
=================================
|
||||
|
||||
External language models (LM) are commonly used to improve WERs for E2E ASR models.
|
||||
This tutorial shows you how to perform ``shallow fusion`` with an external LM
|
||||
to improve the word-error-rate of a transducer model.
|
||||
|
||||
.. note::
|
||||
|
||||
This tutorial is based on the recipe
|
||||
`pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
|
||||
which is a streaming transducer model trained on `LibriSpeech`_.
|
||||
However, you can easily apply shallow fusion to other recipes.
|
||||
If you encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`_.
|
||||
|
||||
.. note::
|
||||
|
||||
For simplicity, the training and testing corpus in this tutorial is the same (`LibriSpeech`_). However, you can change the testing set
|
||||
to any other domains (e.g `GigaSpeech`_) and use an external LM trained on that domain.
|
||||
|
||||
.. HINT::
|
||||
|
||||
We recommend you to use a GPU for decoding.
|
||||
|
||||
For illustration purpose, we will use a pre-trained ASR model from this `link <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`__.
|
||||
If you want to train your model from scratch, please have a look at :ref:`non_streaming_librispeech_pruned_transducer_stateless`.
|
||||
|
||||
As the initial step, let's download the pre-trained model.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
|
||||
$ pushd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$ git lfs pull --include "pretrained.pt"
|
||||
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
|
||||
|
||||
To test the model, let's have a look at the decoding results without using LM. This can be done via the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
|
||||
$ ./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 99 \
|
||||
--avg 1 \
|
||||
--use-averaged-model False \
|
||||
--exp-dir $exp_dir \
|
||||
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--decoding-method modified_beam_search
|
||||
|
||||
The following WERs are achieved on test-clean and test-other:
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 3.11 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.93 best for test-other
|
||||
|
||||
These are already good numbers! But we can further improve it by using shallow fusion with external LM.
|
||||
Training a language model usually takes a long time, we can download a pre-trained LM from this `link <https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm>`__.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ # download the external LM
|
||||
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
|
||||
$ # create a symbolic link so that the checkpoint can be loaded
|
||||
$ pushd icefall-librispeech-rnn-lm/exp
|
||||
$ git lfs pull --include "pretrained.pt"
|
||||
$ ln -s pretrained.pt epoch-99.pt
|
||||
$ popd
|
||||
|
||||
.. note::
|
||||
|
||||
This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus.
|
||||
You may also train a RNN LM from scratch. Please refer to this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/rnn_lm/train.py>`__
|
||||
for training a RNN LM and this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/transformer_lm/train.py>`__ to train a transformer LM.
|
||||
|
||||
To use shallow fusion for decoding, we can execute the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$ lm_dir=./icefall-librispeech-rnn-lm/exp
|
||||
$ lm_scale=0.29
|
||||
$ ./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 99 \
|
||||
--avg 1 \
|
||||
--use-averaged-model False \
|
||||
--beam-size 4 \
|
||||
--exp-dir $exp_dir \
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--decoding-method modified_beam_search_lm_shallow_fusion \
|
||||
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
--use-shallow-fusion 1 \
|
||||
--lm-type rnn \
|
||||
--lm-exp-dir $lm_dir \
|
||||
--lm-epoch 99 \
|
||||
--lm-scale $lm_scale \
|
||||
--lm-avg 1 \
|
||||
--rnn-lm-embedding-dim 2048 \
|
||||
--rnn-lm-hidden-dim 2048 \
|
||||
--rnn-lm-num-layers 3 \
|
||||
--lm-vocab-size 500
|
||||
|
||||
Note that we set ``--decoding-method modified_beam_search_lm_shallow_fusion`` and ``--use-shallow-fusion True``
|
||||
to use shallow fusion. ``--lm-type`` specifies the type of neural LM we are going to use, you can either choose
|
||||
between ``rnn`` or ``transformer``. The following three arguments are associated with the rnn:
|
||||
|
||||
- ``--rnn-lm-embedding-dim``
|
||||
The embedding dimension of the RNN LM
|
||||
|
||||
- ``--rnn-lm-hidden-dim``
|
||||
The hidden dimension of the RNN LM
|
||||
|
||||
- ``--rnn-lm-num-layers``
|
||||
The number of RNN layers in the RNN LM.
|
||||
|
||||
|
||||
The decoding result obtained with the above command are shown below.
|
||||
|
||||
.. code-block:: text
|
||||
|
||||
$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 2.77 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.08 best for test-other
|
||||
|
||||
The improvement of shallow fusion is very obvious! The relative WER reduction on test-other is around 10.5%.
|
||||
A few parameters can be tuned to further boost the performance of shallow fusion:
|
||||
|
||||
- ``--lm-scale``
|
||||
|
||||
Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large,
|
||||
the LM score may dominant during decoding, leading to bad WER. A typical value of this is around 0.3.
|
||||
|
||||
- ``--beam-size``
|
||||
|
||||
The number of active paths in the search beam. It controls the trade-off between decoding efficiency and accuracy.
|
||||
|
||||
Here, we also show how `--beam-size` effect the WER and decoding time:
|
||||
|
||||
.. list-table:: WERs and decoding time (on test-clean) of shallow fusion with different beam sizes
|
||||
:widths: 25 25 25 25
|
||||
:header-rows: 1
|
||||
|
||||
* - Beam size
|
||||
- test-clean
|
||||
- test-other
|
||||
- Decoding time on test-clean (s)
|
||||
* - 4
|
||||
- 2.77
|
||||
- 7.08
|
||||
- 262
|
||||
* - 8
|
||||
- 2.62
|
||||
- 6.65
|
||||
- 352
|
||||
* - 12
|
||||
- 2.58
|
||||
- 6.65
|
||||
- 488
|
||||
|
||||
As we see, a larger beam size during shallow fusion improves the WER, but is also slower.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
@ -34,3 +34,8 @@ speech recognition recipes using `k2 <https://github.com/k2-fsa/k2>`_.
|
||||
|
||||
contributing/index
|
||||
huggingface/index
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
decoding-with-langugage-models/index
|
||||
@ -1,7 +1,7 @@
|
||||
Distillation with HuBERT
|
||||
========================
|
||||
|
||||
This tutorial shows you how to perform knowledge distillation in `icefall`_
|
||||
This tutorial shows you how to perform knowledge distillation in `icefall <https://github.com/k2-fsa/icefall>`_
|
||||
with the `LibriSpeech`_ dataset. The distillation method
|
||||
used here is called "Multi Vector Quantization Knowledge Distillation" (MVQ-KD).
|
||||
Please have a look at our paper `Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation <https://arxiv.org/abs/2211.00508>`_
|
||||
@ -13,7 +13,7 @@ for more details about MVQ-KD.
|
||||
`pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`_.
|
||||
Currently, we only implement MVQ-KD in this recipe. However, MVQ-KD is theoretically applicable to all recipes
|
||||
with only minor changes needed. Feel free to try out MVQ-KD in different recipes. If you
|
||||
encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`_.
|
||||
encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`__.
|
||||
|
||||
.. note::
|
||||
|
||||
@ -217,7 +217,7 @@ the following command.
|
||||
--exp-dir $exp_dir \
|
||||
--enable-distillation True
|
||||
|
||||
You should get similar results as `here <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS-100hours.md#distillation-with-hubert>`_.
|
||||
You should get similar results as `here <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS-100hours.md#distillation-with-hubert>`__.
|
||||
|
||||
That's all! Feel free to experiment with your own setups and report your results.
|
||||
If you encounter any problems during training, please open up an issue `here <https://github.com/k2-fsa/icefall/issues>`_.
|
||||
If you encounter any problems during training, please open up an issue `here <https://github.com/k2-fsa/icefall/issues>`__.
|
||||
|
||||
@ -8,10 +8,10 @@ with the `LibriSpeech <https://www.openslr.org/12>`_ dataset.
|
||||
|
||||
.. Note::
|
||||
|
||||
The tutorial is suitable for `pruned_transducer_stateless <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless>`_,
|
||||
`pruned_transducer_stateless2 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless2>`_,
|
||||
`pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`_,
|
||||
`pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless5>`_,
|
||||
The tutorial is suitable for `pruned_transducer_stateless <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless>`__,
|
||||
`pruned_transducer_stateless2 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless2>`__,
|
||||
`pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`__,
|
||||
`pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless5>`__,
|
||||
We will take pruned_transducer_stateless4 as an example in this tutorial.
|
||||
|
||||
.. HINT::
|
||||
@ -237,7 +237,7 @@ them, please modify ``./pruned_transducer_stateless4/train.py`` directly.
|
||||
|
||||
.. NOTE::
|
||||
|
||||
The options for `pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless5/train.py>`_ are a little different from
|
||||
The options for `pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless5/train.py>`__ are a little different from
|
||||
other recipes. It allows you to configure ``--num-encoder-layers``, ``--dim-feedforward``, ``--nhead``, ``--encoder-dim``, ``--decoder-dim``, ``--joiner-dim`` from commandline, so that you can train models with different size with pruned_transducer_stateless5.
|
||||
|
||||
|
||||
@ -529,13 +529,13 @@ Download pretrained models
|
||||
If you don't want to train from scratch, you can download the pretrained models
|
||||
by visiting the following links:
|
||||
|
||||
- `pruned_transducer_stateless <https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless-2022-03-12>`_
|
||||
- `pruned_transducer_stateless <https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless-2022-03-12>`__
|
||||
|
||||
- `pruned_transducer_stateless2 <https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless2-2022-04-29>`_
|
||||
- `pruned_transducer_stateless2 <https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless2-2022-04-29>`__
|
||||
|
||||
- `pruned_transducer_stateless4 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless4-2022-06-03>`_
|
||||
- `pruned_transducer_stateless4 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless4-2022-06-03>`__
|
||||
|
||||
- `pruned_transducer_stateless5 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless5-2022-07-07>`_
|
||||
- `pruned_transducer_stateless5 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless5-2022-07-07>`__
|
||||
|
||||
See `<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md>`_
|
||||
for the details of the above pretrained models
|
||||
|
||||
@ -45,9 +45,9 @@ the input features.
|
||||
|
||||
We have three variants of Emformer models in ``icefall``.
|
||||
|
||||
- ``pruned_stateless_emformer_rnnt2`` using Emformer from torchaudio, see `LibriSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_stateless_emformer_rnnt2>`_.
|
||||
- ``pruned_stateless_emformer_rnnt2`` using Emformer from torchaudio, see `LibriSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_stateless_emformer_rnnt2>`__.
|
||||
- ``conv_emformer_transducer_stateless`` using ConvEmformer implemented by ourself. Different from the Emformer in torchaudio,
|
||||
ConvEmformer has a convolution in each layer and uses the mechanisms in our reworked conformer model.
|
||||
See `LibriSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/conv_emformer_transducer_stateless>`_.
|
||||
See `LibriSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/conv_emformer_transducer_stateless>`__.
|
||||
- ``conv_emformer_transducer_stateless2`` using ConvEmformer implemented by ourself. The only difference from the above one is that
|
||||
it uses a simplified memory bank. See `LibriSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/conv_emformer_transducer_stateless2>`_.
|
||||
|
||||
@ -6,10 +6,10 @@ with the `LibriSpeech <https://www.openslr.org/12>`_ dataset.
|
||||
|
||||
.. Note::
|
||||
|
||||
The tutorial is suitable for `pruned_transducer_stateless <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless>`_,
|
||||
`pruned_transducer_stateless2 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless2>`_,
|
||||
`pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`_,
|
||||
`pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless5>`_,
|
||||
The tutorial is suitable for `pruned_transducer_stateless <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless>`__,
|
||||
`pruned_transducer_stateless2 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless2>`__,
|
||||
`pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`__,
|
||||
`pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless5>`__,
|
||||
We will take pruned_transducer_stateless4 as an example in this tutorial.
|
||||
|
||||
.. HINT::
|
||||
@ -264,7 +264,7 @@ them, please modify ``./pruned_transducer_stateless4/train.py`` directly.
|
||||
|
||||
.. NOTE::
|
||||
|
||||
The options for `pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless5/train.py>`_ are a little different from
|
||||
The options for `pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless5/train.py>`__ are a little different from
|
||||
other recipes. It allows you to configure ``--num-encoder-layers``, ``--dim-feedforward``, ``--nhead``, ``--encoder-dim``, ``--decoder-dim``, ``--joiner-dim`` from commandline, so that you can train models with different size with pruned_transducer_stateless5.
|
||||
|
||||
|
||||
|
||||
@ -6,7 +6,7 @@ with the `LibriSpeech <https://www.openslr.org/12>`_ dataset.
|
||||
|
||||
.. Note::
|
||||
|
||||
The tutorial is suitable for `pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
|
||||
The tutorial is suitable for `pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`__,
|
||||
|
||||
.. HINT::
|
||||
|
||||
@ -642,7 +642,7 @@ Download pretrained models
|
||||
If you don't want to train from scratch, you can download the pretrained models
|
||||
by visiting the following links:
|
||||
|
||||
- `pruned_transducer_stateless7_streaming <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`_
|
||||
- `pruned_transducer_stateless7_streaming <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`__
|
||||
|
||||
See `<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md>`_
|
||||
for the details of the above pretrained models
|
||||
|
||||
@ -59,6 +59,9 @@
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -59,6 +59,9 @@
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -65,6 +65,9 @@
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -59,6 +59,9 @@
|
||||
</ul>
|
||||
</li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
292
decoding-with-langugage-models/LODR.html
Normal file
292
decoding-with-langugage-models/LODR.html
Normal file
@ -0,0 +1,292 @@
|
||||
<!DOCTYPE html>
|
||||
<html class="writer-html5" lang="en" >
|
||||
<head>
|
||||
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
|
||||
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>LODR for RNN Transducer — icefall 0.1 documentation</title>
|
||||
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
|
||||
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
|
||||
<!--[if lt IE 9]>
|
||||
<script src="../_static/js/html5shiv.min.js"></script>
|
||||
<![endif]-->
|
||||
|
||||
<script src="../_static/jquery.js"></script>
|
||||
<script src="../_static/_sphinx_javascript_frameworks_compat.js"></script>
|
||||
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
|
||||
<script src="../_static/doctools.js"></script>
|
||||
<script src="../_static/sphinx_highlight.js"></script>
|
||||
<script async="async" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
|
||||
<script src="../_static/js/theme.js"></script>
|
||||
<link rel="index" title="Index" href="../genindex.html" />
|
||||
<link rel="search" title="Search" href="../search.html" />
|
||||
<link rel="next" title="LM rescoring for Transducer" href="rescoring.html" />
|
||||
<link rel="prev" title="Shallow fusion for Transducer" href="shallow-fusion.html" />
|
||||
</head>
|
||||
|
||||
<body class="wy-body-for-nav">
|
||||
<div class="wy-grid-for-nav">
|
||||
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
|
||||
<div class="wy-side-scroll">
|
||||
<div class="wy-side-nav-search" >
|
||||
|
||||
|
||||
|
||||
<a href="../index.html" class="icon icon-home">
|
||||
icefall
|
||||
</a>
|
||||
<div role="search">
|
||||
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
|
||||
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
|
||||
<input type="hidden" name="check_keywords" value="yes" />
|
||||
<input type="hidden" name="area" value="default" />
|
||||
</form>
|
||||
</div>
|
||||
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
|
||||
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../installation/index.html">Installation</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../faqs.html">Frequently Asked Questions (FAQs)</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../model-export/index.html">Model export</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../recipes/index.html">Recipes</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul class="current">
|
||||
<li class="toctree-l1 current"><a class="reference internal" href="index.html">Decoding with language models</a><ul class="current">
|
||||
<li class="toctree-l2"><a class="reference internal" href="shallow-fusion.html">Shallow fusion for Transducer</a></li>
|
||||
<li class="toctree-l2 current"><a class="current reference internal" href="#">LODR for RNN Transducer</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="rescoring.html">LM rescoring for Transducer</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</nav>
|
||||
|
||||
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
|
||||
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
|
||||
<a href="../index.html">icefall</a>
|
||||
</nav>
|
||||
|
||||
<div class="wy-nav-content">
|
||||
<div class="rst-content">
|
||||
<div role="navigation" aria-label="Page navigation">
|
||||
<ul class="wy-breadcrumbs">
|
||||
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
|
||||
<li class="breadcrumb-item"><a href="index.html">Decoding with language models</a></li>
|
||||
<li class="breadcrumb-item active">LODR for RNN Transducer</li>
|
||||
<li class="wy-breadcrumbs-aside">
|
||||
<a href="https://github.com/k2-fsa/icefall/blob/master/docs/source/decoding-with-langugage-models/LODR.rst" class="fa fa-github"> Edit on GitHub</a>
|
||||
</li>
|
||||
</ul>
|
||||
<hr/>
|
||||
</div>
|
||||
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
|
||||
<div itemprop="articleBody">
|
||||
|
||||
<section id="lodr-for-rnn-transducer">
|
||||
<span id="lodr"></span><h1>LODR for RNN Transducer<a class="headerlink" href="#lodr-for-rnn-transducer" title="Permalink to this heading"></a></h1>
|
||||
<p>As a type of E2E model, neural transducers are usually considered as having an internal
|
||||
language model, which learns the language level information on the training corpus.
|
||||
In real-life scenario, there is often a mismatch between the training corpus and the target corpus space.
|
||||
This mismatch can be a problem when decoding for neural transducer models with language models as its internal
|
||||
language can act “against” the external LM. In this tutorial, we show how to use
|
||||
<a class="reference external" href="https://arxiv.org/abs/2203.16776">Low-order Density Ratio</a> to alleviate this effect to further improve the performance
|
||||
of langugae model integration.</p>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>This tutorial is based on the recipe
|
||||
<a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming">pruned_transducer_stateless7_streaming</a>,
|
||||
which is a streaming transducer model trained on <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>.
|
||||
However, you can easily apply LODR to other recipes.
|
||||
If you encounter any problems, please open an issue here <a class="reference external" href="https://github.com/k2-fsa/icefall/issues">icefall</a>.</p>
|
||||
</div>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>For simplicity, the training and testing corpus in this tutorial are the same (<a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>). However,
|
||||
you can change the testing set to any other domains (e.g <a class="reference external" href="https://github.com/SpeechColab/GigaSpeech">GigaSpeech</a>) and prepare the language models
|
||||
using that corpus.</p>
|
||||
</div>
|
||||
<p>First, let’s have a look at some background information. As the predecessor of LODR, Density Ratio (DR) is first proposed <a class="reference external" href="https://arxiv.org/abs/2002.11268">here</a>
|
||||
to address the language information mismatch between the training
|
||||
corpus (source domain) and the testing corpus (target domain). Assuming that the source domain and the test domain
|
||||
are acoustically similar, DR derives the following formular for decoding with Bayes’ theorem:</p>
|
||||
<div class="math notranslate nohighlight">
|
||||
\[\text{score}\left(y_u|\mathit{x},y\right) =
|
||||
\log p\left(y_u|\mathit{x},y_{1:u-1}\right) +
|
||||
\lambda_1 \log p_{\text{Target LM}}\left(y_u|\mathit{x},y_{1:u-1}\right) -
|
||||
\lambda_2 \log p_{\text{Source LM}}\left(y_u|\mathit{x},y_{1:u-1}\right)\]</div>
|
||||
<p>where <span class="math notranslate nohighlight">\(\lambda_1\)</span> and <span class="math notranslate nohighlight">\(\lambda_2\)</span> are the weights of LM scores for target domain and source domain respectively.
|
||||
Here, the source domain LM is trained on the training corpus. The only difference in the above formular compared to
|
||||
shallow fusion is the subtraction of the source domain LM.</p>
|
||||
<p>Some works treat the predictor and the joiner of the neural transducer as its internal LM. However, the LM is
|
||||
considered to be weak and can only capture low-level language information. Therefore, <a class="reference external" href="https://arxiv.org/abs/2203.16776">LODR</a> proposed to use
|
||||
a low-order n-gram LM as an approximation of the ILM of the neural transducer. This leads to the following formula
|
||||
during decoding for transducer model:</p>
|
||||
<div class="math notranslate nohighlight">
|
||||
\[\text{score}\left(y_u|\mathit{x},y\right) =
|
||||
\log p_{rnnt}\left(y_u|\mathit{x},y_{1:u-1}\right) +
|
||||
\lambda_1 \log p_{\text{Target LM}}\left(y_u|\mathit{x},y_{1:u-1}\right) -
|
||||
\lambda_2 \log p_{\text{bi-gram}}\left(y_u|\mathit{x},y_{1:u-1}\right)\]</div>
|
||||
<p>In LODR, an additional bi-gram LM estimated on the source domain (e.g training corpus) is required. Comared to DR,
|
||||
the only difference lies in the choice of source domain LM. According to the original <a class="reference external" href="https://arxiv.org/abs/2203.16776">paper</a>,
|
||||
LODR achieves similar performance compared DR in both intra-domain and cross-domain settings.
|
||||
As a bi-gram is much faster to evaluate, LODR is usually much faster.</p>
|
||||
<p>Now, we will show you how to use LODR in <code class="docutils literal notranslate"><span class="pre">icefall</span></code>.
|
||||
For illustration purpose, we will use a pre-trained ASR model from this <a class="reference external" href="https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29">link</a>.
|
||||
If you want to train your model from scratch, please have a look at <a class="reference internal" href="../recipes/Non-streaming-ASR/librispeech/pruned_transducer_stateless.html#non-streaming-librispeech-pruned-transducer-stateless"><span class="std std-ref">Pruned transducer statelessX</span></a>.
|
||||
The testing scenario here is intra-domain (we decode the model trained on <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a> on <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a> testing sets).</p>
|
||||
<p>As the initial step, let’s download the pre-trained model.</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
|
||||
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">"pretrained.pt"</span>
|
||||
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>To test the model, let’s have a look at the decoding results <strong>without</strong> using LM. This can be done via the following command:</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
|
||||
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>The following WERs are achieved on test-clean and test-other:</p>
|
||||
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 3.11 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.93 best for test-other
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>Then, we download the external language model and bi-gram LM that are necessary for LODR.
|
||||
Note that the bi-gram is estimated on the LibriSpeech 960 hours’ text.</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="c1"># download the external LM</span>
|
||||
$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
|
||||
$<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
|
||||
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-librispeech-rnn-lm/exp
|
||||
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">"pretrained.pt"</span>
|
||||
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt
|
||||
$<span class="w"> </span><span class="nb">popd</span>
|
||||
$
|
||||
$<span class="w"> </span><span class="c1"># download the bi-gram</span>
|
||||
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>install
|
||||
$<span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/marcoyang/librispeech_bigram
|
||||
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>data/lang_bpe_500
|
||||
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>../../librispeech_bigram/2gram.fst.txt<span class="w"> </span>.
|
||||
$<span class="w"> </span><span class="nb">popd</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>Then, we perform LODR decoding by setting <code class="docutils literal notranslate"><span class="pre">--decoding-method</span></code> to <code class="docutils literal notranslate"><span class="pre">modified_beam_search_lm_LODR</span></code>:</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$<span class="w"> </span><span class="nv">lm_dir</span><span class="o">=</span>./icefall-librispeech-rnn-lm/exp
|
||||
$<span class="w"> </span><span class="nv">lm_scale</span><span class="o">=</span><span class="m">0</span>.42
|
||||
$<span class="w"> </span><span class="nv">LODR_scale</span><span class="o">=</span>-0.24
|
||||
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--beam-size<span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search_lm_LODR<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
<span class="w"> </span>--use-shallow-fusion<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-type<span class="w"> </span>rnn<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-exp-dir<span class="w"> </span><span class="nv">$lm_dir</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-scale<span class="w"> </span><span class="nv">$lm_scale</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--rnn-lm-embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--rnn-lm-hidden-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--rnn-lm-num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-vocab-size<span class="w"> </span><span class="m">500</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--tokens-ngram<span class="w"> </span><span class="m">2</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--ngram-lm-scale<span class="w"> </span><span class="nv">$LODR_scale</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>There are two extra arguments that need to be given when doing LODR. <code class="docutils literal notranslate"><span class="pre">--tokens-ngram</span></code> specifies the order of n-gram. As we
|
||||
are using a bi-gram, we set it to 2. <code class="docutils literal notranslate"><span class="pre">--ngram-lm-scale</span></code> is the scale of the bi-gram, it should be a negative number
|
||||
as we are subtracting the bi-gram’s score during decoding.</p>
|
||||
<p>The decoding results obtained with the above command are shown below:</p>
|
||||
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 2.61 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 6.74 best for test-other
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>Recall that the lowest WER we obtained in <a class="reference internal" href="shallow-fusion.html#shallow-fusion"><span class="std std-ref">Shallow fusion for Transducer</span></a> with beam size of 4 is <code class="docutils literal notranslate"><span class="pre">2.77/7.08</span></code>, LODR
|
||||
indeed <strong>further improves</strong> the WER. We can do even better if we increase <code class="docutils literal notranslate"><span class="pre">--beam-size</span></code>:</p>
|
||||
<table class="docutils align-default" id="id1">
|
||||
<caption><span class="caption-number">Table 2 </span><span class="caption-text">WER of LODR with different beam sizes</span><a class="headerlink" href="#id1" title="Permalink to this table"></a></caption>
|
||||
<colgroup>
|
||||
<col style="width: 25%" />
|
||||
<col style="width: 25%" />
|
||||
<col style="width: 50%" />
|
||||
</colgroup>
|
||||
<thead>
|
||||
<tr class="row-odd"><th class="head"><p>Beam size</p></th>
|
||||
<th class="head"><p>test-clean</p></th>
|
||||
<th class="head"><p>test-other</p></th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr class="row-even"><td><p>4</p></td>
|
||||
<td><p>2.61</p></td>
|
||||
<td><p>6.74</p></td>
|
||||
</tr>
|
||||
<tr class="row-odd"><td><p>8</p></td>
|
||||
<td><p>2.45</p></td>
|
||||
<td><p>6.38</p></td>
|
||||
</tr>
|
||||
<tr class="row-even"><td><p>12</p></td>
|
||||
<td><p>2.4</p></td>
|
||||
<td><p>6.23</p></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</section>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
|
||||
<a href="shallow-fusion.html" class="btn btn-neutral float-left" title="Shallow fusion for Transducer" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
|
||||
<a href="rescoring.html" class="btn btn-neutral float-right" title="LM rescoring for Transducer" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
|
||||
</div>
|
||||
|
||||
<hr/>
|
||||
|
||||
<div role="contentinfo">
|
||||
<p>© Copyright 2021, icefall development team.</p>
|
||||
</div>
|
||||
|
||||
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
|
||||
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
|
||||
provided by <a href="https://readthedocs.org">Read the Docs</a>.
|
||||
|
||||
|
||||
</footer>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
</div>
|
||||
<script>
|
||||
jQuery(function () {
|
||||
SphinxRtdTheme.Navigation.enable(true);
|
||||
});
|
||||
</script>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
135
decoding-with-langugage-models/index.html
Normal file
135
decoding-with-langugage-models/index.html
Normal file
@ -0,0 +1,135 @@
|
||||
<!DOCTYPE html>
|
||||
<html class="writer-html5" lang="en" >
|
||||
<head>
|
||||
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
|
||||
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Decoding with language models — icefall 0.1 documentation</title>
|
||||
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
|
||||
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
|
||||
<!--[if lt IE 9]>
|
||||
<script src="../_static/js/html5shiv.min.js"></script>
|
||||
<![endif]-->
|
||||
|
||||
<script src="../_static/jquery.js"></script>
|
||||
<script src="../_static/_sphinx_javascript_frameworks_compat.js"></script>
|
||||
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
|
||||
<script src="../_static/doctools.js"></script>
|
||||
<script src="../_static/sphinx_highlight.js"></script>
|
||||
<script src="../_static/js/theme.js"></script>
|
||||
<link rel="index" title="Index" href="../genindex.html" />
|
||||
<link rel="search" title="Search" href="../search.html" />
|
||||
<link rel="next" title="Shallow fusion for Transducer" href="shallow-fusion.html" />
|
||||
<link rel="prev" title="Huggingface spaces" href="../huggingface/spaces.html" />
|
||||
</head>
|
||||
|
||||
<body class="wy-body-for-nav">
|
||||
<div class="wy-grid-for-nav">
|
||||
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
|
||||
<div class="wy-side-scroll">
|
||||
<div class="wy-side-nav-search" >
|
||||
|
||||
|
||||
|
||||
<a href="../index.html" class="icon icon-home">
|
||||
icefall
|
||||
</a>
|
||||
<div role="search">
|
||||
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
|
||||
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
|
||||
<input type="hidden" name="check_keywords" value="yes" />
|
||||
<input type="hidden" name="area" value="default" />
|
||||
</form>
|
||||
</div>
|
||||
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
|
||||
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../installation/index.html">Installation</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../faqs.html">Frequently Asked Questions (FAQs)</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../model-export/index.html">Model export</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../recipes/index.html">Recipes</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul class="current">
|
||||
<li class="toctree-l1 current"><a class="current reference internal" href="#">Decoding with language models</a><ul>
|
||||
<li class="toctree-l2"><a class="reference internal" href="shallow-fusion.html">Shallow fusion for Transducer</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="LODR.html">LODR for RNN Transducer</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="rescoring.html">LM rescoring for Transducer</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</nav>
|
||||
|
||||
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
|
||||
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
|
||||
<a href="../index.html">icefall</a>
|
||||
</nav>
|
||||
|
||||
<div class="wy-nav-content">
|
||||
<div class="rst-content">
|
||||
<div role="navigation" aria-label="Page navigation">
|
||||
<ul class="wy-breadcrumbs">
|
||||
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
|
||||
<li class="breadcrumb-item active">Decoding with language models</li>
|
||||
<li class="wy-breadcrumbs-aside">
|
||||
<a href="https://github.com/k2-fsa/icefall/blob/master/docs/source/decoding-with-langugage-models/index.rst" class="fa fa-github"> Edit on GitHub</a>
|
||||
</li>
|
||||
</ul>
|
||||
<hr/>
|
||||
</div>
|
||||
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
|
||||
<div itemprop="articleBody">
|
||||
|
||||
<section id="decoding-with-language-models">
|
||||
<h1>Decoding with language models<a class="headerlink" href="#decoding-with-language-models" title="Permalink to this heading"></a></h1>
|
||||
<p>This section describes how to use external langugage models
|
||||
during decoding to improve the WER of transducer models.</p>
|
||||
<div class="toctree-wrapper compound">
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="shallow-fusion.html">Shallow fusion for Transducer</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="LODR.html">LODR for RNN Transducer</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="rescoring.html">LM rescoring for Transducer</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
|
||||
<a href="../huggingface/spaces.html" class="btn btn-neutral float-left" title="Huggingface spaces" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
|
||||
<a href="shallow-fusion.html" class="btn btn-neutral float-right" title="Shallow fusion for Transducer" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
|
||||
</div>
|
||||
|
||||
<hr/>
|
||||
|
||||
<div role="contentinfo">
|
||||
<p>© Copyright 2021, icefall development team.</p>
|
||||
</div>
|
||||
|
||||
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
|
||||
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
|
||||
provided by <a href="https://readthedocs.org">Read the Docs</a>.
|
||||
|
||||
|
||||
</footer>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
</div>
|
||||
<script>
|
||||
jQuery(function () {
|
||||
SphinxRtdTheme.Navigation.enable(true);
|
||||
});
|
||||
</script>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
386
decoding-with-langugage-models/rescoring.html
Normal file
386
decoding-with-langugage-models/rescoring.html
Normal file
@ -0,0 +1,386 @@
|
||||
<!DOCTYPE html>
|
||||
<html class="writer-html5" lang="en" >
|
||||
<head>
|
||||
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
|
||||
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>LM rescoring for Transducer — icefall 0.1 documentation</title>
|
||||
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
|
||||
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
|
||||
<!--[if lt IE 9]>
|
||||
<script src="../_static/js/html5shiv.min.js"></script>
|
||||
<![endif]-->
|
||||
|
||||
<script src="../_static/jquery.js"></script>
|
||||
<script src="../_static/_sphinx_javascript_frameworks_compat.js"></script>
|
||||
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
|
||||
<script src="../_static/doctools.js"></script>
|
||||
<script src="../_static/sphinx_highlight.js"></script>
|
||||
<script src="../_static/js/theme.js"></script>
|
||||
<link rel="index" title="Index" href="../genindex.html" />
|
||||
<link rel="search" title="Search" href="../search.html" />
|
||||
<link rel="prev" title="LODR for RNN Transducer" href="LODR.html" />
|
||||
</head>
|
||||
|
||||
<body class="wy-body-for-nav">
|
||||
<div class="wy-grid-for-nav">
|
||||
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
|
||||
<div class="wy-side-scroll">
|
||||
<div class="wy-side-nav-search" >
|
||||
|
||||
|
||||
|
||||
<a href="../index.html" class="icon icon-home">
|
||||
icefall
|
||||
</a>
|
||||
<div role="search">
|
||||
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
|
||||
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
|
||||
<input type="hidden" name="check_keywords" value="yes" />
|
||||
<input type="hidden" name="area" value="default" />
|
||||
</form>
|
||||
</div>
|
||||
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
|
||||
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../installation/index.html">Installation</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../faqs.html">Frequently Asked Questions (FAQs)</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../model-export/index.html">Model export</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../recipes/index.html">Recipes</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul class="current">
|
||||
<li class="toctree-l1 current"><a class="reference internal" href="index.html">Decoding with language models</a><ul class="current">
|
||||
<li class="toctree-l2"><a class="reference internal" href="shallow-fusion.html">Shallow fusion for Transducer</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="LODR.html">LODR for RNN Transducer</a></li>
|
||||
<li class="toctree-l2 current"><a class="current reference internal" href="#">LM rescoring for Transducer</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</nav>
|
||||
|
||||
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
|
||||
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
|
||||
<a href="../index.html">icefall</a>
|
||||
</nav>
|
||||
|
||||
<div class="wy-nav-content">
|
||||
<div class="rst-content">
|
||||
<div role="navigation" aria-label="Page navigation">
|
||||
<ul class="wy-breadcrumbs">
|
||||
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
|
||||
<li class="breadcrumb-item"><a href="index.html">Decoding with language models</a></li>
|
||||
<li class="breadcrumb-item active">LM rescoring for Transducer</li>
|
||||
<li class="wy-breadcrumbs-aside">
|
||||
<a href="https://github.com/k2-fsa/icefall/blob/master/docs/source/decoding-with-langugage-models/rescoring.rst" class="fa fa-github"> Edit on GitHub</a>
|
||||
</li>
|
||||
</ul>
|
||||
<hr/>
|
||||
</div>
|
||||
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
|
||||
<div itemprop="articleBody">
|
||||
|
||||
<section id="lm-rescoring-for-transducer">
|
||||
<span id="rescoring"></span><h1>LM rescoring for Transducer<a class="headerlink" href="#lm-rescoring-for-transducer" title="Permalink to this heading"></a></h1>
|
||||
<p>LM rescoring is a commonly used approach to incorporate external LM information. Unlike shallow-fusion-based
|
||||
methods (see <span class="xref std std-ref">shallow-fusion</span>, <a class="reference internal" href="LODR.html#lodr"><span class="std std-ref">LODR for RNN Transducer</span></a>), rescoring is usually performed to re-rank the n-best hypotheses after beam search.
|
||||
Rescoring is usually more efficient than shallow fusion since less computation is performed on the external LM.
|
||||
In this tutorial, we will show you how to use external LM to rescore the n-best hypotheses decoded from neural transducer models in
|
||||
<a class="reference external" href="https://github.com/k2-fsa/icefall">icefall</a>.</p>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>This tutorial is based on the recipe
|
||||
<a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming">pruned_transducer_stateless7_streaming</a>,
|
||||
which is a streaming transducer model trained on <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>.
|
||||
However, you can easily apply shallow fusion to other recipes.
|
||||
If you encounter any problems, please open an issue <a class="reference external" href="https://github.com/k2-fsa/icefall/issues">here</a>.</p>
|
||||
</div>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>For simplicity, the training and testing corpus in this tutorial is the same (<a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>). However, you can change the testing set
|
||||
to any other domains (e.g <a class="reference external" href="https://github.com/SpeechColab/GigaSpeech">GigaSpeech</a>) and use an external LM trained on that domain.</p>
|
||||
</div>
|
||||
<div class="admonition hint">
|
||||
<p class="admonition-title">Hint</p>
|
||||
<p>We recommend you to use a GPU for decoding.</p>
|
||||
</div>
|
||||
<p>For illustration purpose, we will use a pre-trained ASR model from this <a class="reference external" href="https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29">link</a>.
|
||||
If you want to train your model from scratch, please have a look at <a class="reference internal" href="../recipes/Non-streaming-ASR/librispeech/pruned_transducer_stateless.html#non-streaming-librispeech-pruned-transducer-stateless"><span class="std std-ref">Pruned transducer statelessX</span></a>.</p>
|
||||
<p>As the initial step, let’s download the pre-trained model.</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
|
||||
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">"pretrained.pt"</span>
|
||||
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>As usual, we first test the model’s performance without external LM. This can be done via the following command:</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
|
||||
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>The following WERs are achieved on test-clean and test-other:</p>
|
||||
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 3.11 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.93 best for test-other
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>Now, we will try to improve the above WER numbers via external LM rescoring. We will download
|
||||
a pre-trained LM from this <a class="reference external" href="https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm">link</a>.</p>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus.
|
||||
You may also train a RNN LM from scratch. Please refer to this <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/master/icefall/rnn_lm/train.py">script</a>
|
||||
for training a RNN LM and this <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/master/icefall/transformer_lm/train.py">script</a> to train a transformer LM.</p>
|
||||
</div>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="c1"># download the external LM</span>
|
||||
$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
|
||||
$<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
|
||||
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-librispeech-rnn-lm/exp
|
||||
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">"pretrained.pt"</span>
|
||||
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt
|
||||
$<span class="w"> </span><span class="nb">popd</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>With the RNNLM available, we can rescore the n-best hypotheses generated from <cite>modified_beam_search</cite>. Here,
|
||||
<cite>n</cite> should be the number of beams, i.e <code class="docutils literal notranslate"><span class="pre">--beam-size</span></code>. The command for LM rescoring is
|
||||
as follows. Note that the <code class="docutils literal notranslate"><span class="pre">--decoding-method</span></code> is set to <cite>modified_beam_search_lm_rescore</cite> and <code class="docutils literal notranslate"><span class="pre">--use-shallow-fusion</span></code>
|
||||
is set to <cite>False</cite>.</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$<span class="w"> </span><span class="nv">lm_dir</span><span class="o">=</span>./icefall-librispeech-rnn-lm/exp
|
||||
$<span class="w"> </span><span class="nv">lm_scale</span><span class="o">=</span><span class="m">0</span>.43
|
||||
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--beam-size<span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search_lm_rescore<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
<span class="w"> </span>--use-shallow-fusion<span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-type<span class="w"> </span>rnn<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-exp-dir<span class="w"> </span><span class="nv">$lm_dir</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-scale<span class="w"> </span><span class="nv">$lm_scale</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--rnn-lm-embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--rnn-lm-hidden-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--rnn-lm-num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-vocab-size<span class="w"> </span><span class="m">500</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 2.93 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.6 best for test-other
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>Great! We made some improvements! Increasing the size of the n-best hypotheses will further boost the performance,
|
||||
see the following table:</p>
|
||||
<table class="docutils align-default" id="id1">
|
||||
<caption><span class="caption-number">Table 3 </span><span class="caption-text">WERs of LM rescoring with different beam sizes</span><a class="headerlink" href="#id1" title="Permalink to this table"></a></caption>
|
||||
<colgroup>
|
||||
<col style="width: 33%" />
|
||||
<col style="width: 33%" />
|
||||
<col style="width: 33%" />
|
||||
</colgroup>
|
||||
<thead>
|
||||
<tr class="row-odd"><th class="head"><p>Beam size</p></th>
|
||||
<th class="head"><p>test-clean</p></th>
|
||||
<th class="head"><p>test-other</p></th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr class="row-even"><td><p>4</p></td>
|
||||
<td><p>2.93</p></td>
|
||||
<td><p>7.6</p></td>
|
||||
</tr>
|
||||
<tr class="row-odd"><td><p>8</p></td>
|
||||
<td><p>2.67</p></td>
|
||||
<td><p>7.11</p></td>
|
||||
</tr>
|
||||
<tr class="row-even"><td><p>12</p></td>
|
||||
<td><p>2.59</p></td>
|
||||
<td><p>6.86</p></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>In fact, we can also apply LODR (see <a class="reference internal" href="LODR.html#lodr"><span class="std std-ref">LODR for RNN Transducer</span></a>) when doing LM rescoring. To do so, we need to
|
||||
download the bi-gram required by LODR:</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="c1"># download the bi-gram</span>
|
||||
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>install
|
||||
$<span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/marcoyang/librispeech_bigram
|
||||
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>data/lang_bpe_500
|
||||
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>../../librispeech_bigram/2gram.arpa<span class="w"> </span>.
|
||||
$<span class="w"> </span><span class="nb">popd</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>Then we can performn LM rescoring + LODR by changing the decoding method to <cite>modified_beam_search_lm_rescore_LODR</cite>.</p>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>This decoding method requires the dependency of <a class="reference external" href="https://github.com/kpu/kenlm">kenlm</a>. You can install it
|
||||
via this command: <cite>pip install https://github.com/kpu/kenlm/archive/master.zip</cite>.</p>
|
||||
</div>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$<span class="w"> </span><span class="nv">lm_dir</span><span class="o">=</span>./icefall-librispeech-rnn-lm/exp
|
||||
$<span class="w"> </span><span class="nv">lm_scale</span><span class="o">=</span><span class="m">0</span>.43
|
||||
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--beam-size<span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search_lm_rescore_LODR<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
<span class="w"> </span>--use-shallow-fusion<span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-type<span class="w"> </span>rnn<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-exp-dir<span class="w"> </span><span class="nv">$lm_dir</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-scale<span class="w"> </span><span class="nv">$lm_scale</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--rnn-lm-embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--rnn-lm-hidden-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--rnn-lm-num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-vocab-size<span class="w"> </span><span class="m">500</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>You should see the following WERs after executing the commands above:</p>
|
||||
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 2.9 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.57 best for test-other
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>It’s slightly better than LM rescoring. If we further increase the beam size, we will see
|
||||
further improvements from LM rescoring + LODR:</p>
|
||||
<table class="docutils align-default" id="id2">
|
||||
<caption><span class="caption-number">Table 4 </span><span class="caption-text">WERs of LM rescoring + LODR with different beam sizes</span><a class="headerlink" href="#id2" title="Permalink to this table"></a></caption>
|
||||
<colgroup>
|
||||
<col style="width: 33%" />
|
||||
<col style="width: 33%" />
|
||||
<col style="width: 33%" />
|
||||
</colgroup>
|
||||
<thead>
|
||||
<tr class="row-odd"><th class="head"><p>Beam size</p></th>
|
||||
<th class="head"><p>test-clean</p></th>
|
||||
<th class="head"><p>test-other</p></th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr class="row-even"><td><p>4</p></td>
|
||||
<td><p>2.9</p></td>
|
||||
<td><p>7.57</p></td>
|
||||
</tr>
|
||||
<tr class="row-odd"><td><p>8</p></td>
|
||||
<td><p>2.63</p></td>
|
||||
<td><p>7.04</p></td>
|
||||
</tr>
|
||||
<tr class="row-even"><td><p>12</p></td>
|
||||
<td><p>2.52</p></td>
|
||||
<td><p>6.73</p></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>As mentioned earlier, LM rescoring is usually faster than shallow-fusion based methods.
|
||||
Here, we benchmark the WERs and decoding speed of them:</p>
|
||||
<table class="docutils align-default" id="id3">
|
||||
<caption><span class="caption-number">Table 5 </span><span class="caption-text">LM-rescoring-based methods vs shallow-fusion-based methods (The numbers in each field is WER on test-clean, WER on test-other and decoding time on test-clean)</span><a class="headerlink" href="#id3" title="Permalink to this table"></a></caption>
|
||||
<colgroup>
|
||||
<col style="width: 25%" />
|
||||
<col style="width: 25%" />
|
||||
<col style="width: 25%" />
|
||||
<col style="width: 25%" />
|
||||
</colgroup>
|
||||
<thead>
|
||||
<tr class="row-odd"><th class="head"><p>Decoding method</p></th>
|
||||
<th class="head"><p>beam=4</p></th>
|
||||
<th class="head"><p>beam=8</p></th>
|
||||
<th class="head"><p>beam=12</p></th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr class="row-even"><td><p><cite>modified_beam_search</cite></p></td>
|
||||
<td><p>3.11/7.93; 132s</p></td>
|
||||
<td><p>3.1/7.95; 177s</p></td>
|
||||
<td><p>3.1/7.96; 210s</p></td>
|
||||
</tr>
|
||||
<tr class="row-odd"><td><p><cite>modified_beam_search_lm_shallow_fusion</cite></p></td>
|
||||
<td><p>2.77/7.08; 262s</p></td>
|
||||
<td><p>2.62/6.65; 352s</p></td>
|
||||
<td><p>2.58/6.65; 488s</p></td>
|
||||
</tr>
|
||||
<tr class="row-even"><td><p>LODR</p></td>
|
||||
<td><p>2.61/6.74; 400s</p></td>
|
||||
<td><p>2.45/6.38; 610s</p></td>
|
||||
<td><p>2.4/6.23; 870s</p></td>
|
||||
</tr>
|
||||
<tr class="row-odd"><td><p><cite>modified_beam_search_lm_rescore</cite></p></td>
|
||||
<td><p>2.93/7.6; 156s</p></td>
|
||||
<td><p>2.67/7.11; 203s</p></td>
|
||||
<td><p>2.59/6.86; 255s</p></td>
|
||||
</tr>
|
||||
<tr class="row-even"><td><p><cite>modified_beam_search_lm_rescore_LODR</cite></p></td>
|
||||
<td><p>2.9/7.57; 160s</p></td>
|
||||
<td><p>2.63/7.04; 203s</p></td>
|
||||
<td><p>2.52/6.73; 263s</p></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>Decoding is performed with a single 32G V100, we set <code class="docutils literal notranslate"><span class="pre">--max-duration</span></code> to 600.
|
||||
Decoding time here is only for reference and it may vary.</p>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
|
||||
<a href="LODR.html" class="btn btn-neutral float-left" title="LODR for RNN Transducer" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
|
||||
</div>
|
||||
|
||||
<hr/>
|
||||
|
||||
<div role="contentinfo">
|
||||
<p>© Copyright 2021, icefall development team.</p>
|
||||
</div>
|
||||
|
||||
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
|
||||
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
|
||||
provided by <a href="https://readthedocs.org">Read the Docs</a>.
|
||||
|
||||
|
||||
</footer>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
</div>
|
||||
<script>
|
||||
jQuery(function () {
|
||||
SphinxRtdTheme.Navigation.enable(true);
|
||||
});
|
||||
</script>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
296
decoding-with-langugage-models/shallow-fusion.html
Normal file
296
decoding-with-langugage-models/shallow-fusion.html
Normal file
@ -0,0 +1,296 @@
|
||||
<!DOCTYPE html>
|
||||
<html class="writer-html5" lang="en" >
|
||||
<head>
|
||||
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />
|
||||
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Shallow fusion for Transducer — icefall 0.1 documentation</title>
|
||||
<link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
|
||||
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
|
||||
<!--[if lt IE 9]>
|
||||
<script src="../_static/js/html5shiv.min.js"></script>
|
||||
<![endif]-->
|
||||
|
||||
<script src="../_static/jquery.js"></script>
|
||||
<script src="../_static/_sphinx_javascript_frameworks_compat.js"></script>
|
||||
<script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js"></script>
|
||||
<script src="../_static/doctools.js"></script>
|
||||
<script src="../_static/sphinx_highlight.js"></script>
|
||||
<script src="../_static/js/theme.js"></script>
|
||||
<link rel="index" title="Index" href="../genindex.html" />
|
||||
<link rel="search" title="Search" href="../search.html" />
|
||||
<link rel="next" title="LODR for RNN Transducer" href="LODR.html" />
|
||||
<link rel="prev" title="Decoding with language models" href="index.html" />
|
||||
</head>
|
||||
|
||||
<body class="wy-body-for-nav">
|
||||
<div class="wy-grid-for-nav">
|
||||
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
|
||||
<div class="wy-side-scroll">
|
||||
<div class="wy-side-nav-search" >
|
||||
|
||||
|
||||
|
||||
<a href="../index.html" class="icon icon-home">
|
||||
icefall
|
||||
</a>
|
||||
<div role="search">
|
||||
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
|
||||
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
|
||||
<input type="hidden" name="check_keywords" value="yes" />
|
||||
<input type="hidden" name="area" value="default" />
|
||||
</form>
|
||||
</div>
|
||||
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
|
||||
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../installation/index.html">Installation</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../faqs.html">Frequently Asked Questions (FAQs)</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../model-export/index.html">Model export</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../recipes/index.html">Recipes</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul class="current">
|
||||
<li class="toctree-l1 current"><a class="reference internal" href="index.html">Decoding with language models</a><ul class="current">
|
||||
<li class="toctree-l2 current"><a class="current reference internal" href="#">Shallow fusion for Transducer</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="LODR.html">LODR for RNN Transducer</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="rescoring.html">LM rescoring for Transducer</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</nav>
|
||||
|
||||
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
|
||||
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
|
||||
<a href="../index.html">icefall</a>
|
||||
</nav>
|
||||
|
||||
<div class="wy-nav-content">
|
||||
<div class="rst-content">
|
||||
<div role="navigation" aria-label="Page navigation">
|
||||
<ul class="wy-breadcrumbs">
|
||||
<li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
|
||||
<li class="breadcrumb-item"><a href="index.html">Decoding with language models</a></li>
|
||||
<li class="breadcrumb-item active">Shallow fusion for Transducer</li>
|
||||
<li class="wy-breadcrumbs-aside">
|
||||
<a href="https://github.com/k2-fsa/icefall/blob/master/docs/source/decoding-with-langugage-models/shallow-fusion.rst" class="fa fa-github"> Edit on GitHub</a>
|
||||
</li>
|
||||
</ul>
|
||||
<hr/>
|
||||
</div>
|
||||
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
|
||||
<div itemprop="articleBody">
|
||||
|
||||
<section id="shallow-fusion-for-transducer">
|
||||
<span id="shallow-fusion"></span><h1>Shallow fusion for Transducer<a class="headerlink" href="#shallow-fusion-for-transducer" title="Permalink to this heading"></a></h1>
|
||||
<p>External language models (LM) are commonly used to improve WERs for E2E ASR models.
|
||||
This tutorial shows you how to perform <code class="docutils literal notranslate"><span class="pre">shallow</span> <span class="pre">fusion</span></code> with an external LM
|
||||
to improve the word-error-rate of a transducer model.</p>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>This tutorial is based on the recipe
|
||||
<a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming">pruned_transducer_stateless7_streaming</a>,
|
||||
which is a streaming transducer model trained on <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>.
|
||||
However, you can easily apply shallow fusion to other recipes.
|
||||
If you encounter any problems, please open an issue here <a class="reference external" href="https://github.com/k2-fsa/icefall/issues">icefall</a>.</p>
|
||||
</div>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>For simplicity, the training and testing corpus in this tutorial is the same (<a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>). However, you can change the testing set
|
||||
to any other domains (e.g <a class="reference external" href="https://github.com/SpeechColab/GigaSpeech">GigaSpeech</a>) and use an external LM trained on that domain.</p>
|
||||
</div>
|
||||
<div class="admonition hint">
|
||||
<p class="admonition-title">Hint</p>
|
||||
<p>We recommend you to use a GPU for decoding.</p>
|
||||
</div>
|
||||
<p>For illustration purpose, we will use a pre-trained ASR model from this <a class="reference external" href="https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29">link</a>.
|
||||
If you want to train your model from scratch, please have a look at <a class="reference internal" href="../recipes/Non-streaming-ASR/librispeech/pruned_transducer_stateless.html#non-streaming-librispeech-pruned-transducer-stateless"><span class="std std-ref">Pruned transducer statelessX</span></a>.</p>
|
||||
<p>As the initial step, let’s download the pre-trained model.</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
|
||||
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">"pretrained.pt"</span>
|
||||
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>To test the model, let’s have a look at the decoding results without using LM. This can be done via the following command:</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
|
||||
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>The following WERs are achieved on test-clean and test-other:</p>
|
||||
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 3.11 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.93 best for test-other
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>These are already good numbers! But we can further improve it by using shallow fusion with external LM.
|
||||
Training a language model usually takes a long time, we can download a pre-trained LM from this <a class="reference external" href="https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm">link</a>.</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="c1"># download the external LM</span>
|
||||
$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
|
||||
$<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
|
||||
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-librispeech-rnn-lm/exp
|
||||
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">"pretrained.pt"</span>
|
||||
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt
|
||||
$<span class="w"> </span><span class="nb">popd</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus.
|
||||
You may also train a RNN LM from scratch. Please refer to this <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/master/icefall/rnn_lm/train.py">script</a>
|
||||
for training a RNN LM and this <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/master/icefall/transformer_lm/train.py">script</a> to train a transformer LM.</p>
|
||||
</div>
|
||||
<p>To use shallow fusion for decoding, we can execute the following command:</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$<span class="w"> </span><span class="nv">lm_dir</span><span class="o">=</span>./icefall-librispeech-rnn-lm/exp
|
||||
$<span class="w"> </span><span class="nv">lm_scale</span><span class="o">=</span><span class="m">0</span>.29
|
||||
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--beam-size<span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--decoding-method<span class="w"> </span>modified_beam_search_lm_shallow_fusion<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
<span class="w"> </span>--use-shallow-fusion<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-type<span class="w"> </span>rnn<span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-exp-dir<span class="w"> </span><span class="nv">$lm_dir</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-scale<span class="w"> </span><span class="nv">$lm_scale</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--rnn-lm-embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--rnn-lm-hidden-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--rnn-lm-num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
|
||||
<span class="w"> </span>--lm-vocab-size<span class="w"> </span><span class="m">500</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>Note that we set <code class="docutils literal notranslate"><span class="pre">--decoding-method</span> <span class="pre">modified_beam_search_lm_shallow_fusion</span></code> and <code class="docutils literal notranslate"><span class="pre">--use-shallow-fusion</span> <span class="pre">True</span></code>
|
||||
to use shallow fusion. <code class="docutils literal notranslate"><span class="pre">--lm-type</span></code> specifies the type of neural LM we are going to use, you can either choose
|
||||
between <code class="docutils literal notranslate"><span class="pre">rnn</span></code> or <code class="docutils literal notranslate"><span class="pre">transformer</span></code>. The following three arguments are associated with the rnn:</p>
|
||||
<ul class="simple">
|
||||
<li><dl class="simple">
|
||||
<dt><code class="docutils literal notranslate"><span class="pre">--rnn-lm-embedding-dim</span></code></dt><dd><p>The embedding dimension of the RNN LM</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</li>
|
||||
<li><dl class="simple">
|
||||
<dt><code class="docutils literal notranslate"><span class="pre">--rnn-lm-hidden-dim</span></code></dt><dd><p>The hidden dimension of the RNN LM</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</li>
|
||||
<li><dl class="simple">
|
||||
<dt><code class="docutils literal notranslate"><span class="pre">--rnn-lm-num-layers</span></code></dt><dd><p>The number of RNN layers in the RNN LM.</p>
|
||||
</dd>
|
||||
</dl>
|
||||
</li>
|
||||
</ul>
|
||||
<p>The decoding result obtained with the above command are shown below.</p>
|
||||
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 2.77 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.08 best for test-other
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>The improvement of shallow fusion is very obvious! The relative WER reduction on test-other is around 10.5%.
|
||||
A few parameters can be tuned to further boost the performance of shallow fusion:</p>
|
||||
<ul>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--lm-scale</span></code></p>
|
||||
<blockquote>
|
||||
<div><p>Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large,
|
||||
the LM score may dominant during decoding, leading to bad WER. A typical value of this is around 0.3.</p>
|
||||
</div></blockquote>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--beam-size</span></code></p>
|
||||
<blockquote>
|
||||
<div><p>The number of active paths in the search beam. It controls the trade-off between decoding efficiency and accuracy.</p>
|
||||
</div></blockquote>
|
||||
</li>
|
||||
</ul>
|
||||
<p>Here, we also show how <cite>–beam-size</cite> effect the WER and decoding time:</p>
|
||||
<table class="docutils align-default" id="id2">
|
||||
<caption><span class="caption-number">Table 1 </span><span class="caption-text">WERs and decoding time (on test-clean) of shallow fusion with different beam sizes</span><a class="headerlink" href="#id2" title="Permalink to this table"></a></caption>
|
||||
<colgroup>
|
||||
<col style="width: 25%" />
|
||||
<col style="width: 25%" />
|
||||
<col style="width: 25%" />
|
||||
<col style="width: 25%" />
|
||||
</colgroup>
|
||||
<thead>
|
||||
<tr class="row-odd"><th class="head"><p>Beam size</p></th>
|
||||
<th class="head"><p>test-clean</p></th>
|
||||
<th class="head"><p>test-other</p></th>
|
||||
<th class="head"><p>Decoding time on test-clean (s)</p></th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr class="row-even"><td><p>4</p></td>
|
||||
<td><p>2.77</p></td>
|
||||
<td><p>7.08</p></td>
|
||||
<td><p>262</p></td>
|
||||
</tr>
|
||||
<tr class="row-odd"><td><p>8</p></td>
|
||||
<td><p>2.62</p></td>
|
||||
<td><p>6.65</p></td>
|
||||
<td><p>352</p></td>
|
||||
</tr>
|
||||
<tr class="row-even"><td><p>12</p></td>
|
||||
<td><p>2.58</p></td>
|
||||
<td><p>6.65</p></td>
|
||||
<td><p>488</p></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
<p>As we see, a larger beam size during shallow fusion improves the WER, but is also slower.</p>
|
||||
</section>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
|
||||
<a href="index.html" class="btn btn-neutral float-left" title="Decoding with language models" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
|
||||
<a href="LODR.html" class="btn btn-neutral float-right" title="LODR for RNN Transducer" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
|
||||
</div>
|
||||
|
||||
<hr/>
|
||||
|
||||
<div role="contentinfo">
|
||||
<p>© Copyright 2021, icefall development team.</p>
|
||||
</div>
|
||||
|
||||
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
|
||||
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
|
||||
provided by <a href="https://readthedocs.org">Read the Docs</a>.
|
||||
|
||||
|
||||
</footer>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
</div>
|
||||
<script>
|
||||
jQuery(function () {
|
||||
SphinxRtdTheme.Navigation.enable(true);
|
||||
});
|
||||
</script>
|
||||
|
||||
</body>
|
||||
</html>
|
||||
@ -59,6 +59,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -51,6 +51,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -58,6 +58,9 @@
|
||||
<li class="toctree-l2"><a class="reference internal" href="spaces.html">Huggingface spaces</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -58,6 +58,9 @@
|
||||
<li class="toctree-l2"><a class="reference internal" href="spaces.html">Huggingface spaces</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -19,6 +19,7 @@
|
||||
<script src="../_static/js/theme.js"></script>
|
||||
<link rel="index" title="Index" href="../genindex.html" />
|
||||
<link rel="search" title="Search" href="../search.html" />
|
||||
<link rel="next" title="Decoding with language models" href="../decoding-with-langugage-models/index.html" />
|
||||
<link rel="prev" title="Pre-trained models" href="pretrained-models.html" />
|
||||
</head>
|
||||
|
||||
@ -60,6 +61,9 @@
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
@ -144,6 +148,7 @@ the following YouTube channel by <a class="reference external" href="https://www
|
||||
</div>
|
||||
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
|
||||
<a href="pretrained-models.html" class="btn btn-neutral float-left" title="Pre-trained models" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
|
||||
<a href="../decoding-with-langugage-models/index.html" class="btn btn-neutral float-right" title="Decoding with language models" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
|
||||
</div>
|
||||
|
||||
<hr/>
|
||||
|
||||
13
index.html
13
index.html
@ -53,6 +53,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
@ -148,6 +151,16 @@ speech recognition recipes using <a class="reference external" href="https://git
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="toctree-wrapper compound">
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="decoding-with-langugage-models/index.html">Decoding with language models</a><ul>
|
||||
<li class="toctree-l2"><a class="reference internal" href="decoding-with-langugage-models/shallow-fusion.html">Shallow fusion for Transducer</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="decoding-with-langugage-models/LODR.html">LODR for RNN Transducer</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="decoding-with-langugage-models/rescoring.html">LM rescoring for Transducer</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
</section>
|
||||
|
||||
|
||||
|
||||
@ -76,6 +76,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -67,6 +67,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -75,6 +75,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -75,6 +75,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -74,6 +74,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -66,6 +66,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -68,6 +68,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -66,6 +66,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -66,6 +66,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -61,6 +61,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
BIN
objects.inv
BIN
objects.inv
Binary file not shown.
@ -69,6 +69,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -69,6 +69,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -69,6 +69,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -69,6 +69,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -64,6 +64,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -72,6 +72,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -72,6 +72,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
@ -103,7 +106,7 @@
|
||||
|
||||
<section id="distillation-with-hubert">
|
||||
<h1>Distillation with HuBERT<a class="headerlink" href="#distillation-with-hubert" title="Permalink to this heading"></a></h1>
|
||||
<p>This tutorial shows you how to perform knowledge distillation in <a href="#id7"><span class="problematic" id="id8">`icefall`_</span></a>
|
||||
<p>This tutorial shows you how to perform knowledge distillation in <a class="reference external" href="https://github.com/k2-fsa/icefall">icefall</a>
|
||||
with the <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a> dataset. The distillation method
|
||||
used here is called “Multi Vector Quantization Knowledge Distillation” (MVQ-KD).
|
||||
Please have a look at our paper <a class="reference external" href="https://arxiv.org/abs/2211.00508">Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation</a>
|
||||
@ -119,7 +122,7 @@ encounter any problems, please open an issue here <a class="reference external"
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>We assume you have read the page <a class="reference internal" href="../../../installation/index.html#install-icefall"><span class="std std-ref">Installation</span></a> and have setup
|
||||
the environment for <a href="#id9"><span class="problematic" id="id10">`icefall`_</span></a>.</p>
|
||||
the environment for <a class="reference external" href="https://github.com/k2-fsa/icefall">icefall</a>.</p>
|
||||
</div>
|
||||
<div class="admonition hint">
|
||||
<p class="admonition-title">Hint</p>
|
||||
@ -190,7 +193,7 @@ run <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/m
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>There are 5 stages in total, the first and second stage will be automatically skipped
|
||||
when choosing to downloaded codebook indexes prepared by <a href="#id11"><span class="problematic" id="id12">`icefall`_</span></a>.
|
||||
when choosing to downloaded codebook indexes prepared by <a class="reference external" href="https://github.com/k2-fsa/icefall">icefall</a>.
|
||||
Of course, you can extract and compute the codebook indexes by yourself. This
|
||||
will require you downloading a HuBERT-XL model and it can take a while for
|
||||
the extraction of codebook indexes.</p>
|
||||
@ -226,10 +229,10 @@ and prepares MVQ-augmented training manifests.</p>
|
||||
</div>
|
||||
<p>Please see the
|
||||
following screenshot for the output of an example execution.</p>
|
||||
<figure class="align-center" id="id5">
|
||||
<figure class="align-center" id="id4">
|
||||
<a class="reference internal image-reference" href="../../../_images/distillation_codebook.png"><img alt="Downloading codebook indexes and preparing training manifest." src="../../../_images/distillation_codebook.png" style="width: 800px;" /></a>
|
||||
<figcaption>
|
||||
<p><span class="caption-number">Fig. 6 </span><span class="caption-text">Downloading codebook indexes and preparing training manifest.</span><a class="headerlink" href="#id5" title="Permalink to this image"></a></p>
|
||||
<p><span class="caption-number">Fig. 6 </span><span class="caption-text">Downloading codebook indexes and preparing training manifest.</span><a class="headerlink" href="#id4" title="Permalink to this image"></a></p>
|
||||
</figcaption>
|
||||
</figure>
|
||||
<div class="admonition hint">
|
||||
@ -241,10 +244,10 @@ set <code class="docutils literal notranslate"><span class="pre">use_extracted_c
|
||||
<code class="docutils literal notranslate"><span class="pre">num_codebooks</span></code> by yourself.</p>
|
||||
</div>
|
||||
<p>Now, you should see the following files under the directory <code class="docutils literal notranslate"><span class="pre">./data/vq_fbank_layer36_cb8</span></code>.</p>
|
||||
<figure class="align-center" id="id6">
|
||||
<figure class="align-center" id="id5">
|
||||
<a class="reference internal image-reference" href="../../../_images/distillation_directory.png"><img alt="MVQ-augmented training manifests" src="../../../_images/distillation_directory.png" style="width: 800px;" /></a>
|
||||
<figcaption>
|
||||
<p><span class="caption-number">Fig. 7 </span><span class="caption-text">MVQ-augmented training manifests.</span><a class="headerlink" href="#id6" title="Permalink to this image"></a></p>
|
||||
<p><span class="caption-number">Fig. 7 </span><span class="caption-text">MVQ-augmented training manifests.</span><a class="headerlink" href="#id5" title="Permalink to this image"></a></p>
|
||||
</figcaption>
|
||||
</figure>
|
||||
<p>Whola! You are ready to perform knowledge distillation training now!</p>
|
||||
|
||||
@ -72,6 +72,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -72,6 +72,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
@ -381,10 +384,10 @@ $<span class="w"> </span>tensorboard<span class="w"> </span>dev<span class="w">
|
||||
<p>Note there is a URL in the above output. Click it and you will see
|
||||
the following screenshot:</p>
|
||||
<blockquote>
|
||||
<div><figure class="align-center" id="id9">
|
||||
<div><figure class="align-center" id="id4">
|
||||
<a class="reference external image-reference" href="https://tensorboard.dev/experiment/QOGSPBgsR8KzcRMmie9JGw/"><img alt="TensorBoard screenshot" src="../../../_images/librispeech-pruned-transducer-tensorboard-log.jpg" style="width: 600px;" /></a>
|
||||
<figcaption>
|
||||
<p><span class="caption-number">Fig. 5 </span><span class="caption-text">TensorBoard screenshot.</span><a class="headerlink" href="#id9" title="Permalink to this image"></a></p>
|
||||
<p><span class="caption-number">Fig. 5 </span><span class="caption-text">TensorBoard screenshot.</span><a class="headerlink" href="#id4" title="Permalink to this image"></a></p>
|
||||
</figcaption>
|
||||
</figure>
|
||||
</div></blockquote>
|
||||
|
||||
@ -72,6 +72,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -72,6 +72,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -72,6 +72,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -68,6 +68,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -68,6 +68,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -68,6 +68,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -67,6 +67,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -67,6 +67,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -62,6 +62,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -66,6 +66,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -67,6 +67,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -67,6 +67,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
@ -380,10 +383,10 @@ $<span class="w"> </span>tensorboard<span class="w"> </span>dev<span class="w">
|
||||
<p>Note there is a URL in the above output. Click it and you will see
|
||||
the following screenshot:</p>
|
||||
<blockquote>
|
||||
<div><figure class="align-center" id="id4">
|
||||
<div><figure class="align-center" id="id5">
|
||||
<a class="reference external image-reference" href="https://tensorboard.dev/experiment/lzGnETjwRxC3yghNMd4kPw/"><img alt="TensorBoard screenshot" src="../../../_images/librispeech-lstm-transducer-tensorboard-log.png" style="width: 600px;" /></a>
|
||||
<figcaption>
|
||||
<p><span class="caption-number">Fig. 10 </span><span class="caption-text">TensorBoard screenshot.</span><a class="headerlink" href="#id4" title="Permalink to this image"></a></p>
|
||||
<p><span class="caption-number">Fig. 10 </span><span class="caption-text">TensorBoard screenshot.</span><a class="headerlink" href="#id5" title="Permalink to this image"></a></p>
|
||||
</figcaption>
|
||||
</figure>
|
||||
</div></blockquote>
|
||||
|
||||
@ -67,6 +67,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
@ -400,10 +403,10 @@ $<span class="w"> </span>tensorboard<span class="w"> </span>dev<span class="w">
|
||||
<p>Note there is a URL in the above output. Click it and you will see
|
||||
the following screenshot:</p>
|
||||
<blockquote>
|
||||
<div><figure class="align-center" id="id10">
|
||||
<div><figure class="align-center" id="id5">
|
||||
<a class="reference external image-reference" href="https://tensorboard.dev/experiment/97VKXf80Ru61CnP2ALWZZg/"><img alt="TensorBoard screenshot" src="../../../_images/streaming-librispeech-pruned-transducer-tensorboard-log.jpg" style="width: 600px;" /></a>
|
||||
<figcaption>
|
||||
<p><span class="caption-number">Fig. 9 </span><span class="caption-text">TensorBoard screenshot.</span><a class="headerlink" href="#id10" title="Permalink to this image"></a></p>
|
||||
<p><span class="caption-number">Fig. 9 </span><span class="caption-text">TensorBoard screenshot.</span><a class="headerlink" href="#id5" title="Permalink to this image"></a></p>
|
||||
</figcaption>
|
||||
</figure>
|
||||
</div></blockquote>
|
||||
|
||||
@ -67,6 +67,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -58,6 +58,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
@ -54,6 +54,9 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="decoding-with-langugage-models/index.html">Decoding with language models</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
||||
File diff suppressed because one or more lines are too long
Loading…
x
Reference in New Issue
Block a user