mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-09-19 05:54:20 +00:00
add shallow fusion documentation
This commit is contained in:
parent
ad24b4ad9e
commit
645e2a5ed8
4
docs/source/decoding-with-langugage-models/LODR.rst
Normal file
4
docs/source/decoding-with-langugage-models/LODR.rst
Normal file
@ -0,0 +1,4 @@
|
||||
.. _LODR:
|
||||
|
||||
LODR for RNN Transducer
|
||||
=======================
|
11
docs/source/decoding-with-langugage-models/index.rst
Normal file
11
docs/source/decoding-with-langugage-models/index.rst
Normal file
@ -0,0 +1,11 @@
|
||||
Decoding with language models
|
||||
=============================
|
||||
|
||||
This section describes how to use external langugage models
|
||||
during decoding to improve the WER of transducer models.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
shallow-fusion
|
||||
LODR
|
152
docs/source/decoding-with-langugage-models/shallow-fusion.rst
Normal file
152
docs/source/decoding-with-langugage-models/shallow-fusion.rst
Normal file
@ -0,0 +1,152 @@
|
||||
.. _shallow_fusion:
|
||||
|
||||
Shallow fusion for RNN Transducer
|
||||
=================================
|
||||
|
||||
In real-life scenario, there is often a mismatch between the training corpus and the target corpus space.
|
||||
Therefore, we often use an external language model (LM) to improve the accuracy of the ASR model
|
||||
on the target space. Even if the training and testing domain are similar, using external langugage model
|
||||
can still help the ASR model if the training corpus is not that large.
|
||||
This tutorial shows you how to perform ``shallow fusion`` with an external LM
|
||||
to improve the word-error-rate of a RNN Transducer model.
|
||||
|
||||
.. note::
|
||||
|
||||
This tutorial is based on the recipe
|
||||
`pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
|
||||
which is a streaming transducer model trained on `LibriSpeech`_.
|
||||
However, you can easily apply shallow fusion to other recipes.
|
||||
If you encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`_.
|
||||
|
||||
.. note::
|
||||
|
||||
For simplicity, the training and testing corpus in this tutorial is the same (`LibriSpeech`_). However, you can change the testing set
|
||||
to any other domains (e.g GigaSpeech) and use an external LM trained on that domain.
|
||||
|
||||
.. HINT::
|
||||
|
||||
We recommend you to use a GPU for decoding.
|
||||
|
||||
For illustration purpose, we will use a pre-trained ASR model from this `link <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`_.
|
||||
If you want to train your model from scratch, please have a look at :ref:`non_streaming_librispeech_pruned_transducer_stateless`.
|
||||
|
||||
As the initial step, let's download the pre-trained model.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ git lfs install
|
||||
$ git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
|
||||
$ pushd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
|
||||
|
||||
To test the model, let's have a look at the decoding results without using LM. This can be done via the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
|
||||
$ ./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 30 \
|
||||
--avg 9 \
|
||||
--exp-dir $exp_dir \
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--decoding-method modified_beam_search
|
||||
|
||||
The following WERs are achieved on test-clean and test-other:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 3.11 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.93 best for test-other
|
||||
|
||||
These are already good numbers! But we can further improve it by using shallow fusion with external LM.
|
||||
Training a language model usually takes a long time, we can download a pre-trained LM from this `link <https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm>`_.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ git lfs install
|
||||
$ git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
|
||||
$ pushd icefall-librispeech-rnn-lm/exp
|
||||
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
|
||||
$ popd
|
||||
|
||||
.. note::
|
||||
|
||||
This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus.
|
||||
You may also train a RNN LM from scratch. Please refer to this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/rnn_lm/train.py>`_
|
||||
for training a RNN LM and this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/transformer_lm/train.py>`_ to train a transformer LM.
|
||||
|
||||
To use shallow fusion for decoding, we can execute the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||
$ lm_dir=./icefall-librispeech-rnn-lm/exp
|
||||
$ ./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 99 \
|
||||
--avg 1 \
|
||||
--use-averaged-model False \
|
||||
--beam-size 4 \
|
||||
--exp-dir $exp_dir \
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--decoding-method modified_beam_search_lm_shallow_fusion \
|
||||
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model
|
||||
--use-shallow-fusion 1 \
|
||||
--lm-type rnn \
|
||||
--lm-exp-dir $lm_dir \
|
||||
--lm-epoch 99 \
|
||||
--lm-scale 0.29 \
|
||||
--lm-avg 1 \
|
||||
--rnn-lm-embedding-dim 2048 \
|
||||
--rnn-lm-hidden-dim 2048 \
|
||||
--rnn-lm-num-layers 3 \
|
||||
--lm-vocab-size 500
|
||||
|
||||
Note that we set ``--decoding-method modified_beam_search_lm_shallow_fusion`` and ``--use-shallow-fusion True``
|
||||
to use shallow fusion. ``--lm-type`` specifies the type of neural LM we are going to use, you can either choose
|
||||
between ``rnn`` or ``transformer``. The following three arguments are associated with the rnn:
|
||||
|
||||
- ``--rnn-lm-embedding-dim``
|
||||
The embedding dimension of the RNN LM
|
||||
|
||||
- ``--rnn-lm-hidden-dim``
|
||||
The hidden dimension of the RNN LM
|
||||
|
||||
- ``--rnn-lm-num-layers``
|
||||
The number of RNN layers in the RNN LM.
|
||||
|
||||
|
||||
The decoding result obtained with the above command are shown below.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ For test-clean, WER of different settings are:
|
||||
$ beam_size_4 2.77 best for test-clean
|
||||
$ For test-other, WER of different settings are:
|
||||
$ beam_size_4 7.08 best for test-other
|
||||
|
||||
The improvement of shallow fusion is very obvious! The relative WER reduction on test-other is around 10.5%.
|
||||
A few parameters can be tuned to further boost the performance of shallow fusion:
|
||||
|
||||
- ``--lm-scale``
|
||||
|
||||
Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large,
|
||||
the LM score may dominant during decoding, leading to bad WER. A typical value of this is around 0.3.
|
||||
|
||||
- ``--beam-size``
|
||||
|
||||
The number of active paths in the search beam. It controls the trade-off between decoding efficiency and accuracy.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
@ -34,3 +34,8 @@ speech recognition recipes using `k2 <https://github.com/k2-fsa/k2>`_.
|
||||
|
||||
contributing/index
|
||||
huggingface/index
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
decoding-with-langugage-models/index
|
Loading…
x
Reference in New Issue
Block a user