.. _shallow_fusion: Shallow fusion for Transducer ================================= External language models (LM) are commonly used to improve WERs for E2E ASR models. This tutorial shows you how to perform ``shallow fusion`` with an external LM to improve the word-error-rate of a transducer model. .. note:: This tutorial is based on the recipe `pruned_transducer_stateless7_streaming `_, which is a streaming transducer model trained on `LibriSpeech`_. However, you can easily apply shallow fusion to other recipes. If you encounter any problems, please open an issue here `icefall `_. .. note:: For simplicity, the training and testing corpus in this tutorial is the same (`LibriSpeech`_). However, you can change the testing set to any other domains (e.g `GigaSpeech`_) and use an external LM trained on that domain. .. HINT:: We recommend you to use a GPU for decoding. For illustration purpose, we will use a pre-trained ASR model from this `link `__. If you want to train your model from scratch, please have a look at :ref:`non_streaming_librispeech_pruned_transducer_stateless`. As the initial step, let's download the pre-trained model. .. code-block:: bash $ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29 $ pushd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp $ git lfs pull --include "pretrained.pt" $ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded To test the model, let's have a look at the decoding results without using LM. This can be done via the following command: .. code-block:: bash $ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/ $ ./pruned_transducer_stateless7_streaming/decode.py \ --epoch 99 \ --avg 1 \ --use-averaged-model False \ --exp-dir $exp_dir \ --bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \ --max-duration 600 \ --decode-chunk-len 32 \ --decoding-method modified_beam_search The following WERs are achieved on test-clean and test-other: .. code-block:: text $ For test-clean, WER of different settings are: $ beam_size_4 3.11 best for test-clean $ For test-other, WER of different settings are: $ beam_size_4 7.93 best for test-other These are already good numbers! But we can further improve it by using shallow fusion with external LM. Training a language model usually takes a long time, we can download a pre-trained LM from this `link `__. .. code-block:: bash $ # download the external LM $ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm $ # create a symbolic link so that the checkpoint can be loaded $ pushd icefall-librispeech-rnn-lm/exp $ git lfs pull --include "pretrained.pt" $ ln -s pretrained.pt epoch-99.pt $ popd .. note:: This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus. You may also train a RNN LM from scratch. Please refer to this `script `__ for training a RNN LM and this `script `__ to train a transformer LM. To use shallow fusion for decoding, we can execute the following command: .. code-block:: bash $ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp $ lm_dir=./icefall-librispeech-rnn-lm/exp $ lm_scale=0.29 $ ./pruned_transducer_stateless7_streaming/decode.py \ --epoch 99 \ --avg 1 \ --use-averaged-model False \ --beam-size 4 \ --exp-dir $exp_dir \ --max-duration 600 \ --decode-chunk-len 32 \ --decoding-method modified_beam_search_lm_shallow_fusion \ --bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \ --use-shallow-fusion 1 \ --lm-type rnn \ --lm-exp-dir $lm_dir \ --lm-epoch 99 \ --lm-scale $lm_scale \ --lm-avg 1 \ --rnn-lm-embedding-dim 2048 \ --rnn-lm-hidden-dim 2048 \ --rnn-lm-num-layers 3 \ --lm-vocab-size 500 Note that we set ``--decoding-method modified_beam_search_lm_shallow_fusion`` and ``--use-shallow-fusion True`` to use shallow fusion. ``--lm-type`` specifies the type of neural LM we are going to use, you can either choose between ``rnn`` or ``transformer``. The following three arguments are associated with the rnn: - ``--rnn-lm-embedding-dim`` The embedding dimension of the RNN LM - ``--rnn-lm-hidden-dim`` The hidden dimension of the RNN LM - ``--rnn-lm-num-layers`` The number of RNN layers in the RNN LM. The decoding result obtained with the above command are shown below. .. code-block:: text $ For test-clean, WER of different settings are: $ beam_size_4 2.77 best for test-clean $ For test-other, WER of different settings are: $ beam_size_4 7.08 best for test-other The improvement of shallow fusion is very obvious! The relative WER reduction on test-other is around 10.5%. A few parameters can be tuned to further boost the performance of shallow fusion: - ``--lm-scale`` Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large, the LM score may dominant during decoding, leading to bad WER. A typical value of this is around 0.3. - ``--beam-size`` The number of active paths in the search beam. It controls the trade-off between decoding efficiency and accuracy. Here, we also show how `--beam-size` effect the WER and decoding time: .. list-table:: WERs and decoding time (on test-clean) of shallow fusion with different beam sizes :widths: 25 25 25 25 :header-rows: 1 * - Beam size - test-clean - test-other - Decoding time on test-clean (s) * - 4 - 2.77 - 7.08 - 262 * - 8 - 2.62 - 6.65 - 352 * - 12 - 2.58 - 6.65 - 488 As we see, a larger beam size during shallow fusion improves the WER, but is also slower.