mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-08 17:42:21 +00:00
deploy: 027302c902ce9ab44754d42a56cf1eba9a075be9
This commit is contained in:
parent
c128646ff4
commit
e5fed5060b
@ -30,7 +30,7 @@ of langugae model integration.
|
|||||||
First, let's have a look at some background information. As the predecessor of LODR, Density Ratio (DR) is first proposed `here <https://arxiv.org/abs/2002.11268>`_
|
First, let's have a look at some background information. As the predecessor of LODR, Density Ratio (DR) is first proposed `here <https://arxiv.org/abs/2002.11268>`_
|
||||||
to address the language information mismatch between the training
|
to address the language information mismatch between the training
|
||||||
corpus (source domain) and the testing corpus (target domain). Assuming that the source domain and the test domain
|
corpus (source domain) and the testing corpus (target domain). Assuming that the source domain and the test domain
|
||||||
are acoustically similar, DR derives the following formular for decoding with Bayes' theorem:
|
are acoustically similar, DR derives the following formula for decoding with Bayes' theorem:
|
||||||
|
|
||||||
.. math::
|
.. math::
|
||||||
|
|
||||||
@ -41,7 +41,7 @@ are acoustically similar, DR derives the following formular for decoding with Ba
|
|||||||
|
|
||||||
|
|
||||||
where :math:`\lambda_1` and :math:`\lambda_2` are the weights of LM scores for target domain and source domain respectively.
|
where :math:`\lambda_1` and :math:`\lambda_2` are the weights of LM scores for target domain and source domain respectively.
|
||||||
Here, the source domain LM is trained on the training corpus. The only difference in the above formular compared to
|
Here, the source domain LM is trained on the training corpus. The only difference in the above formula compared to
|
||||||
shallow fusion is the subtraction of the source domain LM.
|
shallow fusion is the subtraction of the source domain LM.
|
||||||
|
|
||||||
Some works treat the predictor and the joiner of the neural transducer as its internal LM. However, the LM is
|
Some works treat the predictor and the joiner of the neural transducer as its internal LM. However, the LM is
|
||||||
@ -58,7 +58,7 @@ during decoding for transducer model:
|
|||||||
|
|
||||||
In LODR, an additional bi-gram LM estimated on the source domain (e.g training corpus) is required. Compared to DR,
|
In LODR, an additional bi-gram LM estimated on the source domain (e.g training corpus) is required. Compared to DR,
|
||||||
the only difference lies in the choice of source domain LM. According to the original `paper <https://arxiv.org/abs/2203.16776>`_,
|
the only difference lies in the choice of source domain LM. According to the original `paper <https://arxiv.org/abs/2203.16776>`_,
|
||||||
LODR achieves similar performance compared DR in both intra-domain and cross-domain settings.
|
LODR achieves similar performance compared to DR in both intra-domain and cross-domain settings.
|
||||||
As a bi-gram is much faster to evaluate, LODR is usually much faster.
|
As a bi-gram is much faster to evaluate, LODR is usually much faster.
|
||||||
|
|
||||||
Now, we will show you how to use LODR in ``icefall``.
|
Now, we will show you how to use LODR in ``icefall``.
|
||||||
|
@ -9,9 +9,9 @@ to improve the word-error-rate of a transducer model.
|
|||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
This tutorial is based on the recipe
|
This tutorial is based on the recipe
|
||||||
`pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
|
`pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
|
||||||
which is a streaming transducer model trained on `LibriSpeech`_.
|
which is a streaming transducer model trained on `LibriSpeech`_.
|
||||||
However, you can easily apply shallow fusion to other recipes.
|
However, you can easily apply shallow fusion to other recipes.
|
||||||
If you encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`_.
|
If you encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`_.
|
||||||
|
|
||||||
@ -69,11 +69,11 @@ Training a language model usually takes a long time, we can download a pre-train
|
|||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
$ # download the external LM
|
$ # download the external LM
|
||||||
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
|
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
|
||||||
$ # create a symbolic link so that the checkpoint can be loaded
|
$ # create a symbolic link so that the checkpoint can be loaded
|
||||||
$ pushd icefall-librispeech-rnn-lm/exp
|
$ pushd icefall-librispeech-rnn-lm/exp
|
||||||
$ git lfs pull --include "pretrained.pt"
|
$ git lfs pull --include "pretrained.pt"
|
||||||
$ ln -s pretrained.pt epoch-99.pt
|
$ ln -s pretrained.pt epoch-99.pt
|
||||||
$ popd
|
$ popd
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
@ -85,7 +85,7 @@ Training a language model usually takes a long time, we can download a pre-train
|
|||||||
To use shallow fusion for decoding, we can execute the following command:
|
To use shallow fusion for decoding, we can execute the following command:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
|
||||||
$ lm_dir=./icefall-librispeech-rnn-lm/exp
|
$ lm_dir=./icefall-librispeech-rnn-lm/exp
|
||||||
$ lm_scale=0.29
|
$ lm_scale=0.29
|
||||||
@ -133,16 +133,16 @@ The decoding result obtained with the above command are shown below.
|
|||||||
$ For test-other, WER of different settings are:
|
$ For test-other, WER of different settings are:
|
||||||
$ beam_size_4 7.08 best for test-other
|
$ beam_size_4 7.08 best for test-other
|
||||||
|
|
||||||
The improvement of shallow fusion is very obvious! The relative WER reduction on test-other is around 10.5%.
|
The improvement of shallow fusion is very obvious! The relative WER reduction on test-other is around 10.5%.
|
||||||
A few parameters can be tuned to further boost the performance of shallow fusion:
|
A few parameters can be tuned to further boost the performance of shallow fusion:
|
||||||
|
|
||||||
- ``--lm-scale``
|
- ``--lm-scale``
|
||||||
|
|
||||||
Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large,
|
Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large,
|
||||||
the LM score may dominant during decoding, leading to bad WER. A typical value of this is around 0.3.
|
the LM score might be dominant during decoding, leading to bad WER. A typical value of this is around 0.3.
|
||||||
|
|
||||||
|
- ``--beam-size``
|
||||||
|
|
||||||
- ``--beam-size``
|
|
||||||
|
|
||||||
The number of active paths in the search beam. It controls the trade-off between decoding efficiency and accuracy.
|
The number of active paths in the search beam. It controls the trade-off between decoding efficiency and accuracy.
|
||||||
|
|
||||||
Here, we also show how `--beam-size` effect the WER and decoding time:
|
Here, we also show how `--beam-size` effect the WER and decoding time:
|
||||||
@ -176,4 +176,4 @@ As we see, a larger beam size during shallow fusion improves the WER, but is als
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -120,14 +120,14 @@ using that corpus.</p>
|
|||||||
<p>First, let’s have a look at some background information. As the predecessor of LODR, Density Ratio (DR) is first proposed <a class="reference external" href="https://arxiv.org/abs/2002.11268">here</a>
|
<p>First, let’s have a look at some background information. As the predecessor of LODR, Density Ratio (DR) is first proposed <a class="reference external" href="https://arxiv.org/abs/2002.11268">here</a>
|
||||||
to address the language information mismatch between the training
|
to address the language information mismatch between the training
|
||||||
corpus (source domain) and the testing corpus (target domain). Assuming that the source domain and the test domain
|
corpus (source domain) and the testing corpus (target domain). Assuming that the source domain and the test domain
|
||||||
are acoustically similar, DR derives the following formular for decoding with Bayes’ theorem:</p>
|
are acoustically similar, DR derives the following formula for decoding with Bayes’ theorem:</p>
|
||||||
<div class="math notranslate nohighlight">
|
<div class="math notranslate nohighlight">
|
||||||
\[\text{score}\left(y_u|\mathit{x},y\right) =
|
\[\text{score}\left(y_u|\mathit{x},y\right) =
|
||||||
\log p\left(y_u|\mathit{x},y_{1:u-1}\right) +
|
\log p\left(y_u|\mathit{x},y_{1:u-1}\right) +
|
||||||
\lambda_1 \log p_{\text{Target LM}}\left(y_u|\mathit{x},y_{1:u-1}\right) -
|
\lambda_1 \log p_{\text{Target LM}}\left(y_u|\mathit{x},y_{1:u-1}\right) -
|
||||||
\lambda_2 \log p_{\text{Source LM}}\left(y_u|\mathit{x},y_{1:u-1}\right)\]</div>
|
\lambda_2 \log p_{\text{Source LM}}\left(y_u|\mathit{x},y_{1:u-1}\right)\]</div>
|
||||||
<p>where <span class="math notranslate nohighlight">\(\lambda_1\)</span> and <span class="math notranslate nohighlight">\(\lambda_2\)</span> are the weights of LM scores for target domain and source domain respectively.
|
<p>where <span class="math notranslate nohighlight">\(\lambda_1\)</span> and <span class="math notranslate nohighlight">\(\lambda_2\)</span> are the weights of LM scores for target domain and source domain respectively.
|
||||||
Here, the source domain LM is trained on the training corpus. The only difference in the above formular compared to
|
Here, the source domain LM is trained on the training corpus. The only difference in the above formula compared to
|
||||||
shallow fusion is the subtraction of the source domain LM.</p>
|
shallow fusion is the subtraction of the source domain LM.</p>
|
||||||
<p>Some works treat the predictor and the joiner of the neural transducer as its internal LM. However, the LM is
|
<p>Some works treat the predictor and the joiner of the neural transducer as its internal LM. However, the LM is
|
||||||
considered to be weak and can only capture low-level language information. Therefore, <a class="reference external" href="https://arxiv.org/abs/2203.16776">LODR</a> proposed to use
|
considered to be weak and can only capture low-level language information. Therefore, <a class="reference external" href="https://arxiv.org/abs/2203.16776">LODR</a> proposed to use
|
||||||
@ -140,7 +140,7 @@ during decoding for transducer model:</p>
|
|||||||
\lambda_2 \log p_{\text{bi-gram}}\left(y_u|\mathit{x},y_{1:u-1}\right)\]</div>
|
\lambda_2 \log p_{\text{bi-gram}}\left(y_u|\mathit{x},y_{1:u-1}\right)\]</div>
|
||||||
<p>In LODR, an additional bi-gram LM estimated on the source domain (e.g training corpus) is required. Compared to DR,
|
<p>In LODR, an additional bi-gram LM estimated on the source domain (e.g training corpus) is required. Compared to DR,
|
||||||
the only difference lies in the choice of source domain LM. According to the original <a class="reference external" href="https://arxiv.org/abs/2203.16776">paper</a>,
|
the only difference lies in the choice of source domain LM. According to the original <a class="reference external" href="https://arxiv.org/abs/2203.16776">paper</a>,
|
||||||
LODR achieves similar performance compared DR in both intra-domain and cross-domain settings.
|
LODR achieves similar performance compared to DR in both intra-domain and cross-domain settings.
|
||||||
As a bi-gram is much faster to evaluate, LODR is usually much faster.</p>
|
As a bi-gram is much faster to evaluate, LODR is usually much faster.</p>
|
||||||
<p>Now, we will show you how to use LODR in <code class="docutils literal notranslate"><span class="pre">icefall</span></code>.
|
<p>Now, we will show you how to use LODR in <code class="docutils literal notranslate"><span class="pre">icefall</span></code>.
|
||||||
For illustration purpose, we will use a pre-trained ASR model from this <a class="reference external" href="https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29">link</a>.
|
For illustration purpose, we will use a pre-trained ASR model from this <a class="reference external" href="https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29">link</a>.
|
||||||
|
@ -223,7 +223,7 @@ A few parameters can be tuned to further boost the performance of shallow fusion
|
|||||||
<li><p><code class="docutils literal notranslate"><span class="pre">--lm-scale</span></code></p>
|
<li><p><code class="docutils literal notranslate"><span class="pre">--lm-scale</span></code></p>
|
||||||
<blockquote>
|
<blockquote>
|
||||||
<div><p>Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large,
|
<div><p>Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large,
|
||||||
the LM score may dominant during decoding, leading to bad WER. A typical value of this is around 0.3.</p>
|
the LM score might be dominant during decoding, leading to bad WER. A typical value of this is around 0.3.</p>
|
||||||
</div></blockquote>
|
</div></blockquote>
|
||||||
</li>
|
</li>
|
||||||
<li><p><code class="docutils literal notranslate"><span class="pre">--beam-size</span></code></p>
|
<li><p><code class="docutils literal notranslate"><span class="pre">--beam-size</span></code></p>
|
||||||
|
File diff suppressed because one or more lines are too long
Loading…
x
Reference in New Issue
Block a user