icefall/decoding-with-langugage-models/shallow-fusion.html

<!DOCTYPE html>
<html class="writer-html5" lang="en" >
<head>
  <meta charset="utf-8" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />

  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <title>Shallow fusion for Transducer &mdash; icefall 0.1 documentation</title>
      <link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
      <link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
  <!--[if lt IE 9]>
    <script src="../_static/js/html5shiv.min.js"></script>
  <![endif]-->

        <script src="../_static/jquery.js?v=5d32c60e"></script>
        <script src="../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
        <script data-url_root="../" id="documentation_options" src="../_static/documentation_options.js?v=e031e9a9"></script>
        <script src="../_static/doctools.js?v=888ff710"></script>
        <script src="../_static/sphinx_highlight.js?v=4825356b"></script>
    <script src="../_static/js/theme.js"></script>
    <link rel="index" title="Index" href="../genindex.html" />
    <link rel="search" title="Search" href="../search.html" />
    <link rel="next" title="LODR for RNN Transducer" href="LODR.html" />
    <link rel="prev" title="Decoding with language models" href="index.html" />
</head>

<body class="wy-body-for-nav">
  <div class="wy-grid-for-nav">
    <nav data-toggle="wy-nav-shift" class="wy-nav-side">
      <div class="wy-side-scroll">
        <div class="wy-side-nav-search" >


          <a href="../index.html" class="icon icon-home">
            icefall
          </a>
<div role="search">
  <form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
    <input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
    <input type="hidden" name="check_keywords" value="yes" />
    <input type="hidden" name="area" value="default" />
  </form>
</div>
        </div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
              <p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../for-dummies/index.html">Icefall for dummies tutorial</a></li>
<li class="toctree-l1"><a class="reference internal" href="../installation/index.html">Installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../docker/index.html">Docker</a></li>
<li class="toctree-l1"><a class="reference internal" href="../faqs.html">Frequently Asked Questions (FAQs)</a></li>
<li class="toctree-l1"><a class="reference internal" href="../model-export/index.html">Model export</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../recipes/index.html">Recipes</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../huggingface/index.html">Huggingface</a></li>
</ul>
<ul class="current">
<li class="toctree-l1 current"><a class="reference internal" href="index.html">Decoding with language models</a><ul class="current">
<li class="toctree-l2 current"><a class="current reference internal" href="#">Shallow fusion for Transducer</a></li>
<li class="toctree-l2"><a class="reference internal" href="LODR.html">LODR for RNN Transducer</a></li>
<li class="toctree-l2"><a class="reference internal" href="rescoring.html">LM rescoring for Transducer</a></li>
</ul>
</li>
</ul>

        </div>
      </div>
    </nav>

    <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
          <i data-toggle="wy-nav-top" class="fa fa-bars"></i>
          <a href="../index.html">icefall</a>
      </nav>

      <div class="wy-nav-content">
        <div class="rst-content">
          <div role="navigation" aria-label="Page navigation">
  <ul class="wy-breadcrumbs">
      <li><a href="../index.html" class="icon icon-home" aria-label="Home"></a></li>
          <li class="breadcrumb-item"><a href="index.html">Decoding with language models</a></li>
      <li class="breadcrumb-item active">Shallow fusion for Transducer</li>
      <li class="wy-breadcrumbs-aside">
              <a href="https://github.com/k2-fsa/icefall/blob/master/docs/source/decoding-with-langugage-models/shallow-fusion.rst" class="fa fa-github"> Edit on GitHub</a>
      </li>
  </ul>
  <hr/>
</div>
          <div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
           <div itemprop="articleBody">

  <section id="shallow-fusion-for-transducer">
<span id="shallow-fusion"></span><h1>Shallow fusion for Transducer<a class="headerlink" href="#shallow-fusion-for-transducer" title="Permalink to this heading"></a></h1>
<p>External language models (LM) are commonly used to improve WERs for E2E ASR models.
This tutorial shows you how to perform <code class="docutils literal notranslate"><span class="pre">shallow</span> <span class="pre">fusion</span></code> with an external LM
to improve the word-error-rate of a transducer model.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This tutorial is based on the recipe
<a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming">pruned_transducer_stateless7_streaming</a>,
which is a streaming transducer model trained on <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>.
However, you can easily apply shallow fusion to other recipes.
If you encounter any problems, please open an issue here <a class="reference external" href="https://github.com/k2-fsa/icefall/issues">icefall</a>.</p>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>For simplicity, the training and testing corpus in this tutorial is the same (<a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>). However, you can change the testing set
to any other domains (e.g <a class="reference external" href="https://github.com/SpeechColab/GigaSpeech">GigaSpeech</a>) and use an external LM trained on that domain.</p>
</div>
<div class="admonition hint">
<p class="admonition-title">Hint</p>
<p>We recommend you to use a GPU for decoding.</p>
</div>
<p>For illustration purpose, we will use a pre-trained ASR model from this <a class="reference external" href="https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29">link</a>.
If you want to train your model from scratch, please have a look at <a class="reference internal" href="../recipes/Non-streaming-ASR/librispeech/pruned_transducer_stateless.html#non-streaming-librispeech-pruned-transducer-stateless"><span class="std std-ref">Pruned transducer statelessX</span></a>.</p>
<p>As the initial step, let’s download the pre-trained model.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">&quot;pretrained.pt&quot;</span>
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
</pre></div>
</div>
<p>To test the model, let’s have a look at the decoding results without using LM. This can be done via the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model<span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--decoding-method<span class="w"> </span>modified_beam_search
</pre></div>
</div>
<p>The following WERs are achieved on test-clean and test-other:</p>
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
$ beam_size_4       3.11    best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4       7.93    best for test-other
</pre></div>
</div>
<p>These are already good numbers! But we can further improve it by using shallow fusion with external LM.
Training a language model usually takes a long time, we can download a pre-trained LM from this <a class="reference external" href="https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm">link</a>.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="c1"># download the external LM</span>
$<span class="w"> </span><span class="nv">GIT_LFS_SKIP_SMUDGE</span><span class="o">=</span><span class="m">1</span><span class="w"> </span>git<span class="w"> </span>clone<span class="w"> </span>https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
$<span class="w"> </span><span class="c1"># create a symbolic link so that the checkpoint can be loaded</span>
$<span class="w"> </span><span class="nb">pushd</span><span class="w"> </span>icefall-librispeech-rnn-lm/exp
$<span class="w"> </span>git<span class="w"> </span>lfs<span class="w"> </span>pull<span class="w"> </span>--include<span class="w"> </span><span class="s2">&quot;pretrained.pt&quot;</span>
$<span class="w"> </span>ln<span class="w"> </span>-s<span class="w"> </span>pretrained.pt<span class="w"> </span>epoch-99.pt
$<span class="w"> </span><span class="nb">popd</span>
</pre></div>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus.
You may also train a RNN LM from scratch. Please refer to this <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/master/icefall/rnn_lm/train.py">script</a>
for training a RNN LM and this <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/master/icefall/transformer_lm/train.py">script</a> to train a transformer LM.</p>
</div>
<p>To use shallow fusion for decoding, we can execute the following command:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span><span class="nv">exp_dir</span><span class="o">=</span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$<span class="w"> </span><span class="nv">lm_dir</span><span class="o">=</span>./icefall-librispeech-rnn-lm/exp
$<span class="w"> </span><span class="nv">lm_scale</span><span class="o">=</span><span class="m">0</span>.29
$<span class="w"> </span>./pruned_transducer_stateless7_streaming/decode.py<span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--use-averaged-model<span class="w"> </span>False<span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--beam-size<span class="w"> </span><span class="m">4</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--exp-dir<span class="w"> </span><span class="nv">$exp_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--max-duration<span class="w"> </span><span class="m">600</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--decode-chunk-len<span class="w"> </span><span class="m">32</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--decoding-method<span class="w"> </span>modified_beam_search_lm_shallow_fusion<span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--bpe-model<span class="w"> </span>./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model<span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--use-shallow-fusion<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--lm-type<span class="w"> </span>rnn<span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--lm-exp-dir<span class="w"> </span><span class="nv">$lm_dir</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--lm-epoch<span class="w"> </span><span class="m">99</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--lm-scale<span class="w"> </span><span class="nv">$lm_scale</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--lm-avg<span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--rnn-lm-embedding-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--rnn-lm-hidden-dim<span class="w"> </span><span class="m">2048</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--rnn-lm-num-layers<span class="w"> </span><span class="m">3</span><span class="w"> </span><span class="se">\</span>
<span class="w">    </span>--lm-vocab-size<span class="w"> </span><span class="m">500</span>
</pre></div>
</div>
<p>Note that we set <code class="docutils literal notranslate"><span class="pre">--decoding-method</span> <span class="pre">modified_beam_search_lm_shallow_fusion</span></code> and <code class="docutils literal notranslate"><span class="pre">--use-shallow-fusion</span> <span class="pre">True</span></code>
to use shallow fusion. <code class="docutils literal notranslate"><span class="pre">--lm-type</span></code> specifies the type of neural LM we are going to use, you can either choose
between <code class="docutils literal notranslate"><span class="pre">rnn</span></code> or <code class="docutils literal notranslate"><span class="pre">transformer</span></code>. The following three arguments are associated with the rnn:</p>
<ul class="simple">
<li><dl class="simple">
<dt><code class="docutils literal notranslate"><span class="pre">--rnn-lm-embedding-dim</span></code></dt><dd><p>The embedding dimension of the RNN LM</p>
</dd>
</dl>
</li>
<li><dl class="simple">
<dt><code class="docutils literal notranslate"><span class="pre">--rnn-lm-hidden-dim</span></code></dt><dd><p>The hidden dimension of the RNN LM</p>
</dd>
</dl>
</li>
<li><dl class="simple">
<dt><code class="docutils literal notranslate"><span class="pre">--rnn-lm-num-layers</span></code></dt><dd><p>The number of RNN layers in the RNN LM.</p>
</dd>
</dl>
</li>
</ul>
<p>The decoding result obtained with the above command are shown below.</p>
<div class="highlight-text notranslate"><div class="highlight"><pre><span></span>$ For test-clean, WER of different settings are:
$ beam_size_4       2.77    best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4       7.08    best for test-other
</pre></div>
</div>
<p>The improvement of shallow fusion is very obvious! The relative WER reduction on test-other is around 10.5%.
A few parameters can be tuned to further boost the performance of shallow fusion:</p>
<ul>
<li><p><code class="docutils literal notranslate"><span class="pre">--lm-scale</span></code></p>
<blockquote>
<div><p>Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large,
the LM score may dominant during decoding, leading to bad WER. A typical value of this is around 0.3.</p>
</div></blockquote>
</li>
<li><p><code class="docutils literal notranslate"><span class="pre">--beam-size</span></code></p>
<blockquote>
<div><p>The number of active paths in the search beam. It controls the trade-off between decoding efficiency and accuracy.</p>
</div></blockquote>
</li>
</ul>
<p>Here, we also show how <cite>–beam-size</cite> effect the WER and decoding time:</p>
<table class="docutils align-default" id="id2">
<caption><span class="caption-number">Table 2 </span><span class="caption-text">WERs and decoding time (on test-clean) of shallow fusion with different beam sizes</span><a class="headerlink" href="#id2" title="Permalink to this table"></a></caption>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Beam size</p></th>
<th class="head"><p>test-clean</p></th>
<th class="head"><p>test-other</p></th>
<th class="head"><p>Decoding time on test-clean (s)</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>4</p></td>
<td><p>2.77</p></td>
<td><p>7.08</p></td>
<td><p>262</p></td>
</tr>
<tr class="row-odd"><td><p>8</p></td>
<td><p>2.62</p></td>
<td><p>6.65</p></td>
<td><p>352</p></td>
</tr>
<tr class="row-even"><td><p>12</p></td>
<td><p>2.58</p></td>
<td><p>6.65</p></td>
<td><p>488</p></td>
</tr>
</tbody>
</table>
<p>As we see, a larger beam size during shallow fusion improves the WER, but is also slower.</p>
</section>


           </div>
          </div>
          <footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
        <a href="index.html" class="btn btn-neutral float-left" title="Decoding with language models" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
        <a href="LODR.html" class="btn btn-neutral float-right" title="LODR for RNN Transducer" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
    </div>

  <hr/>

  <div role="contentinfo">
    <p>&#169; Copyright 2021, icefall development team.</p>
  </div>

  Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
    <a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
    provided by <a href="https://readthedocs.org">Read the Docs</a>.


</footer>
        </div>
      </div>
    </section>
  </div>
  <script>
      jQuery(function () {
          SphinxRtdTheme.Navigation.enable(true);
      });
  </script>

</body>
</html>