icefall/recipes/Streaming-ASR/introduction.html

193 lines
11 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html class="writer-html5" lang="en">
<head>
<meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Introduction &mdash; icefall 0.1 documentation</title>
<link rel="stylesheet" type="text/css" href="../../_static/pygments.css?v=fa44fd50" />
<link rel="stylesheet" type="text/css" href="../../_static/css/theme.css?v=7ab3649f" />
<script src="../../_static/jquery.js?v=5d32c60e"></script>
<script src="../../_static/_sphinx_javascript_frameworks_compat.js?v=2cd50e6c"></script>
<script data-url_root="../../" id="documentation_options" src="../../_static/documentation_options.js?v=e031e9a9"></script>
<script src="../../_static/doctools.js?v=888ff710"></script>
<script src="../../_static/sphinx_highlight.js?v=4825356b"></script>
<script src="../../_static/js/theme.js"></script>
<link rel="index" title="Index" href="../../genindex.html" />
<link rel="search" title="Search" href="../../search.html" />
<link rel="next" title="LibriSpeech" href="librispeech/index.html" />
<link rel="prev" title="Streaming ASR" href="index.html" />
</head>
<body class="wy-body-for-nav">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search" >
<a href="../../index.html" class="icon icon-home">
icefall
</a>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" aria-label="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../for-dummies/index.html">Icefall for dummies tutorial</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../installation/index.html">Installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../docker/index.html">Docker</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../faqs.html">Frequently Asked Questions (FAQs)</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../model-export/index.html">Model export</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../fst-based-forced-alignment/index.html">FST-based forced alignment</a></li>
</ul>
<ul class="current">
<li class="toctree-l1 current"><a class="reference internal" href="../index.html">Recipes</a><ul class="current">
<li class="toctree-l2"><a class="reference internal" href="../Non-streaming-ASR/index.html">Non Streaming ASR</a></li>
<li class="toctree-l2 current"><a class="reference internal" href="index.html">Streaming ASR</a><ul class="current">
<li class="toctree-l3 current"><a class="current reference internal" href="#">Introduction</a><ul>
<li class="toctree-l4"><a class="reference internal" href="#streaming-conformer">Streaming Conformer</a></li>
<li class="toctree-l4"><a class="reference internal" href="#streaming-emformer">Streaming Emformer</a></li>
</ul>
</li>
<li class="toctree-l3"><a class="reference internal" href="librispeech/index.html">LibriSpeech</a></li>
</ul>
</li>
<li class="toctree-l2"><a class="reference internal" href="../RNN-LM/index.html">RNN-LM</a></li>
<li class="toctree-l2"><a class="reference internal" href="../TTS/index.html">TTS</a></li>
<li class="toctree-l2"><a class="reference internal" href="../Finetune/index.html">Fine-tune a pre-trained model</a></li>
</ul>
</li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../contributing/index.html">Contributing</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../huggingface/index.html">Huggingface</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../../decoding-with-langugage-models/index.html">Decoding with language models</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../../index.html">icefall</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="Page navigation">
<ul class="wy-breadcrumbs">
<li><a href="../../index.html" class="icon icon-home" aria-label="Home"></a></li>
<li class="breadcrumb-item"><a href="../index.html">Recipes</a></li>
<li class="breadcrumb-item"><a href="index.html">Streaming ASR</a></li>
<li class="breadcrumb-item active">Introduction</li>
<li class="wy-breadcrumbs-aside">
<a href="https://github.com/k2-fsa/icefall/blob/master/docs/source/recipes/Streaming-ASR/introduction.rst" class="fa fa-github"> Edit on GitHub</a>
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<section id="introduction">
<h1>Introduction<a class="headerlink" href="#introduction" title="Permalink to this heading"></a></h1>
<p>This page shows you how we implement streaming <strong>X-former transducer</strong> models for ASR.</p>
<div class="admonition hint">
<p class="admonition-title">Hint</p>
<p>X-former transducer here means the encoder of the transducer model uses Multi-Head Attention,
like <a class="reference external" href="https://arxiv.org/pdf/2005.08100.pdf">Conformer</a>, <a class="reference external" href="https://arxiv.org/pdf/2010.10759.pdf">EmFormer</a> etc.</p>
</div>
<p>Currently we have implemented two types of streaming models, one uses Conformer as encoder, the other uses Emformer as encoder.</p>
<section id="streaming-conformer">
<h2>Streaming Conformer<a class="headerlink" href="#streaming-conformer" title="Permalink to this heading"></a></h2>
<p>The main idea of training a streaming model is to make the model see limited contexts
in training time, we can achieve this by applying a mask to the output of self-attention.
In icefall, we implement the streaming conformer the way just like what <a class="reference external" href="https://arxiv.org/pdf/2012.05481.pdf">WeNet</a> did.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>The conformer-transducer recipes in LibriSpeech datasets, like, <a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless">pruned_transducer_stateless</a>,
<a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless2">pruned_transducer_stateless2</a>,
<a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless3">pruned_transducer_stateless3</a>,
<a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4">pruned_transducer_stateless4</a>,
<a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless5">pruned_transducer_stateless5</a>
all support streaming.</p>
</div>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Training a streaming conformer model in <code class="docutils literal notranslate"><span class="pre">icefall</span></code> is almost the same as training a
non-streaming model, all you need to do is passing several extra arguments.
See <a class="reference internal" href="librispeech/pruned_transducer_stateless.html"><span class="doc">Pruned transducer statelessX</span></a> for more details.</p>
</div>
<div class="admonition hint">
<p class="admonition-title">Hint</p>
<p>If you want to modify a non-streaming conformer recipe to support both streaming and non-streaming, please refer
to <a class="reference external" href="https://github.com/k2-fsa/icefall/pull/454">this pull request</a>. After adding the code needed by streaming training,
you have to re-train it with the extra arguments mentioned in the docs above to get a streaming model.</p>
</div>
</section>
<section id="streaming-emformer">
<h2>Streaming Emformer<a class="headerlink" href="#streaming-emformer" title="Permalink to this heading"></a></h2>
<p>The Emformer model proposed <a class="reference external" href="https://arxiv.org/pdf/2010.10759.pdf">here</a> uses more
complicated techniques. It has a memory bank component to memorize history information,
what more, it also introduces right context in training time by hard-copying part of
the input features.</p>
<p>We have three variants of Emformer models in <code class="docutils literal notranslate"><span class="pre">icefall</span></code>.</p>
<blockquote>
<div><ul class="simple">
<li><p><code class="docutils literal notranslate"><span class="pre">pruned_stateless_emformer_rnnt2</span></code> using Emformer from torchaudio, see <a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_stateless_emformer_rnnt2">LibriSpeech recipe</a>.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">conv_emformer_transducer_stateless</span></code> using ConvEmformer implemented by ourself. Different from the Emformer in torchaudio,
ConvEmformer has a convolution in each layer and uses the mechanisms in our reworked conformer model.
See <a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/conv_emformer_transducer_stateless">LibriSpeech recipe</a>.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">conv_emformer_transducer_stateless2</span></code> using ConvEmformer implemented by ourself. The only difference from the above one is that
it uses a simplified memory bank. See <a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/conv_emformer_transducer_stateless2">LibriSpeech recipe</a>.</p></li>
</ul>
</div></blockquote>
</section>
</section>
</div>
</div>
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
<a href="index.html" class="btn btn-neutral float-left" title="Streaming ASR" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
<a href="librispeech/index.html" class="btn btn-neutral float-right" title="LibriSpeech" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
</div>
<hr/>
<div role="contentinfo">
<p>&#169; Copyright 2021, icefall development team.</p>
</div>
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script>
jQuery(function () {
SphinxRtdTheme.Navigation.enable(true);
});
</script>
</body>
</html>