WIP: Add stateless transducer tutorial.

2025-12-11 06:55:27 +00:00 · 2022-03-02 18:01:10 +08:00 · 2022-03-02 18:01:10 +08:00 · 334f8bb906
commit 334f8bb906
parent 1ff6196c44
13 changed files with 280 additions and 42 deletions
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -33,6 +33,7 @@ release = "0.1"
 # ones.
 extensions = [
    "sphinx_rtd_theme",
    "sphinx.ext.todo",
 ]
 # Add any paths that contain templates here, relative to this directory.
@ -74,3 +75,5 @@ html_context = {
    "github_version": "master",
    "conf_py_path": "/icefall/docs/source/",
 }
 todo_include_todos = True
--- a/docs/source/recipes/aishell.rst
+++ b/docs/source/recipes/aishell.rst
@ -1,10 +0,0 @@
 Aishell
 =======
 We provide the following models for the Aishell dataset:
 .. toctree::
   :maxdepth: 2
   aishell/conformer_ctc
   aishell/tdnn_lstm_ctc
--- a/docs/source/recipes/aishell/index.rst
+++ b/docs/source/recipes/aishell/index.rst
@ -0,0 +1,22 @@
 aishell
 =======
 Aishell is an open-source Chinese Mandarin speech corpus published by Beijing
 Shell Shell Technology Co.,Ltd.
 400 people from different accent areas in China are invited to participate in
 the recording, which is conducted in a quiet indoor environment using high
 fidelity microphone and downsampled to 16kHz. The manual transcription accuracy
 is above 95%, through professional speech annotation and strict quality
 inspection. The data is free for academic use. We hope to provide moderate
 amount of data for new researchers in the field of speech recognition.
 It can be downloaded from `<https://www.openslr.org/33/>`_
 .. toctree::
   :maxdepth: 1
   tdnn_lstm_ctc
   conformer_ctc
   stateless_transducer
--- a/docs/source/recipes/aishell/stateless_transducer.rst
+++ b/docs/source/recipes/aishell/stateless_transducer.rst
@ -0,0 +1,221 @@
 Stateless Transducer
 ====================
 This tutorial shows you how to do transducer training in ``icefall``.
 .. HINT::
  Instead of using RNN-T or RNN transducer, we only use transducer
  here. As you will see, there are no RNNs in the model.
 The Model
 ---------
 The transducer model consists of 3 parts:
 - **Encoder**: It is a conformer encoder with the following parameters
    - Number of heads: 8
    - Attention dim: 512
    - Number of layers: 12
    - Feedforward dim: 2048
 - **Decoder**: We use a stateless model consisting of:
    - An embedding layer with embedding dim 512
    - A Conv1d layer with a default kernel size 2
 - **Joiner**: It consists of a ``nn.tanh()`` and a ``nn.Linear()``.
 .. Caution::
  The decoder is stateless and very simple. It is borrowed from
  `<https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9054419>`_
  (Rnn-Transducer with Stateless Prediction Network)
  We make one modification to it: Place a Conv1d layer right after
  the embedding layer.
 When using Chinese characters as modelling unit, whose vocabulary size
 is 4335 in this specific dataset,
 the number of parameters of the model is ``87939824``, i.e., about ``88 M``.
 The Loss
 --------
 We are using `<https://github.com/csukuangfj/optimized_transducer>`_
 to compute the transducer loss, which removes extra paddings
 in loss computation to save memory.
 .. Hint::
  ``optimized_transducer`` implements the technqiues proposed
  in `Improving RNN Transducer Modeling for End-to-End Speech Recognition <https://arxiv.org/abs/1909.12415>`_ to save memory.
  Furthermore, it supports ``modified transducer``, limiting the maximum
  number of symbols that can be emitted per frame to 1, which simplifies
  the decoding process significantly. Also, the experiment results
  show that it does not degrade the performance.
  See `<https://github.com/csukuangfj/optimized_transducer#modified-transducer>`_
  for what exactly modified transducer is.
  `<https://github.com/csukuangfj/transducer-loss-benchmarking>`_ shows that
  in the unpruned case ``optimized_transducer`` has the advantage about minimizing
  memory usage.
 .. todo::
  Add tutorial about ``pruned_transducer_stateless`` that uses k2
  pruned transducer loss.
 .. hint::
  You can use::
    pip install optimized_transducer
  to install ``optimized_transducer``. Refer to
  `<https://github.com/csukuangfj/optimized_transducer>`_ for other
  alternatives.
 Data Preparation
 ----------------
 To prepare the data for training, please use the following commands:
 .. code-block:: bash
  cd egs/aishell/ASR
  ./prepare.sh --stop-stage 4
  ./prepare.sh --stage 6 --stop-stage 6
 .. note::
  You can use ``./prepare.sh``, though it will generates FSTs that
  are not used in transducer traning.
 When you finish running the script, you will get the following two folders:
  - ``data/fbank``: It saves the pre-computed features
  - ``data/lang_char``: It contains tokens that will be used in the training
 Training
 --------
 .. code-block:: bash
  cd egs/aishell/ASR
  ./transducer_stateless_modified/train.py --help
 shows you the training options that can be passed from the commandline.
 The following options are used quite often:
  - ``--exp-dir``
    The experiment folder to save logs and model checkpoints,
    defaults to ``./transducer_stateless_modified/exp``.
  - ``--num-epochs``
    It is the number of epochs to train. For instance,
    ``./transducer_stateless_modified/train.py --num-epochs 30`` trains for 30
    epochs and generates ``epoch-0.pt``, ``epoch-1.pt``, ..., ``epoch-29.pt``
    in the folder set by ``--exp-dir``.
  - ``--start-epoch``
    It's used to resume training.
    ``./transducer_stateless_modified/train.py --start-epoch 10`` loads the
    checkpoint from ``exp_dir/epoch-9.pt`` and starts
    training from epoch 10, based on the state from epoch 9.
  - ``--world-size``
    It is used for multi-GPU single-machine DDP training.
      - (a) If it is 1, then no DDP training is used.
      - (b) If it is 2, then GPU 0 and GPU 1 are used for DDP training.
    The following shows some use cases with it.
      **Use case 1**: You have 4 GPUs, but you only want to use GPU 0 and
      GPU 2 for training. You can do the following:
        .. code-block:: bash
          $ cd egs/aishell/ASR
          $ export CUDA_VISIBLE_DEVICES="0,2"
          $ ./transducer_stateless_modified/train.py --world-size 2
      **Use case 2**: You have 4 GPUs and you want to use all of them
      for training. You can do the following:
        .. code-block:: bash
          $ cd egs/aishell/ASR
          $ ./transducer_stateless_modified/train.py --world-size 4
      **Use case 3**: You have 4 GPUs but you only want to use GPU 3
      for training. You can do the following:
        .. code-block:: bash
          $ cd egs/aishell/ASR
          $ export CUDA_VISIBLE_DEVICES="3"
          $ ./transducer_stateless_modified/train.py --world-size 1
    .. CAUTION::
      Only multi-GPU single-machine DDP training is implemented at present.
      There is an on-going PR `<https://github.com/k2-fsa/icefall/pull/63>`
      that adds support for multi-GPU multi-machine DDP training.
  - ``--max-duration``
    It specifies the number of seconds over all utterances in a
    batch, before **padding**.
    If you encounter CUDA OOM, please reduce it. For instance, if
    your are using V100 NVIDIA GPU with 32 GB RAM, we recommend you
    to set it to ``300``.
    .. HINT::
      Due to padding, the number of seconds of all utterances in a
      batch will usually be larger than ``--max-duration``.
      A larger value for ``--max-duration`` may cause OOM during training,
      while a smaller value may increase the training time. You have to
      tune it.
  - ``--lr-factor``
    It contrals the learning rate. If you use single GPU training, you
    may want to use a small value for it. If you use multiple GPUs for training,
    you may increase it.
  - ``--context-size``
    It specifies the kernel size in the decoder. Default value 2 means it
    functions as a tri-gram LM.
  - ``--modified-transducer-prob``
    It specifies the probability to use modified transducer loss.
    If it is 0, then no modified transducer is used; if it is 1,
    then it uses modified transducer loss for all batches. If it is
    ``p``, it applies modified transducer with probability ``p``.
 There are some training options, e.g.,
 number of warmup steps,
 that are not passed from the commandline.
 They are pre-configured by the function ``get_params()`` in
 `transducer_stateless_modified/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/aishell/ASR/transducer_stateless_modified/train.py#L162>`_
 If you need to change them, please modify ``./transducer_stateless_modified/train.py`` directly.
 .. CAUTION::
  The training set is perturbed by speed with two factors: 0.9 and 1.1.
  Each epoch actually processes ``3x150 == 450`` hours of data.
--- a/docs/source/recipes/index.rst
+++ b/docs/source/recipes/index.rst
@ -10,12 +10,10 @@ We may add recipes for other tasks as well in the future.
 .. Other recipes are listed in a alphabetical order.
 .. toctree::
-   :maxdepth: 3
+   :maxdepth: 2
   :caption: Table of Contents
-   yesno
+   aishell/index
-
+   librispeech/index
-   librispeech
+   timit/index
-
+   yesno/index
   aishell
   timit
--- a/docs/source/recipes/librispeech.rst
+++ b/docs/source/recipes/librispeech.rst
@ -1,10 +0,0 @@
 LibriSpeech
 ===========
 We provide the following models for the LibriSpeech dataset:
 .. toctree::
   :maxdepth: 2
   librispeech/tdnn_lstm_ctc
   librispeech/conformer_ctc
--- a/docs/source/recipes/librispeech/index.rst
+++ b/docs/source/recipes/librispeech/index.rst
@ -0,0 +1,8 @@
 LibriSpeech
 ===========
 .. toctree::
   :maxdepth: 1
   tdnn_lstm_ctc
   conformer_ctc
--- a/docs/source/recipes/timit.rst
+++ b/docs/source/recipes/timit.rst
@ -1,10 +0,0 @@
 TIMIT
 ===========
 We provide the following models for the TIMIT dataset:
 .. toctree::
   :maxdepth: 2
   timit/tdnn_lstm_ctc
   timit/tdnn_ligru_ctc
--- a/docs/source/recipes/timit/index.rst
+++ b/docs/source/recipes/timit/index.rst
@ -0,0 +1,9 @@
 TIMIT
 =====
 .. toctree::
   :maxdepth: 1
   tdnn_ligru_ctc
   tdnn_lstm_ctc
--- a/docs/source/recipes/timit/tdnn_ligru_ctc.rst
+++ b/docs/source/recipes/timit/tdnn_ligru_ctc.rst
@ -1,5 +1,5 @@
 TDNN-LiGRU-CTC
-=============
+==============
 This tutorial shows you how to run a TDNN-LiGRU-CTC model with the `TIMIT <https://data.deepai.org/timit.zip>`_ dataset.
--- a/docs/source/recipes/yesno/images/tdnn-tensorboard-log.png
+++ b/docs/source/recipes/yesno/images/tdnn-tensorboard-log.png
--- a/docs/source/recipes/yesno/index.rst
+++ b/docs/source/recipes/yesno/index.rst
@ -0,0 +1,7 @@
 YesNo
 =====
 .. toctree::
   :maxdepth: 1
   tdnn
--- a/docs/source/recipes/yesno/tdnn.rst
+++ b/docs/source/recipes/yesno/tdnn.rst
@ -1,5 +1,5 @@
-yesno
+TDNN-CTC
-=====
+========
 This page shows you how to run the `yesno <https://www.openslr.org/1>`_ recipe. It contains:
@ -145,7 +145,7 @@ In ``tdnn/exp``, you will find the following files:
    Note there is a URL in the above output, click it and you will see
    the following screenshot:
-      .. figure:: images/yesno-tdnn-tensorboard-log.png
+      .. figure:: images/tdnn-tensorboard-log.png
         :width: 600
         :alt: TensorBoard screenshot
         :align: center