LSTM Transducer =============== .. hint:: Please scroll down to the bottom of this page to find download links for pretrained models if you don't want to train a model from scratch. This tutorial shows you how to train an LSTM transducer model with the `LibriSpeech `_ dataset. We use pruned RNN-T to compute the loss. .. note:: You can find the paper about pruned RNN-T at the following address: ``_ The transducer model consists of 3 parts: - Encoder, a.k.a, the transcription network. We use an LSTM model - Decoder, a.k.a, the prediction network. We use a stateless model consisting of ``nn.Embedding`` and ``nn.Conv1d`` - Joiner, a.k.a, the joint network. .. caution:: Contrary to the conventional RNN-T models, we use a stateless decoder. That is, it has no recurrent connections. .. hint:: Since the encoder model is an LSTM, not Transformer/Conformer, the resulting model is suitable for streaming/online ASR. Which model to use ------------------ Currently, there are two folders about LSTM stateless transducer training: - ``(1)`` ``_ This recipe uses only LibriSpeech during training. - ``(2)`` ``_ This recipe uses GigaSpeech + LibriSpeech during training. ``(1)`` and ``(2)`` use the same model architecture. The only difference is that ``(2)`` supports multi-dataset. Since ``(2)`` uses more data, it has a lower WER than ``(1)`` but it needs more training time. We use ``lstm_transducer_stateless2`` as an example below. .. note:: You need to download the `GigaSpeech `_ dataset to run ``(2)``. If you have only ``LibriSpeech`` dataset available, feel free to use ``(1)``. Data preparation ---------------- .. code-block:: bash $ cd egs/librispeech/ASR $ ./prepare.sh # If you use (1), you can **skip** the following command $ ./prepare_giga_speech.sh The script ``./prepare.sh`` handles the data preparation for you, **automagically**. All you need to do is to run it. .. note:: We encourage you to read ``./prepare.sh``. The data preparation contains several stages. You can use the following two options: - ``--stage`` - ``--stop-stage`` to control which stage(s) should be run. By default, all stages are executed. For example, .. code-block:: bash $ cd egs/librispeech/ASR $ ./prepare.sh --stage 0 --stop-stage 0 means to run only stage 0. To run stage 2 to stage 5, use: .. code-block:: bash $ ./prepare.sh --stage 2 --stop-stage 5 .. hint:: If you have pre-downloaded the `LibriSpeech `_ dataset and the `musan `_ dataset, say, they are saved in ``/tmp/LibriSpeech`` and ``/tmp/musan``, you can modify the ``dl_dir`` variable in ``./prepare.sh`` to point to ``/tmp`` so that ``./prepare.sh`` won't re-download them. .. note:: All generated files by ``./prepare.sh``, e.g., features, lexicon, etc, are saved in ``./data`` directory. We provide the following YouTube video showing how to run ``./prepare.sh``. .. note:: To get the latest news of `next-gen Kaldi `_, please subscribe the following YouTube channel by `Nadira Povey `_: ``_ .. youtube:: ofEIoJL-mGM Training -------- Configurable options ~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash $ cd egs/librispeech/ASR $ ./lstm_transducer_stateless2/train.py --help shows you the training options that can be passed from the commandline. The following options are used quite often: - ``--full-libri`` If it's True, the training part uses all the training data, i.e., 960 hours. Otherwise, the training part uses only the subset ``train-clean-100``, which has 100 hours of training data. .. CAUTION:: The training set is perturbed by speed with two factors: 0.9 and 1.1. If ``--full-libri`` is True, each epoch actually processes ``3x960 == 2880`` hours of data. - ``--num-epochs`` It is the number of epochs to train. For instance, ``./lstm_transducer_stateless2/train.py --num-epochs 30`` trains for 30 epochs and generates ``epoch-1.pt``, ``epoch-2.pt``, ..., ``epoch-30.pt`` in the folder ``./lstm_transducer_stateless2/exp``. - ``--start-epoch`` It's used to resume training. ``./lstm_transducer_stateless2/train.py --start-epoch 10`` loads the checkpoint ``./lstm_transducer_stateless2/exp/epoch-9.pt`` and starts training from epoch 10, based on the state from epoch 9. - ``--world-size`` It is used for multi-GPU single-machine DDP training. - (a) If it is 1, then no DDP training is used. - (b) If it is 2, then GPU 0 and GPU 1 are used for DDP training. The following shows some use cases with it. **Use case 1**: You have 4 GPUs, but you only want to use GPU 0 and GPU 2 for training. You can do the following: .. code-block:: bash $ cd egs/librispeech/ASR $ export CUDA_VISIBLE_DEVICES="0,2" $ ./lstm_transducer_stateless2/train.py --world-size 2 **Use case 2**: You have 4 GPUs and you want to use all of them for training. You can do the following: .. code-block:: bash $ cd egs/librispeech/ASR $ ./lstm_transducer_stateless2/train.py --world-size 4 **Use case 3**: You have 4 GPUs but you only want to use GPU 3 for training. You can do the following: .. code-block:: bash $ cd egs/librispeech/ASR $ export CUDA_VISIBLE_DEVICES="3" $ ./lstm_transducer_stateless2/train.py --world-size 1 .. caution:: Only multi-GPU single-machine DDP training is implemented at present. Multi-GPU multi-machine DDP training will be added later. - ``--max-duration`` It specifies the number of seconds over all utterances in a batch, before **padding**. If you encounter CUDA OOM, please reduce it. .. HINT:: Due to padding, the number of seconds of all utterances in a batch will usually be larger than ``--max-duration``. A larger value for ``--max-duration`` may cause OOM during training, while a smaller value may increase the training time. You have to tune it. - ``--giga-prob`` The probability to select a batch from the ``GigaSpeech`` dataset. Note: It is available only for ``(2)``. Pre-configured options ~~~~~~~~~~~~~~~~~~~~~~ There are some training options, e.g., weight decay, number of warmup steps, results dir, etc, that are not passed from the commandline. They are pre-configured by the function ``get_params()`` in `lstm_transducer_stateless2/train.py `_ You don't need to change these pre-configured parameters. If you really need to change them, please modify ``./lstm_transducer_stateless2/train.py`` directly. Training logs ~~~~~~~~~~~~~ Training logs and checkpoints are saved in ``lstm_transducer_stateless2/exp``. You will find the following files in that directory: - ``epoch-1.pt``, ``epoch-2.pt``, ... These are checkpoint files saved at the end of each epoch, containing model ``state_dict`` and optimizer ``state_dict``. To resume training from some checkpoint, say ``epoch-10.pt``, you can use: .. code-block:: bash $ ./lstm_transducer_stateless2/train.py --start-epoch 11 - ``checkpoint-436000.pt``, ``checkpoint-438000.pt``, ... These are checkpoint files saved every ``--save-every-n`` batches, containing model ``state_dict`` and optimizer ``state_dict``. To resume training from some checkpoint, say ``checkpoint-436000``, you can use: .. code-block:: bash $ ./lstm_transducer_stateless2/train.py --start-batch 436000 - ``tensorboard/`` This folder contains tensorBoard logs. Training loss, validation loss, learning rate, etc, are recorded in these logs. You can visualize them by: .. code-block:: bash $ cd lstm_transducer_stateless2/exp/tensorboard $ tensorboard dev upload --logdir . --description "LSTM transducer training for LibriSpeech with icefall" It will print something like below: .. code-block:: TensorFlow installation not found - running with reduced feature set. Upload started and will continue reading any new data as it's added to the logdir. To stop uploading, press Ctrl-C. New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/cj2vtPiwQHKN9Q1tx6PTpg/ [2022-09-20T15:50:50] Started scanning logdir. Uploading 4468 scalars... [2022-09-20T15:53:02] Total uploaded: 210171 scalars, 0 tensors, 0 binary objects Listening for new data in logdir... Note there is a URL in the above output. Click it and you will see the following screenshot: .. figure:: images/librispeech-lstm-transducer-tensorboard-log.png :width: 600 :alt: TensorBoard screenshot :align: center :target: https://tensorboard.dev/experiment/lzGnETjwRxC3yghNMd4kPw/ TensorBoard screenshot. .. hint:: If you don't have access to google, you can use the following command to view the tensorboard log locally: .. code-block:: bash cd lstm_transducer_stateless2/exp/tensorboard tensorboard --logdir . --port 6008 It will print the following message: .. code-block:: Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all TensorBoard 2.8.0 at http://localhost:6008/ (Press CTRL+C to quit) Now start your browser and go to ``_ to view the tensorboard logs. - ``log/log-train-xxxx`` It is the detailed training log in text format, same as the one you saw printed to the console during training. Usage example ~~~~~~~~~~~~~ You can use the following command to start the training using 8 GPUs: .. code-block:: bash export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" ./lstm_transducer_stateless2/train.py \ --world-size 8 \ --num-epochs 35 \ --start-epoch 1 \ --full-libri 1 \ --exp-dir lstm_transducer_stateless2/exp \ --max-duration 500 \ --use-fp16 0 \ --lr-epochs 10 \ --num-workers 2 \ --giga-prob 0.9 Decoding -------- The decoding part uses checkpoints saved by the training part, so you have to run the training part first. .. hint:: There are two kinds of checkpoints: - (1) ``epoch-1.pt``, ``epoch-2.pt``, ..., which are saved at the end of each epoch. You can pass ``--epoch`` to ``lstm_transducer_stateless2/decode.py`` to use them. - (2) ``checkpoints-436000.pt``, ``epoch-438000.pt``, ..., which are saved every ``--save-every-n`` batches. You can pass ``--iter`` to ``lstm_transducer_stateless2/decode.py`` to use them. We suggest that you try both types of checkpoints and choose the one that produces the lowest WERs. .. code-block:: bash $ cd egs/librispeech/ASR $ ./lstm_transducer_stateless2/decode.py --help shows the options for decoding. The following shows two examples: .. code-block:: bash for m in greedy_search fast_beam_search modified_beam_search; do for epoch in 17; do for avg in 1 2; do ./lstm_transducer_stateless2/decode.py \ --epoch $epoch \ --avg $avg \ --exp-dir lstm_transducer_stateless2/exp \ --max-duration 600 \ --num-encoder-layers 12 \ --rnn-hidden-size 1024 \ --decoding-method $m \ --use-averaged-model True \ --beam 4 \ --max-contexts 4 \ --max-states 8 \ --beam-size 4 done done done .. code-block:: bash for m in greedy_search fast_beam_search modified_beam_search; do for iter in 474000; do for avg in 8 10 12 14 16 18; do ./lstm_transducer_stateless2/decode.py \ --iter $iter \ --avg $avg \ --exp-dir lstm_transducer_stateless2/exp \ --max-duration 600 \ --num-encoder-layers 12 \ --rnn-hidden-size 1024 \ --decoding-method $m \ --use-averaged-model True \ --beam 4 \ --max-contexts 4 \ --max-states 8 \ --beam-size 4 done done done Export models ------------- `lstm_transducer_stateless2/export.py `_ supports exporting checkpoints from ``lstm_transducer_stateless2/exp`` in the following ways. Export ``model.state_dict()`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Checkpoints saved by ``lstm_transducer_stateless2/train.py`` also include ``optimizer.state_dict()``. It is useful for resuming training. But after training, we are interested only in ``model.state_dict()``. You can use the following command to extract ``model.state_dict()``. .. code-block:: bash # Assume that --iter 468000 --avg 16 produces the smallest WER # (You can get such information after running ./lstm_transducer_stateless2/decode.py) iter=468000 avg=16 ./lstm_transducer_stateless2/export.py \ --exp-dir ./lstm_transducer_stateless2/exp \ --bpe-model data/lang_bpe_500/bpe.model \ --iter $iter \ --avg $avg It will generate a file ``./lstm_transducer_stateless2/exp/pretrained.pt``. .. hint:: To use the generated ``pretrained.pt`` for ``lstm_transducer_stateless2/decode.py``, you can run: .. code-block:: bash cd lstm_transducer_stateless2/exp ln -s pretrained epoch-9999.pt And then pass ``--epoch 9999 --avg 1 --use-averaged-model 0`` to ``./lstm_transducer_stateless2/decode.py``. To use the exported model with ``./lstm_transducer_stateless2/pretrained.py``, you can run: .. code-block:: bash ./lstm_transducer_stateless2/pretrained.py \ --checkpoint ./lstm_transducer_stateless2/exp/pretrained.pt \ --bpe-model ./data/lang_bpe_500/bpe.model \ --method greedy_search \ /path/to/foo.wav \ /path/to/bar.wav Export model using ``torch.jit.trace()`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash iter=468000 avg=16 ./lstm_transducer_stateless2/export.py \ --exp-dir ./lstm_transducer_stateless2/exp \ --bpe-model data/lang_bpe_500/bpe.model \ --iter $iter \ --avg $avg \ --jit-trace 1 It will generate 3 files: - ``./lstm_transducer_stateless2/exp/encoder_jit_trace.pt`` - ``./lstm_transducer_stateless2/exp/decoder_jit_trace.pt`` - ``./lstm_transducer_stateless2/exp/joiner_jit_trace.pt`` To use the generated files with ``./lstm_transducer_stateless2/jit_pretrained``: .. code-block:: bash ./lstm_transducer_stateless2/jit_pretrained.py \ --bpe-model ./data/lang_bpe_500/bpe.model \ --encoder-model-filename ./lstm_transducer_stateless2/exp/encoder_jit_trace.pt \ --decoder-model-filename ./lstm_transducer_stateless2/exp/decoder_jit_trace.pt \ --joiner-model-filename ./lstm_transducer_stateless2/exp/joiner_jit_trace.pt \ /path/to/foo.wav \ /path/to/bar.wav .. hint:: Please see ``_ for how to use the exported models in ``sherpa``. .. _export-model-for-ncnn: Export model for ncnn ~~~~~~~~~~~~~~~~~~~~~ We support exporting pretrained LSTM transducer models to `ncnn `_ using `pnnx `_. First, let us install a modified version of ``ncnn``: .. code-block:: bash git clone https://github.com/csukuangfj/ncnn cd ncnn git submodule update --recursive --init python3 setup.py bdist_wheel ls -lh dist/ pip install ./dist/*.whl # now build pnnx cd tools/pnnx mkdir build cd build make -j4 export PATH=$PWD/src:$PATH ./src/pnnx .. note:: We assume that you have added the path to the binary ``pnnx`` to the environment variable ``PATH``. Second, let us export the model using ``torch.jit.trace()`` that is suitable for ``pnnx``: .. code-block:: bash iter=468000 avg=16 ./lstm_transducer_stateless2/export.py \ --exp-dir ./lstm_transducer_stateless2/exp \ --bpe-model data/lang_bpe_500/bpe.model \ --iter $iter \ --avg $avg \ --pnnx 1 It will generate 3 files: - ``./lstm_transducer_stateless2/exp/encoder_jit_trace-pnnx.pt`` - ``./lstm_transducer_stateless2/exp/decoder_jit_trace-pnnx.pt`` - ``./lstm_transducer_stateless2/exp/joiner_jit_trace-pnnx.pt`` Third, convert torchscript model to ``ncnn`` format: .. code-block:: pnnx ./lstm_transducer_stateless2/exp/encoder_jit_trace-pnnx.pt pnnx ./lstm_transducer_stateless2/exp/decoder_jit_trace-pnnx.pt pnnx ./lstm_transducer_stateless2/exp/joiner_jit_trace-pnnx.pt It will generate the following files: - ``./lstm_transducer_stateless2/exp/encoder_jit_trace-pnnx.ncnn.param`` - ``./lstm_transducer_stateless2/exp/encoder_jit_trace-pnnx.ncnn.bin`` - ``./lstm_transducer_stateless2/exp/decoder_jit_trace-pnnx.ncnn.param`` - ``./lstm_transducer_stateless2/exp/decoder_jit_trace-pnnx.ncnn.bin`` - ``./lstm_transducer_stateless2/exp/joiner_jit_trace-pnnx.ncnn.param`` - ``./lstm_transducer_stateless2/exp/joiner_jit_trace-pnnx.ncnn.bin`` To use the above generated files, run: .. code-block:: bash ./lstm_transducer_stateless2/ncnn-decode.py \ --bpe-model-filename ./data/lang_bpe_500/bpe.model \ --encoder-param-filename ./lstm_transducer_stateless2/exp/encoder_jit_trace-pnnx.ncnn.param \ --encoder-bin-filename ./lstm_transducer_stateless2/exp/encoder_jit_trace-pnnx.ncnn.bin \ --decoder-param-filename ./lstm_transducer_stateless2/exp/decoder_jit_trace-pnnx.ncnn.param \ --decoder-bin-filename ./lstm_transducer_stateless2/exp/decoder_jit_trace-pnnx.ncnn.bin \ --joiner-param-filename ./lstm_transducer_stateless2/exp/joiner_jit_trace-pnnx.ncnn.param \ --joiner-bin-filename ./lstm_transducer_stateless2/exp/joiner_jit_trace-pnnx.ncnn.bin \ /path/to/foo.wav .. code-block:: bash ./lstm_transducer_stateless2/streaming-ncnn-decode.py \ --bpe-model-filename ./data/lang_bpe_500/bpe.model \ --encoder-param-filename ./lstm_transducer_stateless2/exp/encoder_jit_trace-pnnx.ncnn.param \ --encoder-bin-filename ./lstm_transducer_stateless2/exp/encoder_jit_trace-pnnx.ncnn.bin \ --decoder-param-filename ./lstm_transducer_stateless2/exp/decoder_jit_trace-pnnx.ncnn.param \ --decoder-bin-filename ./lstm_transducer_stateless2/exp/decoder_jit_trace-pnnx.ncnn.bin \ --joiner-param-filename ./lstm_transducer_stateless2/exp/joiner_jit_trace-pnnx.ncnn.param \ --joiner-bin-filename ./lstm_transducer_stateless2/exp/joiner_jit_trace-pnnx.ncnn.bin \ /path/to/foo.wav To use the above generated files in C++, please see ``_ It is able to generate a static linked executable that can be run on Linux, Windows, macOS, Raspberry Pi, etc, without external dependencies. Download pretrained models -------------------------- If you don't want to train from scratch, you can download the pretrained models by visiting the following links: - ``_ - ``_ See ``_ for the details of the above pretrained models You can find more usages of the pretrained models in ``_