mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-09 01:52:41 +00:00
deploy: 67ae5fdf2bf2b09d2ce9e5acb7dab12b2d2fc441
This commit is contained in:
parent
1b93399486
commit
67a922737c
@ -7,3 +7,5 @@ LibriSpeech
|
||||
pruned_transducer_stateless
|
||||
|
||||
lstm_pruned_stateless_transducer
|
||||
|
||||
zipformer_transducer
|
||||
|
@ -0,0 +1,654 @@
|
||||
Zipformer Transducer
|
||||
====================
|
||||
|
||||
This tutorial shows you how to run a **streaming** zipformer transducer model
|
||||
with the `LibriSpeech <https://www.openslr.org/12>`_ dataset.
|
||||
|
||||
.. Note::
|
||||
|
||||
The tutorial is suitable for `pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
|
||||
|
||||
.. HINT::
|
||||
|
||||
We assume you have read the page :ref:`install icefall` and have setup
|
||||
the environment for ``icefall``.
|
||||
|
||||
.. HINT::
|
||||
|
||||
We recommend you to use a GPU or several GPUs to run this recipe.
|
||||
|
||||
.. hint::
|
||||
|
||||
Please scroll down to the bottom of this page to find download links
|
||||
for pretrained models if you don't want to train a model from scratch.
|
||||
|
||||
|
||||
We use pruned RNN-T to compute the loss.
|
||||
|
||||
.. note::
|
||||
|
||||
You can find the paper about pruned RNN-T at the following address:
|
||||
|
||||
`<https://arxiv.org/abs/2206.13236>`_
|
||||
|
||||
The transducer model consists of 3 parts:
|
||||
|
||||
- Encoder, a.k.a, the transcription network. We use a Zipformer model (proposed by Daniel Povey)
|
||||
- Decoder, a.k.a, the prediction network. We use a stateless model consisting of
|
||||
``nn.Embedding`` and ``nn.Conv1d``
|
||||
- Joiner, a.k.a, the joint network.
|
||||
|
||||
.. caution::
|
||||
|
||||
Contrary to the conventional RNN-T models, we use a stateless decoder.
|
||||
That is, it has no recurrent connections.
|
||||
|
||||
|
||||
Data preparation
|
||||
----------------
|
||||
|
||||
.. hint::
|
||||
|
||||
The data preparation is the same as other recipes on LibriSpeech dataset,
|
||||
if you have finished this step, you can skip to ``Training`` directly.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd egs/librispeech/ASR
|
||||
$ ./prepare.sh
|
||||
|
||||
The script ``./prepare.sh`` handles the data preparation for you, **automagically**.
|
||||
All you need to do is to run it.
|
||||
|
||||
The data preparation contains several stages, you can use the following two
|
||||
options:
|
||||
|
||||
- ``--stage``
|
||||
- ``--stop-stage``
|
||||
|
||||
to control which stage(s) should be run. By default, all stages are executed.
|
||||
|
||||
|
||||
For example,
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd egs/librispeech/ASR
|
||||
$ ./prepare.sh --stage 0 --stop-stage 0
|
||||
|
||||
means to run only stage 0.
|
||||
|
||||
To run stage 2 to stage 5, use:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ ./prepare.sh --stage 2 --stop-stage 5
|
||||
|
||||
.. HINT::
|
||||
|
||||
If you have pre-downloaded the `LibriSpeech <https://www.openslr.org/12>`_
|
||||
dataset and the `musan <http://www.openslr.org/17/>`_ dataset, say,
|
||||
they are saved in ``/tmp/LibriSpeech`` and ``/tmp/musan``, you can modify
|
||||
the ``dl_dir`` variable in ``./prepare.sh`` to point to ``/tmp`` so that
|
||||
``./prepare.sh`` won't re-download them.
|
||||
|
||||
.. NOTE::
|
||||
|
||||
All generated files by ``./prepare.sh``, e.g., features, lexicon, etc,
|
||||
are saved in ``./data`` directory.
|
||||
|
||||
We provide the following YouTube video showing how to run ``./prepare.sh``.
|
||||
|
||||
.. note::
|
||||
|
||||
To get the latest news of `next-gen Kaldi <https://github.com/k2-fsa>`_, please subscribe
|
||||
the following YouTube channel by `Nadira Povey <https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_:
|
||||
|
||||
`<https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_
|
||||
|
||||
.. youtube:: ofEIoJL-mGM
|
||||
|
||||
|
||||
Training
|
||||
--------
|
||||
|
||||
Configurable options
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd egs/librispeech/ASR
|
||||
$ ./pruned_transducer_stateless7_streaming/train.py --help
|
||||
|
||||
|
||||
shows you the training options that can be passed from the commandline.
|
||||
The following options are used quite often:
|
||||
|
||||
- ``--exp-dir``
|
||||
|
||||
The directory to save checkpoints, training logs and tensorboard.
|
||||
|
||||
- ``--full-libri``
|
||||
|
||||
If it's True, the training part uses all the training data, i.e.,
|
||||
960 hours. Otherwise, the training part uses only the subset
|
||||
``train-clean-100``, which has 100 hours of training data.
|
||||
|
||||
.. CAUTION::
|
||||
The training set is perturbed by speed with two factors: 0.9 and 1.1.
|
||||
If ``--full-libri`` is True, each epoch actually processes
|
||||
``3x960 == 2880`` hours of data.
|
||||
|
||||
- ``--num-epochs``
|
||||
|
||||
It is the number of epochs to train. For instance,
|
||||
``./pruned_transducer_stateless7_streaming/train.py --num-epochs 30`` trains for 30 epochs
|
||||
and generates ``epoch-1.pt``, ``epoch-2.pt``, ..., ``epoch-30.pt``
|
||||
in the folder ``./pruned_transducer_stateless7_streaming/exp``.
|
||||
|
||||
- ``--start-epoch``
|
||||
|
||||
It's used to resume training.
|
||||
``./pruned_transducer_stateless7_streaming/train.py --start-epoch 10`` loads the
|
||||
checkpoint ``./pruned_transducer_stateless7_streaming/exp/epoch-9.pt`` and starts
|
||||
training from epoch 10, based on the state from epoch 9.
|
||||
|
||||
- ``--world-size``
|
||||
|
||||
It is used for multi-GPU single-machine DDP training.
|
||||
|
||||
- (a) If it is 1, then no DDP training is used.
|
||||
|
||||
- (b) If it is 2, then GPU 0 and GPU 1 are used for DDP training.
|
||||
|
||||
The following shows some use cases with it.
|
||||
|
||||
**Use case 1**: You have 4 GPUs, but you only want to use GPU 0 and
|
||||
GPU 2 for training. You can do the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd egs/librispeech/ASR
|
||||
$ export CUDA_VISIBLE_DEVICES="0,2"
|
||||
$ ./pruned_transducer_stateless7_streaming/train.py --world-size 2
|
||||
|
||||
**Use case 2**: You have 4 GPUs and you want to use all of them
|
||||
for training. You can do the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd egs/librispeech/ASR
|
||||
$ ./pruned_transducer_stateless7_streaming/train.py --world-size 4
|
||||
|
||||
**Use case 3**: You have 4 GPUs but you only want to use GPU 3
|
||||
for training. You can do the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd egs/librispeech/ASR
|
||||
$ export CUDA_VISIBLE_DEVICES="3"
|
||||
$ ./pruned_transducer_stateless7_streaming/train.py --world-size 1
|
||||
|
||||
.. caution::
|
||||
|
||||
Only multi-GPU single-machine DDP training is implemented at present.
|
||||
Multi-GPU multi-machine DDP training will be added later.
|
||||
|
||||
- ``--max-duration``
|
||||
|
||||
It specifies the number of seconds over all utterances in a
|
||||
batch, before **padding**.
|
||||
If you encounter CUDA OOM, please reduce it.
|
||||
|
||||
.. HINT::
|
||||
|
||||
Due to padding, the number of seconds of all utterances in a
|
||||
batch will usually be larger than ``--max-duration``.
|
||||
|
||||
A larger value for ``--max-duration`` may cause OOM during training,
|
||||
while a smaller value may increase the training time. You have to
|
||||
tune it.
|
||||
|
||||
- ``--use-fp16``
|
||||
|
||||
If it is True, the model will train with half precision, from our experiment
|
||||
results, by using half precision you can train with two times larger ``--max-duration``
|
||||
so as to get almost 2X speed up.
|
||||
|
||||
We recommend using ``--use-fp16 True``.
|
||||
|
||||
- ``--short-chunk-size``
|
||||
|
||||
When training a streaming attention model with chunk masking, the chunk size
|
||||
would be either max sequence length of current batch or uniformly sampled from
|
||||
(1, short_chunk_size). The default value is 50, you don't have to change it most of the time.
|
||||
|
||||
- ``--num-left-chunks``
|
||||
|
||||
It indicates how many left context (in chunks) that can be seen when calculating attention.
|
||||
The default value is 4, you don't have to change it most of the time.
|
||||
|
||||
|
||||
- ``--decode-chunk-len``
|
||||
|
||||
The chunk size for decoding (in frames before subsampling). It is used for validation.
|
||||
The default value is 32 (i.e., 320ms).
|
||||
|
||||
|
||||
Pre-configured options
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
There are some training options, e.g., number of encoder layers,
|
||||
encoder dimension, decoder dimension, number of warmup steps etc,
|
||||
that are not passed from the commandline.
|
||||
They are pre-configured by the function ``get_params()`` in
|
||||
`pruned_transducer_stateless7_streaming/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/train.py>`_
|
||||
|
||||
You don't need to change these pre-configured parameters. If you really need to change
|
||||
them, please modify ``./pruned_transducer_stateless7_streaming/train.py`` directly.
|
||||
|
||||
|
||||
Training logs
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
Training logs and checkpoints are saved in ``--exp-dir`` (e.g. ``pruned_transducer_stateless7_streaming/exp``.
|
||||
You will find the following files in that directory:
|
||||
|
||||
- ``epoch-1.pt``, ``epoch-2.pt``, ...
|
||||
|
||||
These are checkpoint files saved at the end of each epoch, containing model
|
||||
``state_dict`` and optimizer ``state_dict``.
|
||||
To resume training from some checkpoint, say ``epoch-10.pt``, you can use:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ ./pruned_transducer_stateless7_streaming/train.py --start-epoch 11
|
||||
|
||||
- ``checkpoint-436000.pt``, ``checkpoint-438000.pt``, ...
|
||||
|
||||
These are checkpoint files saved every ``--save-every-n`` batches,
|
||||
containing model ``state_dict`` and optimizer ``state_dict``.
|
||||
To resume training from some checkpoint, say ``checkpoint-436000``, you can use:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ ./pruned_transducer_stateless7_streaming/train.py --start-batch 436000
|
||||
|
||||
- ``tensorboard/``
|
||||
|
||||
This folder contains tensorBoard logs. Training loss, validation loss, learning
|
||||
rate, etc, are recorded in these logs. You can visualize them by:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd pruned_transducer_stateless7_streaming/exp/tensorboard
|
||||
$ tensorboard dev upload --logdir . --description "pruned transducer training for LibriSpeech with icefall"
|
||||
|
||||
.. hint::
|
||||
|
||||
If you don't have access to google, you can use the following command
|
||||
to view the tensorboard log locally:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd pruned_transducer_stateless7_streaming/exp/tensorboard
|
||||
tensorboard --logdir . --port 6008
|
||||
|
||||
It will print the following message:
|
||||
|
||||
.. code-block::
|
||||
|
||||
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
|
||||
TensorBoard 2.8.0 at http://localhost:6008/ (Press CTRL+C to quit)
|
||||
|
||||
Now start your browser and go to `<http://localhost:6008>`_ to view the tensorboard
|
||||
logs.
|
||||
|
||||
|
||||
- ``log/log-train-xxxx``
|
||||
|
||||
It is the detailed training log in text format, same as the one
|
||||
you saw printed to the console during training.
|
||||
|
||||
Usage example
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
You can use the following command to start the training using 4 GPUs:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
export CUDA_VISIBLE_DEVICES="0,1,2,3"
|
||||
./pruned_transducer_stateless7_streaming/train.py \
|
||||
--world-size 4 \
|
||||
--num-epochs 30 \
|
||||
--start-epoch 1 \
|
||||
--use-fp16 1 \
|
||||
--exp-dir pruned_transducer_stateless7_streaming/exp \
|
||||
--full-libri 1 \
|
||||
--max-duration 550
|
||||
|
||||
Decoding
|
||||
--------
|
||||
|
||||
The decoding part uses checkpoints saved by the training part, so you have
|
||||
to run the training part first.
|
||||
|
||||
.. hint::
|
||||
|
||||
There are two kinds of checkpoints:
|
||||
|
||||
- (1) ``epoch-1.pt``, ``epoch-2.pt``, ..., which are saved at the end
|
||||
of each epoch. You can pass ``--epoch`` to
|
||||
``pruned_transducer_stateless7_streaming/decode.py`` to use them.
|
||||
|
||||
- (2) ``checkpoints-436000.pt``, ``epoch-438000.pt``, ..., which are saved
|
||||
every ``--save-every-n`` batches. You can pass ``--iter`` to
|
||||
``pruned_transducer_stateless7_streaming/decode.py`` to use them.
|
||||
|
||||
We suggest that you try both types of checkpoints and choose the one
|
||||
that produces the lowest WERs.
|
||||
|
||||
.. tip::
|
||||
|
||||
To decode a streaming model, you can use either ``simulate streaming decoding`` in ``decode.py`` or
|
||||
``real chunk-wise streaming decoding`` in ``streaming_decode.py``. The difference between ``decode.py`` and
|
||||
``streaming_decode.py`` is that, ``decode.py`` processes the whole acoustic frames at one time with masking (i.e. same as training),
|
||||
but ``streaming_decode.py`` processes the acoustic frames chunk by chunk.
|
||||
|
||||
.. NOTE::
|
||||
|
||||
``simulate streaming decoding`` in ``decode.py`` and ``real chunk-size streaming decoding`` in ``streaming_decode.py`` should
|
||||
produce almost the same results given the same ``--decode-chunk-len``.
|
||||
|
||||
|
||||
Simulate streaming decoding
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd egs/librispeech/ASR
|
||||
$ ./pruned_transducer_stateless7_streaming/decode.py --help
|
||||
|
||||
shows the options for decoding.
|
||||
The following options are important for streaming models:
|
||||
|
||||
``--decode-chunk-len``
|
||||
|
||||
It is same as in ``train.py``, which specifies the chunk size for decoding (in frames before subsampling).
|
||||
The default value is 32 (i.e., 320ms).
|
||||
|
||||
|
||||
The following shows two examples (for the two types of checkpoints):
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
for m in greedy_search fast_beam_search modified_beam_search; do
|
||||
for epoch in 30; do
|
||||
for avg in 12 11 10 9 8; do
|
||||
./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch $epoch \
|
||||
--avg $avg \
|
||||
--decode-chunk-len 32 \
|
||||
--exp-dir pruned_transducer_stateless7_streaming/exp \
|
||||
--max-duration 600 \
|
||||
--decoding-method $m
|
||||
done
|
||||
done
|
||||
done
|
||||
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
for m in greedy_search fast_beam_search modified_beam_search; do
|
||||
for iter in 474000; do
|
||||
for avg in 8 10 12 14 16 18; do
|
||||
./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--iter $iter \
|
||||
--avg $avg \
|
||||
--decode-chunk-len 32 \
|
||||
--exp-dir pruned_transducer_stateless7_streaming/exp \
|
||||
--max-duration 600 \
|
||||
--decoding-method $m
|
||||
done
|
||||
done
|
||||
done
|
||||
|
||||
|
||||
Real streaming decoding
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd egs/librispeech/ASR
|
||||
$ ./pruned_transducer_stateless7_streaming/streaming_decode.py --help
|
||||
|
||||
shows the options for decoding.
|
||||
The following options are important for streaming models:
|
||||
|
||||
``--decode-chunk-len``
|
||||
|
||||
It is same as in ``train.py``, which specifies the chunk size for decoding (in frames before subsampling).
|
||||
The default value is 32 (i.e., 320ms).
|
||||
For ``real streaming decoding``, we will process ``decode-chunk-len`` acoustic frames at each time.
|
||||
|
||||
``--num-decode-streams``
|
||||
|
||||
The number of decoding streams that can be run in parallel (very similar to the ``bath size``).
|
||||
For ``real streaming decoding``, the batches will be packed dynamically, for example, if the
|
||||
``num-decode-streams`` equals to 10, then, sequence 1 to 10 will be decoded at first, after a while,
|
||||
suppose sequence 1 and 2 are done, so, sequence 3 to 12 will be processed parallelly in a batch.
|
||||
|
||||
|
||||
The following shows two examples (for the two types of checkpoints):
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
for m in greedy_search fast_beam_search modified_beam_search; do
|
||||
for epoch in 30; do
|
||||
for avg in 12 11 10 9 8; do
|
||||
./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch $epoch \
|
||||
--avg $avg \
|
||||
--decode-chunk-len 32 \
|
||||
--num-decode-streams 100 \
|
||||
--exp-dir pruned_transducer_stateless7_streaming/exp \
|
||||
--decoding-method $m
|
||||
done
|
||||
done
|
||||
done
|
||||
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
for m in greedy_search fast_beam_search modified_beam_search; do
|
||||
for iter in 474000; do
|
||||
for avg in 8 10 12 14 16 18; do
|
||||
./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--iter $iter \
|
||||
--avg $avg \
|
||||
--decode-chunk-len 16 \
|
||||
--num-decode-streams 100 \
|
||||
--exp-dir pruned_transducer_stateless7_streaming/exp \
|
||||
--decoding-method $m
|
||||
done
|
||||
done
|
||||
done
|
||||
|
||||
|
||||
.. tip::
|
||||
|
||||
Supporting decoding methods are as follows:
|
||||
|
||||
- ``greedy_search`` : It takes the symbol with largest posterior probability
|
||||
of each frame as the decoding result.
|
||||
|
||||
- ``beam_search`` : It implements Algorithm 1 in https://arxiv.org/pdf/1211.3711.pdf and
|
||||
`espnet/nets/beam_search_transducer.py <https://github.com/espnet/espnet/blob/master/espnet/nets/beam_search_transducer.py#L247>`_
|
||||
is used as a reference. Basicly, it keeps topk states for each frame, and expands the kept states with their own contexts to
|
||||
next frame.
|
||||
|
||||
- ``modified_beam_search`` : It implements the same algorithm as ``beam_search`` above, but it
|
||||
runs in batch mode with ``--max-sym-per-frame=1`` being hardcoded.
|
||||
|
||||
- ``fast_beam_search`` : It implements graph composition between the output ``log_probs`` and
|
||||
given ``FSAs``. It is hard to describe the details in several lines of texts, you can read
|
||||
our paper in https://arxiv.org/pdf/2211.00484.pdf or our `rnnt decode code in k2 <https://github.com/k2-fsa/k2/blob/master/k2/csrc/rnnt_decode.h>`_. ``fast_beam_search`` can decode with ``FSAs`` on GPU efficiently.
|
||||
|
||||
- ``fast_beam_search_LG`` : The same as ``fast_beam_search`` above, ``fast_beam_search`` uses
|
||||
an trivial graph that has only one state, while ``fast_beam_search_LG`` uses an LG graph
|
||||
(with N-gram LM).
|
||||
|
||||
- ``fast_beam_search_nbest`` : It produces the decoding results as follows:
|
||||
|
||||
- (1) Use ``fast_beam_search`` to get a lattice
|
||||
- (2) Select ``num_paths`` paths from the lattice using ``k2.random_paths()``
|
||||
- (3) Unique the selected paths
|
||||
- (4) Intersect the selected paths with the lattice and compute the
|
||||
shortest path from the intersection result
|
||||
- (5) The path with the largest score is used as the decoding output.
|
||||
|
||||
- ``fast_beam_search_nbest_LG`` : It implements same logic as ``fast_beam_search_nbest``, the
|
||||
only difference is that it uses ``fast_beam_search_LG`` to generate the lattice.
|
||||
|
||||
.. NOTE::
|
||||
|
||||
The supporting decoding methods in ``streaming_decode.py`` might be less than that in ``decode.py``, if needed,
|
||||
you can implement them by yourself or file a issue in `icefall <https://github.com/k2-fsa/icefall/issues>`_ .
|
||||
|
||||
|
||||
Export Model
|
||||
------------
|
||||
|
||||
Currently it supports exporting checkpoints from ``pruned_transducer_stateless7_streaming/exp`` in the following ways.
|
||||
|
||||
Export ``model.state_dict()``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Checkpoints saved by ``pruned_transducer_stateless7_streaming/train.py`` also include
|
||||
``optimizer.state_dict()``. It is useful for resuming training. But after training,
|
||||
we are interested only in ``model.state_dict()``. You can use the following
|
||||
command to extract ``model.state_dict()``.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Assume that --epoch 30 --avg 9 produces the smallest WER
|
||||
# (You can get such information after running ./pruned_transducer_stateless7_streaming/decode.py)
|
||||
|
||||
epoch=30
|
||||
avg=9
|
||||
|
||||
./pruned_transducer_stateless7_streaming/export.py \
|
||||
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
|
||||
--bpe-model data/lang_bpe_500/bpe.model \
|
||||
--epoch $epoch \
|
||||
--avg $avg \
|
||||
--use-averaged-model=True \
|
||||
--decode-chunk-len 32
|
||||
|
||||
It will generate a file ``./pruned_transducer_stateless7_streaming/exp/pretrained.pt``.
|
||||
|
||||
.. hint::
|
||||
|
||||
To use the generated ``pretrained.pt`` for ``pruned_transducer_stateless7_streaming/decode.py``,
|
||||
you can run:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd pruned_transducer_stateless7_streaming/exp
|
||||
ln -s pretrained.pt epoch-999.pt
|
||||
|
||||
And then pass ``--epoch 999 --avg 1 --use-averaged-model 0`` to
|
||||
``./pruned_transducer_stateless7_streaming/decode.py``.
|
||||
|
||||
To use the exported model with ``./pruned_transducer_stateless7_streaming/pretrained.py``, you
|
||||
can run:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
./pruned_transducer_stateless7_streaming/pretrained.py \
|
||||
--checkpoint ./pruned_transducer_stateless7_streaming/exp/pretrained.pt \
|
||||
--bpe-model ./data/lang_bpe_500/bpe.model \
|
||||
--method greedy_search \
|
||||
--decode-chunk-len 32 \
|
||||
/path/to/foo.wav \
|
||||
/path/to/bar.wav
|
||||
|
||||
|
||||
Export model using ``torch.jit.script()``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
./pruned_transducer_stateless7_streaming/export.py \
|
||||
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
|
||||
--bpe-model data/lang_bpe_500/bpe.model \
|
||||
--epoch 30 \
|
||||
--avg 9 \
|
||||
--decode-chunk-len 32 \
|
||||
--jit 1
|
||||
|
||||
.. caution::
|
||||
|
||||
``--decode-chunk-len`` is required to export a ScriptModule.
|
||||
|
||||
It will generate a file ``cpu_jit.pt`` in the given ``exp_dir``. You can later
|
||||
load it by ``torch.jit.load("cpu_jit.pt")``.
|
||||
|
||||
Note ``cpu`` in the name ``cpu_jit.pt`` means the parameters when loaded into Python
|
||||
are on CPU. You can use ``to("cuda")`` to move them to a CUDA device.
|
||||
|
||||
Export model using ``torch.jit.trace()``
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
epoch=30
|
||||
avg=9
|
||||
|
||||
./pruned_transducer_stateless7_streaming/jit_trace_export.py \
|
||||
--bpe-model data/lang_bpe_500/bpe.model \
|
||||
--use-averaged-model=True \
|
||||
--decode-chunk-len 32 \
|
||||
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
|
||||
--epoch $epoch \
|
||||
--avg $avg
|
||||
|
||||
.. caution::
|
||||
|
||||
``--decode-chunk-len`` is required to export a ScriptModule.
|
||||
|
||||
It will generate 3 files:
|
||||
|
||||
- ``./pruned_transducer_stateless7_streaming/exp/encoder_jit_trace.pt``
|
||||
- ``./pruned_transducer_stateless7_streaming/exp/decoder_jit_trace.pt``
|
||||
- ``./pruned_transducer_stateless7_streaming/exp/joiner_jit_trace.pt``
|
||||
|
||||
To use the generated files with ``./pruned_transducer_stateless7_streaming/jit_trace_pretrained.py``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
./pruned_transducer_stateless7_streaming/jit_trace_pretrained.py \
|
||||
--encoder-model-filename ./pruned_transducer_stateless7_streaming/exp/encoder_jit_trace.pt \
|
||||
--decoder-model-filename ./pruned_transducer_stateless7_streaming/exp/decoder_jit_trace.pt \
|
||||
--joiner-model-filename ./pruned_transducer_stateless7_streaming/exp/joiner_jit_trace.pt \
|
||||
--bpe-model ./data/lang_bpe_500/bpe.model \
|
||||
--decode-chunk-len 32 \
|
||||
/path/to/foo.wav
|
||||
|
||||
|
||||
Download pretrained models
|
||||
--------------------------
|
||||
|
||||
If you don't want to train from scratch, you can download the pretrained models
|
||||
by visiting the following links:
|
||||
|
||||
- `pruned_transducer_stateless7_streaming <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`_
|
||||
|
||||
See `<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md>`_
|
||||
for the details of the above pretrained models
|
||||
|
||||
Deploy with Sherpa
|
||||
------------------
|
||||
|
||||
Please see `<https://k2-fsa.github.io/sherpa/python/streaming_asr/conformer/index.html#>`_
|
||||
for how to deploy the models in ``sherpa``.
|
@ -21,7 +21,7 @@
|
||||
<link rel="index" title="Index" href="../genindex.html" />
|
||||
<link rel="search" title="Search" href="../search.html" />
|
||||
<link rel="next" title="Contributing to Documentation" href="doc.html" />
|
||||
<link rel="prev" title="LSTM Transducer" href="../recipes/Streaming-ASR/librispeech/lstm_pruned_stateless_transducer.html" />
|
||||
<link rel="prev" title="Zipformer Transducer" href="../recipes/Streaming-ASR/librispeech/zipformer_transducer.html" />
|
||||
</head>
|
||||
|
||||
<body class="wy-body-for-nav">
|
||||
@ -124,7 +124,7 @@ and code to <code class="docutils literal notranslate"><span class="pre">icefall
|
||||
</div>
|
||||
</div>
|
||||
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
|
||||
<a href="../recipes/Streaming-ASR/librispeech/lstm_pruned_stateless_transducer.html" class="btn btn-neutral float-left" title="LSTM Transducer" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
|
||||
<a href="../recipes/Streaming-ASR/librispeech/zipformer_transducer.html" class="btn btn-neutral float-left" title="Zipformer Transducer" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
|
||||
<a href="doc.html" class="btn btn-neutral float-right" title="Contributing to Documentation" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
|
||||
</div>
|
||||
|
||||
|
BIN
objects.inv
BIN
objects.inv
Binary file not shown.
@ -97,6 +97,7 @@
|
||||
<li class="toctree-l1"><a class="reference internal" href="librispeech/index.html">LibriSpeech</a><ul>
|
||||
<li class="toctree-l2"><a class="reference internal" href="librispeech/pruned_transducer_stateless.html">Pruned transducer statelessX</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="librispeech/lstm_pruned_stateless_transducer.html">LSTM Transducer</a></li>
|
||||
<li class="toctree-l2"><a class="reference internal" href="librispeech/zipformer_transducer.html">Zipformer Transducer</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
|
@ -52,6 +52,7 @@
|
||||
<li class="toctree-l3 current"><a class="current reference internal" href="#">LibriSpeech</a><ul>
|
||||
<li class="toctree-l4"><a class="reference internal" href="pruned_transducer_stateless.html">Pruned transducer statelessX</a></li>
|
||||
<li class="toctree-l4"><a class="reference internal" href="lstm_pruned_stateless_transducer.html">LSTM Transducer</a></li>
|
||||
<li class="toctree-l4"><a class="reference internal" href="zipformer_transducer.html">Zipformer Transducer</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
@ -96,6 +97,7 @@
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="pruned_transducer_stateless.html">Pruned transducer statelessX</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="lstm_pruned_stateless_transducer.html">LSTM Transducer</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="zipformer_transducer.html">Zipformer Transducer</a></li>
|
||||
</ul>
|
||||
</div>
|
||||
</section>
|
||||
|
@ -20,7 +20,7 @@
|
||||
<script src="../../../_static/js/theme.js"></script>
|
||||
<link rel="index" title="Index" href="../../../genindex.html" />
|
||||
<link rel="search" title="Search" href="../../../search.html" />
|
||||
<link rel="next" title="Contributing" href="../../../contributing/index.html" />
|
||||
<link rel="next" title="Zipformer Transducer" href="zipformer_transducer.html" />
|
||||
<link rel="prev" title="Pruned transducer statelessX" href="pruned_transducer_stateless.html" />
|
||||
</head>
|
||||
|
||||
@ -52,6 +52,7 @@
|
||||
<li class="toctree-l3 current"><a class="reference internal" href="index.html">LibriSpeech</a><ul class="current">
|
||||
<li class="toctree-l4"><a class="reference internal" href="pruned_transducer_stateless.html">Pruned transducer statelessX</a></li>
|
||||
<li class="toctree-l4 current"><a class="current reference internal" href="#">LSTM Transducer</a></li>
|
||||
<li class="toctree-l4"><a class="reference internal" href="zipformer_transducer.html">Zipformer Transducer</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
@ -699,7 +700,7 @@ for the details of the above pretrained models</p>
|
||||
</div>
|
||||
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
|
||||
<a href="pruned_transducer_stateless.html" class="btn btn-neutral float-left" title="Pruned transducer statelessX" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
|
||||
<a href="../../../contributing/index.html" class="btn btn-neutral float-right" title="Contributing" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
|
||||
<a href="zipformer_transducer.html" class="btn btn-neutral float-right" title="Zipformer Transducer" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
|
||||
</div>
|
||||
|
||||
<hr/>
|
||||
|
@ -52,6 +52,7 @@
|
||||
<li class="toctree-l3 current"><a class="reference internal" href="index.html">LibriSpeech</a><ul class="current">
|
||||
<li class="toctree-l4 current"><a class="current reference internal" href="#">Pruned transducer statelessX</a></li>
|
||||
<li class="toctree-l4"><a class="reference internal" href="lstm_pruned_stateless_transducer.html">LSTM Transducer</a></li>
|
||||
<li class="toctree-l4"><a class="reference internal" href="zipformer_transducer.html">Zipformer Transducer</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
|
752
recipes/Streaming-ASR/librispeech/zipformer_transducer.html
Normal file
752
recipes/Streaming-ASR/librispeech/zipformer_transducer.html
Normal file
@ -0,0 +1,752 @@
|
||||
<!DOCTYPE html>
|
||||
<html class="writer-html5" lang="en" >
|
||||
<head>
|
||||
<meta charset="utf-8" /><meta name="generator" content="Docutils 0.17.1: http://docutils.sourceforge.net/" />
|
||||
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<title>Zipformer Transducer — icefall 0.1 documentation</title>
|
||||
<link rel="stylesheet" href="../../../_static/pygments.css" type="text/css" />
|
||||
<link rel="stylesheet" href="../../../_static/css/theme.css" type="text/css" />
|
||||
<!--[if lt IE 9]>
|
||||
<script src="../../../_static/js/html5shiv.min.js"></script>
|
||||
<![endif]-->
|
||||
|
||||
<script data-url_root="../../../" id="documentation_options" src="../../../_static/documentation_options.js"></script>
|
||||
<script src="../../../_static/jquery.js"></script>
|
||||
<script src="../../../_static/underscore.js"></script>
|
||||
<script src="../../../_static/_sphinx_javascript_frameworks_compat.js"></script>
|
||||
<script src="../../../_static/doctools.js"></script>
|
||||
<script src="../../../_static/sphinx_highlight.js"></script>
|
||||
<script src="../../../_static/js/theme.js"></script>
|
||||
<link rel="index" title="Index" href="../../../genindex.html" />
|
||||
<link rel="search" title="Search" href="../../../search.html" />
|
||||
<link rel="next" title="Contributing" href="../../../contributing/index.html" />
|
||||
<link rel="prev" title="LSTM Transducer" href="lstm_pruned_stateless_transducer.html" />
|
||||
</head>
|
||||
|
||||
<body class="wy-body-for-nav">
|
||||
<div class="wy-grid-for-nav">
|
||||
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
|
||||
<div class="wy-side-scroll">
|
||||
<div class="wy-side-nav-search" >
|
||||
<a href="../../../index.html" class="icon icon-home"> icefall
|
||||
</a>
|
||||
<div role="search">
|
||||
<form id="rtd-search-form" class="wy-form" action="../../../search.html" method="get">
|
||||
<input type="text" name="q" placeholder="Search docs" />
|
||||
<input type="hidden" name="check_keywords" value="yes" />
|
||||
<input type="hidden" name="area" value="default" />
|
||||
</form>
|
||||
</div>
|
||||
</div><div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="Navigation menu">
|
||||
<p class="caption" role="heading"><span class="caption-text">Contents:</span></p>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../installation/index.html">Installation</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../model-export/index.html">Model export</a></li>
|
||||
</ul>
|
||||
<ul class="current">
|
||||
<li class="toctree-l1 current"><a class="reference internal" href="../../index.html">Recipes</a><ul class="current">
|
||||
<li class="toctree-l2"><a class="reference internal" href="../../Non-streaming-ASR/index.html">Non Streaming ASR</a></li>
|
||||
<li class="toctree-l2 current"><a class="reference internal" href="../index.html">Streaming ASR</a><ul class="current">
|
||||
<li class="toctree-l3"><a class="reference internal" href="../introduction.html">Introduction</a></li>
|
||||
<li class="toctree-l3 current"><a class="reference internal" href="index.html">LibriSpeech</a><ul class="current">
|
||||
<li class="toctree-l4"><a class="reference internal" href="pruned_transducer_stateless.html">Pruned transducer statelessX</a></li>
|
||||
<li class="toctree-l4"><a class="reference internal" href="lstm_pruned_stateless_transducer.html">LSTM Transducer</a></li>
|
||||
<li class="toctree-l4 current"><a class="current reference internal" href="#">Zipformer Transducer</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
<ul>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../contributing/index.html">Contributing</a></li>
|
||||
<li class="toctree-l1"><a class="reference internal" href="../../../huggingface/index.html">Huggingface</a></li>
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
</div>
|
||||
</nav>
|
||||
|
||||
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"><nav class="wy-nav-top" aria-label="Mobile navigation menu" >
|
||||
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
|
||||
<a href="../../../index.html">icefall</a>
|
||||
</nav>
|
||||
|
||||
<div class="wy-nav-content">
|
||||
<div class="rst-content">
|
||||
<div role="navigation" aria-label="Page navigation">
|
||||
<ul class="wy-breadcrumbs">
|
||||
<li><a href="../../../index.html" class="icon icon-home"></a></li>
|
||||
<li class="breadcrumb-item"><a href="../../index.html">Recipes</a></li>
|
||||
<li class="breadcrumb-item"><a href="../index.html">Streaming ASR</a></li>
|
||||
<li class="breadcrumb-item"><a href="index.html">LibriSpeech</a></li>
|
||||
<li class="breadcrumb-item active">Zipformer Transducer</li>
|
||||
<li class="wy-breadcrumbs-aside">
|
||||
<a href="https://github.com/k2-fsa/icefall/blob/master/docs/source/recipes/Streaming-ASR/librispeech/zipformer_transducer.rst" class="fa fa-github"> Edit on GitHub</a>
|
||||
</li>
|
||||
</ul>
|
||||
<hr/>
|
||||
</div>
|
||||
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
|
||||
<div itemprop="articleBody">
|
||||
|
||||
<section id="zipformer-transducer">
|
||||
<h1>Zipformer Transducer<a class="headerlink" href="#zipformer-transducer" title="Permalink to this heading"></a></h1>
|
||||
<p>This tutorial shows you how to run a <strong>streaming</strong> zipformer transducer model
|
||||
with the <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a> dataset.</p>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>The tutorial is suitable for <a class="reference external" href="https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming">pruned_transducer_stateless7_streaming</a>,</p>
|
||||
</div>
|
||||
<div class="admonition hint">
|
||||
<p class="admonition-title">Hint</p>
|
||||
<p>We assume you have read the page <a class="reference internal" href="../../../installation/index.html#install-icefall"><span class="std std-ref">Installation</span></a> and have setup
|
||||
the environment for <code class="docutils literal notranslate"><span class="pre">icefall</span></code>.</p>
|
||||
</div>
|
||||
<div class="admonition hint">
|
||||
<p class="admonition-title">Hint</p>
|
||||
<p>We recommend you to use a GPU or several GPUs to run this recipe.</p>
|
||||
</div>
|
||||
<div class="admonition hint">
|
||||
<p class="admonition-title">Hint</p>
|
||||
<p>Please scroll down to the bottom of this page to find download links
|
||||
for pretrained models if you don’t want to train a model from scratch.</p>
|
||||
</div>
|
||||
<p>We use pruned RNN-T to compute the loss.</p>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>You can find the paper about pruned RNN-T at the following address:</p>
|
||||
<p><a class="reference external" href="https://arxiv.org/abs/2206.13236">https://arxiv.org/abs/2206.13236</a></p>
|
||||
</div>
|
||||
<p>The transducer model consists of 3 parts:</p>
|
||||
<blockquote>
|
||||
<div><ul class="simple">
|
||||
<li><p>Encoder, a.k.a, the transcription network. We use a Zipformer model (proposed by Daniel Povey)</p></li>
|
||||
<li><p>Decoder, a.k.a, the prediction network. We use a stateless model consisting of
|
||||
<code class="docutils literal notranslate"><span class="pre">nn.Embedding</span></code> and <code class="docutils literal notranslate"><span class="pre">nn.Conv1d</span></code></p></li>
|
||||
<li><p>Joiner, a.k.a, the joint network.</p></li>
|
||||
</ul>
|
||||
</div></blockquote>
|
||||
<div class="admonition caution">
|
||||
<p class="admonition-title">Caution</p>
|
||||
<p>Contrary to the conventional RNN-T models, we use a stateless decoder.
|
||||
That is, it has no recurrent connections.</p>
|
||||
</div>
|
||||
<section id="data-preparation">
|
||||
<h2>Data preparation<a class="headerlink" href="#data-preparation" title="Permalink to this heading"></a></h2>
|
||||
<div class="admonition hint">
|
||||
<p class="admonition-title">Hint</p>
|
||||
<p>The data preparation is the same as other recipes on LibriSpeech dataset,
|
||||
if you have finished this step, you can skip to <code class="docutils literal notranslate"><span class="pre">Training</span></code> directly.</p>
|
||||
</div>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$ <span class="nb">cd</span> egs/librispeech/ASR
|
||||
$ ./prepare.sh
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>The script <code class="docutils literal notranslate"><span class="pre">./prepare.sh</span></code> handles the data preparation for you, <strong>automagically</strong>.
|
||||
All you need to do is to run it.</p>
|
||||
<p>The data preparation contains several stages, you can use the following two
|
||||
options:</p>
|
||||
<blockquote>
|
||||
<div><ul class="simple">
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--stage</span></code></p></li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--stop-stage</span></code></p></li>
|
||||
</ul>
|
||||
</div></blockquote>
|
||||
<p>to control which stage(s) should be run. By default, all stages are executed.</p>
|
||||
<p>For example,</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$ <span class="nb">cd</span> egs/librispeech/ASR
|
||||
$ ./prepare.sh --stage <span class="m">0</span> --stop-stage <span class="m">0</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>means to run only stage 0.</p>
|
||||
<p>To run stage 2 to stage 5, use:</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$ ./prepare.sh --stage <span class="m">2</span> --stop-stage <span class="m">5</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<div class="admonition hint">
|
||||
<p class="admonition-title">Hint</p>
|
||||
<p>If you have pre-downloaded the <a class="reference external" href="https://www.openslr.org/12">LibriSpeech</a>
|
||||
dataset and the <a class="reference external" href="http://www.openslr.org/17/">musan</a> dataset, say,
|
||||
they are saved in <code class="docutils literal notranslate"><span class="pre">/tmp/LibriSpeech</span></code> and <code class="docutils literal notranslate"><span class="pre">/tmp/musan</span></code>, you can modify
|
||||
the <code class="docutils literal notranslate"><span class="pre">dl_dir</span></code> variable in <code class="docutils literal notranslate"><span class="pre">./prepare.sh</span></code> to point to <code class="docutils literal notranslate"><span class="pre">/tmp</span></code> so that
|
||||
<code class="docutils literal notranslate"><span class="pre">./prepare.sh</span></code> won’t re-download them.</p>
|
||||
</div>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>All generated files by <code class="docutils literal notranslate"><span class="pre">./prepare.sh</span></code>, e.g., features, lexicon, etc,
|
||||
are saved in <code class="docutils literal notranslate"><span class="pre">./data</span></code> directory.</p>
|
||||
</div>
|
||||
<p>We provide the following YouTube video showing how to run <code class="docutils literal notranslate"><span class="pre">./prepare.sh</span></code>.</p>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>To get the latest news of <a class="reference external" href="https://github.com/k2-fsa">next-gen Kaldi</a>, please subscribe
|
||||
the following YouTube channel by <a class="reference external" href="https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw">Nadira Povey</a>:</p>
|
||||
<blockquote>
|
||||
<div><p><a class="reference external" href="https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw">https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw</a></p>
|
||||
</div></blockquote>
|
||||
</div>
|
||||
<div class="video_wrapper" style="">
|
||||
<iframe allowfullscreen="true" src="https://www.youtube.com/embed/ofEIoJL-mGM" style="border: 0; height: 345px; width: 560px">
|
||||
</iframe></div></section>
|
||||
<section id="training">
|
||||
<h2>Training<a class="headerlink" href="#training" title="Permalink to this heading"></a></h2>
|
||||
<section id="configurable-options">
|
||||
<h3>Configurable options<a class="headerlink" href="#configurable-options" title="Permalink to this heading"></a></h3>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$ <span class="nb">cd</span> egs/librispeech/ASR
|
||||
$ ./pruned_transducer_stateless7_streaming/train.py --help
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>shows you the training options that can be passed from the commandline.
|
||||
The following options are used quite often:</p>
|
||||
<blockquote>
|
||||
<div><ul>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--exp-dir</span></code></p>
|
||||
<p>The directory to save checkpoints, training logs and tensorboard.</p>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--full-libri</span></code></p>
|
||||
<p>If it’s True, the training part uses all the training data, i.e.,
|
||||
960 hours. Otherwise, the training part uses only the subset
|
||||
<code class="docutils literal notranslate"><span class="pre">train-clean-100</span></code>, which has 100 hours of training data.</p>
|
||||
<div class="admonition caution">
|
||||
<p class="admonition-title">Caution</p>
|
||||
<p>The training set is perturbed by speed with two factors: 0.9 and 1.1.
|
||||
If <code class="docutils literal notranslate"><span class="pre">--full-libri</span></code> is True, each epoch actually processes
|
||||
<code class="docutils literal notranslate"><span class="pre">3x960</span> <span class="pre">==</span> <span class="pre">2880</span></code> hours of data.</p>
|
||||
</div>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--num-epochs</span></code></p>
|
||||
<p>It is the number of epochs to train. For instance,
|
||||
<code class="docutils literal notranslate"><span class="pre">./pruned_transducer_stateless7_streaming/train.py</span> <span class="pre">--num-epochs</span> <span class="pre">30</span></code> trains for 30 epochs
|
||||
and generates <code class="docutils literal notranslate"><span class="pre">epoch-1.pt</span></code>, <code class="docutils literal notranslate"><span class="pre">epoch-2.pt</span></code>, …, <code class="docutils literal notranslate"><span class="pre">epoch-30.pt</span></code>
|
||||
in the folder <code class="docutils literal notranslate"><span class="pre">./pruned_transducer_stateless7_streaming/exp</span></code>.</p>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--start-epoch</span></code></p>
|
||||
<p>It’s used to resume training.
|
||||
<code class="docutils literal notranslate"><span class="pre">./pruned_transducer_stateless7_streaming/train.py</span> <span class="pre">--start-epoch</span> <span class="pre">10</span></code> loads the
|
||||
checkpoint <code class="docutils literal notranslate"><span class="pre">./pruned_transducer_stateless7_streaming/exp/epoch-9.pt</span></code> and starts
|
||||
training from epoch 10, based on the state from epoch 9.</p>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--world-size</span></code></p>
|
||||
<p>It is used for multi-GPU single-machine DDP training.</p>
|
||||
<blockquote>
|
||||
<div><ul class="simple">
|
||||
<li><ol class="loweralpha simple">
|
||||
<li><p>If it is 1, then no DDP training is used.</p></li>
|
||||
</ol>
|
||||
</li>
|
||||
<li><ol class="loweralpha simple" start="2">
|
||||
<li><p>If it is 2, then GPU 0 and GPU 1 are used for DDP training.</p></li>
|
||||
</ol>
|
||||
</li>
|
||||
</ul>
|
||||
</div></blockquote>
|
||||
<p>The following shows some use cases with it.</p>
|
||||
<blockquote>
|
||||
<div><p><strong>Use case 1</strong>: You have 4 GPUs, but you only want to use GPU 0 and
|
||||
GPU 2 for training. You can do the following:</p>
|
||||
<blockquote>
|
||||
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$ <span class="nb">cd</span> egs/librispeech/ASR
|
||||
$ <span class="nb">export</span> <span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span><span class="s2">"0,2"</span>
|
||||
$ ./pruned_transducer_stateless7_streaming/train.py --world-size <span class="m">2</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
</div></blockquote>
|
||||
<p><strong>Use case 2</strong>: You have 4 GPUs and you want to use all of them
|
||||
for training. You can do the following:</p>
|
||||
<blockquote>
|
||||
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$ <span class="nb">cd</span> egs/librispeech/ASR
|
||||
$ ./pruned_transducer_stateless7_streaming/train.py --world-size <span class="m">4</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
</div></blockquote>
|
||||
<p><strong>Use case 3</strong>: You have 4 GPUs but you only want to use GPU 3
|
||||
for training. You can do the following:</p>
|
||||
<blockquote>
|
||||
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$ <span class="nb">cd</span> egs/librispeech/ASR
|
||||
$ <span class="nb">export</span> <span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span><span class="s2">"3"</span>
|
||||
$ ./pruned_transducer_stateless7_streaming/train.py --world-size <span class="m">1</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
</div></blockquote>
|
||||
</div></blockquote>
|
||||
<div class="admonition caution">
|
||||
<p class="admonition-title">Caution</p>
|
||||
<p>Only multi-GPU single-machine DDP training is implemented at present.
|
||||
Multi-GPU multi-machine DDP training will be added later.</p>
|
||||
</div>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--max-duration</span></code></p>
|
||||
<p>It specifies the number of seconds over all utterances in a
|
||||
batch, before <strong>padding</strong>.
|
||||
If you encounter CUDA OOM, please reduce it.</p>
|
||||
<div class="admonition hint">
|
||||
<p class="admonition-title">Hint</p>
|
||||
<p>Due to padding, the number of seconds of all utterances in a
|
||||
batch will usually be larger than <code class="docutils literal notranslate"><span class="pre">--max-duration</span></code>.</p>
|
||||
<p>A larger value for <code class="docutils literal notranslate"><span class="pre">--max-duration</span></code> may cause OOM during training,
|
||||
while a smaller value may increase the training time. You have to
|
||||
tune it.</p>
|
||||
</div>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--use-fp16</span></code></p>
|
||||
<p>If it is True, the model will train with half precision, from our experiment
|
||||
results, by using half precision you can train with two times larger <code class="docutils literal notranslate"><span class="pre">--max-duration</span></code>
|
||||
so as to get almost 2X speed up.</p>
|
||||
<p>We recommend using <code class="docutils literal notranslate"><span class="pre">--use-fp16</span> <span class="pre">True</span></code>.</p>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--short-chunk-size</span></code></p>
|
||||
<p>When training a streaming attention model with chunk masking, the chunk size
|
||||
would be either max sequence length of current batch or uniformly sampled from
|
||||
(1, short_chunk_size). The default value is 50, you don’t have to change it most of the time.</p>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--num-left-chunks</span></code></p>
|
||||
<p>It indicates how many left context (in chunks) that can be seen when calculating attention.
|
||||
The default value is 4, you don’t have to change it most of the time.</p>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">--decode-chunk-len</span></code></p>
|
||||
<p>The chunk size for decoding (in frames before subsampling). It is used for validation.
|
||||
The default value is 32 (i.e., 320ms).</p>
|
||||
</li>
|
||||
</ul>
|
||||
</div></blockquote>
|
||||
</section>
|
||||
<section id="pre-configured-options">
|
||||
<h3>Pre-configured options<a class="headerlink" href="#pre-configured-options" title="Permalink to this heading"></a></h3>
|
||||
<p>There are some training options, e.g., number of encoder layers,
|
||||
encoder dimension, decoder dimension, number of warmup steps etc,
|
||||
that are not passed from the commandline.
|
||||
They are pre-configured by the function <code class="docutils literal notranslate"><span class="pre">get_params()</span></code> in
|
||||
<a class="reference external" href="https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/train.py">pruned_transducer_stateless7_streaming/train.py</a></p>
|
||||
<p>You don’t need to change these pre-configured parameters. If you really need to change
|
||||
them, please modify <code class="docutils literal notranslate"><span class="pre">./pruned_transducer_stateless7_streaming/train.py</span></code> directly.</p>
|
||||
</section>
|
||||
<section id="training-logs">
|
||||
<h3>Training logs<a class="headerlink" href="#training-logs" title="Permalink to this heading"></a></h3>
|
||||
<p>Training logs and checkpoints are saved in <code class="docutils literal notranslate"><span class="pre">--exp-dir</span></code> (e.g. <code class="docutils literal notranslate"><span class="pre">pruned_transducer_stateless7_streaming/exp</span></code>.
|
||||
You will find the following files in that directory:</p>
|
||||
<blockquote>
|
||||
<div><ul>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">epoch-1.pt</span></code>, <code class="docutils literal notranslate"><span class="pre">epoch-2.pt</span></code>, …</p>
|
||||
<p>These are checkpoint files saved at the end of each epoch, containing model
|
||||
<code class="docutils literal notranslate"><span class="pre">state_dict</span></code> and optimizer <code class="docutils literal notranslate"><span class="pre">state_dict</span></code>.
|
||||
To resume training from some checkpoint, say <code class="docutils literal notranslate"><span class="pre">epoch-10.pt</span></code>, you can use:</p>
|
||||
<blockquote>
|
||||
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$ ./pruned_transducer_stateless7_streaming/train.py --start-epoch <span class="m">11</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
</div></blockquote>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">checkpoint-436000.pt</span></code>, <code class="docutils literal notranslate"><span class="pre">checkpoint-438000.pt</span></code>, …</p>
|
||||
<p>These are checkpoint files saved every <code class="docutils literal notranslate"><span class="pre">--save-every-n</span></code> batches,
|
||||
containing model <code class="docutils literal notranslate"><span class="pre">state_dict</span></code> and optimizer <code class="docutils literal notranslate"><span class="pre">state_dict</span></code>.
|
||||
To resume training from some checkpoint, say <code class="docutils literal notranslate"><span class="pre">checkpoint-436000</span></code>, you can use:</p>
|
||||
<blockquote>
|
||||
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$ ./pruned_transducer_stateless7_streaming/train.py --start-batch <span class="m">436000</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
</div></blockquote>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">tensorboard/</span></code></p>
|
||||
<p>This folder contains tensorBoard logs. Training loss, validation loss, learning
|
||||
rate, etc, are recorded in these logs. You can visualize them by:</p>
|
||||
<blockquote>
|
||||
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$ <span class="nb">cd</span> pruned_transducer_stateless7_streaming/exp/tensorboard
|
||||
$ tensorboard dev upload --logdir . --description <span class="s2">"pruned transducer training for LibriSpeech with icefall"</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
</div></blockquote>
|
||||
</li>
|
||||
</ul>
|
||||
<div class="admonition hint">
|
||||
<p class="admonition-title">Hint</p>
|
||||
<p>If you don’t have access to google, you can use the following command
|
||||
to view the tensorboard log locally:</p>
|
||||
<blockquote>
|
||||
<div><div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">cd</span> pruned_transducer_stateless7_streaming/exp/tensorboard
|
||||
tensorboard --logdir . --port <span class="m">6008</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
</div></blockquote>
|
||||
<p>It will print the following message:</p>
|
||||
<blockquote>
|
||||
<div><div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">Serving</span> <span class="n">TensorBoard</span> <span class="n">on</span> <span class="n">localhost</span><span class="p">;</span> <span class="n">to</span> <span class="n">expose</span> <span class="n">to</span> <span class="n">the</span> <span class="n">network</span><span class="p">,</span> <span class="n">use</span> <span class="n">a</span> <span class="n">proxy</span> <span class="ow">or</span> <span class="k">pass</span> <span class="o">--</span><span class="n">bind_all</span>
|
||||
<span class="n">TensorBoard</span> <span class="mf">2.8.0</span> <span class="n">at</span> <span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">localhost</span><span class="p">:</span><span class="mi">6008</span><span class="o">/</span> <span class="p">(</span><span class="n">Press</span> <span class="n">CTRL</span><span class="o">+</span><span class="n">C</span> <span class="n">to</span> <span class="n">quit</span><span class="p">)</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
</div></blockquote>
|
||||
<p>Now start your browser and go to <a class="reference external" href="http://localhost:6008">http://localhost:6008</a> to view the tensorboard
|
||||
logs.</p>
|
||||
</div>
|
||||
<ul>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">log/log-train-xxxx</span></code></p>
|
||||
<p>It is the detailed training log in text format, same as the one
|
||||
you saw printed to the console during training.</p>
|
||||
</li>
|
||||
</ul>
|
||||
</div></blockquote>
|
||||
</section>
|
||||
<section id="usage-example">
|
||||
<h3>Usage example<a class="headerlink" href="#usage-example" title="Permalink to this heading"></a></h3>
|
||||
<p>You can use the following command to start the training using 4 GPUs:</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">export</span> <span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span><span class="s2">"0,1,2,3"</span>
|
||||
./pruned_transducer_stateless7_streaming/train.py <span class="se">\</span>
|
||||
--world-size <span class="m">4</span> <span class="se">\</span>
|
||||
--num-epochs <span class="m">30</span> <span class="se">\</span>
|
||||
--start-epoch <span class="m">1</span> <span class="se">\</span>
|
||||
--use-fp16 <span class="m">1</span> <span class="se">\</span>
|
||||
--exp-dir pruned_transducer_stateless7_streaming/exp <span class="se">\</span>
|
||||
--full-libri <span class="m">1</span> <span class="se">\</span>
|
||||
--max-duration <span class="m">550</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
<section id="decoding">
|
||||
<h2>Decoding<a class="headerlink" href="#decoding" title="Permalink to this heading"></a></h2>
|
||||
<p>The decoding part uses checkpoints saved by the training part, so you have
|
||||
to run the training part first.</p>
|
||||
<div class="admonition hint">
|
||||
<p class="admonition-title">Hint</p>
|
||||
<p>There are two kinds of checkpoints:</p>
|
||||
<blockquote>
|
||||
<div><ul class="simple">
|
||||
<li><p>(1) <code class="docutils literal notranslate"><span class="pre">epoch-1.pt</span></code>, <code class="docutils literal notranslate"><span class="pre">epoch-2.pt</span></code>, …, which are saved at the end
|
||||
of each epoch. You can pass <code class="docutils literal notranslate"><span class="pre">--epoch</span></code> to
|
||||
<code class="docutils literal notranslate"><span class="pre">pruned_transducer_stateless7_streaming/decode.py</span></code> to use them.</p></li>
|
||||
<li><p>(2) <code class="docutils literal notranslate"><span class="pre">checkpoints-436000.pt</span></code>, <code class="docutils literal notranslate"><span class="pre">epoch-438000.pt</span></code>, …, which are saved
|
||||
every <code class="docutils literal notranslate"><span class="pre">--save-every-n</span></code> batches. You can pass <code class="docutils literal notranslate"><span class="pre">--iter</span></code> to
|
||||
<code class="docutils literal notranslate"><span class="pre">pruned_transducer_stateless7_streaming/decode.py</span></code> to use them.</p></li>
|
||||
</ul>
|
||||
<p>We suggest that you try both types of checkpoints and choose the one
|
||||
that produces the lowest WERs.</p>
|
||||
</div></blockquote>
|
||||
</div>
|
||||
<div class="admonition tip">
|
||||
<p class="admonition-title">Tip</p>
|
||||
<p>To decode a streaming model, you can use either <code class="docutils literal notranslate"><span class="pre">simulate</span> <span class="pre">streaming</span> <span class="pre">decoding</span></code> in <code class="docutils literal notranslate"><span class="pre">decode.py</span></code> or
|
||||
<code class="docutils literal notranslate"><span class="pre">real</span> <span class="pre">chunk-wise</span> <span class="pre">streaming</span> <span class="pre">decoding</span></code> in <code class="docutils literal notranslate"><span class="pre">streaming_decode.py</span></code>. The difference between <code class="docutils literal notranslate"><span class="pre">decode.py</span></code> and
|
||||
<code class="docutils literal notranslate"><span class="pre">streaming_decode.py</span></code> is that, <code class="docutils literal notranslate"><span class="pre">decode.py</span></code> processes the whole acoustic frames at one time with masking (i.e. same as training),
|
||||
but <code class="docutils literal notranslate"><span class="pre">streaming_decode.py</span></code> processes the acoustic frames chunk by chunk.</p>
|
||||
</div>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p><code class="docutils literal notranslate"><span class="pre">simulate</span> <span class="pre">streaming</span> <span class="pre">decoding</span></code> in <code class="docutils literal notranslate"><span class="pre">decode.py</span></code> and <code class="docutils literal notranslate"><span class="pre">real</span> <span class="pre">chunk-size</span> <span class="pre">streaming</span> <span class="pre">decoding</span></code> in <code class="docutils literal notranslate"><span class="pre">streaming_decode.py</span></code> should
|
||||
produce almost the same results given the same <code class="docutils literal notranslate"><span class="pre">--decode-chunk-len</span></code>.</p>
|
||||
</div>
|
||||
<section id="simulate-streaming-decoding">
|
||||
<h3>Simulate streaming decoding<a class="headerlink" href="#simulate-streaming-decoding" title="Permalink to this heading"></a></h3>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$ <span class="nb">cd</span> egs/librispeech/ASR
|
||||
$ ./pruned_transducer_stateless7_streaming/decode.py --help
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>shows the options for decoding.
|
||||
The following options are important for streaming models:</p>
|
||||
<blockquote>
|
||||
<div><p><code class="docutils literal notranslate"><span class="pre">--decode-chunk-len</span></code></p>
|
||||
<blockquote>
|
||||
<div><p>It is same as in <code class="docutils literal notranslate"><span class="pre">train.py</span></code>, which specifies the chunk size for decoding (in frames before subsampling).
|
||||
The default value is 32 (i.e., 320ms).</p>
|
||||
</div></blockquote>
|
||||
</div></blockquote>
|
||||
<p>The following shows two examples (for the two types of checkpoints):</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="k">for</span> m <span class="k">in</span> greedy_search fast_beam_search modified_beam_search<span class="p">;</span> <span class="k">do</span>
|
||||
<span class="k">for</span> epoch <span class="k">in</span> <span class="m">30</span><span class="p">;</span> <span class="k">do</span>
|
||||
<span class="k">for</span> avg <span class="k">in</span> <span class="m">12</span> <span class="m">11</span> <span class="m">10</span> <span class="m">9</span> <span class="m">8</span><span class="p">;</span> <span class="k">do</span>
|
||||
./pruned_transducer_stateless7_streaming/decode.py <span class="se">\</span>
|
||||
--epoch <span class="nv">$epoch</span> <span class="se">\</span>
|
||||
--avg <span class="nv">$avg</span> <span class="se">\</span>
|
||||
--decode-chunk-len <span class="m">32</span> <span class="se">\</span>
|
||||
--exp-dir pruned_transducer_stateless7_streaming/exp <span class="se">\</span>
|
||||
--max-duration <span class="m">600</span> <span class="se">\</span>
|
||||
--decoding-method <span class="nv">$m</span>
|
||||
<span class="k">done</span>
|
||||
<span class="k">done</span>
|
||||
<span class="k">done</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="k">for</span> m <span class="k">in</span> greedy_search fast_beam_search modified_beam_search<span class="p">;</span> <span class="k">do</span>
|
||||
<span class="k">for</span> iter <span class="k">in</span> <span class="m">474000</span><span class="p">;</span> <span class="k">do</span>
|
||||
<span class="k">for</span> avg <span class="k">in</span> <span class="m">8</span> <span class="m">10</span> <span class="m">12</span> <span class="m">14</span> <span class="m">16</span> <span class="m">18</span><span class="p">;</span> <span class="k">do</span>
|
||||
./pruned_transducer_stateless7_streaming/decode.py <span class="se">\</span>
|
||||
--iter <span class="nv">$iter</span> <span class="se">\</span>
|
||||
--avg <span class="nv">$avg</span> <span class="se">\</span>
|
||||
--decode-chunk-len <span class="m">32</span> <span class="se">\</span>
|
||||
--exp-dir pruned_transducer_stateless7_streaming/exp <span class="se">\</span>
|
||||
--max-duration <span class="m">600</span> <span class="se">\</span>
|
||||
--decoding-method <span class="nv">$m</span>
|
||||
<span class="k">done</span>
|
||||
<span class="k">done</span>
|
||||
<span class="k">done</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
</section>
|
||||
<section id="real-streaming-decoding">
|
||||
<h3>Real streaming decoding<a class="headerlink" href="#real-streaming-decoding" title="Permalink to this heading"></a></h3>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$ <span class="nb">cd</span> egs/librispeech/ASR
|
||||
$ ./pruned_transducer_stateless7_streaming/streaming_decode.py --help
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>shows the options for decoding.
|
||||
The following options are important for streaming models:</p>
|
||||
<blockquote>
|
||||
<div><p><code class="docutils literal notranslate"><span class="pre">--decode-chunk-len</span></code></p>
|
||||
<blockquote>
|
||||
<div><p>It is same as in <code class="docutils literal notranslate"><span class="pre">train.py</span></code>, which specifies the chunk size for decoding (in frames before subsampling).
|
||||
The default value is 32 (i.e., 320ms).
|
||||
For <code class="docutils literal notranslate"><span class="pre">real</span> <span class="pre">streaming</span> <span class="pre">decoding</span></code>, we will process <code class="docutils literal notranslate"><span class="pre">decode-chunk-len</span></code> acoustic frames at each time.</p>
|
||||
</div></blockquote>
|
||||
<p><code class="docutils literal notranslate"><span class="pre">--num-decode-streams</span></code></p>
|
||||
<blockquote>
|
||||
<div><p>The number of decoding streams that can be run in parallel (very similar to the <code class="docutils literal notranslate"><span class="pre">bath</span> <span class="pre">size</span></code>).
|
||||
For <code class="docutils literal notranslate"><span class="pre">real</span> <span class="pre">streaming</span> <span class="pre">decoding</span></code>, the batches will be packed dynamically, for example, if the
|
||||
<code class="docutils literal notranslate"><span class="pre">num-decode-streams</span></code> equals to 10, then, sequence 1 to 10 will be decoded at first, after a while,
|
||||
suppose sequence 1 and 2 are done, so, sequence 3 to 12 will be processed parallelly in a batch.</p>
|
||||
</div></blockquote>
|
||||
</div></blockquote>
|
||||
<p>The following shows two examples (for the two types of checkpoints):</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="k">for</span> m <span class="k">in</span> greedy_search fast_beam_search modified_beam_search<span class="p">;</span> <span class="k">do</span>
|
||||
<span class="k">for</span> epoch <span class="k">in</span> <span class="m">30</span><span class="p">;</span> <span class="k">do</span>
|
||||
<span class="k">for</span> avg <span class="k">in</span> <span class="m">12</span> <span class="m">11</span> <span class="m">10</span> <span class="m">9</span> <span class="m">8</span><span class="p">;</span> <span class="k">do</span>
|
||||
./pruned_transducer_stateless7_streaming/decode.py <span class="se">\</span>
|
||||
--epoch <span class="nv">$epoch</span> <span class="se">\</span>
|
||||
--avg <span class="nv">$avg</span> <span class="se">\</span>
|
||||
--decode-chunk-len <span class="m">32</span> <span class="se">\</span>
|
||||
--num-decode-streams <span class="m">100</span> <span class="se">\</span>
|
||||
--exp-dir pruned_transducer_stateless7_streaming/exp <span class="se">\</span>
|
||||
--decoding-method <span class="nv">$m</span>
|
||||
<span class="k">done</span>
|
||||
<span class="k">done</span>
|
||||
<span class="k">done</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="k">for</span> m <span class="k">in</span> greedy_search fast_beam_search modified_beam_search<span class="p">;</span> <span class="k">do</span>
|
||||
<span class="k">for</span> iter <span class="k">in</span> <span class="m">474000</span><span class="p">;</span> <span class="k">do</span>
|
||||
<span class="k">for</span> avg <span class="k">in</span> <span class="m">8</span> <span class="m">10</span> <span class="m">12</span> <span class="m">14</span> <span class="m">16</span> <span class="m">18</span><span class="p">;</span> <span class="k">do</span>
|
||||
./pruned_transducer_stateless7_streaming/decode.py <span class="se">\</span>
|
||||
--iter <span class="nv">$iter</span> <span class="se">\</span>
|
||||
--avg <span class="nv">$avg</span> <span class="se">\</span>
|
||||
--decode-chunk-len <span class="m">16</span> <span class="se">\</span>
|
||||
--num-decode-streams <span class="m">100</span> <span class="se">\</span>
|
||||
--exp-dir pruned_transducer_stateless7_streaming/exp <span class="se">\</span>
|
||||
--decoding-method <span class="nv">$m</span>
|
||||
<span class="k">done</span>
|
||||
<span class="k">done</span>
|
||||
<span class="k">done</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<div class="admonition tip">
|
||||
<p class="admonition-title">Tip</p>
|
||||
<p>Supporting decoding methods are as follows:</p>
|
||||
<blockquote>
|
||||
<div><ul class="simple">
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">greedy_search</span></code> : It takes the symbol with largest posterior probability
|
||||
of each frame as the decoding result.</p></li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">beam_search</span></code> : It implements Algorithm 1 in <a class="reference external" href="https://arxiv.org/pdf/1211.3711.pdf">https://arxiv.org/pdf/1211.3711.pdf</a> and
|
||||
<a class="reference external" href="https://github.com/espnet/espnet/blob/master/espnet/nets/beam_search_transducer.py#L247">espnet/nets/beam_search_transducer.py</a>
|
||||
is used as a reference. Basicly, it keeps topk states for each frame, and expands the kept states with their own contexts to
|
||||
next frame.</p></li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">modified_beam_search</span></code> : It implements the same algorithm as <code class="docutils literal notranslate"><span class="pre">beam_search</span></code> above, but it
|
||||
runs in batch mode with <code class="docutils literal notranslate"><span class="pre">--max-sym-per-frame=1</span></code> being hardcoded.</p></li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">fast_beam_search</span></code> : It implements graph composition between the output <code class="docutils literal notranslate"><span class="pre">log_probs</span></code> and
|
||||
given <code class="docutils literal notranslate"><span class="pre">FSAs</span></code>. It is hard to describe the details in several lines of texts, you can read
|
||||
our paper in <a class="reference external" href="https://arxiv.org/pdf/2211.00484.pdf">https://arxiv.org/pdf/2211.00484.pdf</a> or our <a class="reference external" href="https://github.com/k2-fsa/k2/blob/master/k2/csrc/rnnt_decode.h">rnnt decode code in k2</a>. <code class="docutils literal notranslate"><span class="pre">fast_beam_search</span></code> can decode with <code class="docutils literal notranslate"><span class="pre">FSAs</span></code> on GPU efficiently.</p></li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">fast_beam_search_LG</span></code> : The same as <code class="docutils literal notranslate"><span class="pre">fast_beam_search</span></code> above, <code class="docutils literal notranslate"><span class="pre">fast_beam_search</span></code> uses
|
||||
an trivial graph that has only one state, while <code class="docutils literal notranslate"><span class="pre">fast_beam_search_LG</span></code> uses an LG graph
|
||||
(with N-gram LM).</p></li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">fast_beam_search_nbest</span></code> : It produces the decoding results as follows:</p>
|
||||
<ul>
|
||||
<li><ol class="arabic simple">
|
||||
<li><p>Use <code class="docutils literal notranslate"><span class="pre">fast_beam_search</span></code> to get a lattice</p></li>
|
||||
</ol>
|
||||
</li>
|
||||
<li><ol class="arabic simple" start="2">
|
||||
<li><p>Select <code class="docutils literal notranslate"><span class="pre">num_paths</span></code> paths from the lattice using <code class="docutils literal notranslate"><span class="pre">k2.random_paths()</span></code></p></li>
|
||||
</ol>
|
||||
</li>
|
||||
<li><ol class="arabic simple" start="3">
|
||||
<li><p>Unique the selected paths</p></li>
|
||||
</ol>
|
||||
</li>
|
||||
<li><ol class="arabic simple" start="4">
|
||||
<li><p>Intersect the selected paths with the lattice and compute the
|
||||
shortest path from the intersection result</p></li>
|
||||
</ol>
|
||||
</li>
|
||||
<li><ol class="arabic simple" start="5">
|
||||
<li><p>The path with the largest score is used as the decoding output.</p></li>
|
||||
</ol>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">fast_beam_search_nbest_LG</span></code> : It implements same logic as <code class="docutils literal notranslate"><span class="pre">fast_beam_search_nbest</span></code>, the
|
||||
only difference is that it uses <code class="docutils literal notranslate"><span class="pre">fast_beam_search_LG</span></code> to generate the lattice.</p></li>
|
||||
</ul>
|
||||
</div></blockquote>
|
||||
</div>
|
||||
<div class="admonition note">
|
||||
<p class="admonition-title">Note</p>
|
||||
<p>The supporting decoding methods in <code class="docutils literal notranslate"><span class="pre">streaming_decode.py</span></code> might be less than that in <code class="docutils literal notranslate"><span class="pre">decode.py</span></code>, if needed,
|
||||
you can implement them by yourself or file a issue in <a class="reference external" href="https://github.com/k2-fsa/icefall/issues">icefall</a> .</p>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
<section id="export-model">
|
||||
<h2>Export Model<a class="headerlink" href="#export-model" title="Permalink to this heading"></a></h2>
|
||||
<p>Currently it supports exporting checkpoints from <code class="docutils literal notranslate"><span class="pre">pruned_transducer_stateless7_streaming/exp</span></code> in the following ways.</p>
|
||||
<section id="export-model-state-dict">
|
||||
<h3>Export <code class="docutils literal notranslate"><span class="pre">model.state_dict()</span></code><a class="headerlink" href="#export-model-state-dict" title="Permalink to this heading"></a></h3>
|
||||
<p>Checkpoints saved by <code class="docutils literal notranslate"><span class="pre">pruned_transducer_stateless7_streaming/train.py</span></code> also include
|
||||
<code class="docutils literal notranslate"><span class="pre">optimizer.state_dict()</span></code>. It is useful for resuming training. But after training,
|
||||
we are interested only in <code class="docutils literal notranslate"><span class="pre">model.state_dict()</span></code>. You can use the following
|
||||
command to extract <code class="docutils literal notranslate"><span class="pre">model.state_dict()</span></code>.</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="c1"># Assume that --epoch 30 --avg 9 produces the smallest WER</span>
|
||||
<span class="c1"># (You can get such information after running ./pruned_transducer_stateless7_streaming/decode.py)</span>
|
||||
|
||||
<span class="nv">epoch</span><span class="o">=</span><span class="m">30</span>
|
||||
<span class="nv">avg</span><span class="o">=</span><span class="m">9</span>
|
||||
|
||||
./pruned_transducer_stateless7_streaming/export.py <span class="se">\</span>
|
||||
--exp-dir ./pruned_transducer_stateless7_streaming/exp <span class="se">\</span>
|
||||
--bpe-model data/lang_bpe_500/bpe.model <span class="se">\</span>
|
||||
--epoch <span class="nv">$epoch</span> <span class="se">\</span>
|
||||
--avg <span class="nv">$avg</span> <span class="se">\</span>
|
||||
--use-averaged-model<span class="o">=</span>True <span class="se">\</span>
|
||||
--decode-chunk-len <span class="m">32</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>It will generate a file <code class="docutils literal notranslate"><span class="pre">./pruned_transducer_stateless7_streaming/exp/pretrained.pt</span></code>.</p>
|
||||
<div class="admonition hint">
|
||||
<p class="admonition-title">Hint</p>
|
||||
<p>To use the generated <code class="docutils literal notranslate"><span class="pre">pretrained.pt</span></code> for <code class="docutils literal notranslate"><span class="pre">pruned_transducer_stateless7_streaming/decode.py</span></code>,
|
||||
you can run:</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nb">cd</span> pruned_transducer_stateless7_streaming/exp
|
||||
ln -s pretrained.pt epoch-999.pt
|
||||
</pre></div>
|
||||
</div>
|
||||
<p>And then pass <code class="docutils literal notranslate"><span class="pre">--epoch</span> <span class="pre">999</span> <span class="pre">--avg</span> <span class="pre">1</span> <span class="pre">--use-averaged-model</span> <span class="pre">0</span></code> to
|
||||
<code class="docutils literal notranslate"><span class="pre">./pruned_transducer_stateless7_streaming/decode.py</span></code>.</p>
|
||||
</div>
|
||||
<p>To use the exported model with <code class="docutils literal notranslate"><span class="pre">./pruned_transducer_stateless7_streaming/pretrained.py</span></code>, you
|
||||
can run:</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>./pruned_transducer_stateless7_streaming/pretrained.py <span class="se">\</span>
|
||||
--checkpoint ./pruned_transducer_stateless7_streaming/exp/pretrained.pt <span class="se">\</span>
|
||||
--bpe-model ./data/lang_bpe_500/bpe.model <span class="se">\</span>
|
||||
--method greedy_search <span class="se">\</span>
|
||||
--decode-chunk-len <span class="m">32</span> <span class="se">\</span>
|
||||
/path/to/foo.wav <span class="se">\</span>
|
||||
/path/to/bar.wav
|
||||
</pre></div>
|
||||
</div>
|
||||
</section>
|
||||
<section id="export-model-using-torch-jit-script">
|
||||
<h3>Export model using <code class="docutils literal notranslate"><span class="pre">torch.jit.script()</span></code><a class="headerlink" href="#export-model-using-torch-jit-script" title="Permalink to this heading"></a></h3>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>./pruned_transducer_stateless7_streaming/export.py <span class="se">\</span>
|
||||
--exp-dir ./pruned_transducer_stateless7_streaming/exp <span class="se">\</span>
|
||||
--bpe-model data/lang_bpe_500/bpe.model <span class="se">\</span>
|
||||
--epoch <span class="m">30</span> <span class="se">\</span>
|
||||
--avg <span class="m">9</span> <span class="se">\</span>
|
||||
--decode-chunk-len <span class="m">32</span> <span class="se">\</span>
|
||||
--jit <span class="m">1</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<div class="admonition caution">
|
||||
<p class="admonition-title">Caution</p>
|
||||
<p><code class="docutils literal notranslate"><span class="pre">--decode-chunk-len</span></code> is required to export a ScriptModule.</p>
|
||||
</div>
|
||||
<p>It will generate a file <code class="docutils literal notranslate"><span class="pre">cpu_jit.pt</span></code> in the given <code class="docutils literal notranslate"><span class="pre">exp_dir</span></code>. You can later
|
||||
load it by <code class="docutils literal notranslate"><span class="pre">torch.jit.load("cpu_jit.pt")</span></code>.</p>
|
||||
<p>Note <code class="docutils literal notranslate"><span class="pre">cpu</span></code> in the name <code class="docutils literal notranslate"><span class="pre">cpu_jit.pt</span></code> means the parameters when loaded into Python
|
||||
are on CPU. You can use <code class="docutils literal notranslate"><span class="pre">to("cuda")</span></code> to move them to a CUDA device.</p>
|
||||
</section>
|
||||
<section id="export-model-using-torch-jit-trace">
|
||||
<h3>Export model using <code class="docutils literal notranslate"><span class="pre">torch.jit.trace()</span></code><a class="headerlink" href="#export-model-using-torch-jit-trace" title="Permalink to this heading"></a></h3>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span><span class="nv">epoch</span><span class="o">=</span><span class="m">30</span>
|
||||
<span class="nv">avg</span><span class="o">=</span><span class="m">9</span>
|
||||
|
||||
./pruned_transducer_stateless7_streaming/jit_trace_export.py <span class="se">\</span>
|
||||
--bpe-model data/lang_bpe_500/bpe.model <span class="se">\</span>
|
||||
--use-averaged-model<span class="o">=</span>True <span class="se">\</span>
|
||||
--decode-chunk-len <span class="m">32</span> <span class="se">\</span>
|
||||
--exp-dir ./pruned_transducer_stateless7_streaming/exp <span class="se">\</span>
|
||||
--epoch <span class="nv">$epoch</span> <span class="se">\</span>
|
||||
--avg <span class="nv">$avg</span>
|
||||
</pre></div>
|
||||
</div>
|
||||
<div class="admonition caution">
|
||||
<p class="admonition-title">Caution</p>
|
||||
<p><code class="docutils literal notranslate"><span class="pre">--decode-chunk-len</span></code> is required to export a ScriptModule.</p>
|
||||
</div>
|
||||
<p>It will generate 3 files:</p>
|
||||
<blockquote>
|
||||
<div><ul class="simple">
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">./pruned_transducer_stateless7_streaming/exp/encoder_jit_trace.pt</span></code></p></li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">./pruned_transducer_stateless7_streaming/exp/decoder_jit_trace.pt</span></code></p></li>
|
||||
<li><p><code class="docutils literal notranslate"><span class="pre">./pruned_transducer_stateless7_streaming/exp/joiner_jit_trace.pt</span></code></p></li>
|
||||
</ul>
|
||||
</div></blockquote>
|
||||
<p>To use the generated files with <code class="docutils literal notranslate"><span class="pre">./pruned_transducer_stateless7_streaming/jit_trace_pretrained.py</span></code>:</p>
|
||||
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>./pruned_transducer_stateless7_streaming/jit_trace_pretrained.py <span class="se">\</span>
|
||||
--encoder-model-filename ./pruned_transducer_stateless7_streaming/exp/encoder_jit_trace.pt <span class="se">\</span>
|
||||
--decoder-model-filename ./pruned_transducer_stateless7_streaming/exp/decoder_jit_trace.pt <span class="se">\</span>
|
||||
--joiner-model-filename ./pruned_transducer_stateless7_streaming/exp/joiner_jit_trace.pt <span class="se">\</span>
|
||||
--bpe-model ./data/lang_bpe_500/bpe.model <span class="se">\</span>
|
||||
--decode-chunk-len <span class="m">32</span> <span class="se">\</span>
|
||||
/path/to/foo.wav
|
||||
</pre></div>
|
||||
</div>
|
||||
</section>
|
||||
</section>
|
||||
<section id="download-pretrained-models">
|
||||
<h2>Download pretrained models<a class="headerlink" href="#download-pretrained-models" title="Permalink to this heading"></a></h2>
|
||||
<p>If you don’t want to train from scratch, you can download the pretrained models
|
||||
by visiting the following links:</p>
|
||||
<blockquote>
|
||||
<div><ul class="simple">
|
||||
<li><p><a class="reference external" href="https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29">pruned_transducer_stateless7_streaming</a></p></li>
|
||||
</ul>
|
||||
<p>See <a class="reference external" href="https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md">https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md</a>
|
||||
for the details of the above pretrained models</p>
|
||||
</div></blockquote>
|
||||
</section>
|
||||
<section id="deploy-with-sherpa">
|
||||
<h2>Deploy with Sherpa<a class="headerlink" href="#deploy-with-sherpa" title="Permalink to this heading"></a></h2>
|
||||
<p>Please see <a class="reference external" href="https://k2-fsa.github.io/sherpa/python/streaming_asr/conformer/index.html#">https://k2-fsa.github.io/sherpa/python/streaming_asr/conformer/index.html#</a>
|
||||
for how to deploy the models in <code class="docutils literal notranslate"><span class="pre">sherpa</span></code>.</p>
|
||||
</section>
|
||||
</section>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<footer><div class="rst-footer-buttons" role="navigation" aria-label="Footer">
|
||||
<a href="lstm_pruned_stateless_transducer.html" class="btn btn-neutral float-left" title="LSTM Transducer" accesskey="p" rel="prev"><span class="fa fa-arrow-circle-left" aria-hidden="true"></span> Previous</a>
|
||||
<a href="../../../contributing/index.html" class="btn btn-neutral float-right" title="Contributing" accesskey="n" rel="next">Next <span class="fa fa-arrow-circle-right" aria-hidden="true"></span></a>
|
||||
</div>
|
||||
|
||||
<hr/>
|
||||
|
||||
<div role="contentinfo">
|
||||
<p>© Copyright 2021, icefall development team.</p>
|
||||
</div>
|
||||
|
||||
Built with <a href="https://www.sphinx-doc.org/">Sphinx</a> using a
|
||||
<a href="https://github.com/readthedocs/sphinx_rtd_theme">theme</a>
|
||||
provided by <a href="https://readthedocs.org">Read the Docs</a>.
|
||||
|
||||
|
||||
</footer>
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
</div>
|
||||
<script>
|
||||
jQuery(function () {
|
||||
SphinxRtdTheme.Navigation.enable(true);
|
||||
});
|
||||
</script>
|
||||
|
||||
</body>
|
||||
</html>
|
File diff suppressed because one or more lines are too long
Loading…
x
Reference in New Issue
Block a user