mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-12-11 06:55:27 +00:00
WIP: Add stateless transducer tutorial.
This commit is contained in:
parent
1ff6196c44
commit
334f8bb906
@ -33,6 +33,7 @@ release = "0.1"
|
||||
# ones.
|
||||
extensions = [
|
||||
"sphinx_rtd_theme",
|
||||
"sphinx.ext.todo",
|
||||
]
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
@ -74,3 +75,5 @@ html_context = {
|
||||
"github_version": "master",
|
||||
"conf_py_path": "/icefall/docs/source/",
|
||||
}
|
||||
|
||||
todo_include_todos = True
|
||||
|
||||
@ -1,10 +0,0 @@
|
||||
Aishell
|
||||
=======
|
||||
|
||||
We provide the following models for the Aishell dataset:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
aishell/conformer_ctc
|
||||
aishell/tdnn_lstm_ctc
|
||||
22
docs/source/recipes/aishell/index.rst
Normal file
22
docs/source/recipes/aishell/index.rst
Normal file
@ -0,0 +1,22 @@
|
||||
aishell
|
||||
=======
|
||||
|
||||
Aishell is an open-source Chinese Mandarin speech corpus published by Beijing
|
||||
Shell Shell Technology Co.,Ltd.
|
||||
|
||||
400 people from different accent areas in China are invited to participate in
|
||||
the recording, which is conducted in a quiet indoor environment using high
|
||||
fidelity microphone and downsampled to 16kHz. The manual transcription accuracy
|
||||
is above 95%, through professional speech annotation and strict quality
|
||||
inspection. The data is free for academic use. We hope to provide moderate
|
||||
amount of data for new researchers in the field of speech recognition.
|
||||
|
||||
It can be downloaded from `<https://www.openslr.org/33/>`_
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
tdnn_lstm_ctc
|
||||
conformer_ctc
|
||||
stateless_transducer
|
||||
|
||||
221
docs/source/recipes/aishell/stateless_transducer.rst
Normal file
221
docs/source/recipes/aishell/stateless_transducer.rst
Normal file
@ -0,0 +1,221 @@
|
||||
Stateless Transducer
|
||||
====================
|
||||
|
||||
This tutorial shows you how to do transducer training in ``icefall``.
|
||||
|
||||
.. HINT::
|
||||
|
||||
Instead of using RNN-T or RNN transducer, we only use transducer
|
||||
here. As you will see, there are no RNNs in the model.
|
||||
|
||||
The Model
|
||||
---------
|
||||
|
||||
The transducer model consists of 3 parts:
|
||||
|
||||
- **Encoder**: It is a conformer encoder with the following parameters
|
||||
|
||||
- Number of heads: 8
|
||||
- Attention dim: 512
|
||||
- Number of layers: 12
|
||||
- Feedforward dim: 2048
|
||||
|
||||
- **Decoder**: We use a stateless model consisting of:
|
||||
|
||||
- An embedding layer with embedding dim 512
|
||||
- A Conv1d layer with a default kernel size 2
|
||||
|
||||
- **Joiner**: It consists of a ``nn.tanh()`` and a ``nn.Linear()``.
|
||||
|
||||
.. Caution::
|
||||
|
||||
The decoder is stateless and very simple. It is borrowed from
|
||||
`<https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9054419>`_
|
||||
(Rnn-Transducer with Stateless Prediction Network)
|
||||
|
||||
We make one modification to it: Place a Conv1d layer right after
|
||||
the embedding layer.
|
||||
|
||||
When using Chinese characters as modelling unit, whose vocabulary size
|
||||
is 4335 in this specific dataset,
|
||||
the number of parameters of the model is ``87939824``, i.e., about ``88 M``.
|
||||
|
||||
The Loss
|
||||
--------
|
||||
|
||||
We are using `<https://github.com/csukuangfj/optimized_transducer>`_
|
||||
to compute the transducer loss, which removes extra paddings
|
||||
in loss computation to save memory.
|
||||
|
||||
.. Hint::
|
||||
|
||||
``optimized_transducer`` implements the technqiues proposed
|
||||
in `Improving RNN Transducer Modeling for End-to-End Speech Recognition <https://arxiv.org/abs/1909.12415>`_ to save memory.
|
||||
|
||||
Furthermore, it supports ``modified transducer``, limiting the maximum
|
||||
number of symbols that can be emitted per frame to 1, which simplifies
|
||||
the decoding process significantly. Also, the experiment results
|
||||
show that it does not degrade the performance.
|
||||
|
||||
See `<https://github.com/csukuangfj/optimized_transducer#modified-transducer>`_
|
||||
for what exactly modified transducer is.
|
||||
|
||||
`<https://github.com/csukuangfj/transducer-loss-benchmarking>`_ shows that
|
||||
in the unpruned case ``optimized_transducer`` has the advantage about minimizing
|
||||
memory usage.
|
||||
|
||||
.. todo::
|
||||
|
||||
Add tutorial about ``pruned_transducer_stateless`` that uses k2
|
||||
pruned transducer loss.
|
||||
|
||||
.. hint::
|
||||
|
||||
You can use::
|
||||
|
||||
pip install optimized_transducer
|
||||
|
||||
to install ``optimized_transducer``. Refer to
|
||||
`<https://github.com/csukuangfj/optimized_transducer>`_ for other
|
||||
alternatives.
|
||||
|
||||
Data Preparation
|
||||
----------------
|
||||
|
||||
To prepare the data for training, please use the following commands:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd egs/aishell/ASR
|
||||
./prepare.sh --stop-stage 4
|
||||
./prepare.sh --stage 6 --stop-stage 6
|
||||
|
||||
.. note::
|
||||
|
||||
You can use ``./prepare.sh``, though it will generates FSTs that
|
||||
are not used in transducer traning.
|
||||
|
||||
When you finish running the script, you will get the following two folders:
|
||||
|
||||
- ``data/fbank``: It saves the pre-computed features
|
||||
- ``data/lang_char``: It contains tokens that will be used in the training
|
||||
|
||||
Training
|
||||
--------
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
cd egs/aishell/ASR
|
||||
./transducer_stateless_modified/train.py --help
|
||||
|
||||
shows you the training options that can be passed from the commandline.
|
||||
The following options are used quite often:
|
||||
|
||||
- ``--exp-dir``
|
||||
|
||||
The experiment folder to save logs and model checkpoints,
|
||||
defaults to ``./transducer_stateless_modified/exp``.
|
||||
|
||||
- ``--num-epochs``
|
||||
|
||||
It is the number of epochs to train. For instance,
|
||||
``./transducer_stateless_modified/train.py --num-epochs 30`` trains for 30
|
||||
epochs and generates ``epoch-0.pt``, ``epoch-1.pt``, ..., ``epoch-29.pt``
|
||||
in the folder set by ``--exp-dir``.
|
||||
|
||||
- ``--start-epoch``
|
||||
|
||||
It's used to resume training.
|
||||
``./transducer_stateless_modified/train.py --start-epoch 10`` loads the
|
||||
checkpoint from ``exp_dir/epoch-9.pt`` and starts
|
||||
training from epoch 10, based on the state from epoch 9.
|
||||
|
||||
- ``--world-size``
|
||||
|
||||
It is used for multi-GPU single-machine DDP training.
|
||||
|
||||
- (a) If it is 1, then no DDP training is used.
|
||||
|
||||
- (b) If it is 2, then GPU 0 and GPU 1 are used for DDP training.
|
||||
|
||||
The following shows some use cases with it.
|
||||
|
||||
**Use case 1**: You have 4 GPUs, but you only want to use GPU 0 and
|
||||
GPU 2 for training. You can do the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd egs/aishell/ASR
|
||||
$ export CUDA_VISIBLE_DEVICES="0,2"
|
||||
$ ./transducer_stateless_modified/train.py --world-size 2
|
||||
|
||||
**Use case 2**: You have 4 GPUs and you want to use all of them
|
||||
for training. You can do the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd egs/aishell/ASR
|
||||
$ ./transducer_stateless_modified/train.py --world-size 4
|
||||
|
||||
**Use case 3**: You have 4 GPUs but you only want to use GPU 3
|
||||
for training. You can do the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ cd egs/aishell/ASR
|
||||
$ export CUDA_VISIBLE_DEVICES="3"
|
||||
$ ./transducer_stateless_modified/train.py --world-size 1
|
||||
|
||||
.. CAUTION::
|
||||
|
||||
Only multi-GPU single-machine DDP training is implemented at present.
|
||||
There is an on-going PR `<https://github.com/k2-fsa/icefall/pull/63>`
|
||||
that adds support for multi-GPU multi-machine DDP training.
|
||||
|
||||
- ``--max-duration``
|
||||
|
||||
It specifies the number of seconds over all utterances in a
|
||||
batch, before **padding**.
|
||||
If you encounter CUDA OOM, please reduce it. For instance, if
|
||||
your are using V100 NVIDIA GPU with 32 GB RAM, we recommend you
|
||||
to set it to ``300``.
|
||||
|
||||
.. HINT::
|
||||
|
||||
Due to padding, the number of seconds of all utterances in a
|
||||
batch will usually be larger than ``--max-duration``.
|
||||
|
||||
A larger value for ``--max-duration`` may cause OOM during training,
|
||||
while a smaller value may increase the training time. You have to
|
||||
tune it.
|
||||
|
||||
- ``--lr-factor``
|
||||
|
||||
It contrals the learning rate. If you use single GPU training, you
|
||||
may want to use a small value for it. If you use multiple GPUs for training,
|
||||
you may increase it.
|
||||
|
||||
- ``--context-size``
|
||||
|
||||
It specifies the kernel size in the decoder. Default value 2 means it
|
||||
functions as a tri-gram LM.
|
||||
|
||||
- ``--modified-transducer-prob``
|
||||
|
||||
It specifies the probability to use modified transducer loss.
|
||||
If it is 0, then no modified transducer is used; if it is 1,
|
||||
then it uses modified transducer loss for all batches. If it is
|
||||
``p``, it applies modified transducer with probability ``p``.
|
||||
|
||||
There are some training options, e.g.,
|
||||
number of warmup steps,
|
||||
that are not passed from the commandline.
|
||||
They are pre-configured by the function ``get_params()`` in
|
||||
`transducer_stateless_modified/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/aishell/ASR/transducer_stateless_modified/train.py#L162>`_
|
||||
|
||||
If you need to change them, please modify ``./transducer_stateless_modified/train.py`` directly.
|
||||
|
||||
.. CAUTION::
|
||||
|
||||
The training set is perturbed by speed with two factors: 0.9 and 1.1.
|
||||
Each epoch actually processes ``3x150 == 450`` hours of data.
|
||||
@ -10,12 +10,10 @@ We may add recipes for other tasks as well in the future.
|
||||
.. Other recipes are listed in a alphabetical order.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 3
|
||||
:maxdepth: 2
|
||||
:caption: Table of Contents
|
||||
|
||||
yesno
|
||||
|
||||
librispeech
|
||||
|
||||
aishell
|
||||
|
||||
timit
|
||||
aishell/index
|
||||
librispeech/index
|
||||
timit/index
|
||||
yesno/index
|
||||
|
||||
@ -1,10 +0,0 @@
|
||||
LibriSpeech
|
||||
===========
|
||||
|
||||
We provide the following models for the LibriSpeech dataset:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
librispeech/tdnn_lstm_ctc
|
||||
librispeech/conformer_ctc
|
||||
8
docs/source/recipes/librispeech/index.rst
Normal file
8
docs/source/recipes/librispeech/index.rst
Normal file
@ -0,0 +1,8 @@
|
||||
LibriSpeech
|
||||
===========
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
tdnn_lstm_ctc
|
||||
conformer_ctc
|
||||
@ -1,10 +0,0 @@
|
||||
TIMIT
|
||||
===========
|
||||
|
||||
We provide the following models for the TIMIT dataset:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
timit/tdnn_lstm_ctc
|
||||
timit/tdnn_ligru_ctc
|
||||
9
docs/source/recipes/timit/index.rst
Normal file
9
docs/source/recipes/timit/index.rst
Normal file
@ -0,0 +1,9 @@
|
||||
TIMIT
|
||||
=====
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
tdnn_ligru_ctc
|
||||
tdnn_lstm_ctc
|
||||
|
||||
@ -1,5 +1,5 @@
|
||||
TDNN-LiGRU-CTC
|
||||
=============
|
||||
==============
|
||||
|
||||
This tutorial shows you how to run a TDNN-LiGRU-CTC model with the `TIMIT <https://data.deepai.org/timit.zip>`_ dataset.
|
||||
|
||||
|
||||
|
Before Width: | Height: | Size: 121 KiB After Width: | Height: | Size: 121 KiB |
7
docs/source/recipes/yesno/index.rst
Normal file
7
docs/source/recipes/yesno/index.rst
Normal file
@ -0,0 +1,7 @@
|
||||
YesNo
|
||||
=====
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 1
|
||||
|
||||
tdnn
|
||||
@ -1,5 +1,5 @@
|
||||
yesno
|
||||
=====
|
||||
TDNN-CTC
|
||||
========
|
||||
|
||||
This page shows you how to run the `yesno <https://www.openslr.org/1>`_ recipe. It contains:
|
||||
|
||||
@ -145,7 +145,7 @@ In ``tdnn/exp``, you will find the following files:
|
||||
Note there is a URL in the above output, click it and you will see
|
||||
the following screenshot:
|
||||
|
||||
.. figure:: images/yesno-tdnn-tensorboard-log.png
|
||||
.. figure:: images/tdnn-tensorboard-log.png
|
||||
:width: 600
|
||||
:alt: TensorBoard screenshot
|
||||
:align: center
|
||||
Loading…
x
Reference in New Issue
Block a user