mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-12-11 06:55:27 +00:00
WIP: Add stateless transducer tutorial.
This commit is contained in:
parent
1ff6196c44
commit
334f8bb906
@ -33,6 +33,7 @@ release = "0.1"
|
|||||||
# ones.
|
# ones.
|
||||||
extensions = [
|
extensions = [
|
||||||
"sphinx_rtd_theme",
|
"sphinx_rtd_theme",
|
||||||
|
"sphinx.ext.todo",
|
||||||
]
|
]
|
||||||
|
|
||||||
# Add any paths that contain templates here, relative to this directory.
|
# Add any paths that contain templates here, relative to this directory.
|
||||||
@ -74,3 +75,5 @@ html_context = {
|
|||||||
"github_version": "master",
|
"github_version": "master",
|
||||||
"conf_py_path": "/icefall/docs/source/",
|
"conf_py_path": "/icefall/docs/source/",
|
||||||
}
|
}
|
||||||
|
|
||||||
|
todo_include_todos = True
|
||||||
|
|||||||
@ -1,10 +0,0 @@
|
|||||||
Aishell
|
|
||||||
=======
|
|
||||||
|
|
||||||
We provide the following models for the Aishell dataset:
|
|
||||||
|
|
||||||
.. toctree::
|
|
||||||
:maxdepth: 2
|
|
||||||
|
|
||||||
aishell/conformer_ctc
|
|
||||||
aishell/tdnn_lstm_ctc
|
|
||||||
22
docs/source/recipes/aishell/index.rst
Normal file
22
docs/source/recipes/aishell/index.rst
Normal file
@ -0,0 +1,22 @@
|
|||||||
|
aishell
|
||||||
|
=======
|
||||||
|
|
||||||
|
Aishell is an open-source Chinese Mandarin speech corpus published by Beijing
|
||||||
|
Shell Shell Technology Co.,Ltd.
|
||||||
|
|
||||||
|
400 people from different accent areas in China are invited to participate in
|
||||||
|
the recording, which is conducted in a quiet indoor environment using high
|
||||||
|
fidelity microphone and downsampled to 16kHz. The manual transcription accuracy
|
||||||
|
is above 95%, through professional speech annotation and strict quality
|
||||||
|
inspection. The data is free for academic use. We hope to provide moderate
|
||||||
|
amount of data for new researchers in the field of speech recognition.
|
||||||
|
|
||||||
|
It can be downloaded from `<https://www.openslr.org/33/>`_
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
tdnn_lstm_ctc
|
||||||
|
conformer_ctc
|
||||||
|
stateless_transducer
|
||||||
|
|
||||||
221
docs/source/recipes/aishell/stateless_transducer.rst
Normal file
221
docs/source/recipes/aishell/stateless_transducer.rst
Normal file
@ -0,0 +1,221 @@
|
|||||||
|
Stateless Transducer
|
||||||
|
====================
|
||||||
|
|
||||||
|
This tutorial shows you how to do transducer training in ``icefall``.
|
||||||
|
|
||||||
|
.. HINT::
|
||||||
|
|
||||||
|
Instead of using RNN-T or RNN transducer, we only use transducer
|
||||||
|
here. As you will see, there are no RNNs in the model.
|
||||||
|
|
||||||
|
The Model
|
||||||
|
---------
|
||||||
|
|
||||||
|
The transducer model consists of 3 parts:
|
||||||
|
|
||||||
|
- **Encoder**: It is a conformer encoder with the following parameters
|
||||||
|
|
||||||
|
- Number of heads: 8
|
||||||
|
- Attention dim: 512
|
||||||
|
- Number of layers: 12
|
||||||
|
- Feedforward dim: 2048
|
||||||
|
|
||||||
|
- **Decoder**: We use a stateless model consisting of:
|
||||||
|
|
||||||
|
- An embedding layer with embedding dim 512
|
||||||
|
- A Conv1d layer with a default kernel size 2
|
||||||
|
|
||||||
|
- **Joiner**: It consists of a ``nn.tanh()`` and a ``nn.Linear()``.
|
||||||
|
|
||||||
|
.. Caution::
|
||||||
|
|
||||||
|
The decoder is stateless and very simple. It is borrowed from
|
||||||
|
`<https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9054419>`_
|
||||||
|
(Rnn-Transducer with Stateless Prediction Network)
|
||||||
|
|
||||||
|
We make one modification to it: Place a Conv1d layer right after
|
||||||
|
the embedding layer.
|
||||||
|
|
||||||
|
When using Chinese characters as modelling unit, whose vocabulary size
|
||||||
|
is 4335 in this specific dataset,
|
||||||
|
the number of parameters of the model is ``87939824``, i.e., about ``88 M``.
|
||||||
|
|
||||||
|
The Loss
|
||||||
|
--------
|
||||||
|
|
||||||
|
We are using `<https://github.com/csukuangfj/optimized_transducer>`_
|
||||||
|
to compute the transducer loss, which removes extra paddings
|
||||||
|
in loss computation to save memory.
|
||||||
|
|
||||||
|
.. Hint::
|
||||||
|
|
||||||
|
``optimized_transducer`` implements the technqiues proposed
|
||||||
|
in `Improving RNN Transducer Modeling for End-to-End Speech Recognition <https://arxiv.org/abs/1909.12415>`_ to save memory.
|
||||||
|
|
||||||
|
Furthermore, it supports ``modified transducer``, limiting the maximum
|
||||||
|
number of symbols that can be emitted per frame to 1, which simplifies
|
||||||
|
the decoding process significantly. Also, the experiment results
|
||||||
|
show that it does not degrade the performance.
|
||||||
|
|
||||||
|
See `<https://github.com/csukuangfj/optimized_transducer#modified-transducer>`_
|
||||||
|
for what exactly modified transducer is.
|
||||||
|
|
||||||
|
`<https://github.com/csukuangfj/transducer-loss-benchmarking>`_ shows that
|
||||||
|
in the unpruned case ``optimized_transducer`` has the advantage about minimizing
|
||||||
|
memory usage.
|
||||||
|
|
||||||
|
.. todo::
|
||||||
|
|
||||||
|
Add tutorial about ``pruned_transducer_stateless`` that uses k2
|
||||||
|
pruned transducer loss.
|
||||||
|
|
||||||
|
.. hint::
|
||||||
|
|
||||||
|
You can use::
|
||||||
|
|
||||||
|
pip install optimized_transducer
|
||||||
|
|
||||||
|
to install ``optimized_transducer``. Refer to
|
||||||
|
`<https://github.com/csukuangfj/optimized_transducer>`_ for other
|
||||||
|
alternatives.
|
||||||
|
|
||||||
|
Data Preparation
|
||||||
|
----------------
|
||||||
|
|
||||||
|
To prepare the data for training, please use the following commands:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
cd egs/aishell/ASR
|
||||||
|
./prepare.sh --stop-stage 4
|
||||||
|
./prepare.sh --stage 6 --stop-stage 6
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
You can use ``./prepare.sh``, though it will generates FSTs that
|
||||||
|
are not used in transducer traning.
|
||||||
|
|
||||||
|
When you finish running the script, you will get the following two folders:
|
||||||
|
|
||||||
|
- ``data/fbank``: It saves the pre-computed features
|
||||||
|
- ``data/lang_char``: It contains tokens that will be used in the training
|
||||||
|
|
||||||
|
Training
|
||||||
|
--------
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
cd egs/aishell/ASR
|
||||||
|
./transducer_stateless_modified/train.py --help
|
||||||
|
|
||||||
|
shows you the training options that can be passed from the commandline.
|
||||||
|
The following options are used quite often:
|
||||||
|
|
||||||
|
- ``--exp-dir``
|
||||||
|
|
||||||
|
The experiment folder to save logs and model checkpoints,
|
||||||
|
defaults to ``./transducer_stateless_modified/exp``.
|
||||||
|
|
||||||
|
- ``--num-epochs``
|
||||||
|
|
||||||
|
It is the number of epochs to train. For instance,
|
||||||
|
``./transducer_stateless_modified/train.py --num-epochs 30`` trains for 30
|
||||||
|
epochs and generates ``epoch-0.pt``, ``epoch-1.pt``, ..., ``epoch-29.pt``
|
||||||
|
in the folder set by ``--exp-dir``.
|
||||||
|
|
||||||
|
- ``--start-epoch``
|
||||||
|
|
||||||
|
It's used to resume training.
|
||||||
|
``./transducer_stateless_modified/train.py --start-epoch 10`` loads the
|
||||||
|
checkpoint from ``exp_dir/epoch-9.pt`` and starts
|
||||||
|
training from epoch 10, based on the state from epoch 9.
|
||||||
|
|
||||||
|
- ``--world-size``
|
||||||
|
|
||||||
|
It is used for multi-GPU single-machine DDP training.
|
||||||
|
|
||||||
|
- (a) If it is 1, then no DDP training is used.
|
||||||
|
|
||||||
|
- (b) If it is 2, then GPU 0 and GPU 1 are used for DDP training.
|
||||||
|
|
||||||
|
The following shows some use cases with it.
|
||||||
|
|
||||||
|
**Use case 1**: You have 4 GPUs, but you only want to use GPU 0 and
|
||||||
|
GPU 2 for training. You can do the following:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
$ cd egs/aishell/ASR
|
||||||
|
$ export CUDA_VISIBLE_DEVICES="0,2"
|
||||||
|
$ ./transducer_stateless_modified/train.py --world-size 2
|
||||||
|
|
||||||
|
**Use case 2**: You have 4 GPUs and you want to use all of them
|
||||||
|
for training. You can do the following:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
$ cd egs/aishell/ASR
|
||||||
|
$ ./transducer_stateless_modified/train.py --world-size 4
|
||||||
|
|
||||||
|
**Use case 3**: You have 4 GPUs but you only want to use GPU 3
|
||||||
|
for training. You can do the following:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
$ cd egs/aishell/ASR
|
||||||
|
$ export CUDA_VISIBLE_DEVICES="3"
|
||||||
|
$ ./transducer_stateless_modified/train.py --world-size 1
|
||||||
|
|
||||||
|
.. CAUTION::
|
||||||
|
|
||||||
|
Only multi-GPU single-machine DDP training is implemented at present.
|
||||||
|
There is an on-going PR `<https://github.com/k2-fsa/icefall/pull/63>`
|
||||||
|
that adds support for multi-GPU multi-machine DDP training.
|
||||||
|
|
||||||
|
- ``--max-duration``
|
||||||
|
|
||||||
|
It specifies the number of seconds over all utterances in a
|
||||||
|
batch, before **padding**.
|
||||||
|
If you encounter CUDA OOM, please reduce it. For instance, if
|
||||||
|
your are using V100 NVIDIA GPU with 32 GB RAM, we recommend you
|
||||||
|
to set it to ``300``.
|
||||||
|
|
||||||
|
.. HINT::
|
||||||
|
|
||||||
|
Due to padding, the number of seconds of all utterances in a
|
||||||
|
batch will usually be larger than ``--max-duration``.
|
||||||
|
|
||||||
|
A larger value for ``--max-duration`` may cause OOM during training,
|
||||||
|
while a smaller value may increase the training time. You have to
|
||||||
|
tune it.
|
||||||
|
|
||||||
|
- ``--lr-factor``
|
||||||
|
|
||||||
|
It contrals the learning rate. If you use single GPU training, you
|
||||||
|
may want to use a small value for it. If you use multiple GPUs for training,
|
||||||
|
you may increase it.
|
||||||
|
|
||||||
|
- ``--context-size``
|
||||||
|
|
||||||
|
It specifies the kernel size in the decoder. Default value 2 means it
|
||||||
|
functions as a tri-gram LM.
|
||||||
|
|
||||||
|
- ``--modified-transducer-prob``
|
||||||
|
|
||||||
|
It specifies the probability to use modified transducer loss.
|
||||||
|
If it is 0, then no modified transducer is used; if it is 1,
|
||||||
|
then it uses modified transducer loss for all batches. If it is
|
||||||
|
``p``, it applies modified transducer with probability ``p``.
|
||||||
|
|
||||||
|
There are some training options, e.g.,
|
||||||
|
number of warmup steps,
|
||||||
|
that are not passed from the commandline.
|
||||||
|
They are pre-configured by the function ``get_params()`` in
|
||||||
|
`transducer_stateless_modified/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/aishell/ASR/transducer_stateless_modified/train.py#L162>`_
|
||||||
|
|
||||||
|
If you need to change them, please modify ``./transducer_stateless_modified/train.py`` directly.
|
||||||
|
|
||||||
|
.. CAUTION::
|
||||||
|
|
||||||
|
The training set is perturbed by speed with two factors: 0.9 and 1.1.
|
||||||
|
Each epoch actually processes ``3x150 == 450`` hours of data.
|
||||||
@ -10,12 +10,10 @@ We may add recipes for other tasks as well in the future.
|
|||||||
.. Other recipes are listed in a alphabetical order.
|
.. Other recipes are listed in a alphabetical order.
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 3
|
:maxdepth: 2
|
||||||
|
:caption: Table of Contents
|
||||||
|
|
||||||
yesno
|
aishell/index
|
||||||
|
librispeech/index
|
||||||
librispeech
|
timit/index
|
||||||
|
yesno/index
|
||||||
aishell
|
|
||||||
|
|
||||||
timit
|
|
||||||
|
|||||||
@ -1,10 +0,0 @@
|
|||||||
LibriSpeech
|
|
||||||
===========
|
|
||||||
|
|
||||||
We provide the following models for the LibriSpeech dataset:
|
|
||||||
|
|
||||||
.. toctree::
|
|
||||||
:maxdepth: 2
|
|
||||||
|
|
||||||
librispeech/tdnn_lstm_ctc
|
|
||||||
librispeech/conformer_ctc
|
|
||||||
8
docs/source/recipes/librispeech/index.rst
Normal file
8
docs/source/recipes/librispeech/index.rst
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
LibriSpeech
|
||||||
|
===========
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
tdnn_lstm_ctc
|
||||||
|
conformer_ctc
|
||||||
@ -1,10 +0,0 @@
|
|||||||
TIMIT
|
|
||||||
===========
|
|
||||||
|
|
||||||
We provide the following models for the TIMIT dataset:
|
|
||||||
|
|
||||||
.. toctree::
|
|
||||||
:maxdepth: 2
|
|
||||||
|
|
||||||
timit/tdnn_lstm_ctc
|
|
||||||
timit/tdnn_ligru_ctc
|
|
||||||
9
docs/source/recipes/timit/index.rst
Normal file
9
docs/source/recipes/timit/index.rst
Normal file
@ -0,0 +1,9 @@
|
|||||||
|
TIMIT
|
||||||
|
=====
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
tdnn_ligru_ctc
|
||||||
|
tdnn_lstm_ctc
|
||||||
|
|
||||||
@ -1,5 +1,5 @@
|
|||||||
TDNN-LiGRU-CTC
|
TDNN-LiGRU-CTC
|
||||||
=============
|
==============
|
||||||
|
|
||||||
This tutorial shows you how to run a TDNN-LiGRU-CTC model with the `TIMIT <https://data.deepai.org/timit.zip>`_ dataset.
|
This tutorial shows you how to run a TDNN-LiGRU-CTC model with the `TIMIT <https://data.deepai.org/timit.zip>`_ dataset.
|
||||||
|
|
||||||
|
|||||||
|
Before Width: | Height: | Size: 121 KiB After Width: | Height: | Size: 121 KiB |
7
docs/source/recipes/yesno/index.rst
Normal file
7
docs/source/recipes/yesno/index.rst
Normal file
@ -0,0 +1,7 @@
|
|||||||
|
YesNo
|
||||||
|
=====
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 1
|
||||||
|
|
||||||
|
tdnn
|
||||||
@ -1,5 +1,5 @@
|
|||||||
yesno
|
TDNN-CTC
|
||||||
=====
|
========
|
||||||
|
|
||||||
This page shows you how to run the `yesno <https://www.openslr.org/1>`_ recipe. It contains:
|
This page shows you how to run the `yesno <https://www.openslr.org/1>`_ recipe. It contains:
|
||||||
|
|
||||||
@ -145,7 +145,7 @@ In ``tdnn/exp``, you will find the following files:
|
|||||||
Note there is a URL in the above output, click it and you will see
|
Note there is a URL in the above output, click it and you will see
|
||||||
the following screenshot:
|
the following screenshot:
|
||||||
|
|
||||||
.. figure:: images/yesno-tdnn-tensorboard-log.png
|
.. figure:: images/tdnn-tensorboard-log.png
|
||||||
:width: 600
|
:width: 600
|
||||||
:alt: TensorBoard screenshot
|
:alt: TensorBoard screenshot
|
||||||
:align: center
|
:align: center
|
||||||
Loading…
x
Reference in New Issue
Block a user