mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-09 18:12:19 +00:00
* WIP: Add doc for the LibriSpeech recipe. * Add more doc for LibriSpeech recipe. * Add more doc for the LibriSpeech recipe. * More doc.
446 lines
14 KiB
ReStructuredText
446 lines
14 KiB
ReStructuredText
yesno
|
|
=====
|
|
|
|
This page shows you how to run the `yesno <https://www.openslr.org/1>`_ recipe. It contains:
|
|
|
|
- (1) Prepare data for training
|
|
- (2) Train a TDNN model
|
|
|
|
- (a) View text format logs and visualize TensorBoard logs
|
|
- (b) Select device type, i.e., CPU and GPU, for training
|
|
- (c) Change training options
|
|
- (d) Resume training from a checkpoint
|
|
|
|
- (3) Decode with a trained model
|
|
|
|
- (a) Select a checkpoint for decoding
|
|
- (b) Model averaging
|
|
|
|
- (4) Colab notebook
|
|
|
|
- (a) It shows you step by step how to setup the environment, how to do training,
|
|
and how to do decoding
|
|
- (b) How to use a pre-trained model
|
|
|
|
- (5) Inference with a pre-trained model
|
|
|
|
- (a) Download a pre-trained model, provided by us
|
|
- (b) Decode a single sound file with a pre-trained model
|
|
- (c) Decode multiple sound files at the same time
|
|
|
|
It does **NOT** show you:
|
|
|
|
- (1) How to train with multiple GPUs
|
|
|
|
The ``yesno`` dataset is so small that CPU is more than enough
|
|
for training as well as for decoding.
|
|
|
|
- (2) How to use LM rescoring for decoding
|
|
|
|
The dataset does not have an LM for rescoring.
|
|
|
|
.. HINT::
|
|
|
|
We assume you have read the page :ref:`install icefall` and have setup
|
|
the environment for ``icefall``.
|
|
|
|
.. HINT::
|
|
|
|
You **don't** need a **GPU** to run this recipe. It can be run on a **CPU**.
|
|
The training part takes less than 30 **seconds** on a CPU and you will get
|
|
the following WER at the end::
|
|
|
|
[test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]
|
|
|
|
Data preparation
|
|
----------------
|
|
|
|
.. code-block:: bash
|
|
|
|
$ cd egs/yesno/ASR
|
|
$ ./prepare.sh
|
|
|
|
The script ``./prepare.sh`` handles the data preparation for you, **automagically**.
|
|
All you need to do is to run it.
|
|
|
|
The data preparation contains several stages, you can use the following two
|
|
options:
|
|
|
|
- ``--stage``
|
|
- ``--stop-stage``
|
|
|
|
to control which stage(s) should be run. By default, all stages are executed.
|
|
|
|
|
|
For example,
|
|
|
|
.. code-block:: bash
|
|
|
|
$ cd egs/yesno/ASR
|
|
$ ./prepare.sh --stage 0 --stop-stage 0
|
|
|
|
means to run only stage 0.
|
|
|
|
To run stage 2 to stage 5, use:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ ./prepare.sh --stage 2 --stop-stage 5
|
|
|
|
|
|
Training
|
|
--------
|
|
|
|
We provide only a TDNN model, contained in
|
|
the `tdnn <https://github.com/k2-fsa/icefall/tree/master/egs/yesno/ASR/tdnn>`_
|
|
folder, for ``yesno``.
|
|
|
|
The command to run the training part is:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ cd egs/yesno/ASR
|
|
$ export CUDA_VISIBLE_DEVICES=""
|
|
$ ./tdnn/train.py
|
|
|
|
By default, it will run ``15`` epochs. Training logs and checkpoints are saved
|
|
in ``tdnn/exp``.
|
|
|
|
In ``tdnn/exp``, you will find the following files:
|
|
|
|
- ``epoch-0.pt``, ``epoch-1.pt``, ...
|
|
|
|
These are checkpoint files, containing model ``state_dict`` and optimizer ``state_dict``.
|
|
To resume training from some checkpoint, say ``epoch-10.pt``, you can use:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ ./tdnn/train.py --start-epoch 11
|
|
|
|
- ``tensorboard/``
|
|
|
|
This folder contains TensorBoard logs. Training loss, validation loss, learning
|
|
rate, etc, are recorded in these logs. You can visualize them by:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ cd tdnn/exp/tensorboard
|
|
$ tensorboard dev upload --logdir . --description "TDNN training for yesno with icefall"
|
|
|
|
It will print something like below:
|
|
|
|
.. code-block::
|
|
|
|
TensorFlow installation not found - running with reduced feature set.
|
|
Upload started and will continue reading any new data as it's added to the logdir.
|
|
|
|
To stop uploading, press Ctrl-C.
|
|
|
|
New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/yKUbhb5wRmOSXYkId1z9eg/
|
|
|
|
[2021-08-23T23:49:41] Started scanning logdir.
|
|
[2021-08-23T23:49:42] Total uploaded: 135 scalars, 0 tensors, 0 binary objects
|
|
Listening for new data in logdir...
|
|
|
|
Note there is a URL in the above output, click it and you will see
|
|
the following screenshot:
|
|
|
|
.. figure:: images/yesno-tdnn-tensorboard-log.png
|
|
:width: 600
|
|
:alt: TensorBoard screenshot
|
|
:align: center
|
|
:target: https://tensorboard.dev/experiment/yKUbhb5wRmOSXYkId1z9eg/
|
|
|
|
TensorBoard screenshot.
|
|
|
|
- ``log/log-train-xxxx``
|
|
|
|
It is the detailed training log in text format, same as the one
|
|
you saw printed to the console during training.
|
|
|
|
|
|
|
|
.. NOTE::
|
|
|
|
By default, ``./tdnn/train.py`` uses GPU 0 for training if GPUs are available.
|
|
If you have two GPUs, say, GPU 0 and GPU 1, and you want to use GPU 1 for
|
|
training, you can run:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ export CUDA_VISIBLE_DEVICES="1"
|
|
$ ./tdnn/train.py
|
|
|
|
Since the ``yesno`` dataset is very small, containing only 30 sound files
|
|
for training, and the model in use is also very small, we use:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ export CUDA_VISIBLE_DEVICES=""
|
|
|
|
so that ``./tdnn/train.py`` uses CPU during training.
|
|
|
|
If you don't have GPUs, then you don't need to
|
|
run ``export CUDA_VISIBLE_DEVICES=""``.
|
|
|
|
To see available training options, you can use:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ ./tdnn/train.py --help
|
|
|
|
Other training options, e.g., learning rate, results dir, etc., are
|
|
pre-configured in the function ``get_params()``
|
|
in `tdnn/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/yesno/ASR/tdnn/train.py>`_.
|
|
Normally, you don't need to change them. You can change them by modifying the code, if
|
|
you want.
|
|
|
|
Decoding
|
|
--------
|
|
|
|
The decoding part uses checkpoints saved by the training part, so you have
|
|
to run the training part first.
|
|
|
|
The command for decoding is:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ export CUDA_VISIBLE_DEVICES=""
|
|
$ ./tdnn/decode.py
|
|
|
|
You will see the WER in the output log.
|
|
|
|
Decoded results are saved in ``tdnn/exp``.
|
|
|
|
.. code-block:: bash
|
|
|
|
$ ./tdnn/decode.py --help
|
|
|
|
shows you the available decoding options.
|
|
|
|
Some commonly used options are:
|
|
|
|
- ``--epoch``
|
|
|
|
You can select which checkpoint to be used for decoding.
|
|
For instance, ``./tdnn/decode.py --epoch 10`` means to use
|
|
``./tdnn/exp/epoch-10.pt`` for decoding.
|
|
|
|
- ``--avg``
|
|
|
|
It's related to model averaging. It specifies number of checkpoints
|
|
to be averaged. The averaged model is used for decoding.
|
|
For example, the following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ ./tdnn/decode.py --epoch 10 --avg 3
|
|
|
|
uses the average of ``epoch-8.pt``, ``epoch-9.pt`` and ``epoch-10.pt``
|
|
for decoding.
|
|
|
|
- ``--export``
|
|
|
|
If it is ``True``, i.e., ``./tdnn/decode.py --export 1``, the code
|
|
will save the averaged model to ``tdnn/exp/pretrained.pt``.
|
|
See :ref:`yesno use a pre-trained model` for how to use it.
|
|
|
|
|
|
.. _yesno use a pre-trained model:
|
|
|
|
Pre-trained Model
|
|
-----------------
|
|
|
|
We have uploaded the pre-trained model to
|
|
`<https://huggingface.co/csukuangfj/icefall_asr_yesno_tdnn>`_.
|
|
|
|
The following shows you how to use the pre-trained model.
|
|
|
|
Download the pre-trained model
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. code-block:: bash
|
|
|
|
$ cd egs/yesno/ASR
|
|
$ mkdir tmp
|
|
$ cd tmp
|
|
$ git lfs install
|
|
$ git clone https://huggingface.co/csukuangfj/icefall_asr_yesno_tdnn
|
|
|
|
.. CAUTION::
|
|
|
|
You have to use ``git lfs`` to download the pre-trained model.
|
|
|
|
After downloading, you will have the following files:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ cd egs/yesno/ASR
|
|
$ tree tmp
|
|
|
|
.. code-block:: bash
|
|
|
|
tmp/
|
|
`-- icefall_asr_yesno_tdnn
|
|
|-- README.md
|
|
|-- lang_phone
|
|
| |-- HLG.pt
|
|
| |-- L.pt
|
|
| |-- L_disambig.pt
|
|
| |-- Linv.pt
|
|
| |-- lexicon.txt
|
|
| |-- lexicon_disambig.txt
|
|
| |-- tokens.txt
|
|
| `-- words.txt
|
|
|-- lm
|
|
| |-- G.arpa
|
|
| `-- G.fst.txt
|
|
|-- pretrained.pt
|
|
`-- test_waves
|
|
|-- 0_0_0_1_0_0_0_1.wav
|
|
|-- 0_0_1_0_0_0_1_0.wav
|
|
|-- 0_0_1_0_0_1_1_1.wav
|
|
|-- 0_0_1_0_1_0_0_1.wav
|
|
|-- 0_0_1_1_0_0_0_1.wav
|
|
|-- 0_0_1_1_0_1_1_0.wav
|
|
|-- 0_0_1_1_1_0_0_0.wav
|
|
|-- 0_0_1_1_1_1_0_0.wav
|
|
|-- 0_1_0_0_0_1_0_0.wav
|
|
|-- 0_1_0_0_1_0_1_0.wav
|
|
|-- 0_1_0_1_0_0_0_0.wav
|
|
|-- 0_1_0_1_1_1_0_0.wav
|
|
|-- 0_1_1_0_0_1_1_1.wav
|
|
|-- 0_1_1_1_0_0_1_0.wav
|
|
|-- 0_1_1_1_1_0_1_0.wav
|
|
|-- 1_0_0_0_0_0_0_0.wav
|
|
|-- 1_0_0_0_0_0_1_1.wav
|
|
|-- 1_0_0_1_0_1_1_1.wav
|
|
|-- 1_0_1_1_0_1_1_1.wav
|
|
|-- 1_0_1_1_1_1_0_1.wav
|
|
|-- 1_1_0_0_0_1_1_1.wav
|
|
|-- 1_1_0_0_1_0_1_1.wav
|
|
|-- 1_1_0_1_0_1_0_0.wav
|
|
|-- 1_1_0_1_1_0_0_1.wav
|
|
|-- 1_1_0_1_1_1_1_0.wav
|
|
|-- 1_1_1_0_0_1_0_1.wav
|
|
|-- 1_1_1_0_1_0_1_0.wav
|
|
|-- 1_1_1_1_0_0_1_0.wav
|
|
|-- 1_1_1_1_1_0_0_0.wav
|
|
`-- 1_1_1_1_1_1_1_1.wav
|
|
|
|
4 directories, 42 files
|
|
|
|
.. code-block:: bash
|
|
|
|
$ soxi tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav
|
|
|
|
Input File : 'tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav'
|
|
Channels : 1
|
|
Sample Rate : 8000
|
|
Precision : 16-bit
|
|
Duration : 00:00:06.76 = 54080 samples ~ 507 CDDA sectors
|
|
File Size : 108k
|
|
Bit Rate : 128k
|
|
Sample Encoding: 16-bit Signed Integer PCM
|
|
|
|
- ``0_0_1_0_1_0_0_1.wav``
|
|
|
|
0 means No; 1 means Yes. No and Yes are not in English,
|
|
but in `Hebrew <https://en.wikipedia.org/wiki/Hebrew_language>`_.
|
|
So this file contains ``NO NO YES NO YES NO NO YES``.
|
|
|
|
Download kaldifeat
|
|
~~~~~~~~~~~~~~~~~~
|
|
|
|
`kaldifeat <https://github.com/csukuangfj/kaldifeat>`_ is used for extracting
|
|
features from a single or multiple sound files. Please refer to
|
|
`<https://github.com/csukuangfj/kaldifeat>`_ to install ``kaldifeat`` first.
|
|
|
|
Inference with a pre-trained model
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. code-block:: bash
|
|
|
|
$ cd egs/yesno/ASR
|
|
$ ./tdnn/pretrained.py --help
|
|
|
|
shows the usage information of ``./tdnn/pretrained.py``.
|
|
|
|
To decode a single file, we can use:
|
|
|
|
.. code-block:: bash
|
|
|
|
./tdnn/pretrained.py \
|
|
--checkpoint ./tmp/icefall_asr_yesno_tdnn/pretrained.pt \
|
|
--words-file ./tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt \
|
|
--HLG ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt \
|
|
./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav
|
|
|
|
The output is:
|
|
|
|
.. code-block::
|
|
|
|
2021-08-24 12:22:51,621 INFO [pretrained.py:119] {'feature_dim': 23, 'num_classes': 4, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'checkpoint': './tmp/icefall_asr_yesno_tdnn/pretrained.pt', 'words_file': './tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt', 'HLG': './tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt', 'sound_files': ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav']}
|
|
2021-08-24 12:22:51,645 INFO [pretrained.py:125] device: cpu
|
|
2021-08-24 12:22:51,645 INFO [pretrained.py:127] Creating model
|
|
2021-08-24 12:22:51,650 INFO [pretrained.py:139] Loading HLG from ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt
|
|
2021-08-24 12:22:51,651 INFO [pretrained.py:143] Constructing Fbank computer
|
|
2021-08-24 12:22:51,652 INFO [pretrained.py:153] Reading sound files: ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav']
|
|
2021-08-24 12:22:51,684 INFO [pretrained.py:159] Decoding started
|
|
2021-08-24 12:22:51,708 INFO [pretrained.py:198]
|
|
./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav:
|
|
NO NO YES NO YES NO NO YES
|
|
|
|
|
|
2021-08-24 12:22:51,708 INFO [pretrained.py:200] Decoding Done
|
|
|
|
You can see that for the sound file ``0_0_1_0_1_0_0_1.wav``, the decoding result is
|
|
``NO NO YES NO YES NO NO YES``.
|
|
|
|
To decode **multiple** files at the same time, you can use
|
|
|
|
.. code-block:: bash
|
|
|
|
./tdnn/pretrained.py \
|
|
--checkpoint ./tmp/icefall_asr_yesno_tdnn/pretrained.pt \
|
|
--words-file ./tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt \
|
|
--HLG ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt \
|
|
./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav \
|
|
./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav
|
|
|
|
The decoding output is:
|
|
|
|
.. code-block::
|
|
|
|
2021-08-24 12:25:20,159 INFO [pretrained.py:119] {'feature_dim': 23, 'num_classes': 4, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'checkpoint': './tmp/icefall_asr_yesno_tdnn/pretrained.pt', 'words_file': './tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt', 'HLG': './tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt', 'sound_files': ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav', './tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav']}
|
|
2021-08-24 12:25:20,181 INFO [pretrained.py:125] device: cpu
|
|
2021-08-24 12:25:20,181 INFO [pretrained.py:127] Creating model
|
|
2021-08-24 12:25:20,185 INFO [pretrained.py:139] Loading HLG from ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt
|
|
2021-08-24 12:25:20,186 INFO [pretrained.py:143] Constructing Fbank computer
|
|
2021-08-24 12:25:20,187 INFO [pretrained.py:153] Reading sound files: ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav',
|
|
'./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav']
|
|
2021-08-24 12:25:20,213 INFO [pretrained.py:159] Decoding started
|
|
2021-08-24 12:25:20,287 INFO [pretrained.py:198]
|
|
./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav:
|
|
NO NO YES NO YES NO NO YES
|
|
|
|
./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav:
|
|
YES NO YES YES NO YES YES YES
|
|
|
|
2021-08-24 12:25:20,287 INFO [pretrained.py:200] Decoding Done
|
|
|
|
You can see again that it decodes correctly.
|
|
|
|
Colab notebook
|
|
--------------
|
|
|
|
We do provide a colab notebook for this recipe.
|
|
|
|
|yesno colab notebook|
|
|
|
|
.. |yesno colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg
|
|
:target: https://colab.research.google.com/drive/1tIjjzaJc3IvGyKiMCDWO-TSnBgkcuN3B?usp=sharing
|
|
|
|
|
|
**Congratulations!** You have finished the simplest speech recognition recipe in ``icefall``.
|