yesno ===== This page shows you how to run the `yesno `_ recipe. It contains: - (1) Prepare data for training - (2) Train a TDNN model - (a) View text format logs and visualize TensorBoard logs - (b) Select device type, i.e., CPU and GPU, for training - (c) Change training options - (d) Resume training from a checkpoint - (3) Decode with a trained model - (a) Select a checkpoint for decoding - (b) Model averaging - (4) Colab notebook - (a) It shows you step by step how to setup the environment, how to do training, and how to do decoding - (b) How to use a pre-trained model - (5) Inference with a pre-trained model - (a) Download a pre-trained model, provided by us - (b) Decode a single sound file with a pre-trained model - (c) Decode multiple sound files at the same time It does **NOT** show you: - (1) How to train with multiple GPUs The ``yesno`` dataset is so small that CPU is more than enough for training as well as for decoding. - (2) How to use LM rescoring for decoding The dataset does not have an LM for rescoring. .. HINT:: We assume you have read the page :ref:`install icefall` and have setup the environment for ``icefall``. .. HINT:: You **don't** need a **GPU** to run this recipe. It can be run on a **CPU**. The training part takes less than 30 **seconds** on a CPU and you will get the following WER at the end:: [test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ] Data preparation ---------------- .. code-block:: bash $ cd egs/yesno/ASR $ ./prepare.sh The script ``./prepare.sh`` handles the data preparation for you, **automagically**. All you need to do is to run it. The data preparation contains several stages, you can use the following two options: - ``--stage`` - ``--stop-stage`` to control which stage(s) should be run. By default, all stages are executed. For example, .. code-block:: bash $ cd egs/yesno/ASR $ ./prepare.sh --stage 0 --stop-stage 0 means to run only stage 0. To run stage 2 to stage 5, use: .. code-block:: bash $ ./prepare.sh --stage 2 --stop-stage 5 Training -------- We provide only a TDNN model, contained in the `tdnn `_ folder, for ``yesno``. The command to run the training part is: .. code-block:: bash $ cd egs/yesno/ASR $ export CUDA_VISIBLE_DEVICES="" $ ./tdnn/train.py By default, it will run ``15`` epochs. Training logs and checkpoints are saved in ``tdnn/exp``. In ``tdnn/exp``, you will find the following files: - ``epoch-0.pt``, ``epoch-1.pt``, ... These are checkpoint files, containing model ``state_dict`` and optimizer ``state_dict``. To resume training from some checkpoint, say ``epoch-10.pt``, you can use: .. code-block:: bash $ ./tdnn/train.py --start-epoch 11 - ``tensorboard/`` This folder contains TensorBoard logs. Training loss, validation loss, learning rate, etc, are recorded in these logs. You can visualize them by: .. code-block:: bash $ cd tdnn/exp/tensorboard $ tensorboard dev upload --logdir . --description "TDNN training for yesno with icefall" It will print something like below: .. code-block:: TensorFlow installation not found - running with reduced feature set. Upload started and will continue reading any new data as it's added to the logdir. To stop uploading, press Ctrl-C. New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/yKUbhb5wRmOSXYkId1z9eg/ [2021-08-23T23:49:41] Started scanning logdir. [2021-08-23T23:49:42] Total uploaded: 135 scalars, 0 tensors, 0 binary objects Listening for new data in logdir... Note there is a URL in the above output, click it and you will see the following screenshot: .. figure:: images/yesno-tdnn-tensorboard-log.png :width: 600 :alt: TensorBoard screenshot :align: center :target: https://tensorboard.dev/experiment/yKUbhb5wRmOSXYkId1z9eg/ TensorBoard screenshot. - ``log/log-train-xxxx`` It is the detailed training log in text format, same as the one you saw printed to the console during training. .. NOTE:: By default, ``./tdnn/train.py`` uses GPU 0 for training if GPUs are available. If you have two GPUs, say, GPU 0 and GPU 1, and you want to use GPU 1 for training, you can run: .. code-block:: bash $ export CUDA_VISIBLE_DEVICES="1" $ ./tdnn/train.py Since the ``yesno`` dataset is very small, containing only 30 sound files for training, and the model in use is also very small, we use: .. code-block:: bash $ export CUDA_VISIBLE_DEVICES="" so that ``./tdnn/train.py`` uses CPU during training. If you don't have GPUs, then you don't need to run ``export CUDA_VISIBLE_DEVICES=""``. To see available training options, you can use: .. code-block:: bash $ ./tdnn/train.py --help Other training options, e.g., learning rate, results dir, etc., are pre-configured in the function ``get_params()`` in `tdnn/train.py `_. Normally, you don't need to change them. You can change them by modifying the code, if you want. Decoding -------- The decoding part uses checkpoints saved by the training part, so you have to run the training part first. The command for decoding is: .. code-block:: bash $ export CUDA_VISIBLE_DEVICES="" $ ./tdnn/decode.py You will see the WER in the output log. Decoded results are saved in ``tdnn/exp``. .. code-block:: bash $ ./tdnn/decode.py --help shows you the available decoding options. Some commonly used options are: - ``--epoch`` You can select which checkpoint to be used for decoding. For instance, ``./tdnn/decode.py --epoch 10`` means to use ``./tdnn/exp/epoch-10.pt`` for decoding. - ``--avg`` It's related to model averaging. It specifies number of checkpoints to be averaged. The averaged model is used for decoding. For example, the following command: .. code-block:: bash $ ./tdnn/decode.py --epoch 10 --avg 3 uses the average of ``epoch-8.pt``, ``epoch-9.pt`` and ``epoch-10.pt`` for decoding. - ``--export`` If it is ``True``, i.e., ``./tdnn/decode.py --export 1``, the code will save the averaged model to ``tdnn/exp/pretrained.pt``. See :ref:`yesno use a pre-trained model` for how to use it. .. _yesno use a pre-trained model: Pre-trained Model ----------------- We have uploaded the pre-trained model to ``_. The following shows you how to use the pre-trained model. Download the pre-trained model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash $ cd egs/yesno/ASR $ mkdir tmp $ cd tmp $ git lfs install $ git clone https://huggingface.co/csukuangfj/icefall_asr_yesno_tdnn .. CAUTION:: You have to use ``git lfs`` to download the pre-trained model. After downloading, you will have the following files: .. code-block:: bash $ cd egs/yesno/ASR $ tree tmp .. code-block:: bash tmp/ `-- icefall_asr_yesno_tdnn |-- README.md |-- lang_phone | |-- HLG.pt | |-- L.pt | |-- L_disambig.pt | |-- Linv.pt | |-- lexicon.txt | |-- lexicon_disambig.txt | |-- tokens.txt | `-- words.txt |-- lm | |-- G.arpa | `-- G.fst.txt |-- pretrained.pt `-- test_waves |-- 0_0_0_1_0_0_0_1.wav |-- 0_0_1_0_0_0_1_0.wav |-- 0_0_1_0_0_1_1_1.wav |-- 0_0_1_0_1_0_0_1.wav |-- 0_0_1_1_0_0_0_1.wav |-- 0_0_1_1_0_1_1_0.wav |-- 0_0_1_1_1_0_0_0.wav |-- 0_0_1_1_1_1_0_0.wav |-- 0_1_0_0_0_1_0_0.wav |-- 0_1_0_0_1_0_1_0.wav |-- 0_1_0_1_0_0_0_0.wav |-- 0_1_0_1_1_1_0_0.wav |-- 0_1_1_0_0_1_1_1.wav |-- 0_1_1_1_0_0_1_0.wav |-- 0_1_1_1_1_0_1_0.wav |-- 1_0_0_0_0_0_0_0.wav |-- 1_0_0_0_0_0_1_1.wav |-- 1_0_0_1_0_1_1_1.wav |-- 1_0_1_1_0_1_1_1.wav |-- 1_0_1_1_1_1_0_1.wav |-- 1_1_0_0_0_1_1_1.wav |-- 1_1_0_0_1_0_1_1.wav |-- 1_1_0_1_0_1_0_0.wav |-- 1_1_0_1_1_0_0_1.wav |-- 1_1_0_1_1_1_1_0.wav |-- 1_1_1_0_0_1_0_1.wav |-- 1_1_1_0_1_0_1_0.wav |-- 1_1_1_1_0_0_1_0.wav |-- 1_1_1_1_1_0_0_0.wav `-- 1_1_1_1_1_1_1_1.wav 4 directories, 42 files .. code-block:: bash $ soxi tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav Input File : 'tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav' Channels : 1 Sample Rate : 8000 Precision : 16-bit Duration : 00:00:06.76 = 54080 samples ~ 507 CDDA sectors File Size : 108k Bit Rate : 128k Sample Encoding: 16-bit Signed Integer PCM - ``0_0_1_0_1_0_0_1.wav`` 0 means No; 1 means Yes. No and Yes are not in English, but in `Hebrew `_. So this file contains ``NO NO YES NO YES NO NO YES``. Download kaldifeat ~~~~~~~~~~~~~~~~~~ `kaldifeat `_ is used for extracting features from a single or multiple sound files. Please refer to ``_ to install ``kaldifeat`` first. Inference with a pre-trained model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash $ cd egs/yesno/ASR $ ./tdnn/pretrained.py --help shows the usage information of ``./tdnn/pretrained.py``. To decode a single file, we can use: .. code-block:: bash ./tdnn/pretrained.py \ --checkpoint ./tmp/icefall_asr_yesno_tdnn/pretrained.pt \ --words-file ./tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt \ --HLG ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt \ ./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav The output is: .. code-block:: 2021-08-24 12:22:51,621 INFO [pretrained.py:119] {'feature_dim': 23, 'num_classes': 4, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'checkpoint': './tmp/icefall_asr_yesno_tdnn/pretrained.pt', 'words_file': './tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt', 'HLG': './tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt', 'sound_files': ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav']} 2021-08-24 12:22:51,645 INFO [pretrained.py:125] device: cpu 2021-08-24 12:22:51,645 INFO [pretrained.py:127] Creating model 2021-08-24 12:22:51,650 INFO [pretrained.py:139] Loading HLG from ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt 2021-08-24 12:22:51,651 INFO [pretrained.py:143] Constructing Fbank computer 2021-08-24 12:22:51,652 INFO [pretrained.py:153] Reading sound files: ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav'] 2021-08-24 12:22:51,684 INFO [pretrained.py:159] Decoding started 2021-08-24 12:22:51,708 INFO [pretrained.py:198] ./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav: NO NO YES NO YES NO NO YES 2021-08-24 12:22:51,708 INFO [pretrained.py:200] Decoding Done You can see that for the sound file ``0_0_1_0_1_0_0_1.wav``, the decoding result is ``NO NO YES NO YES NO NO YES``. To decode **multiple** files at the same time, you can use .. code-block:: bash ./tdnn/pretrained.py \ --checkpoint ./tmp/icefall_asr_yesno_tdnn/pretrained.pt \ --words-file ./tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt \ --HLG ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt \ ./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav \ ./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav The decoding output is: .. code-block:: 2021-08-24 12:25:20,159 INFO [pretrained.py:119] {'feature_dim': 23, 'num_classes': 4, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'checkpoint': './tmp/icefall_asr_yesno_tdnn/pretrained.pt', 'words_file': './tmp/icefall_asr_yesno_tdnn/lang_phone/words.txt', 'HLG': './tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt', 'sound_files': ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav', './tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav']} 2021-08-24 12:25:20,181 INFO [pretrained.py:125] device: cpu 2021-08-24 12:25:20,181 INFO [pretrained.py:127] Creating model 2021-08-24 12:25:20,185 INFO [pretrained.py:139] Loading HLG from ./tmp/icefall_asr_yesno_tdnn/lang_phone/HLG.pt 2021-08-24 12:25:20,186 INFO [pretrained.py:143] Constructing Fbank computer 2021-08-24 12:25:20,187 INFO [pretrained.py:153] Reading sound files: ['./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav', './tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav'] 2021-08-24 12:25:20,213 INFO [pretrained.py:159] Decoding started 2021-08-24 12:25:20,287 INFO [pretrained.py:198] ./tmp/icefall_asr_yesno_tdnn/test_waves/0_0_1_0_1_0_0_1.wav: NO NO YES NO YES NO NO YES ./tmp/icefall_asr_yesno_tdnn/test_waves/1_0_1_1_0_1_1_1.wav: YES NO YES YES NO YES YES YES 2021-08-24 12:25:20,287 INFO [pretrained.py:200] Decoding Done You can see again that it decodes correctly. Colab notebook -------------- We do provide a colab notebook for this recipe. |yesno colab notebook| .. |yesno colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/drive/1tIjjzaJc3IvGyKiMCDWO-TSnBgkcuN3B?usp=sharing **Congratulations!** You have finished the simplest speech recognition recipe in ``icefall``.