diff --git a/docs/source/recipes/Non-streaming-ASR/librispeech/distillation.rst b/docs/source/recipes/Non-streaming-ASR/librispeech/distillation.rst new file mode 100644 index 000000000..aa379c3f8 --- /dev/null +++ b/docs/source/recipes/Non-streaming-ASR/librispeech/distillation.rst @@ -0,0 +1,220 @@ +Distillation with HuBERT +======================== + +This totorial shows you how to perform knowledge distillation in ``icefall`` +with the `LibriSpeech `_ dataset. The distillation method +used here is called "Multi Vector Quantization Knowledge Distillation" (MVQ-KD). +Please have a look at our paper `Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation `_ +for more details about MVQ-KD. + +.. note:: + + This tutorial is based on recipe + `pruned_transducer_stateless4 `_. + Currently, we only implement MVQ-KD in this recipe. However, MVQ-KD is theoretically applicable to all recipes + with only minor changes needed. Feel free to try out MVQ-KD in different recipes. If you + encounter any problems, please open an issue here `icefall `_. + +.. note:: + + We assume you have read the page :ref:`install icefall` and have setup + the environment for ``icefall``. + +.. HINT:: + + We recommend you to use a GPU or several GPUs to run this recipe. + +Data preparation +---------------- + +We first prepare necessary training data for ``LibriSpeech``. +This is the same as in `Pruned_transducer_statelessX <./pruned_transducer_stateless.rst>`_. + +.. hint:: + + The data preparation is the same as other recipes on LibriSpeech dataset, + if you have finished this step, you can skip to ``Codebook index preparation`` directly. + +.. code-block:: bash + + $ cd egs/librispeech/ASR + $ ./prepare.sh + +The script ``./prepare.sh`` handles the data preparation for you, **automagically**. +All you need to do is to run it. + +The data preparation contains several stages, you can use the following two +options: + + - ``--stage`` + - ``--stop-stage`` + +to control which stage(s) should be run. By default, all stages are executed. + +For example, + +.. code-block:: bash + + $ cd egs/librispeech/ASR + $ ./prepare.sh --stage 0 --stop-stage 0 # run only stage 0 + $ ./prepare.sh --stage 2 --stop-stage 5 # run from stage 2 to stage 5 + +.. HINT:: + + If you have pre-downloaded the `LibriSpeech `_ + dataset and the `musan `_ dataset, say, + they are saved in ``/tmp/LibriSpeech`` and ``/tmp/musan``, you can modify + the ``dl_dir`` variable in ``./prepare.sh`` to point to ``/tmp`` so that + ``./prepare.sh`` won't re-download them. + +.. NOTE:: + + All generated files by ``./prepare.sh``, e.g., features, lexicon, etc, + are saved in ``./data`` directory. + +We provide the following YouTube video showing how to run ``./prepare.sh``. + +.. note:: + + To get the latest news of `next-gen Kaldi `_, please subscribe + the following YouTube channel by `Nadira Povey `_: + + ``_ + +.. youtube:: ofEIoJL-mGM + + +Codebook index preparation +-------------------------- + +Here, we prepare necessary data for MVQ-KD. This requires the generation +of codebook indexes (please read our `paper `_. +if you are interested in details). In this tutorial, we use the pre-computed +codebook indexes for convenience. The only thing you need to do is to +run ``./distillation_with_hubert.sh``. + +.. note:: + There are 5 stages in total, the first and second stage will be automatically skipped + when choosing to downloaded codebook indexes prepared by `icefall`_. + Of course, you can extract and compute the codebook indexes by yourself. This + will require you downloading a HuBERT-XL model and it can take a while for + the extraction of codebook indexes. + + +As usual, you can control the stages you want to run by specifying the following +two options: + + - ``--stage`` + - ``--stop-stage`` + +For example, + +.. code-block:: bash + + $ cd egs/librispeech/ASR + $ ./distillation_with_hubert.sh --stage 0 --stop-stage 0 # run only stage 0 + $ ./distillation_with_hubert.sh --stage 2 --stop-stage 4 # run from stage 2 to stage 5 + +Here are a few options in ``./distillation_with_hubert.sh`` +you need to know before you proceed. + +- ``--full_libri`` If True, use full 960h data. Otherwise only ``train-clean-100`` will be used +- ``--use_extracted_codebook`` If True, the first two stages will be skipped and the codebook + indexes uploaded by us will be downloaded. + +Since we are using the pre-computed codebook indexes, we set +``use_extracted_codebook=True``. If you want to do full `LibriSpeech`_ +experiments, please set ``full_libri=True``. + +The following command downloads the pre-computed codebook indexes +and prepares MVQ-augmented training manifests. + +.. code-block:: bash + + $ ./distillation_with_hubert.sh --stage 2 --stop-stage 2 # run only stage 2 + +Please see the +following screenshot for the output of an example execution. + +.. figure:: ./images/distillation_codebook.png + :width: 800 + :alt: Downloading codebook indexes and preparing training manifest. + :align: center + + Downloading codebook indexes and preparing training manifest. + +.. hint:: + + The codebook indexes we prepared for you in this tutorial + are extracted from the 36-th layer of a fine-tuned HuBERT-XL model + with 8 codebooks. If you want to try other configurations, please + set ``use_extracted_codebook=False`` and set ``embedding_layer`` and + ``num_codebooks`` by yourself. + +Now, you should see the following files under the direcory ``./data/vq_fbank_layer36_cb8``. + +.. figure:: ./images/distillation_directory.png + :width: 800 + :alt: MVQ-augmented training manifests + :align: center + + MVQ-augmented training manifests. + +Whola! You are ready to perform knowledge distillation training now! + +Training +-------- + +To perform training, please run stage 3 by executing the following command. + +.. code-block:: bash + + $ ./prepare.sh --stage 3 --stop-stage 3 # run MVQ training + +Here is the code snippet for training: + +.. code-block:: bash + + WORLD_SIZE=$(echo ${CUDA_VISIBLE_DEVICES} | awk '{n=split($1, _, ","); print n}') + + ./pruned_transducer_stateless6/train.py \ + --manifest-dir ./data/vq_fbank_layer36_cb8 \ + --master-port 12359 \ + --full-libri $full_libri \ + --spec-aug-time-warp-factor -1 \ + --max-duration 300 \ + --world-size ${WORLD_SIZE} \ + --num-epochs 30 \ + --exp-dir $exp_dir \ + --enable-distillation True \ + --codebook-loss-scale 0.01 + +There are a few training arguments in the following +training commands that should be paid attention to. + - ``--enable-distillation`` If True, knowledge distillation training is enabled. + - ``--codebook-loss-scale`` The scale of the knowledge distillation loss. + - ``--manifest-dir`` The path to the MVQ-augmented manifest. + + +Decoding +-------- + +After training finished, you can test the performance on using +the following command. + +.. code-block:: bash + + export CUDA_VISIBLE_DEVICES=0 + ./pruned_transducer_stateless6/train.py \ + --decoding-method "modified_beam_search" \ + --epoch 30 \ + --avg 10 \ + --max-duration 200 \ + --exp-dir $exp_dir \ + --enable-distillation True + +You should get similar results as `here `_. + +That's all! Feel free to experiment with your own setups and report your results. +If you encounter any problems during training, please open up an issue `here `_. + diff --git a/docs/source/recipes/Non-streaming-ASR/librispeech/images/distillation_codebook.png b/docs/source/recipes/Non-streaming-ASR/librispeech/images/distillation_codebook.png new file mode 100644 index 000000000..1a40d6c6e Binary files /dev/null and b/docs/source/recipes/Non-streaming-ASR/librispeech/images/distillation_codebook.png differ diff --git a/docs/source/recipes/Non-streaming-ASR/librispeech/images/distillation_directory.png b/docs/source/recipes/Non-streaming-ASR/librispeech/images/distillation_directory.png new file mode 100644 index 000000000..30763046f Binary files /dev/null and b/docs/source/recipes/Non-streaming-ASR/librispeech/images/distillation_directory.png differ diff --git a/docs/source/recipes/Non-streaming-ASR/librispeech/index.rst b/docs/source/recipes/Non-streaming-ASR/librispeech/index.rst index 3ebb36b25..bf439861a 100644 --- a/docs/source/recipes/Non-streaming-ASR/librispeech/index.rst +++ b/docs/source/recipes/Non-streaming-ASR/librispeech/index.rst @@ -9,3 +9,4 @@ LibriSpeech pruned_transducer_stateless zipformer_mmi zipformer_ctc_blankskip + distillation diff --git a/egs/librispeech/ASR/distillation_with_hubert.sh b/egs/librispeech/ASR/distillation_with_hubert.sh index a38cf590c..6aaa0333b 100755 --- a/egs/librispeech/ASR/distillation_with_hubert.sh +++ b/egs/librispeech/ASR/distillation_with_hubert.sh @@ -150,7 +150,7 @@ if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then num_codebooks=8 mkdir -p $exp_dir/vq - codebook_dir=$exp_dir/vq/${teacher_model_id}_layer${embedding_layer}_cb${num_codebooks} + codebook_dir=$exp_dir/vq/${teacher_model_id} mkdir -p codebook_dir codebook_download_dir=$exp_dir/download_codebook if [ -d $codebook_download_dir ]; then @@ -180,9 +180,9 @@ if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then ./pruned_transducer_stateless6/extract_codebook_index.py \ --full-libri $full_libri \ --exp-dir $exp_dir \ - --embedding-layer 36 \ + --embedding-layer $embedding_layer \ --num-utts 1000 \ - --num-codebooks 8 \ + --num-codebooks $num_codebooks \ --max-duration 100 \ --teacher-model-id $teacher_model_id \ --use-extracted-codebook $use_extracted_codebook