+
+ +
+

Finetune from a pre-trained Zipformer model with adapters

+

This tutorial shows you how to fine-tune a pre-trained Zipformer +transducer model on a new dataset with adapters. +Adapters are compact and efficient module that can be integrated into a pre-trained model +to improve the model’s performance on a new domain. Adapters are injected +between different modules in the well-trained neural network. During training, only the parameters +in the adapters will be updated. It achieves competitive performance +while requiring much less GPU memory than full fine-tuning. For more details about adapters, +please refer to the original paper for more details.

+
+

Hint

+

We assume you have read the page Installation and have setup +the environment for icefall.

+
+
+

Hint

+

We recommend you to use a GPU or several GPUs to run this recipe

+
+

For illustration purpose, we fine-tune the Zipformer transducer model +pre-trained on LibriSpeech on the small subset of GigaSpeech. You could use your +own data for fine-tuning if you create a manifest for your new dataset.

+
+

Data preparation

+

Please follow the instructions in the GigaSpeech recipe +to prepare the fine-tune data used in this tutorial. We only require the small subset in GigaSpeech for this tutorial.

+
+
+

Model preparation

+

We are using the Zipformer model trained on full LibriSpeech (960 hours) as the intialization. The +checkpoint of the model can be downloaded via the following command:

+
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
+$ cd icefall-asr-librispeech-zipformer-2023-05-15/exp
+$ git lfs pull --include "pretrained.pt"
+$ ln -s pretrained.pt epoch-99.pt
+$ cd ../data/lang_bpe_500
+$ git lfs pull --include bpe.model
+$ cd ../../..
+
+
+

Before fine-tuning, let’s test the model’s WER on the new domain. The following command performs +decoding on the GigaSpeech test sets:

+
./zipformer/decode_gigaspeech.py \
+    --epoch 99 \
+    --avg 1 \
+    --exp-dir icefall-asr-librispeech-zipformer-2023-05-15/exp \
+    --use-averaged-model 0 \
+    --max-duration 1000 \
+    --decoding-method greedy_search
+
+
+

You should see the following numbers:

+
For dev, WER of different settings are:
+greedy_search       20.06   best for dev
+
+For test, WER of different settings are:
+greedy_search       19.27   best for test
+
+
+
+
+

Fine-tune with adapter

+

We insert 4 adapters with residual connection in each Zipformer2EncoderLayer. +The original model parameters remain untouched during training and only the parameters of +the adapters are updated. The following command starts a fine-tuning experiment with adapters:

+
$ do_finetune=1
+$ use_adapters=1
+$ adapter_dim=8
+
+$ ./zipformer_adapter/train.py \
+    --world-size 2 \
+    --num-epochs 20 \
+    --start-epoch 1 \
+    --exp-dir zipformer_adapter/exp_giga_finetune_adapters${use_adapters}_adapter_dim${adapter_dim} \
+    --use-fp16 1 \
+    --base-lr 0.045 \
+    --use-adapters $use_adapters --adapter-dim $adapter_dim \
+    --bpe-model data/lang_bpe_500/bpe.model \
+    --do-finetune $do_finetune \
+    --master-port 13022 \
+    --finetune-ckpt icefall-asr-librispeech-zipformer-2023-05-15/exp/pretrained.pt \
+    --max-duration 1000
+
+
+

The following arguments are related to fine-tuning:

+
    +
  • +
    --do-finetune

    If True, do fine-tuning by initializing the model from a pre-trained checkpoint. +Note that if you want to resume your fine-tuning experiment from certain epochs, you +need to set this to False.

    +
    +
    +
  • +
  • +
    use-adapters

    If adapters are used during fine-tuning.

    +
    +
    +
  • +
  • +
    --adapter-dim

    The bottleneck dimension of the adapter module. Typically a small number.

    +
    +
    +
  • +
+

You should notice that in the training log, the total number of trainale parameters is shown:

+
2024-02-22 21:22:03,808 INFO [train.py:1277] A total of 761344 trainable parameters (1.148% of the whole model)
+
+
+

The trainable parameters only makes up 1.15% of the entire model parameters, so the training will be much faster +and requires less memory than full fine-tuning.

+
+
+

Decoding

+

After training, let’s test the WERs. To test the WERs on the GigaSpeech set, +you can execute the following command:

+
$ epoch=20
+$ avg=10
+$ use_adapters=1
+$ adapter_dim=8
+
+% ./zipformer/decode.py \
+    --epoch $epoch \
+    --avg $avg \
+    --use-averaged-model 1 \
+    --exp-dir zipformer_adapter/exp_giga_finetune_adapters${use_adapters}_adapter_dim${adapter_dim} \
+    --max-duration 600 \
+    --use-adapters $use_adapters \
+    --adapter-dim $adapter_dim \
+    --decoding-method greedy_search
+
+
+

You should see the following numbers:

+
For dev, WER of different settings are:
+greedy_search       15.44   best for dev
+
+For test, WER of different settings are:
+greedy_search       15.42   best for test
+
+
+

The WER on test set is improved from 19.27 to 15.42, demonstrating the effectiveness of adapters.

+

The same model can be used to perform decoding on LibriSpeech test sets. You can deactivate the adapters +to keep the same performance of the original model:

+
$ epoch=20
+$ avg=1
+$ use_adapters=0
+$ adapter_dim=8
+
+% ./zipformer/decode.py \
+    --epoch $epoch \
+    --avg $avg \
+    --use-averaged-model 1 \
+    --exp-dir zipformer_adapter/exp_giga_finetune_adapters${use_adapters}_adapter_dim${adapter_dim} \
+    --max-duration 600 \
+    --use-adapters $use_adapters \
+    --adapter-dim $adapter_dim \
+    --decoding-method greedy_search
+
+
+
For dev, WER of different settings are:
+greedy_search       2.23    best for test-clean
+
+For test, WER of different settings are:
+greedy_search       4.96    best for test-other
+
+
+

The numbers are the same as reported in icefall. So adapter-based +fine-tuning is also very flexible as the same model can be used for decoding on the original and target domain.

+
+
+

Export the model

+

After training, the model can be exported to onnx format easily using the following command:

+
$ use_adapters=1
+$ adapter_dim=16
+
+$ ./zipformer_adapter/export-onnx.py \
+    --tokens icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500/tokens.txt \
+    --use-averaged-model 1 \
+    --epoch 20 \
+    --avg 10 \
+    --exp-dir zipformer_adapter/exp_giga_finetune_adapters${use_adapters}_adapter_dim${adapter_dim} \
+    --use-adapters $use_adapters \
+    --adapter-dim $adapter_dim \
+    --num-encoder-layers "2,2,3,4,3,2" \
+    --downsampling-factor "1,2,4,8,4,2" \
+    --feedforward-dim "512,768,1024,1536,1024,768" \
+    --num-heads "4,4,4,8,4,4" \
+    --encoder-dim "192,256,384,512,384,256" \
+    --query-head-dim 32 \
+    --value-head-dim 12 \
+    --pos-head-dim 4 \
+    --pos-dim 48 \
+    --encoder-unmasked-dim "192,192,256,256,256,192" \
+    --cnn-module-kernel "31,31,15,15,15,31" \
+    --decoder-dim 512 \
+    --joiner-dim 512 \
+    --causal False \
+    --chunk-size "16,32,64,-1" \
+    --left-context-frames "64,128,256,-1"
+
+
+
+
+ + +
+