Introduction

This is a pseudo-labeling based semi-supervised ASR recipe for the LibriSpeech dataset. The ASR model is Zipformer Transducer. The labeled data is Labeled data is LibriSpeech train-clean-100. Unlabeled data can be LibriSpeech "train-clean-360 + train-other-500" for conventional semi-supervised learning or TedLium3 training set for unsupervised domain adaptation.

Description of the recipe

Preparation of data

The data required in this recipe is the same with LibriSpeech/TedLium3 ASR recipe. And the tokenizer of LibriSpeech is used to build the model. Therefore, we can reuse the prepare.sh scripts in those recipes.

Supervised training for the seed ASR model

Firstly, we need to perform supervised training on the LibriSpeech train-clean-100 subset to generate the seed model for the following pseudo-labeling based semi-supervsed training.

export CUDA_VISIBLE_DEVICES="0,1,2,3"
./zipformer/train_seed.py \
  --world-size 4 \
  --num-epochs 70 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp_seed \
  --max-duration 1000

For better performance of the seed model, we average the checkpoints as follows:

./zipformer/generate_averaged_model.py \
    --epoch 70 \
    --avg 30 \
    --exp-dir ./zipformer/exp_seed

The above command generates the final seed model ./zipformer/exp_seed/epoch-70-avg-30.pt

Semi-supervised training for the final ASR model

Then, we peform semi-supervised training with the seed model as the initialization.

Conventional semi-supervised learning setting where unlabeled data is "train-clean-360 + train-other-500":

./zipformer/train_pl.py \
  --world-size 4 \
  --num-epochs 20 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp_pl_librispeech \
  --max-duration 1000 \
  --seed-model-path "zipformer/exp_seed/epoch-70-avg-30.pt" \
  --unlabeled-dataset "librispeech"

Unsupervised domain adaptation setting where unlabeled data is TedLium3 training set:

./zipformer/train_pl.py \
  --world-size 4 \
  --num-epochs 20 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp_pl_tedlium \
  --max-duration 1000 \
  --seed-model-path "zipformer/exp_seed/epoch-70-avg-30.pt" \
  --unlabeled-dataset "tedlium"

Decode

Finally, we decode the ASR model to evaluate the performance.

Evaluate on the LibriSpeech dataset:

./zipformer/decode.py \
    --epoch 20 \
    --avg 10 \
    --exp-dir ./zipformer/exp_pl_librispeech \
    --max-duration 600 \
    --decoding-method modified_beam_search \
    --beam-size 4 \
    --dataset "librispeech"

Evaluate on the TedLium3 dataset:

./zipformer/decode.py \
    --epoch 20 \
    --avg 10 \
    --exp-dir ./zipformer/exp_pl_tedlium \
    --max-duration 600 \
    --decoding-method modified_beam_search \
    --beam-size 4 \
    --dataset "tedlium"

Results

Conventional semi-supervised learning (LibriSpeech 100h/LibriSpeech 860h)

Model	test-clean	test-other	comment
supervised seed model	5.45	13.7	--epoch 70 --avg 30
pseudo-labeling model	4.33	9.61	--epoch 20 --avg 10

Unsupervised domain adaptation (LibriSpeech 100h/TedLium3)

Model	tedlium3 dev	tedlium3 test	comment
supervised seed model	18.29	18.16	--epoch 70 --avg 30
pseudo-labeling model	14.97	14.65	--epoch 20 --avg 10

Pre-trained models and logs

You can find the pre-trained models, training logs, tensorboard logs, decoding logs and decoding results at https://huggingface.co/zhu-han/icefall-pl-librispeech-zipformer-medium-2023-08-06