# Introduction This is a pseudo-labeling based semi-supervised ASR recipe for the LibriSpeech dataset. The ASR model is Zipformer Transducer. The labeled data is Labeled data is LibriSpeech train-clean-100. Unlabeled data can be LibriSpeech "train-clean-360 + train-other-500" for conventional semi-supervised learning or TedLium3 training set for unsupervised domain adaptation. ## Description of the recipe ### Preparation of data The data required in this recipe is the same with LibriSpeech/TedLium3 ASR recipe. And the tokenizer of LibriSpeech is used to build the model. Therefore, we can reuse the `prepare.sh` scripts in those recipes. ### Supervised training for the seed ASR model Firstly, we need to perform supervised training on the LibriSpeech train-clean-100 subset to generate the seed model for the following pseudo-labeling based semi-supervsed training. ``` export CUDA_VISIBLE_DEVICES="0,1,2,3" ./zipformer/train_seed.py \ --world-size 4 \ --num-epochs 70 \ --start-epoch 1 \ --use-fp16 1 \ --exp-dir zipformer/exp_seed \ --max-duration 1000 ``` For better performance of the seed model, we average the checkpoints as follows: ``` ./zipformer/generate_averaged_model.py \ --epoch 70 \ --avg 30 \ --exp-dir ./zipformer/exp_seed ``` The above command generates the final seed model `./zipformer/exp_seed/epoch-70-avg-30.pt` ### Semi-supervised training for the final ASR model Then, we peform semi-supervised training with the seed model as the initialization. - Conventional semi-supervised learning setting where unlabeled data is "train-clean-360 + train-other-500": ``` ./zipformer/train_pl.py \ --world-size 4 \ --num-epochs 20 \ --start-epoch 1 \ --use-fp16 1 \ --exp-dir zipformer/exp_pl_librispeech \ --max-duration 1000 \ --seed-model-path "zipformer/exp_seed/epoch-70-avg-30.pt" \ --unlabeled-dataset "librispeech" ``` - Unsupervised domain adaptation setting where unlabeled data is TedLium3 training set: ``` ./zipformer/train_pl.py \ --world-size 4 \ --num-epochs 20 \ --start-epoch 1 \ --use-fp16 1 \ --exp-dir zipformer/exp_pl_tedlium \ --max-duration 1000 \ --seed-model-path "zipformer/exp_seed/epoch-70-avg-30.pt" \ --unlabeled-dataset "tedlium" ``` ### Decode Finally, we decode the ASR model to evaluate the performance. - Evaluate on the LibriSpeech dataset: ``` ./zipformer/decode.py \ --epoch 20 \ --avg 10 \ --exp-dir ./zipformer/exp_pl_librispeech \ --max-duration 600 \ --decoding-method modified_beam_search \ --beam-size 4 \ --dataset "librispeech" ``` - Evaluate on the TedLium3 dataset: ``` ./zipformer/decode.py \ --epoch 20 \ --avg 10 \ --exp-dir ./zipformer/exp_pl_tedlium \ --max-duration 600 \ --decoding-method modified_beam_search \ --beam-size 4 \ --dataset "tedlium" ``` ## Results - Conventional semi-supervised learning (LibriSpeech 100h/LibriSpeech 860h) | Model | test-clean | test-other | comment | |-------------------------|------------|------------|---------------------| | supervised seed model | 5.45 | 13.7 | --epoch 70 --avg 30 | | pseudo-labeling model | 4.33 | 9.61 | --epoch 20 --avg 10 | - Unsupervised domain adaptation (LibriSpeech 100h/TedLium3) | Model | tedlium3 dev | tedlium3 test | comment | |-------------------------|------------|------------|---------------------| | supervised seed model | 18.29 | 18.16 | --epoch 70 --avg 30 | | pseudo-labeling model | 14.97 | 14.65 | --epoch 20 --avg 10 | ## Pre-trained models and logs You can find the pre-trained models, training logs, tensorboard logs, decoding logs and decoding results at