icefall/egs/librispeech/PL/README.md

# Introduction

This is a pseudo-labeling based semi-supervised ASR recipe for the LibriSpeech dataset. The ASR model is Zipformer Transducer. The labeled data is Labeled data is LibriSpeech train-clean-100. Unlabeled data can be LibriSpeech "train-clean-360 + train-other-500" for conventional semi-supervised learning or TedLium3 training set for unsupervised domain adaptation.

## Description of the recipe

### Preparation of data

The data required in this recipe is the same with LibriSpeech/TedLium3 ASR recipe. And the tokenizer of LibriSpeech is used to build the model. Therefore, we can reuse the `prepare.sh` scripts in those recipes.

### Supervised training for the seed ASR model

Firstly, we need to perform supervised training on the LibriSpeech train-clean-100 subset to generate the seed model for the following pseudo-labeling based semi-supervsed training.

```
export CUDA_VISIBLE_DEVICES="0,1,2,3"
./zipformer/train_seed.py \
  --world-size 4 \
  --num-epochs 70 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp_seed \
  --max-duration 1000
```

For better performance of the seed model, we average the checkpoints as follows:

```
./zipformer/generate_averaged_model.py \
    --epoch 70 \
    --avg 30 \
    --exp-dir ./zipformer/exp_seed
```

The above command generates the final seed model `./zipformer/exp_seed/epoch-70-avg-30.pt`

### Semi-supervised training for the final ASR model

Then, we peform semi-supervised training with the seed model as the initialization.

- Conventional semi-supervised learning setting where unlabeled data is "train-clean-360 + train-other-500":

```
./zipformer/train_pl.py \
  --world-size 4 \
  --num-epochs 20 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp_pl_librispeech \
  --max-duration 1000 \
  --seed-model-path "zipformer/exp_seed/epoch-70-avg-30.pt" \
  --unlabeled-dataset "librispeech"
```

- Unsupervised domain adaptation setting where unlabeled data is TedLium3 training set:

```
./zipformer/train_pl.py \
  --world-size 4 \
  --num-epochs 20 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp_pl_tedlium \
  --max-duration 1000 \
  --seed-model-path "zipformer/exp_seed/epoch-70-avg-30.pt" \
  --unlabeled-dataset "tedlium"
```

### Decode

Finally, we decode the ASR model to evaluate the performance.

- Evaluate on the LibriSpeech dataset:

```
./zipformer/decode.py \
    --epoch 20 \
    --avg 10 \
    --exp-dir ./zipformer/exp_pl_librispeech \
    --max-duration 600 \
    --decoding-method modified_beam_search \
    --beam-size 4 \
    --dataset "librispeech"
```

- Evaluate on the TedLium3 dataset:

```
./zipformer/decode.py \
    --epoch 20 \
    --avg 10 \
    --exp-dir ./zipformer/exp_pl_tedlium \
    --max-duration 600 \
    --decoding-method modified_beam_search \
    --beam-size 4 \
    --dataset "tedlium"
```

## Results

- Conventional semi-supervised learning (LibriSpeech 100h/LibriSpeech 860h)

| Model         | test-clean | test-other | comment             |
|-------------------------|------------|------------|---------------------|
| supervised seed model | 5.45       | 13.7      |  --epoch 70 --avg 30 |
| pseudo-labeling model | 4.33       | 9.61      | --epoch 20 --avg 10  |

- Unsupervised domain adaptation (LibriSpeech 100h/TedLium3)

| Model         | tedlium3 dev | tedlium3 test | comment             |
|-------------------------|------------|------------|---------------------|
| supervised seed model | 18.29      | 18.16      |  --epoch 70 --avg 30 |
| pseudo-labeling model | 14.97       | 14.65      | --epoch 20 --avg 10  |


## Pre-trained models and logs

You can find the pre-trained models, training logs, tensorboard logs, decoding logs and decoding results at <https://huggingface.co/zhu-han/icefall-pl-librispeech-zipformer-medium-2023-08-06>