## Results #### Zipformer | | dev | test | comment | |------------------------------------|------------|------------|------------------------------------------| | modified beam search | 21.87 | 29.04 | --epoch 25, --avg 5, --max-duration 500 | The training command: ``` export CUDA_VISIBLE_DEVICES="0,1,2,3" ./zipformer/train.py \ --world-size 4 \ --num-epochs 25 \ --start-epoch 1 \ --use-fp16 1 \ --exp-dir zipformer/exp-asr-seame \ --causal 0 \ --num-encoder-layers 2,2,2,2,2,2 \ --feedforward-dim 512,768,1024,1024,1024,768 \ --encoder-dim 192,256,256,256,256,256 \ --encoder-unmasked-dim 192,192,192,192,192,192 \ --prune-range 10 \ --max-duration 500 ``` The decoding command: ``` ./zipformer/decode.py \ --epoch 25 \ --avg 5 \ --beam-size 10 --exp-dir ./zipformer/exp-asr-seame \ --max-duration 800 \ --num-encoder-layers 2,2,2,2,2,2 \ --feedforward-dim 512,768,1024,1024,1024,768 \ --encoder-dim 192,256,256,256,256,256 \ --encoder-unmasked-dim 192,192,192,192,192,192 \ --decoding-method modified_beam_search ``` The pretrained model is available at: ### Zipformer-HAT | | dev | test | comment | |------------------------------------|------------|------------|------------------------------------------| | modified beam search | 22.00 | 29.92 | --epoch 20, --avg 5, --max-duration 500 | The training command for reproducing is given below: ``` export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5" ./zipformer_hat/train.py \ --world-size 4 \ --num-epochs 25 \ --start-epoch 1 \ --base-lr 0.045 \ --lr-epochs 6 \ --use-fp16 1 \ --exp-dir zipformer_hat/exp \ --causal 0 \ --num-encoder-layers 2,2,2,2,2,2 \ --feedforward-dim 512,768,1024,1024,1024,768 \ --encoder-dim 192,256,256,256,256,256 \ --encoder-unmasked-dim 192,192,192,192,192,192 \ --prune-range 10 \ --max-duration 500 ``` The decoding command is: ``` ## modified beam search ./zipformer_hat/decode.py \ --epoch 25 --avg 5 --use-averaged-model True \ --beam-size 10 \ --causal 0 \ --exp-dir zipformer_hat/exp \ --bpe-model data_seame/lang_bpe_4000/bpe.model \ --max-duration 1000 \ --num-encoder-layers 2,2,2,2,2,2 \ --feedforward-dim 512,768,1024,1024,1024,768 \ --encoder-dim 192,256,256,256,256,256 \ --encoder-unmasked-dim 192,192,192,192,192,192 \ --decoding-method modified_beam_search ``` A pre-trained model and decoding logs can be found at ### Zipformer-HAT-LID | | dev | test | comment | |------------------------------------|------------|------------|------------------------------------------| | modified beam search | 20.04 | 26.91 | --epoch 15, --avg 5, --max-duration 500 | The training command for reproducing is given below: ``` export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5" ./zipformer_hat_lid/train.py \ --world-size 4 \ --lid True \ --num-epochs 25 \ --start-epoch 1 \ --base-lr 0.045 \ --use-fp16 1 \ --lid-loss-scale 0.3 \ --exp-dir zipformer_hat_lid/exp \ --causal 0 \ --lid-output-layer 3 \ --num-encoder-layers 2,2,2,2,2,2 \ --feedforward-dim 512,768,1024,1024,1024,768 \ --encoder-dim 192,256,256,256,256,256 \ --encoder-unmasked-dim 192,192,192,192,192,192 \ --lids "," \ --prune-range 10 \ --freeze-main-model False \ --use-lid-encoder True \ --use-lid-joiner True \ --lid-num-encoder-layers 2,2,2 \ --lid-downsampling-factor 2,4,2 \ --lid-feedforward-dim 256,256,256 \ --lid-num-heads 4,4,4 \ --lid-encoder-dim 256,256,256 \ --lid-encoder-unmasked-dim 128,128,128 \ --lid-cnn-module-kernel 31,15,31 \ --max-duration 500 ``` The decoding command is: ``` ## modified beam search python zipformer_hat_lid/decode.py \ --epoch $epoch --avg 5 --use-averaged-model True \ --beam-size 10 \ --lid False \ --lids "," \ --exp-dir zipformer_hat_lid/exp \ --bpe-model data_seame/lang_bpe_4000/bpe.model \ --max-duration 800 \ --num-encoder-layers 2,2,2,2,2,2 \ --feedforward-dim 512,768,1024,1024,1024,768 \ --encoder-dim 192,256,256,256,256,256 \ --encoder-unmasked-dim 192,192,192,192,192,192 \ --decoding-method modified_beam_search \ --lid-output-layer 3 \ --use-lid-encoder True \ --use-lid-joiner True \ --lid-num-encoder-layers 2,2,2 \ --lid-downsampling-factor 2,4,2 \ --lid-feedforward-dim 256,256,256 \ --lid-num-heads 4,4,4 \ --lid-encoder-dim 256,256,256 \ --lid-encoder-unmasked-dim 128,128,128 \ --lid-cnn-module-kernel 31,15,31 ``` A pre-trained model and decoding logs can be found at