icefall/egs/ksponspeech/ASR/RESULTS.md

4.5 KiB

Results

Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer)

pruned_transducer_stateless7_streaming

Number of model parameters: 79,022,891, i.e., 79.02 M

Training on KsponSpeech (with MUSAN)

Model: johnBamma/icefall-asr-ksponspeech-pruned-transducer-stateless7-streaming-2024-06-12

The CERs are:

decoding method chunk size eval_clean eval_other comment decoding mode
greedy search 320ms 10.21 11.07 --epoch 30 --avg 9 simulated streaming
greedy search 320ms 10.22 11.07 --epoch 30 --avg 9 chunk-wise
fast beam search 320ms 10.21 11.04 --epoch 30 --avg 9 simulated streaming
fast beam search 320ms 10.25 11.08 --epoch 30 --avg 9 chunk-wise
modified beam search 320ms 10.13 10.88 --epoch 30 --avg 9 simulated streaming
modified beam search 320ms 10.1 10.93 --epoch 30 --avg 9 chunk-wize
greedy search 640ms 9.94 10.82 --epoch 30 --avg 9 simulated streaming
greedy search 640ms 10.04 10.85 --epoch 30 --avg 9 chunk-wise
fast beam search 640ms 10.01 10.81 --epoch 30 --avg 9 simulated streaming
fast beam search 640ms 10.04 10.7 --epoch 30 --avg 9 chunk-wise
modified beam search 640ms 9.91 10.72 --epoch 30 --avg 9 simulated streaming
modified beam search 640ms 9.92 10.72 --epoch 30 --avg 9 chunk-wize

Note: simulated streaming indicates feeding full utterance during decoding using decode.py, while chunk-size indicates feeding certain number of frames at each time using streaming_decode.py.

The training command is:

./pruned_transducer_stateless7_streaming/train.py \
    --world-size 4 \
    --num-epochs 30 \
    --start-epoch 1 \
    --use-fp16 1 \
    --exp-dir pruned_transducer_stateless7_streaming/exp \
    --max-duration 750 \
    --enable-musan True

The simulated streaming decoding command (e.g., chunk-size=320ms) is:

for m in greedy_search fast_beam_search modified_beam_search; do
  ./pruned_transducer_stateless7_streaming/decode.py \
    --epoch 30 \
    --avg 9 \
    --exp-dir ./pruned_transducer_stateless7_streaming/exp \
    --max-duration 600 \
    --decode-chunk-len 32 \
    --decoding-method $m
done

The streaming chunk-size decoding command (e.g., chunk-size=320ms) is:

for m in greedy_search modified_beam_search fast_beam_search; do
  ./pruned_transducer_stateless7_streaming/streaming_decode.py \
    --epoch 30 \
    --avg 9 \
    --exp-dir ./pruned_transducer_stateless7_streaming/exp \
    --decoding-method $m \
    --decode-chunk-len 32 \
    --num-decode-streams 2000
done

zipformer (Zipformer + pruned statelss transducer)

zipformer

Number of model parameters: 74,778,511, i.e., 74.78 M

Training on KsponSpeech (with MUSAN)

Model: johnBamma/icefall-asr-ksponspeech-zipformer-2024-06-24

The CERs are:

decoding method eval_clean eval_other comment
greedy search 10.60 11.56 --epoch 30 --avg 9
fast beam search 10.59 11.54 --epoch 30 --avg 9
modified beam search 10.35 11.35 --epoch 30 --avg 9

The training command is:

./zipformer/train.py \
    --world-size 4 \
    --num-epochs 30 \
    --start-epoch 1 \
    --use-fp16 1 \
    --exp-dir zipformer/exp \
    --max-duration 750 \
    --enable-musan True \
    --base-lr 0.035

NOTICE: I decreased base_lr from 0.045(default) to 0.035, Because of RuntimeError: grad_scale is too small.

The decoding command is:

for m in greedy_search fast_beam_search modified_beam_search; do
    ./zipformer/decode.py \
        --epoch 30 \
        --avg 9 \
        --exp-dir zipformer/exp \
        --decoding-method $m
done