4.5 KiB
Results
Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer)
pruned_transducer_stateless7_streaming
Number of model parameters: 79,022,891, i.e., 79.02 M
Training on KsponSpeech (with MUSAN)
Model: johnBamma/icefall-asr-ksponspeech-pruned-transducer-stateless7-streaming-2024-06-12
The CERs are:
decoding method | chunk size | eval_clean | eval_other | comment | decoding mode |
---|---|---|---|---|---|
greedy search | 320ms | 10.21 | 11.07 | --epoch 30 --avg 9 | simulated streaming |
greedy search | 320ms | 10.22 | 11.07 | --epoch 30 --avg 9 | chunk-wise |
fast beam search | 320ms | 10.21 | 11.04 | --epoch 30 --avg 9 | simulated streaming |
fast beam search | 320ms | 10.25 | 11.08 | --epoch 30 --avg 9 | chunk-wise |
modified beam search | 320ms | 10.13 | 10.88 | --epoch 30 --avg 9 | simulated streaming |
modified beam search | 320ms | 10.1 | 10.93 | --epoch 30 --avg 9 | chunk-wize |
greedy search | 640ms | 9.94 | 10.82 | --epoch 30 --avg 9 | simulated streaming |
greedy search | 640ms | 10.04 | 10.85 | --epoch 30 --avg 9 | chunk-wise |
fast beam search | 640ms | 10.01 | 10.81 | --epoch 30 --avg 9 | simulated streaming |
fast beam search | 640ms | 10.04 | 10.7 | --epoch 30 --avg 9 | chunk-wise |
modified beam search | 640ms | 9.91 | 10.72 | --epoch 30 --avg 9 | simulated streaming |
modified beam search | 640ms | 9.92 | 10.72 | --epoch 30 --avg 9 | chunk-wize |
Note: simulated streaming
indicates feeding full utterance during decoding using decode.py
,
while chunk-size
indicates feeding certain number of frames at each time using streaming_decode.py
.
The training command is:
./pruned_transducer_stateless7_streaming/train.py \
--world-size 4 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir pruned_transducer_stateless7_streaming/exp \
--max-duration 750 \
--enable-musan True
The simulated streaming decoding command (e.g., chunk-size=320ms) is:
for m in greedy_search fast_beam_search modified_beam_search; do
./pruned_transducer_stateless7_streaming/decode.py \
--epoch 30 \
--avg 9 \
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method $m
done
The streaming chunk-size decoding command (e.g., chunk-size=320ms) is:
for m in greedy_search modified_beam_search fast_beam_search; do
./pruned_transducer_stateless7_streaming/streaming_decode.py \
--epoch 30 \
--avg 9 \
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
--decoding-method $m \
--decode-chunk-len 32 \
--num-decode-streams 2000
done
zipformer (Zipformer + pruned statelss transducer)
zipformer
Number of model parameters: 74,778,511, i.e., 74.78 M
Training on KsponSpeech (with MUSAN)
Model: johnBamma/icefall-asr-ksponspeech-zipformer-2024-06-24
The CERs are:
decoding method | eval_clean | eval_other | comment |
---|---|---|---|
greedy search | 10.60 | 11.56 | --epoch 30 --avg 9 |
fast beam search | 10.59 | 11.54 | --epoch 30 --avg 9 |
modified beam search | 10.35 | 11.35 | --epoch 30 --avg 9 |
The training command is:
./zipformer/train.py \
--world-size 4 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp \
--max-duration 750 \
--enable-musan True \
--base-lr 0.035
NOTICE: I decreased base_lr
from 0.045(default) to 0.035, Because of RuntimeError: grad_scale is too small
.
The decoding command is:
for m in greedy_search fast_beam_search modified_beam_search; do
./zipformer/decode.py \
--epoch 30 \
--avg 9 \
--exp-dir zipformer/exp \
--decoding-method $m
done