mirror of https://github.com/k2-fsa/icefall.git synced 2025-08-08 09:32:20 +00:00

CSJ pruned_transducer_stateless7_streaming (#892 )

* update manifest stats

* update transcript configs

* lang_char and compute_fbanks

* save cuts in fbank_dir

* add core codes

* update decode.py

* Create local/utils

* tidy up

* parse raw in prepare_lang_char.py

* update manifest stats

* update transcript configs

* lang_char and compute_fbanks

* save cuts in fbank_dir

* add core codes

* update decode.py

* Create local/utils

* tidy up

* parse raw in prepare_lang_char.py

* working train

* Add compare_cer_transcript.py

* fix tokenizer decode, allow d2f only

* comment cleanup

* add export files and READMEs

* reword average column

* fix comments

* Update new results

2023-02-13 22:19:50 +08:00

10 KiB

Raw Blame History

Results

Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer)

pruned_transducer_stateless7_streaming

See https://github.com/k2-fsa/icefall/pull/892 for more details.

You can find a pretrained model, training logs, decoding logs, and decoding results at: https://huggingface.co/TeoWenShen/icefall-asr-csj-pruned-transducer-stateless7-streaming-230208

Number of model parameters: 75688409, i.e. 75.7M.

training on disfluent transcript

The CERs are:

decoding method	chunk size	eval1	eval2	eval3	excluded	valid	average	decoding mode
fast beam search	320ms	5.39	4.08	4.16	5.4	5.02	--epoch 30 --avg 17	simulated streaming
fast beam search	320ms	5.34	4.1	4.26	5.61	4.91	--epoch 30 --avg 17	chunk-wise
greedy search	320ms	5.43	4.14	4.31	5.48	4.88	--epoch 30 --avg 17	simulated streaming
greedy search	320ms	5.44	4.14	4.39	5.7	4.98	--epoch 30 --avg 17	chunk-wise
modified beam search	320ms	5.2	3.95	4.09	5.12	4.75	--epoch 30 --avg 17	simulated streaming
modified beam search	320ms	5.18	4.07	4.12	5.36	4.77	--epoch 30 --avg 17	chunk-wise
fast beam search	640ms	5.01	3.78	3.96	4.85	4.6	--epoch 30 --avg 17	simulated streaming
fast beam search	640ms	4.97	3.88	3.96	4.91	4.61	--epoch 30 --avg 17	chunk-wise
greedy search	640ms	5.02	3.84	4.14	5.02	4.59	--epoch 30 --avg 17	simulated streaming
greedy search	640ms	5.32	4.22	4.33	5.39	4.99	--epoch 30 --avg 17	chunk-wise
modified beam search	640ms	4.78	3.66	3.85	4.72	4.42	--epoch 30 --avg 17	simulated streaming
modified beam search	640ms	5.77	4.72	4.73	5.85	5.36	--epoch 30 --avg 17	chunk-wise

Note: simulated streaming indicates feeding full utterance during decoding using decode.py, while chunk-size indicates feeding certain number of frames at each time using streaming_decode.py.

The training command was:

./pruned_transducer_stateless7_streaming/train.py \
  --feedforward-dims  "1024,1024,2048,2048,1024" \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
  --max-duration 375 \
  --transcript-mode disfluent \
  --lang data/lang_char \
  --manifest-dir /mnt/host/corpus/csj/fbank \
  --pad-feature 30 \
  --musan-dir /mnt/host/corpus/musan/musan/fbank

The simulated streaming decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
            --epoch 30 \
            --avg 17 \
            --max-duration 350 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode disfluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/sim_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --pad-feature 30 \
            --gpu 0
    done
done

The streaming chunk-wise decoding command was:

for chunk in 64 32; do
    for m in greedy_search fast_beam_search modified_beam_search; do
        python pruned_transducer_stateless7_streaming/streaming_decode.py \
            --feedforward-dims  "1024,1024,2048,2048,1024" \
            --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30 \
            --epoch 30 \
            --avg 17 \
            --max-duration 350 \
            --decoding-method $m \
            --manifest-dir /mnt/host/corpus/csj/fbank \
            --lang data/lang_char \
            --transcript-mode disfluent \
            --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_2_pad30/github/stream_"$chunk"_"$m" \
            --decode-chunk-len $chunk \
            --gpu 2 \
            --num-decode-streams 40
    done
done

training on fluent transcript