2022-03-07 23:29:04 +08:00

2.4 KiB

wer with various right context

related model and decoding result/log fils could be found: https://huggingface.co/GuoLiyong/icefall_streaming_prunned_transducer_stateless/tree/main/streaming_pruned_transducer_stateless/exp

decoding with ctc greedy search:

right_context 1 8 16 32 64 full
latency 0.07s 0.35s 0.67s 1.31s 2.59s *
test_clean 5.60 4.00 3.76 3.75 3.65 3.28
+20 tailing dummy frames 5.52 3.98 3.75 3.75 3.65 3.28
simulate streaming with chunk_by_chunk decoding 5.52 3.98 3.75 3.75 3.65 3.28
test_other 14.07 10.69 9.80 9.48 9.01 8.05
+20 tailing dummy frames 14.00 10.69 9.80 9.48 9.0 8.04
simulate streaming with chunk_by_chunk decoding 14.00 10.69 9.80 9.48 9.0 8.04

How latency is computed?

latency = (subsampling factor * right_context + initialize_frames_need_by_subsampling_convs) * 10ms

During which: subsmapling factor = 4 initialize_frames_need_by_subsampling_convs = 3

To decode the first frame encoder out: 7 frams fbanks = subsampling_factor + initialize_frames_need_by_subsampling_convs are needed. Once the deocding started, 4 frames fbank are needed per encoder_out frame.

Why does tailing dummy frames help?

As 4 frames fbank are needed per encoder_out frame, suppose only 3(or 2,1) frames left, after a decoding process. There will no encoder out frames corresponding to these 3 frames. This may results in some "substitution/deletion errors" at the end. By padding some dummy frames to the right, this problem could be solved to some extent.

Some Examples results:

padding 0 frame padding 20 frames
WITH ONE JUMP (ANDERS->ANDREWS) GOT OUT OF HIS (CHAIR->CHA) WITH ONE JUMP (ANDERS->ANDREWS) GOT OUT OF HIS CHAIR
COME WE'LL HAVE OUR COFFEE IN THE OTHER ROOM AND YOU CAN (SMOKE->SMO) COME WE'LL HAVE OUR COFFEE IN THE OTHER ROOM AND YOU CAN SMOKE
THINKING OF ALL THIS I WENT TO (SLEEP->SLEE) THINKING OF ALL THIS I WENT TO SLEEP
STEAM UP AND CANVAS SPREAD THE SCHOONER STARTED (EASTWARDS->EASTWARD) STEAM UP AND CANVAS SPREAD THE SCHOONER STARTED EASTWARDS

final Wers and detail error counts :

* wer ins del sub
padding 0 5.60 329 283 2332
padding 20 frames 5.52 329 282 2291

Raw log files of previous table:

padding 0 frames:
%WER = 5.60 
Errors: 329 insertions, 283 deletions, 2332 substitutions, over 52576 reference words (49961 correct)

padding 20 frames:
%WER = 5.52
Errors: 329 insertions, 282 deletions, 2291 substitutions, over 52576 reference words (50003 correct)