2.4 KiB
wer with various right context
related model and decoding result/log fils could be found: https://huggingface.co/GuoLiyong/icefall_streaming_prunned_transducer_stateless/tree/main/streaming_pruned_transducer_stateless/exp
decoding with ctc greedy search:
right_context | 1 | 8 | 16 | 32 | 64 | full |
---|---|---|---|---|---|---|
latency | 0.07s | 0.35s | 0.67s | 1.31s | 2.59s | * |
test_clean | 5.60 | 4.00 | 3.76 | 3.75 | 3.65 | 3.28 |
+20 tailing dummy frames | 5.52 | 3.98 | 3.75 | 3.75 | 3.65 | 3.28 |
simulate streaming with chunk_by_chunk decoding | 5.52 | 3.98 | 3.75 | 3.75 | 3.65 | 3.28 |
test_other | 14.07 | 10.69 | 9.80 | 9.48 | 9.01 | 8.05 |
+20 tailing dummy frames | 14.00 | 10.69 | 9.80 | 9.48 | 9.0 | 8.04 |
simulate streaming with chunk_by_chunk decoding | 14.00 | 10.69 | 9.80 | 9.48 | 9.0 | 8.04 |
How latency is computed?
latency = (subsampling factor * right_context + initialize_frames_need_by_subsampling_convs) * 10ms
During which: subsmapling factor = 4 initialize_frames_need_by_subsampling_convs = 3
To decode the first frame encoder out: 7 frams fbanks = subsampling_factor + initialize_frames_need_by_subsampling_convs are needed. Once the deocding started, 4 frames fbank are needed per encoder_out frame.
Why does tailing dummy frames help?
As 4 frames fbank are needed per encoder_out frame, suppose only 3(or 2,1) frames left, after a decoding process. There will no encoder out frames corresponding to these 3 frames. This may results in some "substitution/deletion errors" at the end. By padding some dummy frames to the right, this problem could be solved to some extent.
Some Examples results:
padding 0 frame | padding 20 frames |
---|---|
WITH ONE JUMP (ANDERS->ANDREWS) GOT OUT OF HIS (CHAIR->CHA) | WITH ONE JUMP (ANDERS->ANDREWS) GOT OUT OF HIS CHAIR |
COME WE'LL HAVE OUR COFFEE IN THE OTHER ROOM AND YOU CAN (SMOKE->SMO) | COME WE'LL HAVE OUR COFFEE IN THE OTHER ROOM AND YOU CAN SMOKE |
THINKING OF ALL THIS I WENT TO (SLEEP->SLEE) | THINKING OF ALL THIS I WENT TO SLEEP |
STEAM UP AND CANVAS SPREAD THE SCHOONER STARTED (EASTWARDS->EASTWARD) | STEAM UP AND CANVAS SPREAD THE SCHOONER STARTED EASTWARDS |
final Wers and detail error counts :
* | wer | ins | del | sub |
---|---|---|---|---|
padding 0 | 5.60 | 329 | 283 | 2332 |
padding 20 frames | 5.52 | 329 | 282 | 2291 |
Raw log files of previous table:
padding 0 frames:
%WER = 5.60
Errors: 329 insertions, 283 deletions, 2332 substitutions, over 52576 reference words (49961 correct)
padding 20 frames:
%WER = 5.52
Errors: 329 insertions, 282 deletions, 2291 substitutions, over 52576 reference words (50003 correct)