wer of streaming conformer transducer

This commit is contained in:
Guo Liyong 2022-03-07 23:28:02 +08:00
parent ee359f4d13
commit 2704d589df

View File

@ -0,0 +1,63 @@
## wer with various right context
related model and decoding result/log fils could be found:
https://huggingface.co/GuoLiyong/icefall_streaming_prunned_transducer_stateless/tree/main/streaming_pruned_transducer_stateless/exp
decoding with ctc greedy search:
right_context|1|8|16|32|64|full
--|--|--|--|--|--|--
latency|0.07s|0.35s|0.67s|1.31s|2.59s|*
test_clean|5.60|4.00|3.76|3.75|3.65|3.28|
+20 tailing dummy frames|5.52|3.98|3.75|3.75|3.65|3.28
simulate streaming with chunk_by_chunk decoding|5.52|3.98|3.75|3.75|3.65|3.28
test_other|14.07|10.69|9.80|9.48|9.01|8.05|
+20 tailing dummy frames|14.00|10.69|9.80|9.48|9.0|8.04
simulate streaming with chunk_by_chunk decoding|14.00|10.69|9.80|9.48|9.0|8.04
## How latency is computed?
latency = (subsampling factor * right_context + initialize_frames_need_by_subsampling_convs) * 10ms
During which: subsmapling factor = 4
initialize_frames_need_by_subsampling_convs = 3
To decode the first frame encoder out: 7 frams fbanks = subsampling_factor + initialize_frames_need_by_subsampling_convs are needed.
Once the deocding started, 4 frames fbank are needed per encoder_out frame.
## Why does tailing dummy frames help?
As 4 frames fbank are needed per encoder_out frame, suppose only 3(or 2,1) frames left, after a decoding process.
There will no encoder out frames corresponding to these 3 frames.
This may results in some "substitution/deletion errors" at the end.
By padding some dummy frames to the right, this problem could be solved to some extent.
### Some Examples results:
padding 0 frame|padding 20 frames
--|--
WITH ONE JUMP (ANDERS->ANDREWS) GOT OUT OF HIS (CHAIR->CHA)|WITH ONE JUMP (ANDERS->ANDREWS) GOT OUT OF HIS CHAIR
COME WE'LL HAVE OUR COFFEE IN THE OTHER ROOM AND YOU CAN (SMOKE->SMO)|COME WE'LL HAVE OUR COFFEE IN THE OTHER ROOM AND YOU CAN SMOKE
THINKING OF ALL THIS I WENT TO (SLEEP->SLEE)|THINKING OF ALL THIS I WENT TO SLEEP
STEAM UP AND CANVAS SPREAD THE SCHOONER STARTED (EASTWARDS->EASTWARD)|STEAM UP AND CANVAS SPREAD THE SCHOONER STARTED EASTWARDS
### final Wers and detail error counts :
*|wer|ins|del|sub
--|--|--|--|--
padding 0|5.60|329|283|2332
padding 20 frames|5.52|329|282|2291
Raw log files of previous table:
```
padding 0 frames:
%WER = 5.60
Errors: 329 insertions, 283 deletions, 2332 substitutions, over 52576 reference words (49961 correct)
padding 20 frames:
%WER = 5.52
Errors: 329 insertions, 282 deletions, 2291 substitutions, over 52576 reference words (50003 correct)
```