Update training commands and decode.py accuracy values, add streaming model section

2025-09-03 17:54:34 +09:00 · 2025-09-03 17:54:34 +09:00 · bc2560cb7a
commit bc2560cb7a
parent ef7664e7cf
1 changed files with 82 additions and 17 deletions
--- a/egs/multi_ja_en/ASR/RESULTS.md
+++ b/egs/multi_ja_en/ASR/RESULTS.md
@ -2,29 +2,32 @@

 ### Zipformer

-#### Non-streaming
+#### Non-streaming (Byte-Level BPE vocab_size=2000)
+
+Trained on 15k hours of ReazonSpeech (filtered to only audio segments between 8s and 22s) and 15k hours of MLS English.

 The training command is:

 ```shell
 ./zipformer/train.py \
-  --world-size 4 \
-  --num-epochs 21 \
+  --world-size 8 \
+  --num-epochs 10 \
  --start-epoch 1 \
  --use-fp16 1 \
  --exp-dir zipformer/exp \
-  --manifest-dir data/manifests
+  --manifest-dir data/manifests \
+  --enable-musan True
 ```

 The decoding command is:

 ```shell
 ./zipformer/decode.py \
-    --epoch 21 \
-    --avg 15 \
+    --epoch 10 \
+    --avg 1 \
    --exp-dir ./zipformer/exp \
-    --max-duration 600 \
-    --decoding-method greedy_search
+    --decoding-method modified_beam_search \
+    --manifest-dir data/manifests
 ```

 To export the model with onnx:
@ -33,28 +36,28 @@ To export the model with onnx:
 ./zipformer/export-onnx.py \
  --tokens ./data/lang/bbpe_2000/tokens.txt \
  --use-averaged-model 0 \
-  --epoch 21 \
+  --epoch 10 \
  --avg 1 \
  --exp-dir ./zipformer/exp
 ```

-Word Error Rates (WERs) listed below:
+WER and CER on test set listed below (calculated with `./zipformer/decode.py`):

-|       Datasets       | ReazonSpeech |  ReazonSpeech |     LibriSpeech    |    LibriSpeech    |
-|----------------------|--------------|---------------|--------------------|-------------------|
-|   Zipformer WER (%)  |     dev      |     test      |     test-clean     |    test-other     |
-|     greedy_search    |     5.9      |     4.07      |        3.46        |       8.35        |
-| modified_beam_search |    4.87      |     3.61      |        3.28        |       8.07        |
+|       Datasets       | ReazonSpeech + MLS English (combined test set) |
+|----------------------|------------------------------------------------|
+|   Zipformer WER (%)  |                      test                      |
+|     greedy_search    |                      6.33                      |
+| modified_beam_search |                      6.32                      |



 We also include WER% for common English ASR datasets:

-| Corpus                       | WER (%) |
+| Corpus                      | WER (%) |
 |-----------------------------|---------|
 | CommonVoice                 | 29.03   |
 | TED                         | 16.78   |
-| MLS English (test-clean)    | 8.64   |
+| MLS English (test set)      | 8.64    |


 And CER% for common Japanese datasets:
@ -68,3 +71,65 @@ And CER% for common Japanese datasets:

 Pre-trained model can be found here: [https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k](https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k)

+(Not yet publicly released)
+
+#### Streaming (Byte-Level BPE vocab_size=2000)
+
+Trained on 15k hours of ReazonSpeech (filtered to only audio segments between 8s and 22s) and 15k hours of MLS English.
+
+The training command is:
+
+```shell
+./zipformer/train.py \
+  --world-size 8 \
+  --num-epochs 10 \
+  --start-epoch 1 \
+  --use-fp16 1 \
+  --exp-dir zipformer/exp \
+  --manifest-dir data/manifests \
+  --enable-musan True
+```
+
+The decoding command is:
+
+```shell
+./zipformer/decode.py \
+    --epoch 10 \
+    --avg 1 \
+    --exp-dir ./zipformer/exp \
+    --decoding-method modified_beam_search \
+    --manifest-dir data/manifests
+```
+
+To export the model with onnx:
+
+```shell
+./zipformer/export-onnx.py \
+  --tokens ./data/lang/bbpe_2000/tokens.txt \
+  --use-averaged-model 0 \
+  --epoch 10 \
+  --avg 1 \
+  --decode-chunk-len 32 \
+  --exp-dir ./zipformer/exp
+```
+
+You may also use decode chunk sizes `16`, `32`, `64`, `128`.
+
+Word Error Rates (WERs) listed below:
+
+*Please let us know which script to use to evaluate the streaming model!*
+
+
+We also include WER% for common English ASR datasets:
+
+*Please let us know which script to use to evaluate the streaming model!*
+
+
+And CER% for common Japanese datasets:
+
+*Please let us know which script to use to evaluate the streaming model!*
+
+
+Pre-trained model can be found here: [https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k](https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k)
+
+(Not yet publicly released)