From 9d93d63cf2f7f7214892a8fb9d9c9c9feb4067ad Mon Sep 17 00:00:00 2001 From: Bailey Machiko Hirota <53164945+baileyeet@users.noreply.github.com> Date: Fri, 18 Jul 2025 18:04:12 +0900 Subject: [PATCH] Update RESULTS.md --- egs/multi_ja_en/ASR/RESULTS.md | 40 +++++++++++++++++++++++++--------- 1 file changed, 30 insertions(+), 10 deletions(-) diff --git a/egs/multi_ja_en/ASR/RESULTS.md b/egs/multi_ja_en/ASR/RESULTS.md index 0f6996013..524726441 100644 --- a/egs/multi_ja_en/ASR/RESULTS.md +++ b/egs/multi_ja_en/ASR/RESULTS.md @@ -8,20 +8,19 @@ The training command is: ```shell ./zipformer/train.py \ - --bilingual 1 \ --world-size 4 \ - --num-epochs 30 \ + --num-epochs 21 \ --start-epoch 1 \ --use-fp16 1 \ --exp-dir zipformer/exp \ - --max-duration 600 + --manifest-dir data/manifests ``` The decoding command is: ```shell ./zipformer/decode.py \ - --epoch 28 \ + --epoch 21 \ --avg 15 \ --exp-dir ./zipformer/exp \ --max-duration 600 \ @@ -31,8 +30,14 @@ The decoding command is: To export the model with onnx: ```shell -./zipformer/export-onnx.py --tokens data/lang_bbpe_2000/tokens.txt --use-averaged-model 0 --epoch 35 --avg 1 --exp-dir zipformer/exp --num-encoder-layers "2,2,3,4,3,2" --downsampling-factor "1,2,4,8,4,2" --feedforward-dim "512,768,1024,1536,1024,768" --num-heads "4,4,4,8,4,4" --encoder-dim "192,256,384,512,384,256" --query-head-dim 32 --value-head-dim 12 --pos-head-dim 4 --pos-dim 48 --encoder-unmasked-dim "192,192,256,256,256,192" --cnn-module-kernel "31,31,15,15,15,31" --decoder-dim 512 --joiner-dim 512 --causal False --chunk-size "16,32,64,-1" --left-context-frames "64,128,256,-1" --fp16 True +./zipformer/export-onnx.py \ + --tokens ./data/lang/bbpe_2000/tokens.txt \ + --use-averaged-model 0 \ + --epoch 21 \ + --avg 1 \ + --exp-dir ./zipformer/exp ``` + Word Error Rates (WERs) listed below: | Datasets | ReazonSpeech | ReazonSpeech | LibriSpeech | LibriSpeech | @@ -42,11 +47,26 @@ Word Error Rates (WERs) listed below: | modified_beam_search | 4.87 | 3.61 | 3.28 | 8.07 | -Character Error Rates (CERs) for Japanese listed below: -| Decoding Method | In-Distribution CER | JSUT | CommonVoice | TEDx | -| :------------------: | :-----------------: | :--: | :---------: | :---: | -| greedy search | 12.56 | 6.93 | 9.75 | 9.67 | -| modified beam search | 11.59 | 6.97 | 9.55 | 9.51 | + +We also include WER% for common English ASR datasets: + +| Corpus | WER (%) | +|-----------------------------|---------| +| LibriSpeech (test-clean) | 3.49 | +| LibriSpeech (test-other) | 7.64 | +| CommonVoice | 39.87 | +| TED | 23.92 | +| MLS English (test-clean) | 10.16 | + + +And CER% for common Japanese datasets: + +| Corpus | CER (%) | +|---------------|---------| +| JSUT | 10.04 | +| CommonVoice | 10.39 | +| TEDx | 12.22 | + Pre-trained model can be found here: https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/main