diff --git a/egs/multi_ja_en/ASR/RESULTS.md b/egs/multi_ja_en/ASR/RESULTS.md index 1b2da7303..e89c9e3b6 100644 --- a/egs/multi_ja_en/ASR/RESULTS.md +++ b/egs/multi_ja_en/ASR/RESULTS.md @@ -2,29 +2,32 @@ ### Zipformer -#### Non-streaming +#### Non-streaming (Byte-Level BPE vocab_size=2000) + +Trained on 15k hours of ReazonSpeech (filtered to only audio segments between 8s and 22s) and 15k hours of MLS English. The training command is: ```shell ./zipformer/train.py \ - --world-size 4 \ - --num-epochs 21 \ + --world-size 8 \ + --num-epochs 10 \ --start-epoch 1 \ --use-fp16 1 \ --exp-dir zipformer/exp \ - --manifest-dir data/manifests + --manifest-dir data/manifests \ + --enable-musan True ``` The decoding command is: ```shell ./zipformer/decode.py \ - --epoch 21 \ - --avg 15 \ + --epoch 10 \ + --avg 1 \ --exp-dir ./zipformer/exp \ - --max-duration 600 \ - --decoding-method greedy_search + --decoding-method modified_beam_search \ + --manifest-dir data/manifests ``` To export the model with onnx: @@ -33,28 +36,28 @@ To export the model with onnx: ./zipformer/export-onnx.py \ --tokens ./data/lang/bbpe_2000/tokens.txt \ --use-averaged-model 0 \ - --epoch 21 \ + --epoch 10 \ --avg 1 \ --exp-dir ./zipformer/exp ``` -Word Error Rates (WERs) listed below: +WER and CER on test set listed below (calculated with `./zipformer/decode.py`): -| Datasets | ReazonSpeech | ReazonSpeech | LibriSpeech | LibriSpeech | -|----------------------|--------------|---------------|--------------------|-------------------| -| Zipformer WER (%) | dev | test | test-clean | test-other | -| greedy_search | 5.9 | 4.07 | 3.46 | 8.35 | -| modified_beam_search | 4.87 | 3.61 | 3.28 | 8.07 | +| Datasets | ReazonSpeech + MLS English (combined test set) | +|----------------------|------------------------------------------------| +| Zipformer WER (%) | test | +| greedy_search | 6.33 | +| modified_beam_search | 6.32 | We also include WER% for common English ASR datasets: -| Corpus | WER (%) | +| Corpus | WER (%) | |-----------------------------|---------| | CommonVoice | 29.03 | | TED | 16.78 | -| MLS English (test-clean) | 8.64 | +| MLS English (test set) | 8.64 | And CER% for common Japanese datasets: @@ -68,3 +71,65 @@ And CER% for common Japanese datasets: Pre-trained model can be found here: [https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k](https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k) +(Not yet publicly released) + +#### Streaming (Byte-Level BPE vocab_size=2000) + +Trained on 15k hours of ReazonSpeech (filtered to only audio segments between 8s and 22s) and 15k hours of MLS English. + +The training command is: + +```shell +./zipformer/train.py \ + --world-size 8 \ + --num-epochs 10 \ + --start-epoch 1 \ + --use-fp16 1 \ + --exp-dir zipformer/exp \ + --manifest-dir data/manifests \ + --enable-musan True +``` + +The decoding command is: + +```shell +./zipformer/decode.py \ + --epoch 10 \ + --avg 1 \ + --exp-dir ./zipformer/exp \ + --decoding-method modified_beam_search \ + --manifest-dir data/manifests +``` + +To export the model with onnx: + +```shell +./zipformer/export-onnx.py \ + --tokens ./data/lang/bbpe_2000/tokens.txt \ + --use-averaged-model 0 \ + --epoch 10 \ + --avg 1 \ + --decode-chunk-len 32 \ + --exp-dir ./zipformer/exp +``` + +You may also use decode chunk sizes `16`, `32`, `64`, `128`. + +Word Error Rates (WERs) listed below: + +*Please let us know which script to use to evaluate the streaming model!* + + +We also include WER% for common English ASR datasets: + +*Please let us know which script to use to evaluate the streaming model!* + + +And CER% for common Japanese datasets: + +*Please let us know which script to use to evaluate the streaming model!* + + +Pre-trained model can be found here: [https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k](https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k) + +(Not yet publicly released)