mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-12-11 06:55:27 +00:00
165 lines
4.1 KiB
Markdown
165 lines
4.1 KiB
Markdown
## Results
|
|
|
|
### Zipformer
|
|
|
|
#### Non-streaming (Byte-Level BPE vocab_size=2000)
|
|
|
|
Trained on 15k hours of ReazonSpeech (filtered to only audio segments between 8s and 22s) and 15k hours of MLS English.
|
|
|
|
The training command is:
|
|
|
|
```shell
|
|
./zipformer/train.py \
|
|
--world-size 8 \
|
|
--causal 1 \
|
|
--num-epochs 10 \
|
|
--start-epoch 1 \
|
|
--use-fp16 1 \
|
|
--exp-dir zipformer/exp \
|
|
--manifest-dir data/manifests \
|
|
--enable-musan True
|
|
```
|
|
|
|
The decoding command is:
|
|
|
|
```shell
|
|
./zipformer/decode.py \
|
|
--epoch 10 \
|
|
--avg 1 \
|
|
--exp-dir ./zipformer/exp \
|
|
--decoding-method modified_beam_search \
|
|
--manifest-dir data/manifests
|
|
```
|
|
|
|
To export the model with onnx:
|
|
|
|
```shell
|
|
./zipformer/export-onnx.py \
|
|
--tokens ./data/lang/bbpe_2000/tokens.txt \
|
|
--use-averaged-model 0 \
|
|
--epoch 10 \
|
|
--avg 1 \
|
|
--exp-dir ./zipformer/exp
|
|
```
|
|
|
|
WER and CER on test set listed below (calculated with `./zipformer/decode.py`):
|
|
|
|
| Datasets | ReazonSpeech + MLS English (combined test set) |
|
|
|----------------------|------------------------------------------------|
|
|
| Zipformer WER (%) | test |
|
|
| greedy_search | 6.33 |
|
|
| modified_beam_search | 6.32 |
|
|
|
|
|
|
|
|
We also include WER% for common English ASR datasets:
|
|
|
|
| Corpus | WER (%) |
|
|
|-----------------------------|---------|
|
|
| CommonVoice | 29.03 |
|
|
| TED | 16.78 |
|
|
| MLS English (test set) | 8.64 |
|
|
|
|
|
|
And CER% for common Japanese datasets:
|
|
|
|
| Corpus | CER (%) |
|
|
|---------------|---------|
|
|
| JSUT | 8.13 |
|
|
| CommonVoice | 9.82 |
|
|
| TEDx | 11.64 |
|
|
|
|
|
|
Pre-trained model can be found here: [https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k](https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k)
|
|
|
|
(Not yet publicly released)
|
|
|
|
#### Streaming (Byte-Level BPE vocab_size=2000)
|
|
|
|
Trained on 15k hours of ReazonSpeech (filtered to only audio segments between 8s and 22s) and 15k hours of MLS English.
|
|
|
|
The training command is:
|
|
|
|
```shell
|
|
./zipformer/train.py \
|
|
--world-size 8 \
|
|
--causal 1 \
|
|
--num-epochs 10 \
|
|
--start-epoch 1 \
|
|
--use-fp16 1 \
|
|
--exp-dir zipformer/exp \
|
|
--manifest-dir data/manifests \
|
|
--enable-musan True
|
|
```
|
|
|
|
The decoding command is:
|
|
|
|
```shell
|
|
TODO
|
|
```
|
|
|
|
To export the model with sherpa onnx:
|
|
|
|
```shell
|
|
./zipformer/export-onnx-streaming.py \
|
|
--tokens ./data/lang/bbpe_2000/tokens.txt \
|
|
--use-averaged-model 0 \
|
|
--epoch 10 \
|
|
--avg 1 \
|
|
--exp-dir ./zipformer/exp-15k15k-streaming \
|
|
--num-encoder-layers "2,2,3,4,3,2" \
|
|
--downsampling-factor "1,2,4,8,4,2" \
|
|
--feedforward-dim "512,768,1024,1536,1024,768" \
|
|
--num-heads "4,4,4,8,4,4" \
|
|
--encoder-dim "192,256,384,512,384,256" \
|
|
--query-head-dim 32 \
|
|
--value-head-dim 12 \
|
|
--pos-head-dim 4 \
|
|
--pos-dim 48 \
|
|
--encoder-unmasked-dim "192,192,256,256,256,192" \
|
|
--cnn-module-kernel "31,31,15,15,15,31" \
|
|
--decoder-dim 512 \
|
|
--joiner-dim 512 \
|
|
--causal True \
|
|
--chunk-size 16 \
|
|
--left-context-frames 128 \
|
|
--fp16 True
|
|
```
|
|
|
|
(Adjust the `chunk-size` and `left-context-frames` as necessary)
|
|
|
|
To export the model as Torchscript (`.jit`):
|
|
|
|
```shell
|
|
./zipformer/export.py \
|
|
--exp-dir ./zipformer/exp-15k15k-streaming \
|
|
--causal 1 \
|
|
--chunk-size 16 \
|
|
--left-context-frames 128 \
|
|
--tokens data/lang/bbpe_2000/tokens.txt \
|
|
--epoch 10 \
|
|
--avg 1 \
|
|
--jit 1
|
|
```
|
|
|
|
You may also use decode chunk sizes `16`, `32`, `64`, `128`.
|
|
|
|
Word Error Rates (WERs) listed below:
|
|
|
|
*Please let us know which script to use to evaluate the streaming model!*
|
|
|
|
|
|
We also include WER% for common English ASR datasets:
|
|
|
|
*Please let us know which script to use to evaluate the streaming model!*
|
|
|
|
|
|
And CER% for common Japanese datasets:
|
|
|
|
*Please let us know which script to use to evaluate the streaming model!*
|
|
|
|
|
|
Pre-trained model can be found here: [https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k](https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k)
|
|
|
|
(Not yet publicly released)
|