mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-12-09 05:55:26 +00:00
Update training commands and decode.py accuracy values, add streaming model section
This commit is contained in:
parent
ef7664e7cf
commit
bc2560cb7a
@ -2,29 +2,32 @@
|
||||
|
||||
### Zipformer
|
||||
|
||||
#### Non-streaming
|
||||
#### Non-streaming (Byte-Level BPE vocab_size=2000)
|
||||
|
||||
Trained on 15k hours of ReazonSpeech (filtered to only audio segments between 8s and 22s) and 15k hours of MLS English.
|
||||
|
||||
The training command is:
|
||||
|
||||
```shell
|
||||
./zipformer/train.py \
|
||||
--world-size 4 \
|
||||
--num-epochs 21 \
|
||||
--world-size 8 \
|
||||
--num-epochs 10 \
|
||||
--start-epoch 1 \
|
||||
--use-fp16 1 \
|
||||
--exp-dir zipformer/exp \
|
||||
--manifest-dir data/manifests
|
||||
--manifest-dir data/manifests \
|
||||
--enable-musan True
|
||||
```
|
||||
|
||||
The decoding command is:
|
||||
|
||||
```shell
|
||||
./zipformer/decode.py \
|
||||
--epoch 21 \
|
||||
--avg 15 \
|
||||
--epoch 10 \
|
||||
--avg 1 \
|
||||
--exp-dir ./zipformer/exp \
|
||||
--max-duration 600 \
|
||||
--decoding-method greedy_search
|
||||
--decoding-method modified_beam_search \
|
||||
--manifest-dir data/manifests
|
||||
```
|
||||
|
||||
To export the model with onnx:
|
||||
@ -33,28 +36,28 @@ To export the model with onnx:
|
||||
./zipformer/export-onnx.py \
|
||||
--tokens ./data/lang/bbpe_2000/tokens.txt \
|
||||
--use-averaged-model 0 \
|
||||
--epoch 21 \
|
||||
--epoch 10 \
|
||||
--avg 1 \
|
||||
--exp-dir ./zipformer/exp
|
||||
```
|
||||
|
||||
Word Error Rates (WERs) listed below:
|
||||
WER and CER on test set listed below (calculated with `./zipformer/decode.py`):
|
||||
|
||||
| Datasets | ReazonSpeech | ReazonSpeech | LibriSpeech | LibriSpeech |
|
||||
|----------------------|--------------|---------------|--------------------|-------------------|
|
||||
| Zipformer WER (%) | dev | test | test-clean | test-other |
|
||||
| greedy_search | 5.9 | 4.07 | 3.46 | 8.35 |
|
||||
| modified_beam_search | 4.87 | 3.61 | 3.28 | 8.07 |
|
||||
| Datasets | ReazonSpeech + MLS English (combined test set) |
|
||||
|----------------------|------------------------------------------------|
|
||||
| Zipformer WER (%) | test |
|
||||
| greedy_search | 6.33 |
|
||||
| modified_beam_search | 6.32 |
|
||||
|
||||
|
||||
|
||||
We also include WER% for common English ASR datasets:
|
||||
|
||||
| Corpus | WER (%) |
|
||||
| Corpus | WER (%) |
|
||||
|-----------------------------|---------|
|
||||
| CommonVoice | 29.03 |
|
||||
| TED | 16.78 |
|
||||
| MLS English (test-clean) | 8.64 |
|
||||
| MLS English (test set) | 8.64 |
|
||||
|
||||
|
||||
And CER% for common Japanese datasets:
|
||||
@ -68,3 +71,65 @@ And CER% for common Japanese datasets:
|
||||
|
||||
Pre-trained model can be found here: [https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k](https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k)
|
||||
|
||||
(Not yet publicly released)
|
||||
|
||||
#### Streaming (Byte-Level BPE vocab_size=2000)
|
||||
|
||||
Trained on 15k hours of ReazonSpeech (filtered to only audio segments between 8s and 22s) and 15k hours of MLS English.
|
||||
|
||||
The training command is:
|
||||
|
||||
```shell
|
||||
./zipformer/train.py \
|
||||
--world-size 8 \
|
||||
--num-epochs 10 \
|
||||
--start-epoch 1 \
|
||||
--use-fp16 1 \
|
||||
--exp-dir zipformer/exp \
|
||||
--manifest-dir data/manifests \
|
||||
--enable-musan True
|
||||
```
|
||||
|
||||
The decoding command is:
|
||||
|
||||
```shell
|
||||
./zipformer/decode.py \
|
||||
--epoch 10 \
|
||||
--avg 1 \
|
||||
--exp-dir ./zipformer/exp \
|
||||
--decoding-method modified_beam_search \
|
||||
--manifest-dir data/manifests
|
||||
```
|
||||
|
||||
To export the model with onnx:
|
||||
|
||||
```shell
|
||||
./zipformer/export-onnx.py \
|
||||
--tokens ./data/lang/bbpe_2000/tokens.txt \
|
||||
--use-averaged-model 0 \
|
||||
--epoch 10 \
|
||||
--avg 1 \
|
||||
--decode-chunk-len 32 \
|
||||
--exp-dir ./zipformer/exp
|
||||
```
|
||||
|
||||
You may also use decode chunk sizes `16`, `32`, `64`, `128`.
|
||||
|
||||
Word Error Rates (WERs) listed below:
|
||||
|
||||
*Please let us know which script to use to evaluate the streaming model!*
|
||||
|
||||
|
||||
We also include WER% for common English ASR datasets:
|
||||
|
||||
*Please let us know which script to use to evaluate the streaming model!*
|
||||
|
||||
|
||||
And CER% for common Japanese datasets:
|
||||
|
||||
*Please let us know which script to use to evaluate the streaming model!*
|
||||
|
||||
|
||||
Pre-trained model can be found here: [https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k](https://huggingface.co/reazon-research/reazonspeech-k2-v2-ja-en/tree/multi_ja_en_15k15k)
|
||||
|
||||
(Not yet publicly released)
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user