mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-09-18 21:44:18 +00:00
update RESULTS.md
This commit is contained in:
parent
1f6c822dc0
commit
61794d8a1a
@ -1,5 +1,317 @@
|
|||||||
## Results
|
## Results
|
||||||
|
|
||||||
|
### LibriSpeech BPE training results (Pruned Stateless Conv-Emformer RNN-T 2)
|
||||||
|
|
||||||
|
[conv_emformer_transducer_stateless2](./conv_emformer_transducer_stateless2)
|
||||||
|
|
||||||
|
It implements [Emformer](https://arxiv.org/abs/2010.10759) augmented with convolution module and simplified memory bank for streaming ASR.
|
||||||
|
It is modified from [torchaudio](https://github.com/pytorch/audio).
|
||||||
|
|
||||||
|
See <https://github.com/k2-fsa/icefall/pull/440> for more details.
|
||||||
|
|
||||||
|
#### With lower latency setup, training on full librispeech
|
||||||
|
|
||||||
|
In this model, the lengths of chunk and right context are 32 frames (i.e., 0.32s) and 8 frames (i.e., 0.08s), respectively.
|
||||||
|
|
||||||
|
The WERs are:
|
||||||
|
|
||||||
|
| | test-clean | test-other | comment | decoding mode |
|
||||||
|
|-------------------------------------|------------|------------|----------------------|----------------------|
|
||||||
|
| greedy search (max sym per frame 1) | 3.5 | 9.09 | --epoch 30 --avg 10 | simulated streaming |
|
||||||
|
| greedy search (max sym per frame 1) | 3.57 | 9.1 | --epoch 30 --avg 10 | streaming |
|
||||||
|
| fast beam search | 3.5 | 8.91 | --epoch 30 --avg 10 | simulated streaming |
|
||||||
|
| fast beam search | 3.54 | 8.91 | --epoch 30 --avg 10 | streaming |
|
||||||
|
| modified beam search | 3.43 | 8.86 | --epoch 30 --avg 10 | simulated streaming |
|
||||||
|
| modified beam search | 3.48 | 8.88 | --epoch 30 --avg 10 | streaming |
|
||||||
|
|
||||||
|
The training command is:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/train.py \
|
||||||
|
--world-size 6 \
|
||||||
|
--num-epochs 30 \
|
||||||
|
--start-epoch 1 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--full-libri 1 \
|
||||||
|
--max-duration 280 \
|
||||||
|
--master-port 12321 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32
|
||||||
|
```
|
||||||
|
|
||||||
|
The tensorboard log can be found at
|
||||||
|
<https://tensorboard.dev/experiment/W5MpxekiQLSPyM4fe5hbKg/>
|
||||||
|
|
||||||
|
The simulated streaming decoding command using greedy search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--max-duration 300 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method greedy_search \
|
||||||
|
--use-averaged-model True
|
||||||
|
```
|
||||||
|
|
||||||
|
The simulated streaming decoding command using fast beam search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--max-duration 300 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method fast_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam 4 \
|
||||||
|
--max-contexts 4 \
|
||||||
|
--max-states 8
|
||||||
|
```
|
||||||
|
|
||||||
|
The simulated streaming decoding command using modified beam search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--max-duration 300 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method modified_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam-size 4
|
||||||
|
```
|
||||||
|
|
||||||
|
The streaming decoding command using greedy search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method greedy_search \
|
||||||
|
--use-averaged-model True
|
||||||
|
```
|
||||||
|
|
||||||
|
The streaming decoding command using fast beam search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method fast_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam 4 \
|
||||||
|
--max-contexts 4 \
|
||||||
|
--max-states 8
|
||||||
|
```
|
||||||
|
|
||||||
|
The streaming decoding command using modified beam search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method modified_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam-size 4
|
||||||
|
```
|
||||||
|
|
||||||
|
Pretrained models, training logs, decoding logs, and decoding results
|
||||||
|
are available at
|
||||||
|
<https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05>
|
||||||
|
|
||||||
|
#### With higher latency setup, training on full librispeech
|
||||||
|
|
||||||
|
In this model, the lengths of chunk and right context are 64 frames (i.e., 0.64s) and 16 frames (i.e., 0.16s), respectively.
|
||||||
|
|
||||||
|
The WERs are:
|
||||||
|
|
||||||
|
| | test-clean | test-other | comment | decoding mode |
|
||||||
|
|-------------------------------------|------------|------------|----------------------|----------------------|
|
||||||
|
| greedy search (max sym per frame 1) | 3.3 | 8.71 | --epoch 30 --avg 10 | simulated streaming |
|
||||||
|
| greedy search (max sym per frame 1) | 3.35 | 9.65 | --epoch 30 --avg 10 | streaming |
|
||||||
|
| fast beam search | 3.27 | 8.58 | --epoch 30 --avg 10 | simulated streaming |
|
||||||
|
| fast beam search | 3.31 | 8.48 | --epoch 30 --avg 10 | streaming |
|
||||||
|
| modified beam search | 3.26 | 8.56 | --epoch 30 --avg 10 | simulated streaming |
|
||||||
|
| modified beam search | 3.29 | 8.47 | --epoch 30 --avg 10 | streaming |
|
||||||
|
|
||||||
|
The training command is:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/train.py \
|
||||||
|
--world-size 6 \
|
||||||
|
--num-epochs 30 \
|
||||||
|
--start-epoch 1 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--full-libri 1 \
|
||||||
|
--max-duration 280 \
|
||||||
|
--master-port 12321 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 64 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 64 \
|
||||||
|
--right-context-length 16 \
|
||||||
|
--memory-size 32
|
||||||
|
```
|
||||||
|
|
||||||
|
The tensorboard log can be found at
|
||||||
|
<https://tensorboard.dev/experiment/eRx6XwbOQhGlywgD8lWBjw/>
|
||||||
|
|
||||||
|
The simulated streaming decoding command using greedy search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--max-duration 300 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 64 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 64 \
|
||||||
|
--right-context-length 16 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method greedy_search \
|
||||||
|
--use-averaged-model True
|
||||||
|
```
|
||||||
|
|
||||||
|
The simulated streaming decoding command using fast beam search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--max-duration 300 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 64 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 64 \
|
||||||
|
--right-context-length 16 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method fast_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam 4 \
|
||||||
|
--max-contexts 4 \
|
||||||
|
--max-states 8
|
||||||
|
```
|
||||||
|
|
||||||
|
The simulated streaming decoding command using modified beam search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--max-duration 300 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 64 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 64 \
|
||||||
|
--right-context-length 16 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method modified_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam-size 4
|
||||||
|
```
|
||||||
|
|
||||||
|
The streaming decoding command using greedy search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 64 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 64 \
|
||||||
|
--right-context-length 16 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method greedy_search \
|
||||||
|
--use-averaged-model True
|
||||||
|
```
|
||||||
|
|
||||||
|
The streaming decoding command using fast beam search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 64 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 64 \
|
||||||
|
--right-context-length 16 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method fast_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam 4 \
|
||||||
|
--max-contexts 4 \
|
||||||
|
--max-states 8
|
||||||
|
```
|
||||||
|
|
||||||
|
The streaming decoding command using modified beam search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless2/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless2/exp \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 64 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 64 \
|
||||||
|
--right-context-length 16 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method modified_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam-size 4
|
||||||
|
```
|
||||||
|
|
||||||
|
Pretrained models, training logs, decoding logs, and decoding results
|
||||||
|
are available at
|
||||||
|
<https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-larger-latency-2022-07-06>
|
||||||
|
|
||||||
|
|
||||||
### LibriSpeech BPE training results (Pruned Stateless Streaming Conformer RNN-T)
|
### LibriSpeech BPE training results (Pruned Stateless Streaming Conformer RNN-T)
|
||||||
|
|
||||||
#### [pruned_transducer_stateless](./pruned_transducer_stateless)
|
#### [pruned_transducer_stateless](./pruned_transducer_stateless)
|
||||||
|
Loading…
x
Reference in New Issue
Block a user