mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-09-19 05:54:20 +00:00
remove changes in librispeech
This commit is contained in:
parent
494e88bcb7
commit
93a5c878f1
@ -26,6 +26,7 @@ The following table lists the differences among them.
|
||||
| `pruned_transducer_stateless7_ctc` | Zipformer | Embedding + Conv1d | Same as pruned_transducer_stateless7, but with extra CTC head|
|
||||
| `pruned_transducer_stateless7_ctc_bs` | Zipformer | Embedding + Conv1d | pruned_transducer_stateless7_ctc + blank skip |
|
||||
| `pruned_transducer_stateless7_streaming` | Streaming Zipformer | Embedding + Conv1d | streaming version of pruned_transducer_stateless7 |
|
||||
| `pruned_transducer_stateless7_streaming_multi` | Streaming Zipformer | Embedding + Conv1d | same as pruned_transducer_stateless7_streaming, trained on LibriSpeech + GigaSpeech |
|
||||
| `pruned_transducer_stateless8` | Zipformer | Embedding + Conv1d | Same as pruned_transducer_stateless7, but using extra data from GigaSpeech|
|
||||
| `pruned_stateless_emformer_rnnt2` | Emformer(from torchaudio) | Embedding + Conv1d | Using Emformer from torchaudio for streaming ASR|
|
||||
| `conv_emformer_transducer_stateless` | ConvEmformer | Embedding + Conv1d | Using ConvEmformer for streaming ASR + mechanisms in reworked model |
|
||||
@ -33,6 +34,7 @@ The following table lists the differences among them.
|
||||
| `lstm_transducer_stateless` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model |
|
||||
| `lstm_transducer_stateless2` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model + gigaspeech (multi-dataset setup) |
|
||||
| `lstm_transducer_stateless3` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model + gradient filter + delay penalty |
|
||||
| `zipformer` | Upgraded Zipformer | Embedding + Conv1d | The latest recipe |
|
||||
|
||||
The decoder in `transducer_stateless` is modified from the paper
|
||||
[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).
|
||||
|
@ -1,5 +1,537 @@
|
||||
## Results
|
||||
|
||||
### zipformer (zipformer + pruned stateless transducer)
|
||||
|
||||
See <https://github.com/k2-fsa/icefall/pull/1058> for more details.
|
||||
|
||||
[zipformer](./zipformer)
|
||||
|
||||
#### Non-streaming
|
||||
|
||||
##### normal-scaled model, number of model parameters: 65549011, i.e., 65.55 M
|
||||
|
||||
The tensorboard log can be found at
|
||||
<https://tensorboard.dev/experiment/cBaoIabCQxSDsyZM7FzqZA/>
|
||||
|
||||
You can find a pretrained model, training logs, decoding logs, and decoding results at:
|
||||
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15>
|
||||
|
||||
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
|
||||
|
||||
| decoding method | test-clean | test-other | comment |
|
||||
|----------------------|------------|------------|--------------------|
|
||||
| greedy_search | 2.27 | 5.1 | --epoch 30 --avg 9 |
|
||||
| modified_beam_search | 2.25 | 5.06 | --epoch 30 --avg 9 |
|
||||
| fast_beam_search | 2.25 | 5.04 | --epoch 30 --avg 9 |
|
||||
| greedy_search | 2.23 | 4.96 | --epoch 40 --avg 16 |
|
||||
| modified_beam_search | 2.21 | 4.91 | --epoch 40 --avg 16 |
|
||||
| fast_beam_search | 2.24 | 4.93 | --epoch 40 --avg 16 |
|
||||
|
||||
The training command is:
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES="0,1,2,3"
|
||||
./zipformer/train.py \
|
||||
--world-size 4 \
|
||||
--num-epochs 40 \
|
||||
--start-epoch 1 \
|
||||
--use-fp16 1 \
|
||||
--exp-dir zipformer/exp \
|
||||
--causal 0 \
|
||||
--full-libri 1 \
|
||||
--max-duration 1000
|
||||
```
|
||||
|
||||
The decoding command is:
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES="0"
|
||||
for m in greedy_search modified_beam_search fast_beam_search; do
|
||||
./zipformer/decode.py \
|
||||
--epoch 30 \
|
||||
--avg 9 \
|
||||
--use-averaged-model 1 \
|
||||
--exp-dir ./zipformer/exp \
|
||||
--max-duration 600 \
|
||||
--decoding-method $m
|
||||
done
|
||||
```
|
||||
|
||||
##### small-scaled model, number of model parameters: 23285615, i.e., 23.3 M
|
||||
|
||||
The tensorboard log can be found at
|
||||
<https://tensorboard.dev/experiment/53P4tL22TpO0UdiL0kPaLg/>
|
||||
|
||||
You can find a pretrained model, training logs, decoding logs, and decoding results at:
|
||||
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-small-2023-05-16>
|
||||
|
||||
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
|
||||
|
||||
| decoding method | test-clean | test-other | comment |
|
||||
|----------------------|------------|------------|--------------------|
|
||||
| greedy_search | 2.64 | 6.14 | --epoch 30 --avg 8 |
|
||||
| modified_beam_search | 2.6 | 6.01 | --epoch 30 --avg 8 |
|
||||
| fast_beam_search | 2.62 | 6.06 | --epoch 30 --avg 8 |
|
||||
| greedy_search | 2.49 | 5.91 | --epoch 40 --avg 13 |
|
||||
| modified_beam_search | 2.46 | 5.83 | --epoch 40 --avg 13 |
|
||||
| fast_beam_search | 2.46 | 5.87 | --epoch 40 --avg 13 |
|
||||
|
||||
The training command is:
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES="0,1"
|
||||
./zipformer/train.py \
|
||||
--world-size 2 \
|
||||
--num-epochs 40 \
|
||||
--start-epoch 1 \
|
||||
--use-fp16 1 \
|
||||
--exp-dir zipformer/exp-small \
|
||||
--causal 0 \
|
||||
--num-encoder-layers 2,2,2,2,2,2 \
|
||||
--feedforward-dim 512,768,768,768,768,768 \
|
||||
--encoder-dim 192,256,256,256,256,256 \
|
||||
--encoder-unmasked-dim 192,192,192,192,192,192 \
|
||||
--base-lr 0.04 \
|
||||
--full-libri 1 \
|
||||
--max-duration 1500
|
||||
```
|
||||
|
||||
The decoding command is:
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES="0"
|
||||
for m in greedy_search modified_beam_search fast_beam_search; do
|
||||
./zipformer/decode.py \
|
||||
--epoch 40 \
|
||||
--avg 13 \
|
||||
--exp-dir zipformer/exp-small \
|
||||
--max-duration 600 \
|
||||
--causal 0 \
|
||||
--decoding-method $m \
|
||||
--num-encoder-layers 2,2,2,2,2,2 \
|
||||
--feedforward-dim 512,768,768,768,768,768 \
|
||||
--encoder-dim 192,256,256,256,256,256 \
|
||||
--encoder-unmasked-dim 192,192,192,192,192,192
|
||||
done
|
||||
```
|
||||
|
||||
##### large-scaled model, number of model parameters: 148439574, i.e., 148.4 M
|
||||
|
||||
The tensorboard log can be found at
|
||||
<https://tensorboard.dev/experiment/HJ74wWYpQAGSzETkmQnrmQ/>
|
||||
|
||||
You can find a pretrained model, training logs, decoding logs, and decoding results at:
|
||||
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-large-2023-05-16>
|
||||
|
||||
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
|
||||
|
||||
| decoding method | test-clean | test-other | comment |
|
||||
|----------------------|------------|------------|--------------------|
|
||||
| greedy_search | 2.12 | 4.91 | --epoch 30 --avg 9 |
|
||||
| modified_beam_search | 2.11 | 4.9 | --epoch 30 --avg 9 |
|
||||
| fast_beam_search | 2.13 | 4.93 | --epoch 30 --avg 9 |
|
||||
| greedy_search | 2.12 | 4.8 | --epoch 40 --avg 13 |
|
||||
| modified_beam_search | 2.11 | 4.7 | --epoch 40 --avg 13 |
|
||||
| fast_beam_search | 2.13 | 4.78 | --epoch 40 --avg 13 |
|
||||
|
||||
The training command is:
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES="0,1,2,3"
|
||||
./zipformer/train.py \
|
||||
--world-size 4 \
|
||||
--num-epochs 40 \
|
||||
--start-epoch 1 \
|
||||
--use-fp16 1 \
|
||||
--exp-dir zipformer/exp-large \
|
||||
--causal 0 \
|
||||
--num-encoder-layers 2,2,4,5,4,2 \
|
||||
--feedforward-dim 512,768,1536,2048,1536,768 \
|
||||
--encoder-dim 192,256,512,768,512,256 \
|
||||
--encoder-unmasked-dim 192,192,256,320,256,192 \
|
||||
--full-libri 1 \
|
||||
--max-duration 1000
|
||||
```
|
||||
|
||||
The decoding command is:
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES="0"
|
||||
for m in greedy_search modified_beam_search fast_beam_search; do
|
||||
./zipformer/decode.py \
|
||||
--epoch 40 \
|
||||
--avg 16 \
|
||||
--exp-dir zipformer/exp-large \
|
||||
--max-duration 600 \
|
||||
--causal 0 \
|
||||
--decoding-method $m \
|
||||
--num-encoder-layers 2,2,4,5,4,2 \
|
||||
--feedforward-dim 512,768,1536,2048,1536,768 \
|
||||
--encoder-dim 192,256,512,768,512,256 \
|
||||
--encoder-unmasked-dim 192,192,256,320,256,192
|
||||
done
|
||||
```
|
||||
|
||||
#### streaming
|
||||
|
||||
##### normal-scaled model, number of model parameters: 66110931, i.e., 66.11 M
|
||||
|
||||
The tensorboard log can be found at
|
||||
<https://tensorboard.dev/experiment/9rD0i6rMSWq1O61poWi71A>
|
||||
|
||||
You can find a pretrained model, training logs, decoding logs, and decoding results at:
|
||||
<https://huggingface.co/Zengwei/icefall-asr-librispeech-streaming-zipformer-2023-05-17>
|
||||
|
||||
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
|
||||
|
||||
| decoding method | chunk size | test-clean | test-other | decoding mode | comment |
|
||||
|----------------------|------------|------------|------------|---------------------|--------------------|
|
||||
| greedy_search | 320ms | 3.06 | 7.81 | simulated streaming | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
|
||||
| greedy_search | 320ms | 3.06 | 7.79 | chunk-wise | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
|
||||
| modified_beam_search | 320ms | 3.01 | 7.69 | simulated streaming | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
|
||||
| modified_beam_search | 320ms | 3.05 | 7.69 | chunk-wise | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
|
||||
| fast_beam_search | 320ms | 3.04 | 7.68 | simulated streaming | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
|
||||
| fast_beam_search | 320ms | 3.07 | 7.69 | chunk-wise | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
|
||||
| greedy_search | 640ms | 2.81 | 7.15 | simulated streaming | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
|
||||
| greedy_search | 640ms | 2.84 | 7.16 | chunk-wise | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
|
||||
| modified_beam_search | 640ms | 2.79 | 7.05 | simulated streaming | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
|
||||
| modified_beam_search | 640ms | 2.81 | 7.11 | chunk-wise | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
|
||||
| fast_beam_search | 640ms | 2.84 | 7.04 | simulated streaming | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
|
||||
| fast_beam_search | 640ms | 2.83 | 7.1 | chunk-wise | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
|
||||
|
||||
Note: For decoding mode, `simulated streaming` indicates feeding full utterance during decoding using `decode.py`,
|
||||
while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`.
|
||||
|
||||
The training command is:
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES="0,1,2,3"
|
||||
./zipformer/train.py \
|
||||
--world-size 4 \
|
||||
--num-epochs 40 \
|
||||
--start-epoch 1 \
|
||||
--use-fp16 1 \
|
||||
--exp-dir zipformer/exp-causal \
|
||||
--causal 1 \
|
||||
--full-libri 1 \
|
||||
--max-duration 1000
|
||||
```
|
||||
|
||||
The simulated streaming decoding command is:
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES="0"
|
||||
for m in greedy_search modified_beam_search fast_beam_search; do
|
||||
./zipformer/decode.py \
|
||||
--epoch 30 \
|
||||
--avg 8 \
|
||||
--use-averaged-model 1 \
|
||||
--exp-dir ./zipformer/exp-causal \
|
||||
--causal 1 \
|
||||
--chunk-size 16 \
|
||||
--left-context-frames 128 \
|
||||
--max-duration 600 \
|
||||
--decoding-method $m
|
||||
done
|
||||
```
|
||||
|
||||
The chunk-wise streaming decoding command is:
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES="0"
|
||||
for m in greedy_search modified_beam_search fast_beam_search; do
|
||||
./zipformer/streaming_decode.py \
|
||||
--epoch 30 \
|
||||
--avg 8 \
|
||||
--use-averaged-model 1 \
|
||||
--exp-dir ./zipformer/exp-causal \
|
||||
--causal 1 \
|
||||
--chunk-size 16 \
|
||||
--left-context-frames 128 \
|
||||
--num-decode-streams 2000 \
|
||||
--decoding-method $m
|
||||
done
|
||||
```
|
||||
|
||||
### pruned_transducer_stateless7 (Fine-tune with mux)
|
||||
|
||||
See <https://github.com/k2-fsa/icefall/pull/1059> for more details.
|
||||
|
||||
[pruned_transducer_stateless7](./pruned_transducer_stateless7)
|
||||
|
||||
The tensorboard log can be found at
|
||||
<https://tensorboard.dev/experiment/MaNDZfO7RzW2Czzf3R2ZRA/>
|
||||
|
||||
You can find the pretrained model and bpe model needed for fine-tuning at:
|
||||
<https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11>
|
||||
|
||||
You can find a fine-tuned model, fine-tuning logs, decoding logs, and decoding
|
||||
results at:
|
||||
<https://huggingface.co/yfyeung/icefall-asr-finetune-mux-pruned_transducer_stateless7-2023-05-19>
|
||||
|
||||
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
|
||||
|
||||
Number of model parameters: 70369391, i.e., 70.37 M
|
||||
|
||||
| decoding method | dev | test | test-clean | test-other | comment |
|
||||
|----------------------|------------|------------|------------|------------|--------------------|
|
||||
| greedy_search | 14.27 | 14.22 | 2.08 | 4.79 | --epoch 20 --avg 5 |
|
||||
| modified_beam_search | 14.22 | 14.08 | 2.06 | 4.72 | --epoch 20 --avg 5 |
|
||||
| fast_beam_search | 14.23 | 14.17 | 2.08 | 4.09 | --epoch 20 --avg 5 |
|
||||
|
||||
The training commands are:
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES="0,1"
|
||||
|
||||
./pruned_transducer_stateless7/finetune.py \
|
||||
--world-size 2 \
|
||||
--num-epochs 20 \
|
||||
--start-epoch 1 \
|
||||
--exp-dir pruned_transducer_stateless7/exp_giga_finetune \
|
||||
--subset S \
|
||||
--use-fp16 1 \
|
||||
--base-lr 0.005 \
|
||||
--lr-epochs 100 \
|
||||
--lr-batches 100000 \
|
||||
--bpe-model icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11/data/lang_bpe_500/bpe.model \
|
||||
--do-finetune True \
|
||||
--use-mux True \
|
||||
--finetune-ckpt icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11/exp/pretrain.pt \
|
||||
--max-duration 500
|
||||
```
|
||||
|
||||
The decoding commands are:
|
||||
```bash
|
||||
# greedy_search
|
||||
./pruned_transducer_stateless7/decode.py \
|
||||
--epoch 20 \
|
||||
--avg 5 \
|
||||
--use-averaged-model 1 \
|
||||
--exp-dir ./pruned_transducer_stateless7/exp_giga_finetune \
|
||||
--max-duration 600 \
|
||||
--decoding-method greedy_search
|
||||
|
||||
# modified_beam_search
|
||||
./pruned_transducer_stateless7/decode.py \
|
||||
--epoch 20 \
|
||||
--avg 5 \
|
||||
--use-averaged-model 1 \
|
||||
--exp-dir ./pruned_transducer_stateless7/exp_giga_finetune \
|
||||
--max-duration 600 \
|
||||
--decoding-method modified_beam_search \
|
||||
--beam-size 4
|
||||
|
||||
# fast_beam_search
|
||||
./pruned_transducer_stateless7/decode.py \
|
||||
--epoch 20 \
|
||||
--avg 5 \
|
||||
--use-averaged-model 1 \
|
||||
--exp-dir ./pruned_transducer_stateless7/exp_giga_finetune \
|
||||
--max-duration 600 \
|
||||
--decoding-method fast_beam_search \
|
||||
--beam 20.0 \
|
||||
--max-contexts 8 \
|
||||
--max-states 64
|
||||
```
|
||||
|
||||
### pruned_transducer_stateless7 (zipformer + multidataset(LibriSpeech + GigaSpeech + CommonVoice 13.0))
|
||||
|
||||
See <https://github.com/k2-fsa/icefall/pull/1010> for more details.
|
||||
|
||||
[pruned_transducer_stateless7](./pruned_transducer_stateless7)
|
||||
|
||||
The tensorboard log can be found at
|
||||
<https://tensorboard.dev/experiment/SwdJoHgZSZWn8ph9aJLb8g/>
|
||||
|
||||
You can find a pretrained model, training logs, decoding logs, and decoding
|
||||
results at:
|
||||
<https://huggingface.co/yfyeung/icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04>
|
||||
|
||||
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
|
||||
|
||||
Number of model parameters: 70369391, i.e., 70.37 M
|
||||
|
||||
| decoding method | test-clean | test-other | comment |
|
||||
|----------------------|------------|------------|--------------------|
|
||||
| greedy_search | 1.91 | 4.06 | --epoch 30 --avg 7 |
|
||||
| modified_beam_search | 1.90 | 3.99 | --epoch 30 --avg 7 |
|
||||
| fast_beam_search | 1.90 | 3.98 | --epoch 30 --avg 7 |
|
||||
|
||||
|
||||
The training commands are:
|
||||
```bash
|
||||
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
|
||||
|
||||
./pruned_transducer_stateless7/train.py \
|
||||
--world-size 8 \
|
||||
--num-epochs 30 \
|
||||
--use-multidataset 1 \
|
||||
--use-fp16 1 \
|
||||
--max-duration 750 \
|
||||
--exp-dir pruned_transducer_stateless7/exp
|
||||
```
|
||||
|
||||
The decoding commands are:
|
||||
```bash
|
||||
# greedy_search
|
||||
./pruned_transducer_stateless7/decode.py \
|
||||
--epoch 30 \
|
||||
--avg 7 \
|
||||
--use-averaged-model 1 \
|
||||
--exp-dir ./pruned_transducer_stateless7/exp \
|
||||
--max-duration 600 \
|
||||
--decoding-method greedy_search
|
||||
|
||||
# modified_beam_search
|
||||
./pruned_transducer_stateless7/decode.py \
|
||||
--epoch 30 \
|
||||
--avg 7 \
|
||||
--use-averaged-model 1 \
|
||||
--exp-dir ./pruned_transducer_stateless7/exp \
|
||||
--max-duration 600 \
|
||||
--decoding-method modified_beam_search \
|
||||
--beam-size 4
|
||||
|
||||
# fast_beam_search
|
||||
./pruned_transducer_stateless7/decode.py \
|
||||
--epoch 30 \
|
||||
--avg 7 \
|
||||
--use-averaged-model 1 \
|
||||
--exp-dir ./pruned_transducer_stateless7/exp \
|
||||
--max-duration 600 \
|
||||
--decoding-method fast_beam_search \
|
||||
--beam 20.0 \
|
||||
--max-contexts 8 \
|
||||
--max-states 64
|
||||
```
|
||||
|
||||
### Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer + Multi-Dataset)
|
||||
|
||||
#### [pruned_transducer_stateless7_streaming_multi](./pruned_transducer_stateless7_streaming_multi)
|
||||
|
||||
See <https://github.com/k2-fsa/icefall/pull/984> for more details.
|
||||
|
||||
You can find a pretrained model, training logs, decoding logs, and decoding
|
||||
results at: <https://huggingface.co/marcoyang/icefall-libri-giga-pruned-transducer-stateless7-streaming-2023-04-04>
|
||||
|
||||
Number of model parameters: 70369391, i.e., 70.37 M
|
||||
|
||||
##### training on full librispeech + full gigaspeech (with giga_prob=0.9)
|
||||
|
||||
The WERs are:
|
||||
|
||||
|
||||
| decoding method | chunk size | test-clean | test-other | comment | decoding mode |
|
||||
|----------------------|------------|------------|------------|---------------------|----------------------|
|
||||
| greedy search | 320ms | 2.43 | 6.0 | --epoch 20 --avg 4 | simulated streaming |
|
||||
| greedy search | 320ms | 2.47 | 6.13 | --epoch 20 --avg 4 | chunk-wise |
|
||||
| fast beam search | 320ms | 2.43 | 5.99 | --epoch 20 --avg 4 | simulated streaming |
|
||||
| fast beam search | 320ms | 2.8 | 6.46 | --epoch 20 --avg 4 | chunk-wise |
|
||||
| modified beam search | 320ms | 2.4 | 5.96 | --epoch 20 --avg 4 | simulated streaming |
|
||||
| modified beam search | 320ms | 2.42 | 6.03 | --epoch 20 --avg 4 | chunk-size |
|
||||
| greedy search | 640ms | 2.26 | 5.58 | --epoch 20 --avg 4 | simulated streaming |
|
||||
| greedy search | 640ms | 2.33 | 5.76 | --epoch 20 --avg 4 | chunk-wise |
|
||||
| fast beam search | 640ms | 2.27 | 5.54 | --epoch 20 --avg 4 | simulated streaming |
|
||||
| fast beam search | 640ms | 2.37 | 5.75 | --epoch 20 --avg 4 | chunk-wise |
|
||||
| modified beam search | 640ms | 2.22 | 5.5 | --epoch 20 --avg 4 | simulated streaming |
|
||||
| modified beam search | 640ms | 2.25 | 5.69 | --epoch 20 --avg 4 | chunk-size |
|
||||
|
||||
The model also has good WERs on GigaSpeech. The following WERs are achieved on GigaSpeech test and dev sets:
|
||||
|
||||
| decoding method | chunk size | dev | test | comment | decoding mode |
|
||||
|----------------------|------------|-----|------|------------|---------------------|
|
||||
| greedy search | 320ms | 12.08 | 11.98 | --epoch 20 --avg 4 | simulated streaming |
|
||||
| greedy search | 640ms | 11.66 | 11.71 | --epoch 20 --avg 4 | simulated streaming |
|
||||
| modified beam search | 320ms | 11.95 | 11.83 | --epoch 20 --avg 4 | simulated streaming |
|
||||
| modified beam search | 320ms | 11.65 | 11.56 | --epoch 20 --avg 4 | simulated streaming |
|
||||
|
||||
|
||||
Note: `simulated streaming` indicates feeding full utterance during decoding using `decode.py`,
|
||||
while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`.
|
||||
|
||||
The training command is:
|
||||
|
||||
```bash
|
||||
./pruned_transducer_stateless7_streaming_multi/train.py \
|
||||
--world-size 4 \
|
||||
--num-epochs 20 \
|
||||
--start-epoch 1 \
|
||||
--use-fp16 1 \
|
||||
--exp-dir pruned_transducer_stateless7_streaming_multi/exp \
|
||||
--full-libri 1 \
|
||||
--giga-prob 0.9 \
|
||||
--max-duration 750 \
|
||||
--master-port 12345
|
||||
```
|
||||
|
||||
The tensorboard log can be found at
|
||||
<https://tensorboard.dev/experiment/G4yDMLXGQXexf41i4MA2Tg/#scalars>
|
||||
|
||||
The simulated streaming decoding command (e.g., chunk-size=320ms) is:
|
||||
```bash
|
||||
for m in greedy_search fast_beam_search modified_beam_search; do
|
||||
./pruned_transducer_stateless7_streaming_multi/decode.py \
|
||||
--epoch 20 \
|
||||
--avg 4 \
|
||||
--exp-dir ./pruned_transducer_stateless7_streaming_multi/exp \
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--right-padding 64 \
|
||||
--decoding-method $m
|
||||
done
|
||||
```
|
||||
|
||||
The streaming chunk-size decoding command (e.g., chunk-size=320ms) is:
|
||||
```bash
|
||||
for m in greedy_search modified_beam_search fast_beam_search; do
|
||||
./pruned_transducer_stateless7_streaming_multi/streaming_decode.py \
|
||||
--epoch 20 \
|
||||
--avg 4 \
|
||||
--exp-dir ./pruned_transducer_stateless7_streaming_multi/exp \
|
||||
--decoding-method $m \
|
||||
--decode-chunk-len 32 \
|
||||
--num-decode-streams 2000
|
||||
done
|
||||
```
|
||||
|
||||
|
||||
#### Smaller model
|
||||
|
||||
We also provide a very small version (only 6.1M parameters) of this setup. The training command for the small model is:
|
||||
|
||||
```bash
|
||||
./pruned_transducer_stateless7_streaming_multi/train.py \
|
||||
--world-size 4 \
|
||||
--num-epochs 30 \
|
||||
--start-epoch 1 \
|
||||
--use-fp16 1 \
|
||||
--exp-dir pruned_transducer_stateless7_streaming_multi/exp \
|
||||
--full-libri 1 \
|
||||
--giga-prob 0.9 \
|
||||
--num-encoder-layers "2,2,2,2,2" \
|
||||
--feedforward-dims "256,256,512,512,256" \
|
||||
--nhead "4,4,4,4,4" \
|
||||
--encoder-dims "128,128,128,128,128" \
|
||||
--attention-dims "96,96,96,96,96" \
|
||||
--encoder-unmasked-dims "96,96,96,96,96" \
|
||||
--max-duration 1200 \
|
||||
--master-port 12345
|
||||
```
|
||||
|
||||
You can find this pretrained small model and its training logs, decoding logs, and decoding
|
||||
results at:
|
||||
<https://huggingface.co/marcoyang/icefall-libri-giga-pruned-transducer-stateless7-streaming-6M-2023-04-03>
|
||||
|
||||
|
||||
| decoding method | chunk size | test-clean | test-other | comment | decoding mode |
|
||||
|----------------------|------------|------------|------------|---------------------|----------------------|
|
||||
| greedy search | 320ms | 5.95 | 15.03 | --epoch 30 --avg 1 | simulated streaming |
|
||||
| greedy search | 640ms | 5.61 | 13.86 | --epoch 30 --avg 1 | simulated streaming |
|
||||
| modified beam search | 320ms | 5.72 | 14.34 | --epoch 30 --avg 1 | simulated streaming |
|
||||
| modified beam search | 640ms | 5.43 | 13.16 | --epoch 30 --avg 1 | simulated streaming |
|
||||
| fast beam search | 320ms | 5.88 | 14.45 | --epoch 30 --avg 1 | simulated streaming |
|
||||
| fast beam search | 640ms | 5.48 | 13.31 | --epoch 30 --avg 1 | simulated streaming |
|
||||
|
||||
This small model achieves the following WERs on GigaSpeech test and dev sets:
|
||||
|
||||
| decoding method | chunk size | dev | test | comment | decoding mode |
|
||||
|----------------------|------------|------------|------------|---------------------|----------------------|
|
||||
| greedy search | 320ms | 17.57 | 17.2 | --epoch 30 --avg 1 | simulated streaming |
|
||||
| modified beam search | 320ms | 16.98 | 11.98 | --epoch 30 --avg 1 | simulated streaming |
|
||||
|
||||
You can find the tensorboard logs at <https://tensorboard.dev/experiment/tAc5iXxTQrCQxky5O5OLyw/#scalars>.
|
||||
|
||||
### Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer)
|
||||
|
||||
#### [pruned_transducer_stateless7_streaming](./pruned_transducer_stateless7_streaming)
|
||||
@ -53,7 +585,7 @@ The tensorboard log can be found at
|
||||
|
||||
The simulated streaming decoding command (e.g., chunk-size=320ms) is:
|
||||
```bash
|
||||
for $m in greedy_search fast_beam_search modified_beam_search; do
|
||||
for m in greedy_search fast_beam_search modified_beam_search; do
|
||||
./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 30 \
|
||||
--avg 9 \
|
||||
@ -76,6 +608,90 @@ for m in greedy_search modified_beam_search fast_beam_search; do
|
||||
--num-decode-streams 2000
|
||||
done
|
||||
```
|
||||
We also support decoding with neural network LMs. After combining with language models, the WERs are
|
||||
| decoding method | chunk size | test-clean | test-other | comment | decoding mode |
|
||||
|----------------------|------------|------------|------------|---------------------|----------------------|
|
||||
| `modified_beam_search` | 320ms | 3.11 | 7.93 | --epoch 30 --avg 9 | simulated streaming |
|
||||
| `modified_beam_search_lm_shallow_fusion` | 320ms | 2.58 | 6.65 | --epoch 30 --avg 9 | simulated streaming |
|
||||
| `modified_beam_search_lm_rescore` | 320ms | 2.59 | 6.86 | --epoch 30 --avg 9 | simulated streaming |
|
||||
| `modified_beam_search_lm_rescore_LODR` | 320ms | 2.52 | 6.73 | --epoch 30 --avg 9 | simulated streaming |
|
||||
|
||||
Please use the following command for `modified_beam_search_lm_shallow_fusion`:
|
||||
```bash
|
||||
for lm_scale in $(seq 0.15 0.01 0.38); do
|
||||
for beam_size in 4 8 12; do
|
||||
./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 99 \
|
||||
--avg 1 \
|
||||
--use-averaged-model False \
|
||||
--beam-size $beam_size \
|
||||
--exp-dir ./pruned_transducer_stateless7_streaming/exp-large-LM \
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--decoding-method modified_beam_search_lm_shallow_fusion \
|
||||
--use-shallow-fusion 1 \
|
||||
--lm-type rnn \
|
||||
--lm-exp-dir rnn_lm/exp \
|
||||
--lm-epoch 99 \
|
||||
--lm-scale $lm_scale \
|
||||
--lm-avg 1 \
|
||||
--rnn-lm-embedding-dim 2048 \
|
||||
--rnn-lm-hidden-dim 2048 \
|
||||
--rnn-lm-num-layers 3 \
|
||||
--lm-vocab-size 500
|
||||
done
|
||||
done
|
||||
```
|
||||
|
||||
Please use the following command for `modified_beam_search_lm_rescore`:
|
||||
```bash
|
||||
./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 30 \
|
||||
--avg 9 \
|
||||
--use-averaged-model True \
|
||||
--beam-size 8 \
|
||||
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--decoding-method modified_beam_search_lm_rescore \
|
||||
--use-shallow-fusion 0 \
|
||||
--lm-type rnn \
|
||||
--lm-exp-dir rnn_lm/exp \
|
||||
--lm-epoch 99 \
|
||||
--lm-avg 1 \
|
||||
--rnn-lm-embedding-dim 2048 \
|
||||
--rnn-lm-hidden-dim 2048 \
|
||||
--rnn-lm-num-layers 3 \
|
||||
--lm-vocab-size 500
|
||||
```
|
||||
|
||||
Please use the following command for `modified_beam_search_lm_rescore_LODR`:
|
||||
```bash
|
||||
./pruned_transducer_stateless7_streaming/decode.py \
|
||||
--epoch 30 \
|
||||
--avg 9 \
|
||||
--use-averaged-model True \
|
||||
--beam-size 8 \
|
||||
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
|
||||
--max-duration 600 \
|
||||
--decode-chunk-len 32 \
|
||||
--decoding-method modified_beam_search_lm_rescore_LODR \
|
||||
--use-shallow-fusion 0 \
|
||||
--lm-type rnn \
|
||||
--lm-exp-dir rnn_lm/exp \
|
||||
--lm-epoch 99 \
|
||||
--lm-avg 1 \
|
||||
--rnn-lm-embedding-dim 2048 \
|
||||
--rnn-lm-hidden-dim 2048 \
|
||||
--rnn-lm-num-layers 3 \
|
||||
--lm-vocab-size 500 \
|
||||
--tokens-ngram 2 \
|
||||
--backoff-id 500
|
||||
```
|
||||
|
||||
A well-trained RNNLM can be found here: <https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm/tree/main>. The bi-gram used in LODR decoding
|
||||
can be found here: <https://huggingface.co/marcoyang/librispeech_bigram>.
|
||||
|
||||
|
||||
#### Smaller model
|
||||
|
||||
@ -540,6 +1156,10 @@ for m in greedy_search fast_beam_search modified_beam_search ; do
|
||||
done
|
||||
```
|
||||
|
||||
Note that a small change is made to the `pruned_transducer_stateless7/decoder.py` in
|
||||
this [PR](https://github.com/k2-fsa/icefall/pull/942) to address the
|
||||
problem of emitting the first symbol at the very beginning. If you need a
|
||||
model without this issue, please download the model from here: <https://huggingface.co/marcoyang/icefall-asr-librispeech-pruned-transducer-stateless7-2023-03-10>
|
||||
|
||||
### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + gradient filter)
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user