mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-12-11 06:55:27 +00:00
remove changes in librispeech
This commit is contained in:
parent
494e88bcb7
commit
93a5c878f1
@ -26,6 +26,7 @@ The following table lists the differences among them.
|
|||||||
| `pruned_transducer_stateless7_ctc` | Zipformer | Embedding + Conv1d | Same as pruned_transducer_stateless7, but with extra CTC head|
|
| `pruned_transducer_stateless7_ctc` | Zipformer | Embedding + Conv1d | Same as pruned_transducer_stateless7, but with extra CTC head|
|
||||||
| `pruned_transducer_stateless7_ctc_bs` | Zipformer | Embedding + Conv1d | pruned_transducer_stateless7_ctc + blank skip |
|
| `pruned_transducer_stateless7_ctc_bs` | Zipformer | Embedding + Conv1d | pruned_transducer_stateless7_ctc + blank skip |
|
||||||
| `pruned_transducer_stateless7_streaming` | Streaming Zipformer | Embedding + Conv1d | streaming version of pruned_transducer_stateless7 |
|
| `pruned_transducer_stateless7_streaming` | Streaming Zipformer | Embedding + Conv1d | streaming version of pruned_transducer_stateless7 |
|
||||||
|
| `pruned_transducer_stateless7_streaming_multi` | Streaming Zipformer | Embedding + Conv1d | same as pruned_transducer_stateless7_streaming, trained on LibriSpeech + GigaSpeech |
|
||||||
| `pruned_transducer_stateless8` | Zipformer | Embedding + Conv1d | Same as pruned_transducer_stateless7, but using extra data from GigaSpeech|
|
| `pruned_transducer_stateless8` | Zipformer | Embedding + Conv1d | Same as pruned_transducer_stateless7, but using extra data from GigaSpeech|
|
||||||
| `pruned_stateless_emformer_rnnt2` | Emformer(from torchaudio) | Embedding + Conv1d | Using Emformer from torchaudio for streaming ASR|
|
| `pruned_stateless_emformer_rnnt2` | Emformer(from torchaudio) | Embedding + Conv1d | Using Emformer from torchaudio for streaming ASR|
|
||||||
| `conv_emformer_transducer_stateless` | ConvEmformer | Embedding + Conv1d | Using ConvEmformer for streaming ASR + mechanisms in reworked model |
|
| `conv_emformer_transducer_stateless` | ConvEmformer | Embedding + Conv1d | Using ConvEmformer for streaming ASR + mechanisms in reworked model |
|
||||||
@ -33,6 +34,7 @@ The following table lists the differences among them.
|
|||||||
| `lstm_transducer_stateless` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model |
|
| `lstm_transducer_stateless` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model |
|
||||||
| `lstm_transducer_stateless2` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model + gigaspeech (multi-dataset setup) |
|
| `lstm_transducer_stateless2` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model + gigaspeech (multi-dataset setup) |
|
||||||
| `lstm_transducer_stateless3` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model + gradient filter + delay penalty |
|
| `lstm_transducer_stateless3` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model + gradient filter + delay penalty |
|
||||||
|
| `zipformer` | Upgraded Zipformer | Embedding + Conv1d | The latest recipe |
|
||||||
|
|
||||||
The decoder in `transducer_stateless` is modified from the paper
|
The decoder in `transducer_stateless` is modified from the paper
|
||||||
[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).
|
[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).
|
||||||
|
|||||||
@ -1,5 +1,537 @@
|
|||||||
## Results
|
## Results
|
||||||
|
|
||||||
|
### zipformer (zipformer + pruned stateless transducer)
|
||||||
|
|
||||||
|
See <https://github.com/k2-fsa/icefall/pull/1058> for more details.
|
||||||
|
|
||||||
|
[zipformer](./zipformer)
|
||||||
|
|
||||||
|
#### Non-streaming
|
||||||
|
|
||||||
|
##### normal-scaled model, number of model parameters: 65549011, i.e., 65.55 M
|
||||||
|
|
||||||
|
The tensorboard log can be found at
|
||||||
|
<https://tensorboard.dev/experiment/cBaoIabCQxSDsyZM7FzqZA/>
|
||||||
|
|
||||||
|
You can find a pretrained model, training logs, decoding logs, and decoding results at:
|
||||||
|
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15>
|
||||||
|
|
||||||
|
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
|
||||||
|
|
||||||
|
| decoding method | test-clean | test-other | comment |
|
||||||
|
|----------------------|------------|------------|--------------------|
|
||||||
|
| greedy_search | 2.27 | 5.1 | --epoch 30 --avg 9 |
|
||||||
|
| modified_beam_search | 2.25 | 5.06 | --epoch 30 --avg 9 |
|
||||||
|
| fast_beam_search | 2.25 | 5.04 | --epoch 30 --avg 9 |
|
||||||
|
| greedy_search | 2.23 | 4.96 | --epoch 40 --avg 16 |
|
||||||
|
| modified_beam_search | 2.21 | 4.91 | --epoch 40 --avg 16 |
|
||||||
|
| fast_beam_search | 2.24 | 4.93 | --epoch 40 --avg 16 |
|
||||||
|
|
||||||
|
The training command is:
|
||||||
|
```bash
|
||||||
|
export CUDA_VISIBLE_DEVICES="0,1,2,3"
|
||||||
|
./zipformer/train.py \
|
||||||
|
--world-size 4 \
|
||||||
|
--num-epochs 40 \
|
||||||
|
--start-epoch 1 \
|
||||||
|
--use-fp16 1 \
|
||||||
|
--exp-dir zipformer/exp \
|
||||||
|
--causal 0 \
|
||||||
|
--full-libri 1 \
|
||||||
|
--max-duration 1000
|
||||||
|
```
|
||||||
|
|
||||||
|
The decoding command is:
|
||||||
|
```bash
|
||||||
|
export CUDA_VISIBLE_DEVICES="0"
|
||||||
|
for m in greedy_search modified_beam_search fast_beam_search; do
|
||||||
|
./zipformer/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 9 \
|
||||||
|
--use-averaged-model 1 \
|
||||||
|
--exp-dir ./zipformer/exp \
|
||||||
|
--max-duration 600 \
|
||||||
|
--decoding-method $m
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
##### small-scaled model, number of model parameters: 23285615, i.e., 23.3 M
|
||||||
|
|
||||||
|
The tensorboard log can be found at
|
||||||
|
<https://tensorboard.dev/experiment/53P4tL22TpO0UdiL0kPaLg/>
|
||||||
|
|
||||||
|
You can find a pretrained model, training logs, decoding logs, and decoding results at:
|
||||||
|
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-small-2023-05-16>
|
||||||
|
|
||||||
|
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
|
||||||
|
|
||||||
|
| decoding method | test-clean | test-other | comment |
|
||||||
|
|----------------------|------------|------------|--------------------|
|
||||||
|
| greedy_search | 2.64 | 6.14 | --epoch 30 --avg 8 |
|
||||||
|
| modified_beam_search | 2.6 | 6.01 | --epoch 30 --avg 8 |
|
||||||
|
| fast_beam_search | 2.62 | 6.06 | --epoch 30 --avg 8 |
|
||||||
|
| greedy_search | 2.49 | 5.91 | --epoch 40 --avg 13 |
|
||||||
|
| modified_beam_search | 2.46 | 5.83 | --epoch 40 --avg 13 |
|
||||||
|
| fast_beam_search | 2.46 | 5.87 | --epoch 40 --avg 13 |
|
||||||
|
|
||||||
|
The training command is:
|
||||||
|
```bash
|
||||||
|
export CUDA_VISIBLE_DEVICES="0,1"
|
||||||
|
./zipformer/train.py \
|
||||||
|
--world-size 2 \
|
||||||
|
--num-epochs 40 \
|
||||||
|
--start-epoch 1 \
|
||||||
|
--use-fp16 1 \
|
||||||
|
--exp-dir zipformer/exp-small \
|
||||||
|
--causal 0 \
|
||||||
|
--num-encoder-layers 2,2,2,2,2,2 \
|
||||||
|
--feedforward-dim 512,768,768,768,768,768 \
|
||||||
|
--encoder-dim 192,256,256,256,256,256 \
|
||||||
|
--encoder-unmasked-dim 192,192,192,192,192,192 \
|
||||||
|
--base-lr 0.04 \
|
||||||
|
--full-libri 1 \
|
||||||
|
--max-duration 1500
|
||||||
|
```
|
||||||
|
|
||||||
|
The decoding command is:
|
||||||
|
```bash
|
||||||
|
export CUDA_VISIBLE_DEVICES="0"
|
||||||
|
for m in greedy_search modified_beam_search fast_beam_search; do
|
||||||
|
./zipformer/decode.py \
|
||||||
|
--epoch 40 \
|
||||||
|
--avg 13 \
|
||||||
|
--exp-dir zipformer/exp-small \
|
||||||
|
--max-duration 600 \
|
||||||
|
--causal 0 \
|
||||||
|
--decoding-method $m \
|
||||||
|
--num-encoder-layers 2,2,2,2,2,2 \
|
||||||
|
--feedforward-dim 512,768,768,768,768,768 \
|
||||||
|
--encoder-dim 192,256,256,256,256,256 \
|
||||||
|
--encoder-unmasked-dim 192,192,192,192,192,192
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
##### large-scaled model, number of model parameters: 148439574, i.e., 148.4 M
|
||||||
|
|
||||||
|
The tensorboard log can be found at
|
||||||
|
<https://tensorboard.dev/experiment/HJ74wWYpQAGSzETkmQnrmQ/>
|
||||||
|
|
||||||
|
You can find a pretrained model, training logs, decoding logs, and decoding results at:
|
||||||
|
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-large-2023-05-16>
|
||||||
|
|
||||||
|
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
|
||||||
|
|
||||||
|
| decoding method | test-clean | test-other | comment |
|
||||||
|
|----------------------|------------|------------|--------------------|
|
||||||
|
| greedy_search | 2.12 | 4.91 | --epoch 30 --avg 9 |
|
||||||
|
| modified_beam_search | 2.11 | 4.9 | --epoch 30 --avg 9 |
|
||||||
|
| fast_beam_search | 2.13 | 4.93 | --epoch 30 --avg 9 |
|
||||||
|
| greedy_search | 2.12 | 4.8 | --epoch 40 --avg 13 |
|
||||||
|
| modified_beam_search | 2.11 | 4.7 | --epoch 40 --avg 13 |
|
||||||
|
| fast_beam_search | 2.13 | 4.78 | --epoch 40 --avg 13 |
|
||||||
|
|
||||||
|
The training command is:
|
||||||
|
```bash
|
||||||
|
export CUDA_VISIBLE_DEVICES="0,1,2,3"
|
||||||
|
./zipformer/train.py \
|
||||||
|
--world-size 4 \
|
||||||
|
--num-epochs 40 \
|
||||||
|
--start-epoch 1 \
|
||||||
|
--use-fp16 1 \
|
||||||
|
--exp-dir zipformer/exp-large \
|
||||||
|
--causal 0 \
|
||||||
|
--num-encoder-layers 2,2,4,5,4,2 \
|
||||||
|
--feedforward-dim 512,768,1536,2048,1536,768 \
|
||||||
|
--encoder-dim 192,256,512,768,512,256 \
|
||||||
|
--encoder-unmasked-dim 192,192,256,320,256,192 \
|
||||||
|
--full-libri 1 \
|
||||||
|
--max-duration 1000
|
||||||
|
```
|
||||||
|
|
||||||
|
The decoding command is:
|
||||||
|
```bash
|
||||||
|
export CUDA_VISIBLE_DEVICES="0"
|
||||||
|
for m in greedy_search modified_beam_search fast_beam_search; do
|
||||||
|
./zipformer/decode.py \
|
||||||
|
--epoch 40 \
|
||||||
|
--avg 16 \
|
||||||
|
--exp-dir zipformer/exp-large \
|
||||||
|
--max-duration 600 \
|
||||||
|
--causal 0 \
|
||||||
|
--decoding-method $m \
|
||||||
|
--num-encoder-layers 2,2,4,5,4,2 \
|
||||||
|
--feedforward-dim 512,768,1536,2048,1536,768 \
|
||||||
|
--encoder-dim 192,256,512,768,512,256 \
|
||||||
|
--encoder-unmasked-dim 192,192,256,320,256,192
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
#### streaming
|
||||||
|
|
||||||
|
##### normal-scaled model, number of model parameters: 66110931, i.e., 66.11 M
|
||||||
|
|
||||||
|
The tensorboard log can be found at
|
||||||
|
<https://tensorboard.dev/experiment/9rD0i6rMSWq1O61poWi71A>
|
||||||
|
|
||||||
|
You can find a pretrained model, training logs, decoding logs, and decoding results at:
|
||||||
|
<https://huggingface.co/Zengwei/icefall-asr-librispeech-streaming-zipformer-2023-05-17>
|
||||||
|
|
||||||
|
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
|
||||||
|
|
||||||
|
| decoding method | chunk size | test-clean | test-other | decoding mode | comment |
|
||||||
|
|----------------------|------------|------------|------------|---------------------|--------------------|
|
||||||
|
| greedy_search | 320ms | 3.06 | 7.81 | simulated streaming | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
|
||||||
|
| greedy_search | 320ms | 3.06 | 7.79 | chunk-wise | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
|
||||||
|
| modified_beam_search | 320ms | 3.01 | 7.69 | simulated streaming | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
|
||||||
|
| modified_beam_search | 320ms | 3.05 | 7.69 | chunk-wise | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
|
||||||
|
| fast_beam_search | 320ms | 3.04 | 7.68 | simulated streaming | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
|
||||||
|
| fast_beam_search | 320ms | 3.07 | 7.69 | chunk-wise | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
|
||||||
|
| greedy_search | 640ms | 2.81 | 7.15 | simulated streaming | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
|
||||||
|
| greedy_search | 640ms | 2.84 | 7.16 | chunk-wise | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
|
||||||
|
| modified_beam_search | 640ms | 2.79 | 7.05 | simulated streaming | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
|
||||||
|
| modified_beam_search | 640ms | 2.81 | 7.11 | chunk-wise | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
|
||||||
|
| fast_beam_search | 640ms | 2.84 | 7.04 | simulated streaming | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
|
||||||
|
| fast_beam_search | 640ms | 2.83 | 7.1 | chunk-wise | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
|
||||||
|
|
||||||
|
Note: For decoding mode, `simulated streaming` indicates feeding full utterance during decoding using `decode.py`,
|
||||||
|
while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`.
|
||||||
|
|
||||||
|
The training command is:
|
||||||
|
```bash
|
||||||
|
export CUDA_VISIBLE_DEVICES="0,1,2,3"
|
||||||
|
./zipformer/train.py \
|
||||||
|
--world-size 4 \
|
||||||
|
--num-epochs 40 \
|
||||||
|
--start-epoch 1 \
|
||||||
|
--use-fp16 1 \
|
||||||
|
--exp-dir zipformer/exp-causal \
|
||||||
|
--causal 1 \
|
||||||
|
--full-libri 1 \
|
||||||
|
--max-duration 1000
|
||||||
|
```
|
||||||
|
|
||||||
|
The simulated streaming decoding command is:
|
||||||
|
```bash
|
||||||
|
export CUDA_VISIBLE_DEVICES="0"
|
||||||
|
for m in greedy_search modified_beam_search fast_beam_search; do
|
||||||
|
./zipformer/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 8 \
|
||||||
|
--use-averaged-model 1 \
|
||||||
|
--exp-dir ./zipformer/exp-causal \
|
||||||
|
--causal 1 \
|
||||||
|
--chunk-size 16 \
|
||||||
|
--left-context-frames 128 \
|
||||||
|
--max-duration 600 \
|
||||||
|
--decoding-method $m
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
The chunk-wise streaming decoding command is:
|
||||||
|
```bash
|
||||||
|
export CUDA_VISIBLE_DEVICES="0"
|
||||||
|
for m in greedy_search modified_beam_search fast_beam_search; do
|
||||||
|
./zipformer/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 8 \
|
||||||
|
--use-averaged-model 1 \
|
||||||
|
--exp-dir ./zipformer/exp-causal \
|
||||||
|
--causal 1 \
|
||||||
|
--chunk-size 16 \
|
||||||
|
--left-context-frames 128 \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--decoding-method $m
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
### pruned_transducer_stateless7 (Fine-tune with mux)
|
||||||
|
|
||||||
|
See <https://github.com/k2-fsa/icefall/pull/1059> for more details.
|
||||||
|
|
||||||
|
[pruned_transducer_stateless7](./pruned_transducer_stateless7)
|
||||||
|
|
||||||
|
The tensorboard log can be found at
|
||||||
|
<https://tensorboard.dev/experiment/MaNDZfO7RzW2Czzf3R2ZRA/>
|
||||||
|
|
||||||
|
You can find the pretrained model and bpe model needed for fine-tuning at:
|
||||||
|
<https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11>
|
||||||
|
|
||||||
|
You can find a fine-tuned model, fine-tuning logs, decoding logs, and decoding
|
||||||
|
results at:
|
||||||
|
<https://huggingface.co/yfyeung/icefall-asr-finetune-mux-pruned_transducer_stateless7-2023-05-19>
|
||||||
|
|
||||||
|
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
|
||||||
|
|
||||||
|
Number of model parameters: 70369391, i.e., 70.37 M
|
||||||
|
|
||||||
|
| decoding method | dev | test | test-clean | test-other | comment |
|
||||||
|
|----------------------|------------|------------|------------|------------|--------------------|
|
||||||
|
| greedy_search | 14.27 | 14.22 | 2.08 | 4.79 | --epoch 20 --avg 5 |
|
||||||
|
| modified_beam_search | 14.22 | 14.08 | 2.06 | 4.72 | --epoch 20 --avg 5 |
|
||||||
|
| fast_beam_search | 14.23 | 14.17 | 2.08 | 4.09 | --epoch 20 --avg 5 |
|
||||||
|
|
||||||
|
The training commands are:
|
||||||
|
```bash
|
||||||
|
export CUDA_VISIBLE_DEVICES="0,1"
|
||||||
|
|
||||||
|
./pruned_transducer_stateless7/finetune.py \
|
||||||
|
--world-size 2 \
|
||||||
|
--num-epochs 20 \
|
||||||
|
--start-epoch 1 \
|
||||||
|
--exp-dir pruned_transducer_stateless7/exp_giga_finetune \
|
||||||
|
--subset S \
|
||||||
|
--use-fp16 1 \
|
||||||
|
--base-lr 0.005 \
|
||||||
|
--lr-epochs 100 \
|
||||||
|
--lr-batches 100000 \
|
||||||
|
--bpe-model icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11/data/lang_bpe_500/bpe.model \
|
||||||
|
--do-finetune True \
|
||||||
|
--use-mux True \
|
||||||
|
--finetune-ckpt icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11/exp/pretrain.pt \
|
||||||
|
--max-duration 500
|
||||||
|
```
|
||||||
|
|
||||||
|
The decoding commands are:
|
||||||
|
```bash
|
||||||
|
# greedy_search
|
||||||
|
./pruned_transducer_stateless7/decode.py \
|
||||||
|
--epoch 20 \
|
||||||
|
--avg 5 \
|
||||||
|
--use-averaged-model 1 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless7/exp_giga_finetune \
|
||||||
|
--max-duration 600 \
|
||||||
|
--decoding-method greedy_search
|
||||||
|
|
||||||
|
# modified_beam_search
|
||||||
|
./pruned_transducer_stateless7/decode.py \
|
||||||
|
--epoch 20 \
|
||||||
|
--avg 5 \
|
||||||
|
--use-averaged-model 1 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless7/exp_giga_finetune \
|
||||||
|
--max-duration 600 \
|
||||||
|
--decoding-method modified_beam_search \
|
||||||
|
--beam-size 4
|
||||||
|
|
||||||
|
# fast_beam_search
|
||||||
|
./pruned_transducer_stateless7/decode.py \
|
||||||
|
--epoch 20 \
|
||||||
|
--avg 5 \
|
||||||
|
--use-averaged-model 1 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless7/exp_giga_finetune \
|
||||||
|
--max-duration 600 \
|
||||||
|
--decoding-method fast_beam_search \
|
||||||
|
--beam 20.0 \
|
||||||
|
--max-contexts 8 \
|
||||||
|
--max-states 64
|
||||||
|
```
|
||||||
|
|
||||||
|
### pruned_transducer_stateless7 (zipformer + multidataset(LibriSpeech + GigaSpeech + CommonVoice 13.0))
|
||||||
|
|
||||||
|
See <https://github.com/k2-fsa/icefall/pull/1010> for more details.
|
||||||
|
|
||||||
|
[pruned_transducer_stateless7](./pruned_transducer_stateless7)
|
||||||
|
|
||||||
|
The tensorboard log can be found at
|
||||||
|
<https://tensorboard.dev/experiment/SwdJoHgZSZWn8ph9aJLb8g/>
|
||||||
|
|
||||||
|
You can find a pretrained model, training logs, decoding logs, and decoding
|
||||||
|
results at:
|
||||||
|
<https://huggingface.co/yfyeung/icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04>
|
||||||
|
|
||||||
|
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
|
||||||
|
|
||||||
|
Number of model parameters: 70369391, i.e., 70.37 M
|
||||||
|
|
||||||
|
| decoding method | test-clean | test-other | comment |
|
||||||
|
|----------------------|------------|------------|--------------------|
|
||||||
|
| greedy_search | 1.91 | 4.06 | --epoch 30 --avg 7 |
|
||||||
|
| modified_beam_search | 1.90 | 3.99 | --epoch 30 --avg 7 |
|
||||||
|
| fast_beam_search | 1.90 | 3.98 | --epoch 30 --avg 7 |
|
||||||
|
|
||||||
|
|
||||||
|
The training commands are:
|
||||||
|
```bash
|
||||||
|
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
|
||||||
|
|
||||||
|
./pruned_transducer_stateless7/train.py \
|
||||||
|
--world-size 8 \
|
||||||
|
--num-epochs 30 \
|
||||||
|
--use-multidataset 1 \
|
||||||
|
--use-fp16 1 \
|
||||||
|
--max-duration 750 \
|
||||||
|
--exp-dir pruned_transducer_stateless7/exp
|
||||||
|
```
|
||||||
|
|
||||||
|
The decoding commands are:
|
||||||
|
```bash
|
||||||
|
# greedy_search
|
||||||
|
./pruned_transducer_stateless7/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 7 \
|
||||||
|
--use-averaged-model 1 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless7/exp \
|
||||||
|
--max-duration 600 \
|
||||||
|
--decoding-method greedy_search
|
||||||
|
|
||||||
|
# modified_beam_search
|
||||||
|
./pruned_transducer_stateless7/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 7 \
|
||||||
|
--use-averaged-model 1 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless7/exp \
|
||||||
|
--max-duration 600 \
|
||||||
|
--decoding-method modified_beam_search \
|
||||||
|
--beam-size 4
|
||||||
|
|
||||||
|
# fast_beam_search
|
||||||
|
./pruned_transducer_stateless7/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 7 \
|
||||||
|
--use-averaged-model 1 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless7/exp \
|
||||||
|
--max-duration 600 \
|
||||||
|
--decoding-method fast_beam_search \
|
||||||
|
--beam 20.0 \
|
||||||
|
--max-contexts 8 \
|
||||||
|
--max-states 64
|
||||||
|
```
|
||||||
|
|
||||||
|
### Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer + Multi-Dataset)
|
||||||
|
|
||||||
|
#### [pruned_transducer_stateless7_streaming_multi](./pruned_transducer_stateless7_streaming_multi)
|
||||||
|
|
||||||
|
See <https://github.com/k2-fsa/icefall/pull/984> for more details.
|
||||||
|
|
||||||
|
You can find a pretrained model, training logs, decoding logs, and decoding
|
||||||
|
results at: <https://huggingface.co/marcoyang/icefall-libri-giga-pruned-transducer-stateless7-streaming-2023-04-04>
|
||||||
|
|
||||||
|
Number of model parameters: 70369391, i.e., 70.37 M
|
||||||
|
|
||||||
|
##### training on full librispeech + full gigaspeech (with giga_prob=0.9)
|
||||||
|
|
||||||
|
The WERs are:
|
||||||
|
|
||||||
|
|
||||||
|
| decoding method | chunk size | test-clean | test-other | comment | decoding mode |
|
||||||
|
|----------------------|------------|------------|------------|---------------------|----------------------|
|
||||||
|
| greedy search | 320ms | 2.43 | 6.0 | --epoch 20 --avg 4 | simulated streaming |
|
||||||
|
| greedy search | 320ms | 2.47 | 6.13 | --epoch 20 --avg 4 | chunk-wise |
|
||||||
|
| fast beam search | 320ms | 2.43 | 5.99 | --epoch 20 --avg 4 | simulated streaming |
|
||||||
|
| fast beam search | 320ms | 2.8 | 6.46 | --epoch 20 --avg 4 | chunk-wise |
|
||||||
|
| modified beam search | 320ms | 2.4 | 5.96 | --epoch 20 --avg 4 | simulated streaming |
|
||||||
|
| modified beam search | 320ms | 2.42 | 6.03 | --epoch 20 --avg 4 | chunk-size |
|
||||||
|
| greedy search | 640ms | 2.26 | 5.58 | --epoch 20 --avg 4 | simulated streaming |
|
||||||
|
| greedy search | 640ms | 2.33 | 5.76 | --epoch 20 --avg 4 | chunk-wise |
|
||||||
|
| fast beam search | 640ms | 2.27 | 5.54 | --epoch 20 --avg 4 | simulated streaming |
|
||||||
|
| fast beam search | 640ms | 2.37 | 5.75 | --epoch 20 --avg 4 | chunk-wise |
|
||||||
|
| modified beam search | 640ms | 2.22 | 5.5 | --epoch 20 --avg 4 | simulated streaming |
|
||||||
|
| modified beam search | 640ms | 2.25 | 5.69 | --epoch 20 --avg 4 | chunk-size |
|
||||||
|
|
||||||
|
The model also has good WERs on GigaSpeech. The following WERs are achieved on GigaSpeech test and dev sets:
|
||||||
|
|
||||||
|
| decoding method | chunk size | dev | test | comment | decoding mode |
|
||||||
|
|----------------------|------------|-----|------|------------|---------------------|
|
||||||
|
| greedy search | 320ms | 12.08 | 11.98 | --epoch 20 --avg 4 | simulated streaming |
|
||||||
|
| greedy search | 640ms | 11.66 | 11.71 | --epoch 20 --avg 4 | simulated streaming |
|
||||||
|
| modified beam search | 320ms | 11.95 | 11.83 | --epoch 20 --avg 4 | simulated streaming |
|
||||||
|
| modified beam search | 320ms | 11.65 | 11.56 | --epoch 20 --avg 4 | simulated streaming |
|
||||||
|
|
||||||
|
|
||||||
|
Note: `simulated streaming` indicates feeding full utterance during decoding using `decode.py`,
|
||||||
|
while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`.
|
||||||
|
|
||||||
|
The training command is:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./pruned_transducer_stateless7_streaming_multi/train.py \
|
||||||
|
--world-size 4 \
|
||||||
|
--num-epochs 20 \
|
||||||
|
--start-epoch 1 \
|
||||||
|
--use-fp16 1 \
|
||||||
|
--exp-dir pruned_transducer_stateless7_streaming_multi/exp \
|
||||||
|
--full-libri 1 \
|
||||||
|
--giga-prob 0.9 \
|
||||||
|
--max-duration 750 \
|
||||||
|
--master-port 12345
|
||||||
|
```
|
||||||
|
|
||||||
|
The tensorboard log can be found at
|
||||||
|
<https://tensorboard.dev/experiment/G4yDMLXGQXexf41i4MA2Tg/#scalars>
|
||||||
|
|
||||||
|
The simulated streaming decoding command (e.g., chunk-size=320ms) is:
|
||||||
|
```bash
|
||||||
|
for m in greedy_search fast_beam_search modified_beam_search; do
|
||||||
|
./pruned_transducer_stateless7_streaming_multi/decode.py \
|
||||||
|
--epoch 20 \
|
||||||
|
--avg 4 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless7_streaming_multi/exp \
|
||||||
|
--max-duration 600 \
|
||||||
|
--decode-chunk-len 32 \
|
||||||
|
--right-padding 64 \
|
||||||
|
--decoding-method $m
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
The streaming chunk-size decoding command (e.g., chunk-size=320ms) is:
|
||||||
|
```bash
|
||||||
|
for m in greedy_search modified_beam_search fast_beam_search; do
|
||||||
|
./pruned_transducer_stateless7_streaming_multi/streaming_decode.py \
|
||||||
|
--epoch 20 \
|
||||||
|
--avg 4 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless7_streaming_multi/exp \
|
||||||
|
--decoding-method $m \
|
||||||
|
--decode-chunk-len 32 \
|
||||||
|
--num-decode-streams 2000
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
#### Smaller model
|
||||||
|
|
||||||
|
We also provide a very small version (only 6.1M parameters) of this setup. The training command for the small model is:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./pruned_transducer_stateless7_streaming_multi/train.py \
|
||||||
|
--world-size 4 \
|
||||||
|
--num-epochs 30 \
|
||||||
|
--start-epoch 1 \
|
||||||
|
--use-fp16 1 \
|
||||||
|
--exp-dir pruned_transducer_stateless7_streaming_multi/exp \
|
||||||
|
--full-libri 1 \
|
||||||
|
--giga-prob 0.9 \
|
||||||
|
--num-encoder-layers "2,2,2,2,2" \
|
||||||
|
--feedforward-dims "256,256,512,512,256" \
|
||||||
|
--nhead "4,4,4,4,4" \
|
||||||
|
--encoder-dims "128,128,128,128,128" \
|
||||||
|
--attention-dims "96,96,96,96,96" \
|
||||||
|
--encoder-unmasked-dims "96,96,96,96,96" \
|
||||||
|
--max-duration 1200 \
|
||||||
|
--master-port 12345
|
||||||
|
```
|
||||||
|
|
||||||
|
You can find this pretrained small model and its training logs, decoding logs, and decoding
|
||||||
|
results at:
|
||||||
|
<https://huggingface.co/marcoyang/icefall-libri-giga-pruned-transducer-stateless7-streaming-6M-2023-04-03>
|
||||||
|
|
||||||
|
|
||||||
|
| decoding method | chunk size | test-clean | test-other | comment | decoding mode |
|
||||||
|
|----------------------|------------|------------|------------|---------------------|----------------------|
|
||||||
|
| greedy search | 320ms | 5.95 | 15.03 | --epoch 30 --avg 1 | simulated streaming |
|
||||||
|
| greedy search | 640ms | 5.61 | 13.86 | --epoch 30 --avg 1 | simulated streaming |
|
||||||
|
| modified beam search | 320ms | 5.72 | 14.34 | --epoch 30 --avg 1 | simulated streaming |
|
||||||
|
| modified beam search | 640ms | 5.43 | 13.16 | --epoch 30 --avg 1 | simulated streaming |
|
||||||
|
| fast beam search | 320ms | 5.88 | 14.45 | --epoch 30 --avg 1 | simulated streaming |
|
||||||
|
| fast beam search | 640ms | 5.48 | 13.31 | --epoch 30 --avg 1 | simulated streaming |
|
||||||
|
|
||||||
|
This small model achieves the following WERs on GigaSpeech test and dev sets:
|
||||||
|
|
||||||
|
| decoding method | chunk size | dev | test | comment | decoding mode |
|
||||||
|
|----------------------|------------|------------|------------|---------------------|----------------------|
|
||||||
|
| greedy search | 320ms | 17.57 | 17.2 | --epoch 30 --avg 1 | simulated streaming |
|
||||||
|
| modified beam search | 320ms | 16.98 | 11.98 | --epoch 30 --avg 1 | simulated streaming |
|
||||||
|
|
||||||
|
You can find the tensorboard logs at <https://tensorboard.dev/experiment/tAc5iXxTQrCQxky5O5OLyw/#scalars>.
|
||||||
|
|
||||||
### Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer)
|
### Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer)
|
||||||
|
|
||||||
#### [pruned_transducer_stateless7_streaming](./pruned_transducer_stateless7_streaming)
|
#### [pruned_transducer_stateless7_streaming](./pruned_transducer_stateless7_streaming)
|
||||||
@ -53,7 +585,7 @@ The tensorboard log can be found at
|
|||||||
|
|
||||||
The simulated streaming decoding command (e.g., chunk-size=320ms) is:
|
The simulated streaming decoding command (e.g., chunk-size=320ms) is:
|
||||||
```bash
|
```bash
|
||||||
for $m in greedy_search fast_beam_search modified_beam_search; do
|
for m in greedy_search fast_beam_search modified_beam_search; do
|
||||||
./pruned_transducer_stateless7_streaming/decode.py \
|
./pruned_transducer_stateless7_streaming/decode.py \
|
||||||
--epoch 30 \
|
--epoch 30 \
|
||||||
--avg 9 \
|
--avg 9 \
|
||||||
@ -76,6 +608,90 @@ for m in greedy_search modified_beam_search fast_beam_search; do
|
|||||||
--num-decode-streams 2000
|
--num-decode-streams 2000
|
||||||
done
|
done
|
||||||
```
|
```
|
||||||
|
We also support decoding with neural network LMs. After combining with language models, the WERs are
|
||||||
|
| decoding method | chunk size | test-clean | test-other | comment | decoding mode |
|
||||||
|
|----------------------|------------|------------|------------|---------------------|----------------------|
|
||||||
|
| `modified_beam_search` | 320ms | 3.11 | 7.93 | --epoch 30 --avg 9 | simulated streaming |
|
||||||
|
| `modified_beam_search_lm_shallow_fusion` | 320ms | 2.58 | 6.65 | --epoch 30 --avg 9 | simulated streaming |
|
||||||
|
| `modified_beam_search_lm_rescore` | 320ms | 2.59 | 6.86 | --epoch 30 --avg 9 | simulated streaming |
|
||||||
|
| `modified_beam_search_lm_rescore_LODR` | 320ms | 2.52 | 6.73 | --epoch 30 --avg 9 | simulated streaming |
|
||||||
|
|
||||||
|
Please use the following command for `modified_beam_search_lm_shallow_fusion`:
|
||||||
|
```bash
|
||||||
|
for lm_scale in $(seq 0.15 0.01 0.38); do
|
||||||
|
for beam_size in 4 8 12; do
|
||||||
|
./pruned_transducer_stateless7_streaming/decode.py \
|
||||||
|
--epoch 99 \
|
||||||
|
--avg 1 \
|
||||||
|
--use-averaged-model False \
|
||||||
|
--beam-size $beam_size \
|
||||||
|
--exp-dir ./pruned_transducer_stateless7_streaming/exp-large-LM \
|
||||||
|
--max-duration 600 \
|
||||||
|
--decode-chunk-len 32 \
|
||||||
|
--decoding-method modified_beam_search_lm_shallow_fusion \
|
||||||
|
--use-shallow-fusion 1 \
|
||||||
|
--lm-type rnn \
|
||||||
|
--lm-exp-dir rnn_lm/exp \
|
||||||
|
--lm-epoch 99 \
|
||||||
|
--lm-scale $lm_scale \
|
||||||
|
--lm-avg 1 \
|
||||||
|
--rnn-lm-embedding-dim 2048 \
|
||||||
|
--rnn-lm-hidden-dim 2048 \
|
||||||
|
--rnn-lm-num-layers 3 \
|
||||||
|
--lm-vocab-size 500
|
||||||
|
done
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
Please use the following command for `modified_beam_search_lm_rescore`:
|
||||||
|
```bash
|
||||||
|
./pruned_transducer_stateless7_streaming/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 9 \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam-size 8 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
|
||||||
|
--max-duration 600 \
|
||||||
|
--decode-chunk-len 32 \
|
||||||
|
--decoding-method modified_beam_search_lm_rescore \
|
||||||
|
--use-shallow-fusion 0 \
|
||||||
|
--lm-type rnn \
|
||||||
|
--lm-exp-dir rnn_lm/exp \
|
||||||
|
--lm-epoch 99 \
|
||||||
|
--lm-avg 1 \
|
||||||
|
--rnn-lm-embedding-dim 2048 \
|
||||||
|
--rnn-lm-hidden-dim 2048 \
|
||||||
|
--rnn-lm-num-layers 3 \
|
||||||
|
--lm-vocab-size 500
|
||||||
|
```
|
||||||
|
|
||||||
|
Please use the following command for `modified_beam_search_lm_rescore_LODR`:
|
||||||
|
```bash
|
||||||
|
./pruned_transducer_stateless7_streaming/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 9 \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam-size 8 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
|
||||||
|
--max-duration 600 \
|
||||||
|
--decode-chunk-len 32 \
|
||||||
|
--decoding-method modified_beam_search_lm_rescore_LODR \
|
||||||
|
--use-shallow-fusion 0 \
|
||||||
|
--lm-type rnn \
|
||||||
|
--lm-exp-dir rnn_lm/exp \
|
||||||
|
--lm-epoch 99 \
|
||||||
|
--lm-avg 1 \
|
||||||
|
--rnn-lm-embedding-dim 2048 \
|
||||||
|
--rnn-lm-hidden-dim 2048 \
|
||||||
|
--rnn-lm-num-layers 3 \
|
||||||
|
--lm-vocab-size 500 \
|
||||||
|
--tokens-ngram 2 \
|
||||||
|
--backoff-id 500
|
||||||
|
```
|
||||||
|
|
||||||
|
A well-trained RNNLM can be found here: <https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm/tree/main>. The bi-gram used in LODR decoding
|
||||||
|
can be found here: <https://huggingface.co/marcoyang/librispeech_bigram>.
|
||||||
|
|
||||||
|
|
||||||
#### Smaller model
|
#### Smaller model
|
||||||
|
|
||||||
@ -540,6 +1156,10 @@ for m in greedy_search fast_beam_search modified_beam_search ; do
|
|||||||
done
|
done
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Note that a small change is made to the `pruned_transducer_stateless7/decoder.py` in
|
||||||
|
this [PR](https://github.com/k2-fsa/icefall/pull/942) to address the
|
||||||
|
problem of emitting the first symbol at the very beginning. If you need a
|
||||||
|
model without this issue, please download the model from here: <https://huggingface.co/marcoyang/icefall-asr-librispeech-pruned-transducer-stateless7-2023-03-10>
|
||||||
|
|
||||||
### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + gradient filter)
|
### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + gradient filter)
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user