remove changes in librispeech

This commit is contained in:
Desh Raj 2023-06-13 08:14:11 -04:00
parent 494e88bcb7
commit 93a5c878f1
2 changed files with 623 additions and 1 deletions

View File

@ -26,6 +26,7 @@ The following table lists the differences among them.
| `pruned_transducer_stateless7_ctc` | Zipformer | Embedding + Conv1d | Same as pruned_transducer_stateless7, but with extra CTC head|
| `pruned_transducer_stateless7_ctc_bs` | Zipformer | Embedding + Conv1d | pruned_transducer_stateless7_ctc + blank skip |
| `pruned_transducer_stateless7_streaming` | Streaming Zipformer | Embedding + Conv1d | streaming version of pruned_transducer_stateless7 |
| `pruned_transducer_stateless7_streaming_multi` | Streaming Zipformer | Embedding + Conv1d | same as pruned_transducer_stateless7_streaming, trained on LibriSpeech + GigaSpeech |
| `pruned_transducer_stateless8` | Zipformer | Embedding + Conv1d | Same as pruned_transducer_stateless7, but using extra data from GigaSpeech|
| `pruned_stateless_emformer_rnnt2` | Emformer(from torchaudio) | Embedding + Conv1d | Using Emformer from torchaudio for streaming ASR|
| `conv_emformer_transducer_stateless` | ConvEmformer | Embedding + Conv1d | Using ConvEmformer for streaming ASR + mechanisms in reworked model |
@ -33,6 +34,7 @@ The following table lists the differences among them.
| `lstm_transducer_stateless` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model |
| `lstm_transducer_stateless2` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model + gigaspeech (multi-dataset setup) |
| `lstm_transducer_stateless3` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model + gradient filter + delay penalty |
| `zipformer` | Upgraded Zipformer | Embedding + Conv1d | The latest recipe |
The decoder in `transducer_stateless` is modified from the paper
[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).

View File

@ -1,5 +1,537 @@
## Results
### zipformer (zipformer + pruned stateless transducer)
See <https://github.com/k2-fsa/icefall/pull/1058> for more details.
[zipformer](./zipformer)
#### Non-streaming
##### normal-scaled model, number of model parameters: 65549011, i.e., 65.55 M
The tensorboard log can be found at
<https://tensorboard.dev/experiment/cBaoIabCQxSDsyZM7FzqZA/>
You can find a pretrained model, training logs, decoding logs, and decoding results at:
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15>
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
| decoding method | test-clean | test-other | comment |
|----------------------|------------|------------|--------------------|
| greedy_search | 2.27 | 5.1 | --epoch 30 --avg 9 |
| modified_beam_search | 2.25 | 5.06 | --epoch 30 --avg 9 |
| fast_beam_search | 2.25 | 5.04 | --epoch 30 --avg 9 |
| greedy_search | 2.23 | 4.96 | --epoch 40 --avg 16 |
| modified_beam_search | 2.21 | 4.91 | --epoch 40 --avg 16 |
| fast_beam_search | 2.24 | 4.93 | --epoch 40 --avg 16 |
The training command is:
```bash
export CUDA_VISIBLE_DEVICES="0,1,2,3"
./zipformer/train.py \
--world-size 4 \
--num-epochs 40 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp \
--causal 0 \
--full-libri 1 \
--max-duration 1000
```
The decoding command is:
```bash
export CUDA_VISIBLE_DEVICES="0"
for m in greedy_search modified_beam_search fast_beam_search; do
./zipformer/decode.py \
--epoch 30 \
--avg 9 \
--use-averaged-model 1 \
--exp-dir ./zipformer/exp \
--max-duration 600 \
--decoding-method $m
done
```
##### small-scaled model, number of model parameters: 23285615, i.e., 23.3 M
The tensorboard log can be found at
<https://tensorboard.dev/experiment/53P4tL22TpO0UdiL0kPaLg/>
You can find a pretrained model, training logs, decoding logs, and decoding results at:
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-small-2023-05-16>
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
| decoding method | test-clean | test-other | comment |
|----------------------|------------|------------|--------------------|
| greedy_search | 2.64 | 6.14 | --epoch 30 --avg 8 |
| modified_beam_search | 2.6 | 6.01 | --epoch 30 --avg 8 |
| fast_beam_search | 2.62 | 6.06 | --epoch 30 --avg 8 |
| greedy_search | 2.49 | 5.91 | --epoch 40 --avg 13 |
| modified_beam_search | 2.46 | 5.83 | --epoch 40 --avg 13 |
| fast_beam_search | 2.46 | 5.87 | --epoch 40 --avg 13 |
The training command is:
```bash
export CUDA_VISIBLE_DEVICES="0,1"
./zipformer/train.py \
--world-size 2 \
--num-epochs 40 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp-small \
--causal 0 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,768,768,768,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--base-lr 0.04 \
--full-libri 1 \
--max-duration 1500
```
The decoding command is:
```bash
export CUDA_VISIBLE_DEVICES="0"
for m in greedy_search modified_beam_search fast_beam_search; do
./zipformer/decode.py \
--epoch 40 \
--avg 13 \
--exp-dir zipformer/exp-small \
--max-duration 600 \
--causal 0 \
--decoding-method $m \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,768,768,768,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192
done
```
##### large-scaled model, number of model parameters: 148439574, i.e., 148.4 M
The tensorboard log can be found at
<https://tensorboard.dev/experiment/HJ74wWYpQAGSzETkmQnrmQ/>
You can find a pretrained model, training logs, decoding logs, and decoding results at:
<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-large-2023-05-16>
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
| decoding method | test-clean | test-other | comment |
|----------------------|------------|------------|--------------------|
| greedy_search | 2.12 | 4.91 | --epoch 30 --avg 9 |
| modified_beam_search | 2.11 | 4.9 | --epoch 30 --avg 9 |
| fast_beam_search | 2.13 | 4.93 | --epoch 30 --avg 9 |
| greedy_search | 2.12 | 4.8 | --epoch 40 --avg 13 |
| modified_beam_search | 2.11 | 4.7 | --epoch 40 --avg 13 |
| fast_beam_search | 2.13 | 4.78 | --epoch 40 --avg 13 |
The training command is:
```bash
export CUDA_VISIBLE_DEVICES="0,1,2,3"
./zipformer/train.py \
--world-size 4 \
--num-epochs 40 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp-large \
--causal 0 \
--num-encoder-layers 2,2,4,5,4,2 \
--feedforward-dim 512,768,1536,2048,1536,768 \
--encoder-dim 192,256,512,768,512,256 \
--encoder-unmasked-dim 192,192,256,320,256,192 \
--full-libri 1 \
--max-duration 1000
```
The decoding command is:
```bash
export CUDA_VISIBLE_DEVICES="0"
for m in greedy_search modified_beam_search fast_beam_search; do
./zipformer/decode.py \
--epoch 40 \
--avg 16 \
--exp-dir zipformer/exp-large \
--max-duration 600 \
--causal 0 \
--decoding-method $m \
--num-encoder-layers 2,2,4,5,4,2 \
--feedforward-dim 512,768,1536,2048,1536,768 \
--encoder-dim 192,256,512,768,512,256 \
--encoder-unmasked-dim 192,192,256,320,256,192
done
```
#### streaming
##### normal-scaled model, number of model parameters: 66110931, i.e., 66.11 M
The tensorboard log can be found at
<https://tensorboard.dev/experiment/9rD0i6rMSWq1O61poWi71A>
You can find a pretrained model, training logs, decoding logs, and decoding results at:
<https://huggingface.co/Zengwei/icefall-asr-librispeech-streaming-zipformer-2023-05-17>
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
| decoding method | chunk size | test-clean | test-other | decoding mode | comment |
|----------------------|------------|------------|------------|---------------------|--------------------|
| greedy_search | 320ms | 3.06 | 7.81 | simulated streaming | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
| greedy_search | 320ms | 3.06 | 7.79 | chunk-wise | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
| modified_beam_search | 320ms | 3.01 | 7.69 | simulated streaming | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
| modified_beam_search | 320ms | 3.05 | 7.69 | chunk-wise | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
| fast_beam_search | 320ms | 3.04 | 7.68 | simulated streaming | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
| fast_beam_search | 320ms | 3.07 | 7.69 | chunk-wise | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 |
| greedy_search | 640ms | 2.81 | 7.15 | simulated streaming | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
| greedy_search | 640ms | 2.84 | 7.16 | chunk-wise | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
| modified_beam_search | 640ms | 2.79 | 7.05 | simulated streaming | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
| modified_beam_search | 640ms | 2.81 | 7.11 | chunk-wise | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
| fast_beam_search | 640ms | 2.84 | 7.04 | simulated streaming | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
| fast_beam_search | 640ms | 2.83 | 7.1 | chunk-wise | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 |
Note: For decoding mode, `simulated streaming` indicates feeding full utterance during decoding using `decode.py`,
while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`.
The training command is:
```bash
export CUDA_VISIBLE_DEVICES="0,1,2,3"
./zipformer/train.py \
--world-size 4 \
--num-epochs 40 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp-causal \
--causal 1 \
--full-libri 1 \
--max-duration 1000
```
The simulated streaming decoding command is:
```bash
export CUDA_VISIBLE_DEVICES="0"
for m in greedy_search modified_beam_search fast_beam_search; do
./zipformer/decode.py \
--epoch 30 \
--avg 8 \
--use-averaged-model 1 \
--exp-dir ./zipformer/exp-causal \
--causal 1 \
--chunk-size 16 \
--left-context-frames 128 \
--max-duration 600 \
--decoding-method $m
done
```
The chunk-wise streaming decoding command is:
```bash
export CUDA_VISIBLE_DEVICES="0"
for m in greedy_search modified_beam_search fast_beam_search; do
./zipformer/streaming_decode.py \
--epoch 30 \
--avg 8 \
--use-averaged-model 1 \
--exp-dir ./zipformer/exp-causal \
--causal 1 \
--chunk-size 16 \
--left-context-frames 128 \
--num-decode-streams 2000 \
--decoding-method $m
done
```
### pruned_transducer_stateless7 (Fine-tune with mux)
See <https://github.com/k2-fsa/icefall/pull/1059> for more details.
[pruned_transducer_stateless7](./pruned_transducer_stateless7)
The tensorboard log can be found at
<https://tensorboard.dev/experiment/MaNDZfO7RzW2Czzf3R2ZRA/>
You can find the pretrained model and bpe model needed for fine-tuning at:
<https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11>
You can find a fine-tuned model, fine-tuning logs, decoding logs, and decoding
results at:
<https://huggingface.co/yfyeung/icefall-asr-finetune-mux-pruned_transducer_stateless7-2023-05-19>
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
Number of model parameters: 70369391, i.e., 70.37 M
| decoding method | dev | test | test-clean | test-other | comment |
|----------------------|------------|------------|------------|------------|--------------------|
| greedy_search | 14.27 | 14.22 | 2.08 | 4.79 | --epoch 20 --avg 5 |
| modified_beam_search | 14.22 | 14.08 | 2.06 | 4.72 | --epoch 20 --avg 5 |
| fast_beam_search | 14.23 | 14.17 | 2.08 | 4.09 | --epoch 20 --avg 5 |
The training commands are:
```bash
export CUDA_VISIBLE_DEVICES="0,1"
./pruned_transducer_stateless7/finetune.py \
--world-size 2 \
--num-epochs 20 \
--start-epoch 1 \
--exp-dir pruned_transducer_stateless7/exp_giga_finetune \
--subset S \
--use-fp16 1 \
--base-lr 0.005 \
--lr-epochs 100 \
--lr-batches 100000 \
--bpe-model icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11/data/lang_bpe_500/bpe.model \
--do-finetune True \
--use-mux True \
--finetune-ckpt icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11/exp/pretrain.pt \
--max-duration 500
```
The decoding commands are:
```bash
# greedy_search
./pruned_transducer_stateless7/decode.py \
--epoch 20 \
--avg 5 \
--use-averaged-model 1 \
--exp-dir ./pruned_transducer_stateless7/exp_giga_finetune \
--max-duration 600 \
--decoding-method greedy_search
# modified_beam_search
./pruned_transducer_stateless7/decode.py \
--epoch 20 \
--avg 5 \
--use-averaged-model 1 \
--exp-dir ./pruned_transducer_stateless7/exp_giga_finetune \
--max-duration 600 \
--decoding-method modified_beam_search \
--beam-size 4
# fast_beam_search
./pruned_transducer_stateless7/decode.py \
--epoch 20 \
--avg 5 \
--use-averaged-model 1 \
--exp-dir ./pruned_transducer_stateless7/exp_giga_finetune \
--max-duration 600 \
--decoding-method fast_beam_search \
--beam 20.0 \
--max-contexts 8 \
--max-states 64
```
### pruned_transducer_stateless7 (zipformer + multidataset(LibriSpeech + GigaSpeech + CommonVoice 13.0))
See <https://github.com/k2-fsa/icefall/pull/1010> for more details.
[pruned_transducer_stateless7](./pruned_transducer_stateless7)
The tensorboard log can be found at
<https://tensorboard.dev/experiment/SwdJoHgZSZWn8ph9aJLb8g/>
You can find a pretrained model, training logs, decoding logs, and decoding
results at:
<https://huggingface.co/yfyeung/icefall-asr-multidataset-pruned_transducer_stateless7-2023-05-04>
You can use <https://github.com/k2-fsa/sherpa> to deploy it.
Number of model parameters: 70369391, i.e., 70.37 M
| decoding method | test-clean | test-other | comment |
|----------------------|------------|------------|--------------------|
| greedy_search | 1.91 | 4.06 | --epoch 30 --avg 7 |
| modified_beam_search | 1.90 | 3.99 | --epoch 30 --avg 7 |
| fast_beam_search | 1.90 | 3.98 | --epoch 30 --avg 7 |
The training commands are:
```bash
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
./pruned_transducer_stateless7/train.py \
--world-size 8 \
--num-epochs 30 \
--use-multidataset 1 \
--use-fp16 1 \
--max-duration 750 \
--exp-dir pruned_transducer_stateless7/exp
```
The decoding commands are:
```bash
# greedy_search
./pruned_transducer_stateless7/decode.py \
--epoch 30 \
--avg 7 \
--use-averaged-model 1 \
--exp-dir ./pruned_transducer_stateless7/exp \
--max-duration 600 \
--decoding-method greedy_search
# modified_beam_search
./pruned_transducer_stateless7/decode.py \
--epoch 30 \
--avg 7 \
--use-averaged-model 1 \
--exp-dir ./pruned_transducer_stateless7/exp \
--max-duration 600 \
--decoding-method modified_beam_search \
--beam-size 4
# fast_beam_search
./pruned_transducer_stateless7/decode.py \
--epoch 30 \
--avg 7 \
--use-averaged-model 1 \
--exp-dir ./pruned_transducer_stateless7/exp \
--max-duration 600 \
--decoding-method fast_beam_search \
--beam 20.0 \
--max-contexts 8 \
--max-states 64
```
### Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer + Multi-Dataset)
#### [pruned_transducer_stateless7_streaming_multi](./pruned_transducer_stateless7_streaming_multi)
See <https://github.com/k2-fsa/icefall/pull/984> for more details.
You can find a pretrained model, training logs, decoding logs, and decoding
results at: <https://huggingface.co/marcoyang/icefall-libri-giga-pruned-transducer-stateless7-streaming-2023-04-04>
Number of model parameters: 70369391, i.e., 70.37 M
##### training on full librispeech + full gigaspeech (with giga_prob=0.9)
The WERs are:
| decoding method | chunk size | test-clean | test-other | comment | decoding mode |
|----------------------|------------|------------|------------|---------------------|----------------------|
| greedy search | 320ms | 2.43 | 6.0 | --epoch 20 --avg 4 | simulated streaming |
| greedy search | 320ms | 2.47 | 6.13 | --epoch 20 --avg 4 | chunk-wise |
| fast beam search | 320ms | 2.43 | 5.99 | --epoch 20 --avg 4 | simulated streaming |
| fast beam search | 320ms | 2.8 | 6.46 | --epoch 20 --avg 4 | chunk-wise |
| modified beam search | 320ms | 2.4 | 5.96 | --epoch 20 --avg 4 | simulated streaming |
| modified beam search | 320ms | 2.42 | 6.03 | --epoch 20 --avg 4 | chunk-size |
| greedy search | 640ms | 2.26 | 5.58 | --epoch 20 --avg 4 | simulated streaming |
| greedy search | 640ms | 2.33 | 5.76 | --epoch 20 --avg 4 | chunk-wise |
| fast beam search | 640ms | 2.27 | 5.54 | --epoch 20 --avg 4 | simulated streaming |
| fast beam search | 640ms | 2.37 | 5.75 | --epoch 20 --avg 4 | chunk-wise |
| modified beam search | 640ms | 2.22 | 5.5 | --epoch 20 --avg 4 | simulated streaming |
| modified beam search | 640ms | 2.25 | 5.69 | --epoch 20 --avg 4 | chunk-size |
The model also has good WERs on GigaSpeech. The following WERs are achieved on GigaSpeech test and dev sets:
| decoding method | chunk size | dev | test | comment | decoding mode |
|----------------------|------------|-----|------|------------|---------------------|
| greedy search | 320ms | 12.08 | 11.98 | --epoch 20 --avg 4 | simulated streaming |
| greedy search | 640ms | 11.66 | 11.71 | --epoch 20 --avg 4 | simulated streaming |
| modified beam search | 320ms | 11.95 | 11.83 | --epoch 20 --avg 4 | simulated streaming |
| modified beam search | 320ms | 11.65 | 11.56 | --epoch 20 --avg 4 | simulated streaming |
Note: `simulated streaming` indicates feeding full utterance during decoding using `decode.py`,
while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`.
The training command is:
```bash
./pruned_transducer_stateless7_streaming_multi/train.py \
--world-size 4 \
--num-epochs 20 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir pruned_transducer_stateless7_streaming_multi/exp \
--full-libri 1 \
--giga-prob 0.9 \
--max-duration 750 \
--master-port 12345
```
The tensorboard log can be found at
<https://tensorboard.dev/experiment/G4yDMLXGQXexf41i4MA2Tg/#scalars>
The simulated streaming decoding command (e.g., chunk-size=320ms) is:
```bash
for m in greedy_search fast_beam_search modified_beam_search; do
./pruned_transducer_stateless7_streaming_multi/decode.py \
--epoch 20 \
--avg 4 \
--exp-dir ./pruned_transducer_stateless7_streaming_multi/exp \
--max-duration 600 \
--decode-chunk-len 32 \
--right-padding 64 \
--decoding-method $m
done
```
The streaming chunk-size decoding command (e.g., chunk-size=320ms) is:
```bash
for m in greedy_search modified_beam_search fast_beam_search; do
./pruned_transducer_stateless7_streaming_multi/streaming_decode.py \
--epoch 20 \
--avg 4 \
--exp-dir ./pruned_transducer_stateless7_streaming_multi/exp \
--decoding-method $m \
--decode-chunk-len 32 \
--num-decode-streams 2000
done
```
#### Smaller model
We also provide a very small version (only 6.1M parameters) of this setup. The training command for the small model is:
```bash
./pruned_transducer_stateless7_streaming_multi/train.py \
--world-size 4 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir pruned_transducer_stateless7_streaming_multi/exp \
--full-libri 1 \
--giga-prob 0.9 \
--num-encoder-layers "2,2,2,2,2" \
--feedforward-dims "256,256,512,512,256" \
--nhead "4,4,4,4,4" \
--encoder-dims "128,128,128,128,128" \
--attention-dims "96,96,96,96,96" \
--encoder-unmasked-dims "96,96,96,96,96" \
--max-duration 1200 \
--master-port 12345
```
You can find this pretrained small model and its training logs, decoding logs, and decoding
results at:
<https://huggingface.co/marcoyang/icefall-libri-giga-pruned-transducer-stateless7-streaming-6M-2023-04-03>
| decoding method | chunk size | test-clean | test-other | comment | decoding mode |
|----------------------|------------|------------|------------|---------------------|----------------------|
| greedy search | 320ms | 5.95 | 15.03 | --epoch 30 --avg 1 | simulated streaming |
| greedy search | 640ms | 5.61 | 13.86 | --epoch 30 --avg 1 | simulated streaming |
| modified beam search | 320ms | 5.72 | 14.34 | --epoch 30 --avg 1 | simulated streaming |
| modified beam search | 640ms | 5.43 | 13.16 | --epoch 30 --avg 1 | simulated streaming |
| fast beam search | 320ms | 5.88 | 14.45 | --epoch 30 --avg 1 | simulated streaming |
| fast beam search | 640ms | 5.48 | 13.31 | --epoch 30 --avg 1 | simulated streaming |
This small model achieves the following WERs on GigaSpeech test and dev sets:
| decoding method | chunk size | dev | test | comment | decoding mode |
|----------------------|------------|------------|------------|---------------------|----------------------|
| greedy search | 320ms | 17.57 | 17.2 | --epoch 30 --avg 1 | simulated streaming |
| modified beam search | 320ms | 16.98 | 11.98 | --epoch 30 --avg 1 | simulated streaming |
You can find the tensorboard logs at <https://tensorboard.dev/experiment/tAc5iXxTQrCQxky5O5OLyw/#scalars>.
### Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer)
#### [pruned_transducer_stateless7_streaming](./pruned_transducer_stateless7_streaming)
@ -53,7 +585,7 @@ The tensorboard log can be found at
The simulated streaming decoding command (e.g., chunk-size=320ms) is:
```bash
for $m in greedy_search fast_beam_search modified_beam_search; do
for m in greedy_search fast_beam_search modified_beam_search; do
./pruned_transducer_stateless7_streaming/decode.py \
--epoch 30 \
--avg 9 \
@ -76,6 +608,90 @@ for m in greedy_search modified_beam_search fast_beam_search; do
--num-decode-streams 2000
done
```
We also support decoding with neural network LMs. After combining with language models, the WERs are
| decoding method | chunk size | test-clean | test-other | comment | decoding mode |
|----------------------|------------|------------|------------|---------------------|----------------------|
| `modified_beam_search` | 320ms | 3.11 | 7.93 | --epoch 30 --avg 9 | simulated streaming |
| `modified_beam_search_lm_shallow_fusion` | 320ms | 2.58 | 6.65 | --epoch 30 --avg 9 | simulated streaming |
| `modified_beam_search_lm_rescore` | 320ms | 2.59 | 6.86 | --epoch 30 --avg 9 | simulated streaming |
| `modified_beam_search_lm_rescore_LODR` | 320ms | 2.52 | 6.73 | --epoch 30 --avg 9 | simulated streaming |
Please use the following command for `modified_beam_search_lm_shallow_fusion`:
```bash
for lm_scale in $(seq 0.15 0.01 0.38); do
for beam_size in 4 8 12; do
./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--beam-size $beam_size \
--exp-dir ./pruned_transducer_stateless7_streaming/exp-large-LM \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_lm_shallow_fusion \
--use-shallow-fusion 1 \
--lm-type rnn \
--lm-exp-dir rnn_lm/exp \
--lm-epoch 99 \
--lm-scale $lm_scale \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500
done
done
```
Please use the following command for `modified_beam_search_lm_rescore`:
```bash
./pruned_transducer_stateless7_streaming/decode.py \
--epoch 30 \
--avg 9 \
--use-averaged-model True \
--beam-size 8 \
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_lm_rescore \
--use-shallow-fusion 0 \
--lm-type rnn \
--lm-exp-dir rnn_lm/exp \
--lm-epoch 99 \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500
```
Please use the following command for `modified_beam_search_lm_rescore_LODR`:
```bash
./pruned_transducer_stateless7_streaming/decode.py \
--epoch 30 \
--avg 9 \
--use-averaged-model True \
--beam-size 8 \
--exp-dir ./pruned_transducer_stateless7_streaming/exp \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_lm_rescore_LODR \
--use-shallow-fusion 0 \
--lm-type rnn \
--lm-exp-dir rnn_lm/exp \
--lm-epoch 99 \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500 \
--tokens-ngram 2 \
--backoff-id 500
```
A well-trained RNNLM can be found here: <https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm/tree/main>. The bi-gram used in LODR decoding
can be found here: <https://huggingface.co/marcoyang/librispeech_bigram>.
#### Smaller model
@ -540,6 +1156,10 @@ for m in greedy_search fast_beam_search modified_beam_search ; do
done
```
Note that a small change is made to the `pruned_transducer_stateless7/decoder.py` in
this [PR](https://github.com/k2-fsa/icefall/pull/942) to address the
problem of emitting the first symbol at the very beginning. If you need a
model without this issue, please download the model from here: <https://huggingface.co/marcoyang/icefall-asr-librispeech-pruned-transducer-stateless7-2023-03-10>
### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + gradient filter)