From 93a5c878f17fad4a9ea9f90d2ce2d1d10e759f69 Mon Sep 17 00:00:00 2001 From: Desh Raj Date: Tue, 13 Jun 2023 08:14:11 -0400 Subject: [PATCH] remove changes in librispeech --- egs/librispeech/ASR/README.md | 2 + egs/librispeech/ASR/RESULTS.md | 622 ++++++++++++++++++++++++++++++++- 2 files changed, 623 insertions(+), 1 deletion(-) diff --git a/egs/librispeech/ASR/README.md b/egs/librispeech/ASR/README.md index 9ffd78d5b..6f5ee7846 100644 --- a/egs/librispeech/ASR/README.md +++ b/egs/librispeech/ASR/README.md @@ -26,6 +26,7 @@ The following table lists the differences among them. | `pruned_transducer_stateless7_ctc` | Zipformer | Embedding + Conv1d | Same as pruned_transducer_stateless7, but with extra CTC head| | `pruned_transducer_stateless7_ctc_bs` | Zipformer | Embedding + Conv1d | pruned_transducer_stateless7_ctc + blank skip | | `pruned_transducer_stateless7_streaming` | Streaming Zipformer | Embedding + Conv1d | streaming version of pruned_transducer_stateless7 | +| `pruned_transducer_stateless7_streaming_multi` | Streaming Zipformer | Embedding + Conv1d | same as pruned_transducer_stateless7_streaming, trained on LibriSpeech + GigaSpeech | | `pruned_transducer_stateless8` | Zipformer | Embedding + Conv1d | Same as pruned_transducer_stateless7, but using extra data from GigaSpeech| | `pruned_stateless_emformer_rnnt2` | Emformer(from torchaudio) | Embedding + Conv1d | Using Emformer from torchaudio for streaming ASR| | `conv_emformer_transducer_stateless` | ConvEmformer | Embedding + Conv1d | Using ConvEmformer for streaming ASR + mechanisms in reworked model | @@ -33,6 +34,7 @@ The following table lists the differences among them. | `lstm_transducer_stateless` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model | | `lstm_transducer_stateless2` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model + gigaspeech (multi-dataset setup) | | `lstm_transducer_stateless3` | LSTM | Embedding + Conv1d | Using LSTM with mechanisms in reworked model + gradient filter + delay penalty | +| `zipformer` | Upgraded Zipformer | Embedding + Conv1d | The latest recipe | The decoder in `transducer_stateless` is modified from the paper [Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/). diff --git a/egs/librispeech/ASR/RESULTS.md b/egs/librispeech/ASR/RESULTS.md index ecb84eb01..b7f704e41 100644 --- a/egs/librispeech/ASR/RESULTS.md +++ b/egs/librispeech/ASR/RESULTS.md @@ -1,5 +1,537 @@ ## Results +### zipformer (zipformer + pruned stateless transducer) + +See for more details. + +[zipformer](./zipformer) + +#### Non-streaming + +##### normal-scaled model, number of model parameters: 65549011, i.e., 65.55 M + +The tensorboard log can be found at + + +You can find a pretrained model, training logs, decoding logs, and decoding results at: + + +You can use to deploy it. + +| decoding method | test-clean | test-other | comment | +|----------------------|------------|------------|--------------------| +| greedy_search | 2.27 | 5.1 | --epoch 30 --avg 9 | +| modified_beam_search | 2.25 | 5.06 | --epoch 30 --avg 9 | +| fast_beam_search | 2.25 | 5.04 | --epoch 30 --avg 9 | +| greedy_search | 2.23 | 4.96 | --epoch 40 --avg 16 | +| modified_beam_search | 2.21 | 4.91 | --epoch 40 --avg 16 | +| fast_beam_search | 2.24 | 4.93 | --epoch 40 --avg 16 | + +The training command is: +```bash +export CUDA_VISIBLE_DEVICES="0,1,2,3" +./zipformer/train.py \ + --world-size 4 \ + --num-epochs 40 \ + --start-epoch 1 \ + --use-fp16 1 \ + --exp-dir zipformer/exp \ + --causal 0 \ + --full-libri 1 \ + --max-duration 1000 +``` + +The decoding command is: +```bash +export CUDA_VISIBLE_DEVICES="0" +for m in greedy_search modified_beam_search fast_beam_search; do + ./zipformer/decode.py \ + --epoch 30 \ + --avg 9 \ + --use-averaged-model 1 \ + --exp-dir ./zipformer/exp \ + --max-duration 600 \ + --decoding-method $m +done +``` + +##### small-scaled model, number of model parameters: 23285615, i.e., 23.3 M + +The tensorboard log can be found at + + +You can find a pretrained model, training logs, decoding logs, and decoding results at: + + +You can use to deploy it. + +| decoding method | test-clean | test-other | comment | +|----------------------|------------|------------|--------------------| +| greedy_search | 2.64 | 6.14 | --epoch 30 --avg 8 | +| modified_beam_search | 2.6 | 6.01 | --epoch 30 --avg 8 | +| fast_beam_search | 2.62 | 6.06 | --epoch 30 --avg 8 | +| greedy_search | 2.49 | 5.91 | --epoch 40 --avg 13 | +| modified_beam_search | 2.46 | 5.83 | --epoch 40 --avg 13 | +| fast_beam_search | 2.46 | 5.87 | --epoch 40 --avg 13 | + +The training command is: +```bash +export CUDA_VISIBLE_DEVICES="0,1" +./zipformer/train.py \ + --world-size 2 \ + --num-epochs 40 \ + --start-epoch 1 \ + --use-fp16 1 \ + --exp-dir zipformer/exp-small \ + --causal 0 \ + --num-encoder-layers 2,2,2,2,2,2 \ + --feedforward-dim 512,768,768,768,768,768 \ + --encoder-dim 192,256,256,256,256,256 \ + --encoder-unmasked-dim 192,192,192,192,192,192 \ + --base-lr 0.04 \ + --full-libri 1 \ + --max-duration 1500 +``` + +The decoding command is: +```bash +export CUDA_VISIBLE_DEVICES="0" +for m in greedy_search modified_beam_search fast_beam_search; do + ./zipformer/decode.py \ + --epoch 40 \ + --avg 13 \ + --exp-dir zipformer/exp-small \ + --max-duration 600 \ + --causal 0 \ + --decoding-method $m \ + --num-encoder-layers 2,2,2,2,2,2 \ + --feedforward-dim 512,768,768,768,768,768 \ + --encoder-dim 192,256,256,256,256,256 \ + --encoder-unmasked-dim 192,192,192,192,192,192 +done +``` + +##### large-scaled model, number of model parameters: 148439574, i.e., 148.4 M + +The tensorboard log can be found at + + +You can find a pretrained model, training logs, decoding logs, and decoding results at: + + +You can use to deploy it. + +| decoding method | test-clean | test-other | comment | +|----------------------|------------|------------|--------------------| +| greedy_search | 2.12 | 4.91 | --epoch 30 --avg 9 | +| modified_beam_search | 2.11 | 4.9 | --epoch 30 --avg 9 | +| fast_beam_search | 2.13 | 4.93 | --epoch 30 --avg 9 | +| greedy_search | 2.12 | 4.8 | --epoch 40 --avg 13 | +| modified_beam_search | 2.11 | 4.7 | --epoch 40 --avg 13 | +| fast_beam_search | 2.13 | 4.78 | --epoch 40 --avg 13 | + +The training command is: +```bash +export CUDA_VISIBLE_DEVICES="0,1,2,3" +./zipformer/train.py \ + --world-size 4 \ + --num-epochs 40 \ + --start-epoch 1 \ + --use-fp16 1 \ + --exp-dir zipformer/exp-large \ + --causal 0 \ + --num-encoder-layers 2,2,4,5,4,2 \ + --feedforward-dim 512,768,1536,2048,1536,768 \ + --encoder-dim 192,256,512,768,512,256 \ + --encoder-unmasked-dim 192,192,256,320,256,192 \ + --full-libri 1 \ + --max-duration 1000 +``` + +The decoding command is: +```bash +export CUDA_VISIBLE_DEVICES="0" +for m in greedy_search modified_beam_search fast_beam_search; do + ./zipformer/decode.py \ + --epoch 40 \ + --avg 16 \ + --exp-dir zipformer/exp-large \ + --max-duration 600 \ + --causal 0 \ + --decoding-method $m \ + --num-encoder-layers 2,2,4,5,4,2 \ + --feedforward-dim 512,768,1536,2048,1536,768 \ + --encoder-dim 192,256,512,768,512,256 \ + --encoder-unmasked-dim 192,192,256,320,256,192 +done +``` + +#### streaming + +##### normal-scaled model, number of model parameters: 66110931, i.e., 66.11 M + +The tensorboard log can be found at + + +You can find a pretrained model, training logs, decoding logs, and decoding results at: + + +You can use to deploy it. + +| decoding method | chunk size | test-clean | test-other | decoding mode | comment | +|----------------------|------------|------------|------------|---------------------|--------------------| +| greedy_search | 320ms | 3.06 | 7.81 | simulated streaming | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 | +| greedy_search | 320ms | 3.06 | 7.79 | chunk-wise | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 | +| modified_beam_search | 320ms | 3.01 | 7.69 | simulated streaming | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 | +| modified_beam_search | 320ms | 3.05 | 7.69 | chunk-wise | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 | +| fast_beam_search | 320ms | 3.04 | 7.68 | simulated streaming | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 | +| fast_beam_search | 320ms | 3.07 | 7.69 | chunk-wise | --epoch 30 --avg 8 --chunk-size 16 --left-context-frames 128 | +| greedy_search | 640ms | 2.81 | 7.15 | simulated streaming | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 | +| greedy_search | 640ms | 2.84 | 7.16 | chunk-wise | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 | +| modified_beam_search | 640ms | 2.79 | 7.05 | simulated streaming | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 | +| modified_beam_search | 640ms | 2.81 | 7.11 | chunk-wise | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 | +| fast_beam_search | 640ms | 2.84 | 7.04 | simulated streaming | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 | +| fast_beam_search | 640ms | 2.83 | 7.1 | chunk-wise | --epoch 30 --avg 8 --chunk-size 32 --left-context-frames 256 | + +Note: For decoding mode, `simulated streaming` indicates feeding full utterance during decoding using `decode.py`, + while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`. + +The training command is: +```bash +export CUDA_VISIBLE_DEVICES="0,1,2,3" +./zipformer/train.py \ + --world-size 4 \ + --num-epochs 40 \ + --start-epoch 1 \ + --use-fp16 1 \ + --exp-dir zipformer/exp-causal \ + --causal 1 \ + --full-libri 1 \ + --max-duration 1000 +``` + +The simulated streaming decoding command is: +```bash +export CUDA_VISIBLE_DEVICES="0" +for m in greedy_search modified_beam_search fast_beam_search; do + ./zipformer/decode.py \ + --epoch 30 \ + --avg 8 \ + --use-averaged-model 1 \ + --exp-dir ./zipformer/exp-causal \ + --causal 1 \ + --chunk-size 16 \ + --left-context-frames 128 \ + --max-duration 600 \ + --decoding-method $m +done +``` + +The chunk-wise streaming decoding command is: +```bash +export CUDA_VISIBLE_DEVICES="0" +for m in greedy_search modified_beam_search fast_beam_search; do + ./zipformer/streaming_decode.py \ + --epoch 30 \ + --avg 8 \ + --use-averaged-model 1 \ + --exp-dir ./zipformer/exp-causal \ + --causal 1 \ + --chunk-size 16 \ + --left-context-frames 128 \ + --num-decode-streams 2000 \ + --decoding-method $m +done +``` + +### pruned_transducer_stateless7 (Fine-tune with mux) + +See for more details. + +[pruned_transducer_stateless7](./pruned_transducer_stateless7) + +The tensorboard log can be found at + + +You can find the pretrained model and bpe model needed for fine-tuning at: + + +You can find a fine-tuned model, fine-tuning logs, decoding logs, and decoding +results at: + + +You can use to deploy it. + +Number of model parameters: 70369391, i.e., 70.37 M + +| decoding method | dev | test | test-clean | test-other | comment | +|----------------------|------------|------------|------------|------------|--------------------| +| greedy_search | 14.27 | 14.22 | 2.08 | 4.79 | --epoch 20 --avg 5 | +| modified_beam_search | 14.22 | 14.08 | 2.06 | 4.72 | --epoch 20 --avg 5 | +| fast_beam_search | 14.23 | 14.17 | 2.08 | 4.09 | --epoch 20 --avg 5 | + +The training commands are: +```bash +export CUDA_VISIBLE_DEVICES="0,1" + +./pruned_transducer_stateless7/finetune.py \ + --world-size 2 \ + --num-epochs 20 \ + --start-epoch 1 \ + --exp-dir pruned_transducer_stateless7/exp_giga_finetune \ + --subset S \ + --use-fp16 1 \ + --base-lr 0.005 \ + --lr-epochs 100 \ + --lr-batches 100000 \ + --bpe-model icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11/data/lang_bpe_500/bpe.model \ + --do-finetune True \ + --use-mux True \ + --finetune-ckpt icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11/exp/pretrain.pt \ + --max-duration 500 +``` + +The decoding commands are: +```bash +# greedy_search +./pruned_transducer_stateless7/decode.py \ + --epoch 20 \ + --avg 5 \ + --use-averaged-model 1 \ + --exp-dir ./pruned_transducer_stateless7/exp_giga_finetune \ + --max-duration 600 \ + --decoding-method greedy_search + +# modified_beam_search +./pruned_transducer_stateless7/decode.py \ + --epoch 20 \ + --avg 5 \ + --use-averaged-model 1 \ + --exp-dir ./pruned_transducer_stateless7/exp_giga_finetune \ + --max-duration 600 \ + --decoding-method modified_beam_search \ + --beam-size 4 + +# fast_beam_search +./pruned_transducer_stateless7/decode.py \ + --epoch 20 \ + --avg 5 \ + --use-averaged-model 1 \ + --exp-dir ./pruned_transducer_stateless7/exp_giga_finetune \ + --max-duration 600 \ + --decoding-method fast_beam_search \ + --beam 20.0 \ + --max-contexts 8 \ + --max-states 64 +``` + +### pruned_transducer_stateless7 (zipformer + multidataset(LibriSpeech + GigaSpeech + CommonVoice 13.0)) + +See for more details. + +[pruned_transducer_stateless7](./pruned_transducer_stateless7) + +The tensorboard log can be found at + + +You can find a pretrained model, training logs, decoding logs, and decoding +results at: + + +You can use to deploy it. + +Number of model parameters: 70369391, i.e., 70.37 M + +| decoding method | test-clean | test-other | comment | +|----------------------|------------|------------|--------------------| +| greedy_search | 1.91 | 4.06 | --epoch 30 --avg 7 | +| modified_beam_search | 1.90 | 3.99 | --epoch 30 --avg 7 | +| fast_beam_search | 1.90 | 3.98 | --epoch 30 --avg 7 | + + +The training commands are: +```bash +export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" + +./pruned_transducer_stateless7/train.py \ + --world-size 8 \ + --num-epochs 30 \ + --use-multidataset 1 \ + --use-fp16 1 \ + --max-duration 750 \ + --exp-dir pruned_transducer_stateless7/exp +``` + +The decoding commands are: +```bash +# greedy_search +./pruned_transducer_stateless7/decode.py \ + --epoch 30 \ + --avg 7 \ + --use-averaged-model 1 \ + --exp-dir ./pruned_transducer_stateless7/exp \ + --max-duration 600 \ + --decoding-method greedy_search + +# modified_beam_search +./pruned_transducer_stateless7/decode.py \ + --epoch 30 \ + --avg 7 \ + --use-averaged-model 1 \ + --exp-dir ./pruned_transducer_stateless7/exp \ + --max-duration 600 \ + --decoding-method modified_beam_search \ + --beam-size 4 + +# fast_beam_search +./pruned_transducer_stateless7/decode.py \ + --epoch 30 \ + --avg 7 \ + --use-averaged-model 1 \ + --exp-dir ./pruned_transducer_stateless7/exp \ + --max-duration 600 \ + --decoding-method fast_beam_search \ + --beam 20.0 \ + --max-contexts 8 \ + --max-states 64 +``` + +### Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer + Multi-Dataset) + +#### [pruned_transducer_stateless7_streaming_multi](./pruned_transducer_stateless7_streaming_multi) + +See for more details. + +You can find a pretrained model, training logs, decoding logs, and decoding +results at: + +Number of model parameters: 70369391, i.e., 70.37 M + +##### training on full librispeech + full gigaspeech (with giga_prob=0.9) + +The WERs are: + + +| decoding method | chunk size | test-clean | test-other | comment | decoding mode | +|----------------------|------------|------------|------------|---------------------|----------------------| +| greedy search | 320ms | 2.43 | 6.0 | --epoch 20 --avg 4 | simulated streaming | +| greedy search | 320ms | 2.47 | 6.13 | --epoch 20 --avg 4 | chunk-wise | +| fast beam search | 320ms | 2.43 | 5.99 | --epoch 20 --avg 4 | simulated streaming | +| fast beam search | 320ms | 2.8 | 6.46 | --epoch 20 --avg 4 | chunk-wise | +| modified beam search | 320ms | 2.4 | 5.96 | --epoch 20 --avg 4 | simulated streaming | +| modified beam search | 320ms | 2.42 | 6.03 | --epoch 20 --avg 4 | chunk-size | +| greedy search | 640ms | 2.26 | 5.58 | --epoch 20 --avg 4 | simulated streaming | +| greedy search | 640ms | 2.33 | 5.76 | --epoch 20 --avg 4 | chunk-wise | +| fast beam search | 640ms | 2.27 | 5.54 | --epoch 20 --avg 4 | simulated streaming | +| fast beam search | 640ms | 2.37 | 5.75 | --epoch 20 --avg 4 | chunk-wise | +| modified beam search | 640ms | 2.22 | 5.5 | --epoch 20 --avg 4 | simulated streaming | +| modified beam search | 640ms | 2.25 | 5.69 | --epoch 20 --avg 4 | chunk-size | + +The model also has good WERs on GigaSpeech. The following WERs are achieved on GigaSpeech test and dev sets: + +| decoding method | chunk size | dev | test | comment | decoding mode | +|----------------------|------------|-----|------|------------|---------------------| +| greedy search | 320ms | 12.08 | 11.98 | --epoch 20 --avg 4 | simulated streaming | +| greedy search | 640ms | 11.66 | 11.71 | --epoch 20 --avg 4 | simulated streaming | +| modified beam search | 320ms | 11.95 | 11.83 | --epoch 20 --avg 4 | simulated streaming | +| modified beam search | 320ms | 11.65 | 11.56 | --epoch 20 --avg 4 | simulated streaming | + + +Note: `simulated streaming` indicates feeding full utterance during decoding using `decode.py`, +while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`. + +The training command is: + +```bash +./pruned_transducer_stateless7_streaming_multi/train.py \ + --world-size 4 \ + --num-epochs 20 \ + --start-epoch 1 \ + --use-fp16 1 \ + --exp-dir pruned_transducer_stateless7_streaming_multi/exp \ + --full-libri 1 \ + --giga-prob 0.9 \ + --max-duration 750 \ + --master-port 12345 +``` + +The tensorboard log can be found at + + +The simulated streaming decoding command (e.g., chunk-size=320ms) is: +```bash +for m in greedy_search fast_beam_search modified_beam_search; do + ./pruned_transducer_stateless7_streaming_multi/decode.py \ + --epoch 20 \ + --avg 4 \ + --exp-dir ./pruned_transducer_stateless7_streaming_multi/exp \ + --max-duration 600 \ + --decode-chunk-len 32 \ + --right-padding 64 \ + --decoding-method $m +done +``` + +The streaming chunk-size decoding command (e.g., chunk-size=320ms) is: +```bash +for m in greedy_search modified_beam_search fast_beam_search; do + ./pruned_transducer_stateless7_streaming_multi/streaming_decode.py \ + --epoch 20 \ + --avg 4 \ + --exp-dir ./pruned_transducer_stateless7_streaming_multi/exp \ + --decoding-method $m \ + --decode-chunk-len 32 \ + --num-decode-streams 2000 +done +``` + + +#### Smaller model + +We also provide a very small version (only 6.1M parameters) of this setup. The training command for the small model is: + +```bash +./pruned_transducer_stateless7_streaming_multi/train.py \ + --world-size 4 \ + --num-epochs 30 \ + --start-epoch 1 \ + --use-fp16 1 \ + --exp-dir pruned_transducer_stateless7_streaming_multi/exp \ + --full-libri 1 \ + --giga-prob 0.9 \ + --num-encoder-layers "2,2,2,2,2" \ + --feedforward-dims "256,256,512,512,256" \ + --nhead "4,4,4,4,4" \ + --encoder-dims "128,128,128,128,128" \ + --attention-dims "96,96,96,96,96" \ + --encoder-unmasked-dims "96,96,96,96,96" \ + --max-duration 1200 \ + --master-port 12345 +``` + +You can find this pretrained small model and its training logs, decoding logs, and decoding +results at: + + + +| decoding method | chunk size | test-clean | test-other | comment | decoding mode | +|----------------------|------------|------------|------------|---------------------|----------------------| +| greedy search | 320ms | 5.95 | 15.03 | --epoch 30 --avg 1 | simulated streaming | +| greedy search | 640ms | 5.61 | 13.86 | --epoch 30 --avg 1 | simulated streaming | +| modified beam search | 320ms | 5.72 | 14.34 | --epoch 30 --avg 1 | simulated streaming | +| modified beam search | 640ms | 5.43 | 13.16 | --epoch 30 --avg 1 | simulated streaming | +| fast beam search | 320ms | 5.88 | 14.45 | --epoch 30 --avg 1 | simulated streaming | +| fast beam search | 640ms | 5.48 | 13.31 | --epoch 30 --avg 1 | simulated streaming | + +This small model achieves the following WERs on GigaSpeech test and dev sets: + +| decoding method | chunk size | dev | test | comment | decoding mode | +|----------------------|------------|------------|------------|---------------------|----------------------| +| greedy search | 320ms | 17.57 | 17.2 | --epoch 30 --avg 1 | simulated streaming | +| modified beam search | 320ms | 16.98 | 11.98 | --epoch 30 --avg 1 | simulated streaming | + +You can find the tensorboard logs at . + ### Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer) #### [pruned_transducer_stateless7_streaming](./pruned_transducer_stateless7_streaming) @@ -53,7 +585,7 @@ The tensorboard log can be found at The simulated streaming decoding command (e.g., chunk-size=320ms) is: ```bash -for $m in greedy_search fast_beam_search modified_beam_search; do +for m in greedy_search fast_beam_search modified_beam_search; do ./pruned_transducer_stateless7_streaming/decode.py \ --epoch 30 \ --avg 9 \ @@ -76,6 +608,90 @@ for m in greedy_search modified_beam_search fast_beam_search; do --num-decode-streams 2000 done ``` +We also support decoding with neural network LMs. After combining with language models, the WERs are +| decoding method | chunk size | test-clean | test-other | comment | decoding mode | +|----------------------|------------|------------|------------|---------------------|----------------------| +| `modified_beam_search` | 320ms | 3.11 | 7.93 | --epoch 30 --avg 9 | simulated streaming | +| `modified_beam_search_lm_shallow_fusion` | 320ms | 2.58 | 6.65 | --epoch 30 --avg 9 | simulated streaming | +| `modified_beam_search_lm_rescore` | 320ms | 2.59 | 6.86 | --epoch 30 --avg 9 | simulated streaming | +| `modified_beam_search_lm_rescore_LODR` | 320ms | 2.52 | 6.73 | --epoch 30 --avg 9 | simulated streaming | + +Please use the following command for `modified_beam_search_lm_shallow_fusion`: +```bash +for lm_scale in $(seq 0.15 0.01 0.38); do + for beam_size in 4 8 12; do + ./pruned_transducer_stateless7_streaming/decode.py \ + --epoch 99 \ + --avg 1 \ + --use-averaged-model False \ + --beam-size $beam_size \ + --exp-dir ./pruned_transducer_stateless7_streaming/exp-large-LM \ + --max-duration 600 \ + --decode-chunk-len 32 \ + --decoding-method modified_beam_search_lm_shallow_fusion \ + --use-shallow-fusion 1 \ + --lm-type rnn \ + --lm-exp-dir rnn_lm/exp \ + --lm-epoch 99 \ + --lm-scale $lm_scale \ + --lm-avg 1 \ + --rnn-lm-embedding-dim 2048 \ + --rnn-lm-hidden-dim 2048 \ + --rnn-lm-num-layers 3 \ + --lm-vocab-size 500 + done +done +``` + +Please use the following command for `modified_beam_search_lm_rescore`: +```bash +./pruned_transducer_stateless7_streaming/decode.py \ + --epoch 30 \ + --avg 9 \ + --use-averaged-model True \ + --beam-size 8 \ + --exp-dir ./pruned_transducer_stateless7_streaming/exp \ + --max-duration 600 \ + --decode-chunk-len 32 \ + --decoding-method modified_beam_search_lm_rescore \ + --use-shallow-fusion 0 \ + --lm-type rnn \ + --lm-exp-dir rnn_lm/exp \ + --lm-epoch 99 \ + --lm-avg 1 \ + --rnn-lm-embedding-dim 2048 \ + --rnn-lm-hidden-dim 2048 \ + --rnn-lm-num-layers 3 \ + --lm-vocab-size 500 +``` + +Please use the following command for `modified_beam_search_lm_rescore_LODR`: +```bash +./pruned_transducer_stateless7_streaming/decode.py \ + --epoch 30 \ + --avg 9 \ + --use-averaged-model True \ + --beam-size 8 \ + --exp-dir ./pruned_transducer_stateless7_streaming/exp \ + --max-duration 600 \ + --decode-chunk-len 32 \ + --decoding-method modified_beam_search_lm_rescore_LODR \ + --use-shallow-fusion 0 \ + --lm-type rnn \ + --lm-exp-dir rnn_lm/exp \ + --lm-epoch 99 \ + --lm-avg 1 \ + --rnn-lm-embedding-dim 2048 \ + --rnn-lm-hidden-dim 2048 \ + --rnn-lm-num-layers 3 \ + --lm-vocab-size 500 \ + --tokens-ngram 2 \ + --backoff-id 500 +``` + +A well-trained RNNLM can be found here: . The bi-gram used in LODR decoding +can be found here: . + #### Smaller model @@ -540,6 +1156,10 @@ for m in greedy_search fast_beam_search modified_beam_search ; do done ``` +Note that a small change is made to the `pruned_transducer_stateless7/decoder.py` in +this [PR](https://github.com/k2-fsa/icefall/pull/942) to address the +problem of emitting the first symbol at the very beginning. If you need a +model without this issue, please download the model from here: ### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + gradient filter)