From a6f4bc77c8aee73f88d984f85fc4d4c8d0921b0e Mon Sep 17 00:00:00 2001 From: Fangjun Kuang Date: Wed, 1 Jun 2022 08:32:36 +0800 Subject: [PATCH] Update results for streaming Emformer. --- egs/librispeech/ASR/README.md | 1 + egs/librispeech/ASR/RESULTS.md | 64 ++++++++++++++++++++++++++++++++++ 2 files changed, 65 insertions(+) diff --git a/egs/librispeech/ASR/README.md b/egs/librispeech/ASR/README.md index a738b652f..e2aaa9d7e 100644 --- a/egs/librispeech/ASR/README.md +++ b/egs/librispeech/ASR/README.md @@ -22,6 +22,7 @@ The following table lists the differences among them. | `pruned_transducer_stateless4` | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless2 + save averaged models periodically during training | | `pruned_transducer_stateless5` | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless4 + more layers + random combiner| | `pruned_transducer_stateless6` | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless4 + distillation with hubert| +| `pruned_stateless_emformer_rnnt2` | Emformer(from torchaudio) | Embedding + Conv1d | Using Emformer from torchaudio for streaming ASR| The decoder in `transducer_stateless` is modified from the paper diff --git a/egs/librispeech/ASR/RESULTS.md b/egs/librispeech/ASR/RESULTS.md index 453751ba5..15f72e55f 100644 --- a/egs/librispeech/ASR/RESULTS.md +++ b/egs/librispeech/ASR/RESULTS.md @@ -1,5 +1,69 @@ ## Results +### LibriSpeech BPE training results (Pruned Stateless Emformer RNN-T) + +[pruned_stateless_emformer_rnnt2](./pruned_stateless_emformer_rnnt2) + +Use [Emformer](https://arxiv.org/abs/2010.10759) from [torchaudio](https://github.com/pytorch/audio) +for streaming ASR. The Emformer model is imported from torchaudio without modifications. + +| | test-clean | test-other | comment | +|-------------------------------------|------------|------------|----------------------------------------| +| greedy search (max sym per frame 1) | 4.28 | 11.42 | --epoch 39 --avg 6 --max-duration 600 | +| modified beam search | 4.22 | 11.16 | --epoch 39 --avg 6 --max-duration 600 | +| fast beam search | 4.29 | 11.26 | --epoch 39 --avg 6 --max-duration 600 | + + +The training commands are: +```bash +export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" + +./pruned_stateless_emformer_rnnt2/train.py \ + --world-size 8 \ + --num-epochs 40 \ + --start-epoch 1 \ + --exp-dir pruned_stateless_emformer_rnnt2/exp-full \ + --full-libri 1 \ + --use-fp16 0 \ + --max-duration 200 \ + --prune-range 5 \ + --lm-scale 0.25 \ + --master-port 12358 \ + --num-encoder-layers 18 \ + --left-context-length 128 \ + --segment-length 8 \ + --right-context-length 4 +``` + +The tensorboard log can be found at + + +The decoding commands are: +```bash +for m in greedy_search fast_beam_search modified_beam_search; do + for epoch in 39; do + for avg in 6; do + ./pruned_stateless_emformer_rnnt2/decode.py \ + --epoch $epoch \ + --avg $avg \ + --use-averaged-model 1 \ + --exp-dir pruned_stateless_emformer_rnnt2/exp-full \ + --max-duration 50 \ + --decoding-method $m \ + --num-encoder-layers 18 \ + --left-context-length 128 \ + --segment-length 8 \ + --right-context-length 4 + done + done +done +``` + +You can find a pretrained model, training logs, decoding logs, and decoding +results at: + + + ### LibriSpeech BPE training results (Pruned Stateless Transducer 5) [pruned_transducer_stateless5](./pruned_transducer_stateless5)