mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-09-19 05:54:20 +00:00
update RESULTS.md
This commit is contained in:
parent
7f94b86bb0
commit
1f5216236f
@ -1,5 +1,163 @@
|
|||||||
## Results
|
## Results
|
||||||
|
|
||||||
|
### LibriSpeech BPE training results (Pruned Stateless Conv-Emformer RNN-T)
|
||||||
|
|
||||||
|
[conv_emformer_transducer_stateless](./conv_emformer_transducer_stateless)
|
||||||
|
|
||||||
|
It implements [Emformer](https://arxiv.org/abs/2010.10759) augmented with convolution module for streaming ASR.
|
||||||
|
It is modified from [torchaudio](https://github.com/pytorch/audio).
|
||||||
|
|
||||||
|
See <https://github.com/k2-fsa/icefall/pull/389> for more details.
|
||||||
|
|
||||||
|
#### Training on full librispeech
|
||||||
|
|
||||||
|
The WERs are:
|
||||||
|
|
||||||
|
| | test-clean | test-other | comment | decoding mode |
|
||||||
|
|-------------------------------------|------------|------------|---------------------------------------------|
|
||||||
|
| greedy search (max sym per frame 1) | 3.63 | 9.61 | --epoch 30 --avg 10 | simulated streaming |
|
||||||
|
| greedy search (max sym per frame 1) | 3.64 | 9.65 | --epoch 30 --avg 10 | streaming |
|
||||||
|
| fast beam search | 3.61 | 9.4 | --epoch 30 --avg 10 | simulated streaming |
|
||||||
|
| fast beam search | 3.58 | 9.5 | --epoch 30 --avg 10 | streaming |
|
||||||
|
| modified beam search | 3.56 | 9.41 | --epoch 30 --avg 10 | simulated streaming |
|
||||||
|
| modified beam search | 3.54 | 9.46 | --epoch 30 --avg 10 | streaming |
|
||||||
|
|
||||||
|
The training command is:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless/train.py \
|
||||||
|
--world-size 6 \
|
||||||
|
--num-epochs 30 \
|
||||||
|
--start-epoch 1 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless/exp \
|
||||||
|
--full-libri 1 \
|
||||||
|
--max-duration 300 \
|
||||||
|
--master-port 12321 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32
|
||||||
|
```
|
||||||
|
|
||||||
|
The tensorboard log can be found at
|
||||||
|
<https://tensorboard.dev/experiment/4em2FLsxRwGhmoCRQUEoDw/>
|
||||||
|
|
||||||
|
The simulated streaming decoding command using greedy search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless/exp \
|
||||||
|
--max-duration 300 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method greedy_search \
|
||||||
|
--use-averaged-model True
|
||||||
|
```
|
||||||
|
|
||||||
|
The simulated streaming decoding command using fast beam search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless/exp \
|
||||||
|
--max-duration 300 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method fast_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam 4 \
|
||||||
|
--max-contexts 4 \
|
||||||
|
--max-states 8
|
||||||
|
```
|
||||||
|
|
||||||
|
The simulated streaming decoding command using modified beam search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless/decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless/exp \
|
||||||
|
--max-duration 300 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method modified_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam-size 4
|
||||||
|
```
|
||||||
|
|
||||||
|
The streaming decoding command using greedy search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless/exp \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method greedy_search \
|
||||||
|
--use-averaged-model True
|
||||||
|
```
|
||||||
|
|
||||||
|
The streaming decoding command using fast beam search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless/exp \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method fast_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam 4 \
|
||||||
|
--max-contexts 4 \
|
||||||
|
--max-states 8
|
||||||
|
```
|
||||||
|
|
||||||
|
The streaming decoding command using modified beam search is:
|
||||||
|
```bash
|
||||||
|
./conv_emformer_transducer_stateless/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless/exp \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method modified_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam-size 4
|
||||||
|
```
|
||||||
|
|
||||||
|
Pretrained models, training logs, decoding logs, and decoding results
|
||||||
|
are available at
|
||||||
|
<https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless-2022-06-11>
|
||||||
|
|
||||||
### LibriSpeech BPE training results (Pruned Stateless Emformer RNN-T)
|
### LibriSpeech BPE training results (Pruned Stateless Emformer RNN-T)
|
||||||
|
|
||||||
[pruned_stateless_emformer_rnnt2](./pruned_stateless_emformer_rnnt2)
|
[pruned_stateless_emformer_rnnt2](./pruned_stateless_emformer_rnnt2)
|
||||||
@ -280,12 +438,12 @@ The WERs are:
|
|||||||
|
|
||||||
| | test-clean | test-other | comment |
|
| | test-clean | test-other | comment |
|
||||||
|-------------------------------------|------------|------------|-------------------------------------------------------------------------------|
|
|-------------------------------------|------------|------------|-------------------------------------------------------------------------------|
|
||||||
| greedy search (max sym per frame 1) | 2.75 | 6.74 | --epoch 30 --avg 6 --use_averaged_model False |
|
| greedy search (max sym per frame 1) | 2.75 | 6.74 | --epoch 30 --avg 6 --use-averaged-model False |
|
||||||
| greedy search (max sym per frame 1) | 2.69 | 6.64 | --epoch 30 --avg 6 --use_averaged_model True |
|
| greedy search (max sym per frame 1) | 2.69 | 6.64 | --epoch 30 --avg 6 --use-averaged-model True |
|
||||||
| fast beam search | 2.72 | 6.67 | --epoch 30 --avg 6 --use_averaged_model False |
|
| fast beam search | 2.72 | 6.67 | --epoch 30 --avg 6 --use-averaged-model False |
|
||||||
| fast beam search | 2.66 | 6.6 | --epoch 30 --avg 6 --use_averaged_model True |
|
| fast beam search | 2.66 | 6.6 | --epoch 30 --avg 6 --use-averaged-model True |
|
||||||
| modified beam search | 2.67 | 6.68 | --epoch 30 --avg 6 --use_averaged_model False |
|
| modified beam search | 2.67 | 6.68 | --epoch 30 --avg 6 --use-averaged-model False |
|
||||||
| modified beam search | 2.62 | 6.57 | --epoch 30 --avg 6 --use_averaged_model True |
|
| modified beam search | 2.62 | 6.57 | --epoch 30 --avg 6 --use-averaged-model True |
|
||||||
|
|
||||||
The training command is:
|
The training command is:
|
||||||
|
|
||||||
|
@ -16,7 +16,57 @@
|
|||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
# See the License for the specific language governing permissions and
|
# See the License for the specific language governing permissions and
|
||||||
# limitations under the License.
|
# limitations under the License.
|
||||||
|
"""
|
||||||
|
Usage:
|
||||||
|
(1) greedy search
|
||||||
|
./conv_emformer_transducer_stateless/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless/exp \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method greedy_search \
|
||||||
|
--use-averaged-model True
|
||||||
|
|
||||||
|
(2) modified beam search
|
||||||
|
./conv_emformer_transducer_stateless/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless/exp \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method modified_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam-size 4
|
||||||
|
|
||||||
|
(3) fast beam search
|
||||||
|
./conv_emformer_transducer_stateless/streaming_decode.py \
|
||||||
|
--epoch 30 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir conv_emformer_transducer_stateless/exp \
|
||||||
|
--num-decode-streams 2000 \
|
||||||
|
--num-encoder-layers 12 \
|
||||||
|
--chunk-length 32 \
|
||||||
|
--cnn-module-kernel 31 \
|
||||||
|
--left-context-length 32 \
|
||||||
|
--right-context-length 8 \
|
||||||
|
--memory-size 32 \
|
||||||
|
--decoding-method fast_beam_search \
|
||||||
|
--use-averaged-model True \
|
||||||
|
--beam 4 \
|
||||||
|
--max-contexts 4 \
|
||||||
|
--max-states 8
|
||||||
|
"""
|
||||||
import argparse
|
import argparse
|
||||||
import logging
|
import logging
|
||||||
import warnings
|
import warnings
|
||||||
@ -686,8 +736,9 @@ def decode_dataset(
|
|||||||
)
|
)
|
||||||
del streams[i]
|
del streams[i]
|
||||||
|
|
||||||
key = "greedy_search"
|
if params.decoding_method == "greedy_search":
|
||||||
if params.decoding_method == "fast_beam_search":
|
key = "greedy_search"
|
||||||
|
elif params.decoding_method == "fast_beam_search":
|
||||||
key = (
|
key = (
|
||||||
f"beam_{params.beam}_"
|
f"beam_{params.beam}_"
|
||||||
f"max_contexts_{params.max_contexts}_"
|
f"max_contexts_{params.max_contexts}_"
|
||||||
|
Loading…
x
Reference in New Issue
Block a user