I needed this in order to pull unreleased fixes. The last tagged version was too old (dated back in Jul 2023), and not compatible with recent lhotse releases.

Introduction
icefall contains ASR recipes for various datasets using https://github.com/k2-fsa/k2.
You can use https://github.com/k2-fsa/sherpa to deploy models trained with icefall.
You can try pre-trained models from within your browser without the need to download or install anything by visiting https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition See https://k2-fsa.github.io/icefall/huggingface/spaces.html for more details.
Installation
Please refer to https://icefall.readthedocs.io/en/latest/installation/index.html for installation.
Recipes
Please refer to https://icefall.readthedocs.io/en/latest/recipes/index.html for more information.
We provide the following recipes:
- yesno
- LibriSpeech
- GigaSpeech
- AMI
- Aishell
- Aishell2
- Aishell4
- TIMIT
- TED-LIUM3
- Aidatatang_200zh
- WenetSpeech
- Alimeeting
- Switchboard
- TAL_CSASR
yesno
This is the simplest ASR recipe in icefall
and can be run on CPU.
Training takes less than 30 seconds and gives you the following WER:
[test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]
We provide a Colab notebook for this recipe:
LibriSpeech
Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md for the latest results.
We provide 5 models for this recipe:
- conformer CTC model
- TDNN LSTM CTC model
- Transducer: Conformer encoder + LSTM decoder
- Transducer: Conformer encoder + Embedding decoder
- Transducer: Zipformer encoder + Embedding decoder
Conformer CTC Model
The best WER we currently have is:
test-clean | test-other | |
---|---|---|
WER | 2.42 | 5.73 |
We provide a Colab notebook to run a pre-trained conformer CTC model:
TDNN LSTM CTC Model
The WER for this model is:
test-clean | test-other | |
---|---|---|
WER | 6.59 | 17.69 |
We provide a Colab notebook to run a pre-trained TDNN LSTM CTC model:
Transducer: Conformer encoder + LSTM decoder
Using Conformer as encoder and LSTM as decoder.
The best WER with greedy search is:
test-clean | test-other | |
---|---|---|
WER | 3.07 | 7.51 |
We provide a Colab notebook to run a pre-trained RNN-T conformer model:
Transducer: Conformer encoder + Embedding decoder
Using Conformer as encoder. The decoder consists of 1 embedding layer and 1 convolutional layer.
The best WER using modified beam search with beam size 4 is:
test-clean | test-other | |
---|---|---|
WER | 2.56 | 6.27 |
Note: No auxiliary losses are used in the training and no LMs are used in the decoding.
We provide a Colab notebook to run a pre-trained transducer conformer + stateless decoder model:
k2 pruned RNN-T
Encoder | Params | test-clean | test-other | epochs | devices |
---|---|---|---|---|---|
zipformer | 65.5M | 2.21 | 4.79 | 50 | 4 32G-V100 |
zipformer-small | 23.2M | 2.42 | 5.73 | 50 | 2 32G-V100 |
zipformer-large | 148.4M | 2.06 | 4.63 | 50 | 4 32G-V100 |
zipformer-large | 148.4M | 2.00 | 4.38 | 174 | 8 80G-A100 |
Note: No auxiliary losses are used in the training and no LMs are used in the decoding.
k2 pruned RNN-T + GigaSpeech
test-clean | test-other | |
---|---|---|
WER | 1.78 | 4.08 |
Note: No auxiliary losses are used in the training and no LMs are used in the decoding.
k2 pruned RNN-T + GigaSpeech + CommonVoice
test-clean | test-other | |
---|---|---|
WER | 1.90 | 3.98 |
Note: No auxiliary losses are used in the training and no LMs are used in the decoding.
GigaSpeech
We provide three models for this recipe:
- Conformer CTC model
- Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss.
- Transducer: Zipformer encoder + Embedding decoder
Conformer CTC
Dev | Test | |
---|---|---|
WER | 10.47 | 10.58 |
Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss
Dev | Test | |
---|---|---|
greedy search | 10.51 | 10.73 |
fast beam search | 10.50 | 10.69 |
modified beam search | 10.40 | 10.51 |
Transducer: Zipformer encoder + Embedding decoder
Dev | Test | |
---|---|---|
greedy search | 10.31 | 10.50 |
fast beam search | 10.26 | 10.48 |
modified beam search | 10.25 | 10.38 |
Aishell
We provide three models for this recipe: conformer CTC model, TDNN LSTM CTC model, and Transducer Stateless Model,
Conformer CTC Model
The best CER we currently have is:
test | |
---|---|
CER | 4.26 |
TDNN LSTM CTC Model
The CER for this model is:
test | |
---|---|
CER | 10.16 |
We provide a Colab notebook to run a pre-trained TDNN LSTM CTC model:
Transducer Stateless Model
The best CER we currently have is:
test | |
---|---|
CER | 4.38 |
We provide a Colab notebook to run a pre-trained TransducerStateless model:
Aishell2
We provide one model for this recipe: Transducer Stateless Model.
Transducer Stateless Model
The best WER we currently have is:
dev-ios | test-ios | |
---|---|---|
WER | 5.32 | 5.56 |
Aishell4
We provide one model for this recipe: Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss.
Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss (trained with all subsets)
The best CER we currently have is:
test | |
---|---|
CER | 29.08 |
We provide a Colab notebook to run a pre-trained Pruned Transducer Stateless model:
TIMIT
We provide two models for this recipe: TDNN LSTM CTC model and TDNN LiGRU CTC model.
TDNN LSTM CTC Model
The best PER we currently have is:
TEST | |
---|---|
PER | 19.71% |
We provide a Colab notebook to run a pre-trained TDNN LSTM CTC model:
TDNN LiGRU CTC Model
The PER for this model is:
TEST | |
---|---|
PER | 17.66% |
We provide a Colab notebook to run a pre-trained TDNN LiGRU CTC model:
TED-LIUM3
We provide two models for this recipe: Transducer Stateless: Conformer encoder + Embedding decoder and Pruned Transducer Stateless: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss.
Transducer Stateless: Conformer encoder + Embedding decoder
The best WER using modified beam search with beam size 4 is:
dev | test | |
---|---|---|
WER | 6.91 | 6.33 |
Note: No auxiliary losses are used in the training and no LMs are used in the decoding.
We provide a Colab notebook to run a pre-trained Transducer Stateless model:
Pruned Transducer Stateless: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss
The best WER using modified beam search with beam size 4 is:
dev | test | |
---|---|---|
WER | 6.77 | 6.14 |
We provide a Colab notebook to run a pre-trained Pruned Transducer Stateless model:
Aidatatang_200zh
We provide one model for this recipe: Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss.
Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss
Dev | Test | |
---|---|---|
greedy search | 5.53 | 6.59 |
fast beam search | 5.30 | 6.34 |
modified beam search | 5.27 | 6.33 |
We provide a Colab notebook to run a pre-trained Pruned Transducer Stateless model:
WenetSpeech
We provide some models for this recipe: Pruned stateless RNN-T_2: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss and Pruned stateless RNN-T_5: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss.
Pruned stateless RNN-T_2: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss (trained with L subset, offline ASR)
Dev | Test-Net | Test-Meeting | |
---|---|---|---|
greedy search | 7.80 | 8.75 | 13.49 |
modified beam search | 7.76 | 8.71 | 13.41 |
fast beam search | 7.94 | 8.74 | 13.80 |
Pruned stateless RNN-T_5: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss (trained with L subset)
Streaming:
Dev | Test-Net | Test-Meeting | |
---|---|---|---|
greedy_search | 8.78 | 10.12 | 16.16 |
modified_beam_search | 8.53 | 9.95 | 15.81 |
fast_beam_search | 9.01 | 10.47 | 16.28 |
We provide a Colab notebook to run a pre-trained Pruned Transducer Stateless2 model:
Alimeeting
We provide one model for this recipe: Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss.
Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss (trained with far subset)
Eval | Test-Net | |
---|---|---|
greedy search | 31.77 | 34.66 |
fast beam search | 31.39 | 33.02 |
modified beam search | 30.38 | 34.25 |
We provide a Colab notebook to run a pre-trained Pruned Transducer Stateless model:
TAL_CSASR
We provide one model for this recipe: Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss.
Pruned stateless RNN-T: Conformer encoder + Embedding decoder + k2 pruned RNN-T loss
The best results for Chinese CER(%) and English WER(%) respectively (zh: Chinese, en: English):
decoding-method | dev | dev_zh | dev_en | test | test_zh | test_en |
---|---|---|---|---|---|---|
greedy_search | 7.30 | 6.48 | 19.19 | 7.39 | 6.66 | 19.13 |
modified_beam_search | 7.15 | 6.35 | 18.95 | 7.22 | 6.50 | 18.70 |
fast_beam_search | 7.18 | 6.39 | 18.90 | 7.27 | 6.55 | 18.77 |
We provide a Colab notebook to run a pre-trained Pruned Transducer Stateless model:
Deployment with C++
Once you have trained a model in icefall, you may want to deploy it with C++, without Python dependencies.
Please refer to the documentation https://icefall.readthedocs.io/en/latest/recipes/Non-streaming-ASR/librispeech/conformer_ctc.html#deployment-with-c for how to do this.
We also provide a Colab notebook, showing you how to run a torch scripted model in k2 with C++.
Please see: