icefall/egs/librispeech/ASR/RESULTS.md
Zengwei Yao a9dccdc33f
Streaming merge (#369)
* Remove ReLU in attention

* Adding diagnostics code...

* Refactor/simplify ConformerEncoder

* First version of rand-combine iterated-training-like idea.

* Improvements to diagnostics (RE those with 1 dim

* Add pelu to this good-performing setup..

* Small bug fixes/imports

* Add baseline for the PeLU expt, keeping only the small normalization-related changes.

* pelu_base->expscale, add 2xExpScale in subsampling, and in feedforward units.

* Double learning rate of exp-scale units

* Combine ExpScale and swish for memory reduction

* Add import

* Fix backprop bug

* Fix bug in diagnostics

* Increase scale on Scale from 4 to 20

* Increase scale from 20 to 50.

* Fix duplicate Swish; replace norm+swish with swish+exp-scale in convolution module

* Reduce scale from 50 to 20

* Add deriv-balancing code

* Double the threshold in brelu; slightly increase max_factor.

* Fix exp dir

* Convert swish nonlinearities to ReLU

* Replace relu with swish-squared.

* Restore ConvolutionModule to state before changes; change all Swish,Swish(Swish) to SwishOffset.

* Replace norm on input layer with scale of 0.1.

* Extensions to diagnostics code

* Update diagnostics

* Add BasicNorm module

* Replace most normalizations with scales (still have norm in conv)

* Change exp dir

* Replace norm in ConvolutionModule with a scaling factor.

* use nonzero threshold in DerivBalancer

* Add min-abs-value 0.2

* Fix dirname

* Change min-abs threshold from 0.2 to 0.5

* Scale up pos_bias_u and pos_bias_v before use.

* Reduce max_factor to 0.01

* Fix q*scaling logic

* Change max_factor in DerivBalancer from 0.025 to 0.01; fix scaling code.

* init 1st conv module to smaller variance

* Change how scales are applied; fix residual bug

* Reduce min_abs from 0.5 to 0.2

* Introduce in_scale=0.5 for SwishExpScale

* Fix scale from 0.5 to 2.0 as I really intended..

* Set scaling on SwishExpScale

* Add identity pre_norm_final for diagnostics.

* Add learnable post-scale for mha

* Fix self.post-scale-mha

* Another rework, use scales on linear/conv

* Change dir name

* Reduce initial scaling of modules

* Bug-fix RE bias

* Cosmetic change

* Reduce initial_scale.

* Replace ExpScaleRelu with DoubleSwish()

* DoubleSwish fix

* Use learnable scales for joiner and decoder

* Add max-abs-value constraint in DerivBalancer

* Add max-abs-value

* Change dir name

* Remove ExpScale in feedforward layes.

* Reduce max-abs limit from 1000 to 100; introduce 2 DerivBalancer modules in conv layer.

* Make DoubleSwish more memory efficient

* Reduce constraints from deriv-balancer in ConvModule.

* Add warmup mode

* Remove max-positive constraint in deriv-balancing; add second DerivBalancer in conv module.

* Add some extra info to diagnostics

* Add deriv-balancer at output of embedding.

* Add more stats.

* Make epsilon in BasicNorm learnable, optionally.

* Draft of 0mean changes..

* Rework of initialization

* Fix typo

* Remove dead code

* Modifying initialization from normal->uniform; add initial_scale when initializing

* bug fix re sqrt

* Remove xscale from pos_embedding

* Remove some dead code.

* Cosmetic changes/renaming things

* Start adding some files..

* Add more files..

* update decode.py file type

* Add remaining files in pruned_transducer_stateless2

* Fix diagnostics-getting code

* Scale down pruned loss in warmup mode

* Reduce warmup scale on pruned loss form 0.1 to 0.01.

* Remove scale_speed, make swish deriv more efficient.

* Cosmetic changes to swish

* Double warm_step

* Fix bug with import

* Change initial std from 0.05 to 0.025.

* Set also scale for embedding to 0.025.

* Remove logging code that broke with newer Lhotse; fix bug with pruned_loss

* Add norm+balancer to VggSubsampling

* Incorporate changes from master into pruned_transducer_stateless2.

* Add max-abs=6, debugged version

* Change 0.025,0.05 to 0.01 in initializations

* Fix balancer code

* Whitespace fix

* Reduce initial pruned_loss scale from 0.01 to 0.0

* Increase warm_step (and valid_interval)

* Change max-abs from 6 to 10

* Change how warmup works.

* Add changes from master to decode.py, train.py

* Simplify the warmup code; max_abs 10->6

* Make warmup work by scaling layer contributions; leave residual layer-drop

* Fix bug

* Fix test mode with random layer dropout

* Add random-number-setting function in dataloader

* Fix/patch how fix_random_seed() is imported.

* Reduce layer-drop prob

* Reduce layer-drop prob after warmup to 1 in 100

* Change power of lr-schedule from -0.5 to -0.333

* Increase model_warm_step to 4k

* Change max-keep-prob to 0.95

* Refactoring and simplifying conformer and frontend

* Rework conformer, remove some code.

* Reduce 1st conv channels from 64 to 32

* Add another convolutional layer

* Fix padding bug

* Remove dropout in output layer

* Reduce speed of some components

* Initial refactoring to remove unnecessary vocab_size

* Fix RE identity

* Bug-fix

* Add final dropout to conformer

* Remove some un-used code

* Replace nn.Linear with ScaledLinear in simple joiner

* Make 2 projections..

* Reduce initial_speed

* Use initial_speed=0.5

* Reduce initial_speed further from 0.5 to 0.25

* Reduce initial_speed from 0.5 to 0.25

* Change how warmup is applied.

* Bug fix to warmup_scale

* Fix test-mode

* Remove final dropout

* Make layer dropout rate 0.075, was 0.1.

* First draft of model rework

* Various bug fixes

* Change learning speed of simple_lm_proj

* Revert transducer_stateless/ to state in upstream/master

* Fix to joiner to allow different dims

* Some cleanups

* Make training more efficient, avoid redoing some projections.

* Change how warm-step is set

* First draft of new approach to learning rates + init

* Some fixes..

* Change initialization to 0.25

* Fix type of parameter

* Fix weight decay formula by adding 1/1-beta

* Fix weight decay formula by adding 1/1-beta

* Fix checkpoint-writing

* Fix to reading scheudler from optim

* Simplified optimizer, rework somet things..

* Reduce model_warm_step from 4k to 3k

* Fix bug in lambda

* Bug-fix RE sign of target_rms

* Changing initial_speed from 0.25 to 01

* Change some defaults in LR-setting rule.

* Remove initial_speed

* Set new scheduler

* Change exponential part of lrate to be epoch based

* Fix bug

* Set 2n rule..

* Implement 2o schedule

* Make lrate rule more symmetric

* Implement 2p version of learning rate schedule.

* Refactor how learning rate is set.

* Fix import

* Modify init (#301)

* update icefall/__init__.py to import more common functions.

* update icefall/__init__.py

* make imports style consistent.

* exclude black check for icefall/__init__.py in pyproject.toml.

* Minor fixes for logging (#296)

* Minor fixes for logging

* Minor fix

* Fix dir names

* Modify beam search to be efficient with current joienr

* Fix adding learning rate to tensorboard

* Fix docs in optim.py

* Support mix precision training on the reworked model (#305)

* Add mix precision support

* Minor fixes

* Minor fixes

* Minor fixes

* Tedlium3 pruned transducer stateless (#261)

* update tedlium3-pruned-transducer-stateless-codes

* update README.md

* update README.md

* add fast beam search for decoding

* do a change for RESULTS.md

* do a change for RESULTS.md

* do a fix

* do some changes for pruned RNN-T

* Add mix precision support

* Minor fixes

* Minor fixes

* Updating RESULTS.md; fix in beam_search.py

* Fix rebase

* Code style check for librispeech pruned transducer stateless2 (#308)

* Update results for tedlium3 pruned RNN-T (#307)

* Update README.md

* Fix CI errors. (#310)

* Add more results

* Fix tensorboard log location

* Add one more epoch of full expt

* fix comments

* Add results for mixed precision with max-duration 300

* Changes for pretrained.py (tedlium3 pruned RNN-T) (#311)

* GigaSpeech recipe (#120)

* initial commit

* support download, data prep, and fbank

* on-the-fly feature extraction by default

* support BPE based lang

* support HLG for BPE

* small fix

* small fix

* chunked feature extraction by default

* Compute features for GigaSpeech by splitting the manifest.

* Fixes after review.

* Split manifests into 2000 pieces.

* set audio duration mismatch tolerance to 0.01

* small fix

* add conformer training recipe

* Add conformer.py without pre-commit checking

* lazy loading and use SingleCutSampler

* DynamicBucketingSampler

* use KaldifeatFbank to compute fbank for musan

* use pretrained language model and lexicon

* use 3gram to decode, 4gram to rescore

* Add decode.py

* Update .flake8

* Delete compute_fbank_gigaspeech.py

* Use BucketingSampler for valid and test dataloader

* Update params in train.py

* Use bpe_500

* update params in decode.py

* Decrease num_paths while CUDA OOM

* Added README

* Update RESULTS

* black

* Decrease num_paths while CUDA OOM

* Decode with post-processing

* Update results

* Remove lazy_load option

* Use default `storage_type`

* Keep the original tolerance

* Use split-lazy

* black

* Update pretrained model

Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com>

* Add LG decoding (#277)

* Add LG decoding

* Add log weight pushing

* Minor fixes

* Support computing RNN-T loss with torchaudio (#316)

* Update results for torchaudio RNN-T. (#322)

* Fix some typos. (#329)

* fix fp16 option in example usage (#332)

* Support averaging models with weight tying. (#333)

* Support specifying iteration number of checkpoints for decoding. (#336)

See also #289

* Modified conformer with multi datasets (#312)

* Copy files for editing.

* Use librispeech + gigaspeech with modified conformer.

* Support specifying number of workers for on-the-fly feature extraction.

* Feature extraction code for GigaSpeech.

* Combine XL splits lazily during training.

* Fix warnings in decoding.

* Add decoding code for GigaSpeech.

* Fix decoding the gigaspeech dataset.

We have to use the decoder/joiner networks for the GigaSpeech dataset.

* Disable speed perturbe for XL subset.

* Compute the Nbest oracle WER for RNN-T decoding.

* Minor fixes.

* Minor fixes.

* Add results.

* Update results.

* Update CI.

* Update results.

* Fix style issues.

* Update results.

* Fix style issues.

* Update results. (#340)

* Update results.

* Typo fixes.

* Validate generated manifest files. (#338)

* Validate generated manifest files. (#338)

* Save batch to disk on OOM. (#343)

* Save batch to disk on OOM.

* minor fixes

* Fixes after review.

* Fix style issues.

* Fix decoding for gigaspeech in the libri + giga setup. (#345)

* Model average (#344)

* First upload of model average codes.

* minor fix

* update decode file

* update .flake8

* rename pruned_transducer_stateless3 to pruned_transducer_stateless4

* change epoch number counter starting from 1 instead of 0

* minor fix of pruned_transducer_stateless4/train.py

* refactor the checkpoint.py

* minor fix, update docs, and modify the epoch number to count from 1 in the pruned_transducer_stateless4/decode.py

* update author info

* add docs of the scaling in function average_checkpoints_with_averaged_model

* Save batch to disk on exception. (#350)

* Bug fix (#352)

* Keep model_avg on cpu (#348)

* keep model_avg on cpu

* explicitly convert model_avg to cpu

* minor fix

* remove device convertion for model_avg

* modify usage of the model device in train.py

* change model.device to next(model.parameters()).device for decoding

* assert params.start_epoch>0

* assert params.start_epoch>0, params.start_epoch

* Do some changes for aishell/ASR/transducer stateless/export.py (#347)

* do some changes for aishell/ASR/transducer_stateless/export.py

* Support decoding with averaged model when using --iter (#353)

* support decoding with averaged model when using --iter

* minor fix

* monir fix of copyright date

* Stringify torch.__version__ before serializing it. (#354)

* Run decode.py in GitHub actions. (#356)

* Ignore padding frames during RNN-T decoding. (#358)

* Ignore padding frames during RNN-T decoding.

* Fix outdated decoding code.

* Minor fixes.

* Support --iter in export.py (#360)

* GigaSpeech RNN-T experiments (#318)

* Copy RNN-T recipe from librispeech

* flake8

* flake8

* Update params

* gigaspeech decode

* black

* Update results

* syntax highlight

* Update RESULTS.md

* typo

* Update decoding script for gigaspeech and remove duplicate files. (#361)

* Validate that there are no OOV tokens in BPE-based lexicons. (#359)

* Validate that there are no OOV tokens in BPE-based lexicons.

* Typo fixes.

* Decode gigaspeech in GitHub actions (#362)

* Add CI for gigaspeech.

* Update results for libri+giga multi dataset setup. (#363)

* Update results for libri+giga multi dataset setup.

* Update GigaSpeech reults (#364)

* Update decode.py

* Update export.py

* Update results

* Update README.md

* Fix GitHub CI for decoding GigaSpeech dev/test datasets (#366)

* modify .flake8

* minor fix

* minor fix

Co-authored-by: Daniel Povey <dpovey@gmail.com>
Co-authored-by: Wei Kang <wkang@pku.org.cn>
Co-authored-by: Mingshuang Luo <37799481+luomingshuang@users.noreply.github.com>
Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com>
Co-authored-by: Guo Liyong <guonwpu@qq.com>
Co-authored-by: Wang, Guanbo <wgb14@outlook.com>
Co-authored-by: whsqkaak <whsqkaak@naver.com>
Co-authored-by: pehonnet <pe.honnet@gmail.com>
2022-05-15 21:08:30 +08:00

31 KiB

Results

LibriSpeech BPE training results (Pruned Transducer 3, 2022-04-29)

pruned_transducer_stateless3 Same as Pruned Transducer 2 but using the XL subset from GigaSpeech as extra training data.

During training, it selects either a batch from GigaSpeech with prob giga_prob or a batch from LibriSpeech with prob 1 - giga_prob. All utterances within a batch come from the same dataset.

Using commit ac84220de91dee10c00e8f4223287f937b1930b6.

See https://github.com/k2-fsa/icefall/pull/312.

The WERs are:

test-clean test-other comment
greedy search (max sym per frame 1) 2.21 5.09 --epoch 27 --avg 2 --max-duration 600
greedy search (max sym per frame 1) 2.25 5.02 --epoch 27 --avg 12 --max-duration 600
modified beam search 2.19 5.03 --epoch 25 --avg 6 --max-duration 600
modified beam search 2.23 4.94 --epoch 27 --avg 10 --max-duration 600
beam search 2.16 4.95 --epoch 25 --avg 7 --max-duration 600
fast beam search 2.21 4.96 --epoch 27 --avg 10 --max-duration 600
fast beam search 2.19 4.97 --epoch 27 --avg 12 --max-duration 600

The training commands are:

./prepare.sh
./prepare_giga_speech.sh

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

./pruned_transducer_stateless3/train.py \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 0 \
  --full-libri 1 \
  --exp-dir pruned_transducer_stateless3/exp \
  --max-duration 300 \
  --use-fp16 1 \
  --lr-epochs 4 \
  --num-workers 2 \
  --giga-prob 0.8

The tensorboard log can be found at https://tensorboard.dev/experiment/gaD34WeYSMCOkzoo3dZXGg/ (Note: The training process is killed manually after saving epoch-28.pt.)

Pretrained models, training logs, decoding logs, and decoding results are available at https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-04-29

The decoding commands are:


# greedy search
./pruned_transducer_stateless3/decode.py \
    --epoch 27 \
    --avg 2 \
    --exp-dir ./pruned_transducer_stateless3/exp \
    --max-duration 600 \
    --decoding-method greedy_search \
    --max-sym-per-frame 1

# modified beam search
./pruned_transducer_stateless3/decode.py \
    --epoch 25 \
    --avg 6 \
    --exp-dir ./pruned_transducer_stateless3/exp \
    --max-duration 600 \
    --decoding-method modified_beam_search \
    --max-sym-per-frame 1

# beam search
./pruned_transducer_stateless3/decode.py \
    --epoch 25 \
    --avg 7 \
    --exp-dir ./pruned_transducer_stateless3/exp \
    --max-duration 600 \
    --decoding-method beam_search \
    --max-sym-per-frame 1

# fast beam search
for epoch in 27; do
  for avg in 10 12; do
    ./pruned_transducer_stateless3/decode.py \
        --epoch $epoch \
        --avg $avg \
        --exp-dir ./pruned_transducer_stateless3/exp \
        --max-duration 600 \
        --decoding-method fast_beam_search \
        --max-states 32 \
        --beam 8
  done
done

The following table shows the Nbest oracle WER for fast beam search.

epoch avg num_paths nbest_scale test-clean test-other
27 10 50 0.5 0.91 2.74
27 10 50 0.8 0.94 2.82
27 10 50 1.0 1.06 2.88
27 10 100 0.5 0.82 2.58
27 10 100 0.8 0.92 2.65
27 10 100 1.0 0.95 2.77
27 10 200 0.5 0.81 2.50
27 10 200 0.8 0.85 2.56
27 10 200 1.0 0.91 2.64
27 10 400 0.5 N/A N/A
27 10 400 0.8 0.81 2.49
27 10 400 1.0 0.85 2.54

The Nbest oracle WER is computed using the following steps:

    1. Use fast_beam_search to produce a lattice.
    1. Extract N paths from the lattice using k2.random_path
    1. Unique paths so that each path has a distinct sequence of tokens
    1. Compute the edit distance of each path with the ground truth
    1. The path with the lowest edit distance is the final output and is used to compute the WER

The command to compute the Nbest oracle WER is:

for epoch in 27; do
  for avg in 10 ; do
    for num_paths in 50 100 200 400; do
      for nbest_scale in 0.5 0.8 1.0; do
        ./pruned_transducer_stateless3/decode.py \
            --epoch $epoch \
            --avg $avg \
            --exp-dir ./pruned_transducer_stateless3/exp \
            --max-duration 600 \
            --decoding-method fast_beam_search_nbest_oracle \
            --num-paths $num_paths \
            --max-states 32 \
            --beam 8 \
            --nbest-scale $nbest_scale
      done
    done
  done
done

LibriSpeech BPE training results (Pruned Transducer 3, 2022-05-13)

Same setup as pruned_transducer_stateless3 (2022-04-29) but change --giga-prob from 0.8 to 0.9. Also use repeat on gigaspeech XL subset so that the gigaspeech dataloader never exhausts.

test-clean test-other comment
greedy search (max sym per frame 1) 2.03 4.70 --iter 1224000 --avg 14 --max-duration 600
modified beam search 2.00 4.63 --iter 1224000 --avg 14 --max-duration 600
fast beam search 2.10 4.68 --iter 1224000 --avg 14 --max-duration 600

The training commands are:

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

./prepare.sh
./prepare_giga_speech.sh

./pruned_transducer_stateless3/train.py \
  --world-size 8 \
  --num-epochs 30 \
  --start-epoch 0 \
  --full-libri 1 \
  --exp-dir pruned_transducer_stateless3/exp-0.9 \
  --max-duration 300 \
  --use-fp16 1 \
  --lr-epochs 4 \
  --num-workers 2 \
  --giga-prob 0.9

The tensorboard log is available at https://tensorboard.dev/experiment/HpocR7dKS9KCQkJeYxfXug/

Decoding commands:

for iter in 1224000; do
  for avg in 14; do
    for method in greedy_search modified_beam_search fast_beam_search ; do
      ./pruned_transducer_stateless3/decode.py \
        --iter $iter \
        --avg $avg \
        --exp-dir ./pruned_transducer_stateless3/exp-0.9/ \
        --max-duration 600 \
        --decoding-method $method \
        --max-sym-per-frame 1 \
        --beam 4 \
        --max-contexts 32
    done
  done
done

The pretrained models, training logs, decoding logs, and decoding results can be found at https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13

LibriSpeech BPE training results (Pruned Transducer 2)

pruned_transducer_stateless2 This is with a reworked version of the conformer encoder, with many changes.

Training on fulll librispeech

Using commit 34aad74a2c849542dd5f6359c9e6b527e8782fd6. See https://github.com/k2-fsa/icefall/pull/288

The WERs are:

test-clean test-other comment
greedy search (max sym per frame 1) 2.62 6.37 --epoch 25 --avg 8 --max-duration 600
fast beam search 2.61 6.17 --epoch 25 --avg 8 --max-duration 600 --decoding-method fast_beam_search
modified beam search 2.59 6.19 --epoch 25 --avg 8 --max-duration 600 --decoding-method modified_beam_search
greedy search (max sym per frame 1) 2.70 6.04 --epoch 34 --avg 10 --max-duration 600
fast beam search 2.66 6.00 --epoch 34 --avg 10 --max-duration 600 --decoding-method fast_beam_search
greedy search (max sym per frame 1) 2.62 6.03 --epoch 38 --avg 10 --max-duration 600
fast beam search 2.57 5.95 --epoch 38 --avg 10 --max-duration 600 --decoding-method fast_beam_search

The train and decode commands are: python3 ./pruned_transducer_stateless2/train.py --exp-dir=pruned_transducer_stateless2/exp --world-size 8 --num-epochs 26 --full-libri 1 --max-duration 300 and: python3 ./pruned_transducer_stateless2/decode.py --exp-dir pruned_transducer_stateless2/exp --epoch 25 --avg 8 --bpe-model ./data/lang_bpe_500/bpe.model --max-duration 600

The Tensorboard log is at https://tensorboard.dev/experiment/Xoz0oABMTWewo1slNFXkyA (apologies, log starts only from epoch 3).

The pretrained models, training logs, decoding logs, and decoding results can be found at https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless2-2022-04-29

Training on train-clean-100:

Trained with 1 job: python3 ./pruned_transducer_stateless2/train.py --exp-dir=pruned_transducer_stateless2/exp_100h_ws1 --world-size 1 --num-epochs 40 --full-libri 0 --max-duration 300 and decoded with: python3 ./pruned_transducer_stateless2/decode.py --exp-dir pruned_transducer_stateless2/exp_100h_ws1 --epoch 19 --avg 8 --bpe-model ./data/lang_bpe_500/bpe.model --max-duration 600.

The Tensorboard log is at https://tensorboard.dev/experiment/AhnhooUBRPqTnaggoqo7lg (learning rate schedule is not visible due to a since-fixed bug).

test-clean test-other comment
greedy search (max sym per frame 1) 7.12 18.42 --epoch 19 --avg 8
greedy search (max sym per frame 1) 6.71 17.77 --epoch 29 --avg 8
greedy search (max sym per frame 1) 6.64 17.19 --epoch 39 --avg 10
fast beam search 6.58 17.27 --epoch 29 --avg 8 --decoding-method fast_beam_search
fast beam search 6.53 16.82 --epoch 39 --avg 10 --decoding-method fast_beam_search

Trained with 2 jobs: python3 ./pruned_transducer_stateless2/train.py --exp-dir=pruned_transducer_stateless2/exp_100h_ws2 --world-size 2 --num-epochs 40 --full-libri 0 --max-duration 300 and decoded with: python3 ./pruned_transducer_stateless2/decode.py --exp-dir pruned_transducer_stateless2/exp_100h_ws2 --epoch 19 --avg 8 --bpe-model ./data/lang_bpe_500/bpe.model --max-duration 600.

The Tensorboard log is at https://tensorboard.dev/experiment/dvOC9wsrSdWrAIdsebJILg/ (learning rate schedule is not visible due to a since-fixed bug).

test-clean test-other comment
greedy search (max sym per frame 1) 7.05 18.77 --epoch 19 --avg 8
greedy search (max sym per frame 1) 6.82 18.14 --epoch 29 --avg 8
greedy search (max sym per frame 1) 6.81 17.66 --epoch 30 --avg 10

Trained with 4 jobs: python3 ./pruned_transducer_stateless2/train.py --exp-dir=pruned_transducer_stateless2/exp_100h_ws4 --world-size 4 --num-epochs 40 --full-libri 0 --max-duration 300 and decoded with: python3 ./pruned_transducer_stateless2/decode.py --exp-dir pruned_transducer_stateless2/exp_100h_ws4 --epoch 19 --avg 8 --bpe-model ./data/lang_bpe_500/bpe.model --max-duration 600.

The Tensorboard log is at https://tensorboard.dev/experiment/a3T0TyC0R5aLj5bmFbRErA/ (learning rate schedule is not visible due to a since-fixed bug).

test-clean test-other comment
greedy search (max sym per frame 1) 7.31 19.55 --epoch 19 --avg 8
greedy search (max sym per frame 1) 7.08 18.59 --epoch 29 --avg 8
greedy search (max sym per frame 1) 6.86 18.29 --epoch 30 --avg 10

Trained with 1 job, with --use-fp16=True --max-duration=300 i.e. with half-precision floats (but without increasing max-duration), after merging https://github.com/k2-fsa/icefall/pull/305. Train command was python3 ./pruned_transducer_stateless2/train.py --exp-dir=pruned_transducer_stateless2/exp_100h_fp16 --world-size 1 --num-epochs 40 --full-libri 0 --max-duration 300 --use-fp16 True

The Tensorboard log is at https://tensorboard.dev/experiment/DAtGG9lpQJCROUDwPNxwpA

test-clean test-other comment
greedy search (max sym per frame 1) 7.10 18.57 --epoch 19 --avg 8
greedy search (max sym per frame 1) 6.81 17.84 --epoch 29 --avg 8
greedy search (max sym per frame 1) 6.63 17.39 --epoch 30 --avg 10

Trained with 1 job, with --use-fp16=True --max-duration=500, i.e. with half-precision floats and max-duration increased from 300 to 500, after merging https://github.com/k2-fsa/icefall/pull/305. Train command was python3 ./pruned_transducer_stateless2/train.py --exp-dir=pruned_transducer_stateless2/exp_100h_fp16 --world-size 1 --num-epochs 40 --full-libri 0 --max-duration 500 --use-fp16 True

The Tensorboard log is at https://tensorboard.dev/experiment/Km7QBHYnSLWs4qQnAJWsaA

test-clean test-other comment
greedy search (max sym per frame 1) 7.10 18.79 --epoch 19 --avg 8
greedy search (max sym per frame 1) 6.92 18.16 --epoch 29 --avg 8
greedy search (max sym per frame 1) 6.89 17.75 --epoch 30 --avg 10

LibriSpeech BPE training results (Pruned Transducer)

Conformer encoder + non-current decoder. The decoder contains only an embedding layer, a Conv1d (with kernel size 2) and a linear layer (to transform tensor dim).

2022-03-12

pruned_transducer_stateless

Using commit 1603744469d167d848e074f2ea98c587153205fa. See https://github.com/k2-fsa/icefall/pull/248

The WERs are:

test-clean test-other comment
greedy search (max sym per frame 1) 2.62 6.37 --epoch 42 --avg 11 --max-duration 100
greedy search (max sym per frame 2) 2.62 6.37 --epoch 42 --avg 11 --max-duration 100
greedy search (max sym per frame 3) 2.62 6.37 --epoch 42 --avg 11 --max-duration 100
modified beam search (beam size 4) 2.56 6.27 --epoch 42 --avg 11 --max-duration 100
beam search (beam size 4) 2.57 6.27 --epoch 42 --avg 11 --max-duration 100

The decoding time for test-clean and test-other is given below: (A V100 GPU with 32 GB RAM is used for decoding. Note: Not all GPU RAM is used during decoding.)

decoding method test-clean (seconds) test-other (seconds)
greedy search (--max-sym-per-frame=1) 160 159
greedy search (--max-sym-per-frame=2) 184 177
greedy search (--max-sym-per-frame=3) 210 213
modified beam search (--beam-size 4) 273 269
beam search (--beam-size 4) 2741 2221

We recommend you to use modified_beam_search.

Training command:

cd egs/librispeech/ASR/
./prepare.sh

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

. path.sh

./pruned_transducer_stateless/train.py \
  --world-size 8 \
  --num-epochs 60 \
  --start-epoch 0 \
  --exp-dir pruned_transducer_stateless/exp \
  --full-libri 1 \
  --max-duration 300 \
  --prune-range 5 \
  --lr-factor 5 \
  --lm-scale 0.25

The tensorboard training log can be found at https://tensorboard.dev/experiment/WKRFY5fYSzaVBHahenpNlA/

The command for decoding is:

epoch=42
avg=11
sym=1

# greedy search

./pruned_transducer_stateless/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir ./pruned_transducer_stateless/exp \
  --max-duration 100 \
  --decoding-method greedy_search \
  --beam-size 4 \
  --max-sym-per-frame $sym

# modified beam search
./pruned_transducer_stateless/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir ./pruned_transducer_stateless/exp \
  --max-duration 100 \
  --decoding-method modified_beam_search \
  --beam-size 4

# beam search
# (not recommended)
./pruned_transducer_stateless/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir ./pruned_transducer_stateless/exp \
  --max-duration 100 \
  --decoding-method beam_search \
  --beam-size 4

You can find a pre-trained model, decoding logs, and decoding results at https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless-2022-03-12

2022-02-18

pruned_transducer_stateless

The WERs are

test-clean test-other comment
greedy search 2.85 6.98 --epoch 28 --avg 15 --max-duration 100

The training command for reproducing is given below:

export CUDA_VISIBLE_DEVICES="0,1,2,3"

./pruned_transducer_stateless/train.py \
  --world-size 4 \
  --num-epochs 30 \
  --start-epoch 0 \
  --exp-dir pruned_transducer_stateless/exp \
  --full-libri 1 \
  --max-duration 300 \
  --prune-range 5 \
  --lr-factor 5 \
  --lm-scale 0.25 \

The tensorboard training log can be found at https://tensorboard.dev/experiment/ejG7VpakRYePNNj6AbDEUw/#scalars

The decoding command is:

epoch=28
avg=15

## greedy search
./pruned_transducer_stateless/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir pruned_transducer_stateless/exp \
  --max-duration 100

LibriSpeech BPE training results (Transducer)

Conformer encoder + embedding decoder

Conformer encoder + non-recurrent decoder. The decoder contains only an embedding layer and a Conv1d (with kernel size 2).

See

2022-03-01

Using commit 2332ba312d7ce72f08c7bac1e3312f7e3dd722dc.

It uses GigaSpeech as extra training data. 20% of the time it selects a batch from L subset of GigaSpeech and 80% of the time it selects a batch from LibriSpeech.

The WERs are

test-clean test-other comment
greedy search (max sym per frame 1) 2.64 6.55 --epoch 39 --avg 15 --max-duration 100
modified beam search (beam size 4) 2.61 6.46 --epoch 39 --avg 15 --max-duration 100

The training command for reproducing is given below:

cd egs/librispeech/ASR/
./prepare.sh
./prepare_giga_speech.sh

export CUDA_VISIBLE_DEVICES="0,1,2,3"

./transducer_stateless_multi_datasets/train.py \
  --world-size 4 \
  --num-epochs 40 \
  --start-epoch 0 \
  --exp-dir transducer_stateless_multi_datasets/exp-full-2 \
  --full-libri 1 \
  --max-duration 300 \
  --lr-factor 5 \
  --bpe-model data/lang_bpe_500/bpe.model \
  --modified-transducer-prob 0.25 \
  --giga-prob 0.2

The tensorboard training log can be found at https://tensorboard.dev/experiment/xmo5oCgrRVelH9dCeOkYBg/

The decoding command is:

epoch=39
avg=15
sym=1

# greedy search
./transducer_stateless_multi_datasets/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir transducer_stateless_multi_datasets/exp-full-2 \
  --bpe-model ./data/lang_bpe_500/bpe.model \
  --max-duration 100 \
  --context-size 2 \
  --max-sym-per-frame $sym

# modified beam search
./transducer_stateless_multi_datasets/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir transducer_stateless_multi_datasets/exp-full-2 \
  --bpe-model ./data/lang_bpe_500/bpe.model \
  --max-duration 100 \
  --context-size 2 \
  --decoding-method modified_beam_search \
  --beam-size 4

You can find a pretrained model by visiting https://huggingface.co/csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01

2022-04-19

transducer_stateless2

This version uses torchaudio's RNN-T loss.

Using commit fce7f3cd9a486405ee008bcbe4999264f27774a3. See https://github.com/k2-fsa/icefall/pull/316

test-clean test-other comment
greedy search (max sym per frame 1) 2.65 6.30 --epoch 59 --avg 10 --max-duration 600
greedy search (max sym per frame 2) 2.62 6.23 --epoch 59 --avg 10 --max-duration 100
greedy search (max sym per frame 3) 2.62 6.23 --epoch 59 --avg 10 --max-duration 100
modified beam search 2.63 6.15 --epoch 59 --avg 10 --max-duration 100 --decoding-method modified_beam_search
beam search 2.59 6.15 --epoch 59 --avg 10 --max-duration 100 --decoding-method beam_search

Note: This model is trained with standard RNN-T loss. Neither modified transducer nor pruned RNN-T is used. You can see that there is a performance degradation in WER when we limit the max symbol per frame to 1.

The number of active paths in modified_beam_search and beam_search is 4.

The training and decoding commands are:

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"

./transducer_stateless2/train.py \
  --world-size 8 \
  --num-epochs 60 \
  --start-epoch 0 \
  --exp-dir transducer_stateless2/exp-2 \
  --full-libri 1 \
  --max-duration 300 \
  --lr-factor 5

epoch=59
avg=10
# greedy search
./transducer_stateless2/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir ./transducer_stateless2/exp-2 \
  --max-duration 600 \
  --decoding-method greedy_search \
  --max-sym-per-frame 1

# modified beam search
./transducer_stateless2/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir ./transducer_stateless2/exp-2 \
  --max-duration 100 \
  --decoding-method modified_beam_search \

# beam search
./transducer_stateless2/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir ./transducer_stateless2/exp-2 \
  --max-duration 100 \
  --decoding-method beam_search \

The tensorboard log is at https://tensorboard.dev/experiment/oAlle3dxQD2EY8ePwjIGuw/.

You can find a pre-trained model, decoding logs, and decoding results at https://huggingface.co/csukuangfj/icefall-asr-librispeech-transducer-stateless2-torchaudio-2022-04-19

2022-02-07

Using commit a8150021e01d34ecbd6198fe03a57eacf47a16f2.

The WERs are

test-clean test-other comment
greedy search (max sym per frame 1) 2.67 6.67 --epoch 63 --avg 19 --max-duration 100
greedy search (max sym per frame 2) 2.67 6.67 --epoch 63 --avg 19 --max-duration 100
greedy search (max sym per frame 3) 2.67 6.67 --epoch 63 --avg 19 --max-duration 100
modified beam search (beam size 4) 2.67 6.57 --epoch 63 --avg 19 --max-duration 100

The training command for reproducing is given below:

cd egs/librispeech/ASR/
./prepare.sh
export CUDA_VISIBLE_DEVICES="0,1,2,3"
./transducer_stateless/train.py \
  --world-size 4 \
  --num-epochs 76 \
  --start-epoch 0 \
  --exp-dir transducer_stateless/exp-full \
  --full-libri 1 \
  --max-duration 300 \
  --lr-factor 5 \
  --bpe-model data/lang_bpe_500/bpe.model \
  --modified-transducer-prob 0.25

The tensorboard training log can be found at https://tensorboard.dev/experiment/qgvWkbF2R46FYA6ZMNmOjA/#scalars

The decoding command is:

epoch=63
avg=19

## greedy search
for sym in 1 2 3; do
  ./transducer_stateless/decode.py \
    --epoch $epoch \
    --avg $avg \
    --exp-dir transducer_stateless/exp-full \
    --bpe-model ./data/lang_bpe_500/bpe.model \
    --max-duration 100 \
    --max-sym-per-frame $sym
done

## modified beam search

./transducer_stateless/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir transducer_stateless/exp-full \
  --bpe-model ./data/lang_bpe_500/bpe.model \
  --max-duration 100 \
  --context-size 2 \
  --decoding-method modified_beam_search \
  --beam-size 4

You can find a pretrained model by visiting https://huggingface.co/csukuangfj/icefall-asr-librispeech-transducer-stateless-bpe-500-2022-02-07

Conformer encoder + LSTM decoder

Using commit 8187d6236c2926500da5ee854f758e621df803cc.

Conformer encoder + LSTM decoder.

The best WER is

test-clean test-other
WER 3.07 7.51

using --epoch 34 --avg 11 with greedy search.

The training command to reproduce the above WER is:

export CUDA_VISIBLE_DEVICES="0,1,2,3"

./transducer/train.py \
  --world-size 4 \
  --num-epochs 35 \
  --start-epoch 0 \
  --exp-dir transducer/exp-lr-2.5-full \
  --full-libri 1 \
  --max-duration 180 \
  --lr-factor 2.5

The decoding command is:

epoch=34
avg=11

./transducer/decode.py \
  --epoch $epoch \
  --avg $avg \
  --exp-dir transducer/exp-lr-2.5-full \
  --bpe-model ./data/lang_bpe_500/bpe.model \
  --max-duration 100

You can find the tensorboard log at: https://tensorboard.dev/experiment/D7NQc3xqTpyVmWi5FnWjrA

LibriSpeech BPE training results (Conformer-CTC)

2021-11-09

The best WER, as of 2021-11-09, for the librispeech test dataset is below (using HLG decoding + n-gram LM rescoring + attention decoder rescoring):

test-clean test-other
WER 2.42 5.73

Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:

ngram_lm_scale attention_scale
2.0 2.0

To reproduce the above result, use the following commands for training:

cd egs/librispeech/ASR/conformer_ctc
./prepare.sh
export CUDA_VISIBLE_DEVICES="0,1,2,3"
./conformer_ctc/train.py \
  --exp-dir conformer_ctc/exp_500_att0.8 \
  --lang-dir data/lang_bpe_500 \
  --att-rate 0.8 \
  --full-libri 1 \
  --max-duration 200 \
  --concatenate-cuts 0 \
  --world-size 4 \
  --bucketing-sampler 1 \
  --start-epoch 0 \
  --num-epochs 90
# Note: It trains for 90 epochs, but the best WER is at epoch-77.pt

and the following command for decoding

./conformer_ctc/decode.py \
  --exp-dir conformer_ctc/exp_500_att0.8 \
  --lang-dir data/lang_bpe_500 \
  --max-duration 30 \
  --concatenate-cuts 0 \
  --bucketing-sampler 1 \
  --num-paths 1000 \
  --epoch 77 \
  --avg 55 \
  --method attention-decoder \
  --nbest-scale 0.5

You can find the pre-trained model by visiting https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09

The tensorboard log for training is available at https://tensorboard.dev/experiment/hZDWrZfaSqOMqtW0NEfXKg/#scalars

2021-08-19

(Wei Kang): Result of https://github.com/k2-fsa/icefall/pull/13

TensorBoard log is available at https://tensorboard.dev/experiment/GnRzq8WWQW62dK4bklXBTg/#scalars

Pretrained model is available at https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc

The best decoding results (WER) are listed below, we got this results by averaging models from epoch 15 to 34, and using attention-decoder decoder with num_paths equals to 100.

test-clean test-other
WER 2.57% 5.94%

To get more unique paths, we scaled the lattice.scores with 0.5 (see https://github.com/k2-fsa/icefall/pull/10#discussion_r690951662 for more details), we searched the lm_score_scale and attention_score_scale for best results, the scales that produced the WER above are also listed below.

lm_scale attention_scale
test-clean 1.3 1.2
test-other 1.2 1.1

You can use the following commands to reproduce our results:

git clone https://github.com/k2-fsa/icefall
cd icefall

# It was using ef233486, you may not need to switch to it
# git checkout ef233486

cd egs/librispeech/ASR
./prepare.sh

export CUDA_VISIBLE_DEVICES="0,1,2,3"
python conformer_ctc/train.py --bucketing-sampler True \
                              --concatenate-cuts False \
                              --max-duration 200 \
                              --full-libri True \
                              --world-size 4 \
                              --lang-dir data/lang_bpe_5000

python conformer_ctc/decode.py --nbest-scale 0.5 \
                               --epoch 34 \
                               --avg 20 \
                               --method attention-decoder \
                               --max-duration 20 \
                               --num-paths 100 \
                               --lang-dir data/lang_bpe_5000

LibriSpeech training results (Tdnn-Lstm)

2021-08-24

(Wei Kang): Result of phone based Tdnn-Lstm model.

Icefall version: caa0b9e942

Pretrained model is available at https://huggingface.co/pkufool/icefall_asr_librispeech_tdnn-lstm_ctc

The best decoding results (WER) are listed below, we got this results by averaging models from epoch 19 to 14, and using whole-lattice-rescoring decoding method.

test-clean test-other
WER 6.59% 17.69%

We searched the lm_score_scale for best results, the scales that produced the WER above are also listed below.

lm_scale
test-clean 0.8
test-other 0.9