update results

2025-09-19 05:54:20 +00:00 · 2022-07-21 22:00:53 +08:00 · 2022-07-21 22:00:53 +08:00 · b354368685
commit b354368685
parent dd393b5d72
1 changed files with 112 additions and 0 deletions
--- a/egs/librispeech/ASR/RESULTS.md
+++ b/egs/librispeech/ASR/RESULTS.md
@ -1998,6 +1998,118 @@ avg=11
 You can find the tensorboard log at: <https://tensorboard.dev/experiment/D7NQc3xqTpyVmWi5FnWjrA>
 ### LibriSpeech BPE training results (Conformer-CTC 2)
 [conformer_ctc2](./conformer_ctc2)
 #### 2022-07-21
 It implements a 'reworked' version of CTC attention model.
 As demenstrated by pruned_transducer_stateless2, reworked Conformer model has superior performance compared to the original Conformer.
 So in this modified version of CTC attention model, it has the reworked Conformer as the encoder and the reworked Transformer as the decoder.
 conformer_ctc2 also integrates with the idea of the 'averaging models' in pruned_transducer_stateless4.
 The WERs on comparisons with a baseline model, for the librispeech test dataset, are listed as below.
 The baseline model is the original conformer CTC attention model trained with icefall/egs/librispeech/ASR/conformer_ctc.
 The model is downloaded from  <https://huggingface.co/csukuangfj/icefall-asr-librispeech-conformer-ctc-jit-bpe-500-2021-11-09>.
 This model has 12 layers of Conformer encoder layers and 6 Transformer decoder layers.
 Number of model parameters is 109,226,120.
 It has been trained with 90 epochs with full Librispeech dataset.
 For this reworked CTC attention model, it has 12 layers of reworked Conformer encoder layers and 6 reworked Transformer decoder layers.
 Number of model parameters is 103,071,035.
 With full Librispeech data set, it was trained for **only** 30 epochs because the reworked model would converge much faster.
 Please refer to <https://tensorboard.dev/experiment/GR1s6VrJRTW5rtB50jakew/#scalars> to see the loss convergence curve.
 Please find the above trained model at <https://huggingface.co/WayneWiser/icefall-asr-librispeech-conformer-ctc2-jit-bpe-500-2022-07-21> in huggingface.
 The decoding configuration for the reworked model is --epoch 30, --avg 8, --use-averaged-model True, which is the best after searching.
 | WER | reworked ctc attention | with --epoch 30 --avg 8 --use-averaged-model True | | ctc attention| with --epoch 77 --avg 55 | |
 |------------------------|-------|------|------|------|------|-----|
 | test sets | test-clean | test-other | Avg | test-clean | test-other | Avg |
 | ctc-greedy-search      | 2.98% | 7.14%| 5.06%| 2.90%| 7.47%| 5.19%|
 | ctc-decoding           | 2.98% | 7.14%| 5.06%| 2.90%| 7.47%| 5.19%|
 | 1best                  | 2.93% | 6.37%| 4.65%| 2.70%| 6.49%| 4.60%|
 | nbest                  | 2.94% | 6.39%| 4.67%| 2.70%| 6.48%| 4.59%|
 | nbest-rescoring        | 2.68% | 5.77%| 4.23%| 2.55%| 6.07%| 4.31%|
 | whole-lattice-rescoring| 2.66% | 5.76%| 4.21%| 2.56%| 6.04%| 4.30%|
 | attention-decoder      | 2.59% | 5.54%| 4.07%| 2.41%| 5.77%| 4.09%|
 | nbest-oracle           | 1.53% | 3.47%| 2.50%| 1.69%| 4.02%| 2.86%|
 conformer_ctc2 also implemented the CTC greedy search decoding, it has the indentical WERs with the CTC-decoding method.
 For other decoding method, the average WER of the two test sets with the two model is similar.
 Except for the 1best and nbest method, the overall performance of reworked model is better than the baseline model.
 To reproduce the above result, use the following commands for training:
 The training commands are
 ``bash
    WORLD_SIZE=8
    export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
    ./conformer_ctc2/train.py \
    --manifest-dir data/fbank \
    --exp-dir conformer_ctc2/exp \
    --full-libri 1 \
    --spec-aug-time-warp-factor 80 \
    --max-duration 300 \
    --world-size ${WORLD_SIZE} \
    --start-epoch 1 \
    --num-epochs 30 \
    --att-rate 0.7 \
    --num-decoder-layers 6
 ```
 And the following commands are for decoding:
 ``bash
 for method in ctc-greedy-search ctc-decoding 1best nbest-oracle; do
  python3 ./conformer_ctc2/decode.py \
  --exp-dir conformer_ctc2/exp \
  --use-averaged-model True --epoch 30 --avg 8 --max-duration 200 --method $method
 done
 for method in nbest nbest-rescoring whole-lattice-rescoring attention-decoder ; do
  python3 ./conformer_ctc2/decode.py \
  --exp-dir conformer_ctc2/exp \
  --use-averaged-model True --epoch 30 --avg 8 --max-duration 20 --method $method
 done
 rnn_dir=$(git rev-parse --show-toplevel)/icefall/rnn_lm
 ./conformer_ctc2/decode.py \
  --exp-dir conformer_ctc2/exp \
  --lang-dir data/lang_bpe_500 \
  --lm-dir data/lm \
  --max-duration 30 \
  --concatenate-cuts 0 \
  --bucketing-sampler 1 \
  --num-paths 1000 \
  --use-averaged-model True \
  --epoch 30 \
  --avg 8 \
  --nbest-scale 0.5 \
  --rnn-lm-exp-dir ${rnn_dir}/exp \
  --rnn-lm-epoch 29 \
  --rnn-lm-avg 3 \
  --rnn-lm-embedding-dim 2048 \
  --rnn-lm-hidden-dim 2048 \
  --rnn-lm-num-layers 3 \
  --rnn-lm-tie-weights true \
  --method rnn-lm
 ```
 You can find the RNN-LM pre-trained model at
 <https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm/tree/main>
 ### LibriSpeech BPE training results (Conformer-CTC)
 #### 2021-11-09