mgb2

2025-09-09 09:04:19 +00:00 · 2022-06-05 01:00:32 +03:00 · 2022-06-05 01:00:32 +03:00 · 68aa924eeb
commit 68aa924eeb
parent 86e1f9b056
2 changed files with 61 additions and 29 deletions
--- a/egs/mgb2/ASR/README.md
+++ b/egs/mgb2/ASR/README.md
@ -1,30 +1,35 @@
-# Introduction
+# MGB2
-Please refer to <https://icefall.readthedocs.io/en/latest/recipes/librispeech/index.html> for how to run models in this recipe.
+The Multi-Dialect Broadcast News Arabic Speech Recognition (MGB-2):
 The second edition of the Multi-Genre Broadcast (MGB-2) Challenge is
 an evaluation of speech recognition and lightly supervised alignment
 using TV recordings in Arabic. The speech data is broad and multi-genre,
 spanning the whole range of TV output, and represents a challenging task for
 speech technology. In 2016, the challenge featured two new Arabic tracks based
 on TV data from Aljazeera. It was an official challenge at the 2016 IEEE
 Workshop on Spoken Language Technology. The 1,200 hours MGB-2: from Aljazeera
 TV programs have been manually captioned with no timing information.
 QCRI Arabic ASR system has been used to recognize all programs. The ASR output
 was used to align the manual captioning and produce speech segments for
 training speech recognition. More than 20 hours from 2015 programs have been
 transcribed verbatim and manually segmented. This data is split into a
 development set of 10 hours, and a similar evaluation set of 10 hours.
 Both the development and evaluation data have been released in the 2016 MGB
 challenge
-[./RESULTS.md](./RESULTS.md) contains the latest results.
+Official reference:
-# Transducers
+Ali, Ahmed, et al. "The MGB-2 challenge: Arabic multi-dialect broadcast media recognition." 
 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016.
-There are various folders containing the name `transducer` in this folder.
+IEEE link: https://ieeexplore.ieee.org/abstract/document/7846277
 The following table lists the differences among them.
-|                                       | Encoder             | Decoder            | Comment                                           |
+## Performance Record (after 3 epochs)
-|---------------------------------------|---------------------|--------------------|---------------------------------------------------|
+
-| `transducer`                          | Conformer           | LSTM               |                                                   |
+| Decoding method           | dev WER    | test WER |
-| `transducer_stateless`                | Conformer           | Embedding + Conv1d | Using optimized_transducer from computing RNN-T loss  |
+|---------------------------|------------|---------|
-| `transducer_stateless2`               | Conformer           | Embedding + Conv1d | Using torchaudio for computing RNN-T loss             |
+| attention-decoder         | 27.87      |  26.12   |
-| `transducer_lstm`                     | LSTM                | LSTM               |                                                   |
+| whole-lattice-rescoring   | 25.32      |  23.53   |
 | `transducer_stateless_multi_datasets` | Conformer           | Embedding + Conv1d | Using data from GigaSpeech as extra training data |
 | `pruned_transducer_stateless`         | Conformer           | Embedding + Conv1d | Using k2 pruned RNN-T loss                        |
 | `pruned_transducer_stateless2`        | Conformer(modified) | Embedding + Conv1d | Using k2 pruned RNN-T loss                        |
 | `pruned_transducer_stateless3`        | Conformer(modified) | Embedding + Conv1d | Using k2 pruned RNN-T loss + using GigaSpeech as extra training data |
 | `pruned_transducer_stateless4`        | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless2 + save averaged models periodically during training                        |
 | `pruned_transducer_stateless5`        | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless4 + more layers + random combiner|
 | `pruned_transducer_stateless6`        | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless4 + distillation with hubert|
 | `pruned_stateless_emformer_rnnt2`     | Emformer(from torchaudio) | Embedding + Conv1d | Using Emformer from torchaudio for streaming ASR|
-The decoder in `transducer_stateless` is modified from the paper
+See [RESULTS](/egs/mgb2/ASR/RESULTS.md) for details.
 [Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).
 We place an additional Conv1d layer right after the input embedding layer.
--- a/egs/mgb2/ASR/RESULTS.md
+++ b/egs/mgb2/ASR/RESULTS.md
@ -1,20 +1,33 @@
 # Results
-### MGB2 BPE training results (Conformer-CTC)
+### MGB2 BPE training results (Conformer-CTC) (after 3 epochs)
 #### 2022-06-04
 The best WER, as of 2022-06-04, for the MGB2 test dataset is below
 (using HLG decoding + n-gram LM rescoring + attention decoder rescoring):
-|     | dev | test |
+Using whole lattice HLG decoding + n-gram LM rescoring + attention decoder rescoring
 |     | dev        | test       |
 |-----|------------|------------|
-| WER | -       | -      |
+| WER | 25.32      |  23.53     |
 Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:
 | ngram_lm_scale | attention_scale |
 |----------------|-----------------|
-| -           | -            |
+| 0.1            | -            |
 Using n-best (n=0.5) HLG decoding + n-gram LM rescoring + attention decoder rescoring:
 |     | dev        | test       |
 |-----|------------|------------|
 | WER |    27.87   |  26.12     |
 Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:
 | ngram_lm_scale | attention_scale |
 |----------------|-----------------|
 | 0.01           | 0.3             |
 To reproduce the above result, use the following commands for training:
@ -40,7 +53,7 @@ export CUDA_VISIBLE_DEVICES="0,1"
 ```
-and the following command for decoding
+and the following command for nbest decoding
 ```
 ./conformer_ctc/decode.py \
@ -55,6 +68,20 @@ and the following command for decoding
  --nbest-scale 0.5
 ```
 and the following command for whole-lattice decoding
 ```
 ./conformer_ctc/decode.py \
  --lang-dir data/lang_bpe_5000 \
  --max-duration 30 \
  --concatenate-cuts 0 \
  --bucketing-sampler 1 \
  --num-paths 1000 \
  --epoch 2 \
  --avg 2 \
  --method  whole-lattice-rescoring
 ```
 You can find the pre-trained model by visiting
 <comming soon>