mgb2

2025-09-08 08:34:19 +00:00 · 2022-06-05 01:00:32 +03:00 · 2022-06-05 01:00:32 +03:00 · 68aa924eeb
commit 68aa924eeb
parent 86e1f9b056
2 changed files with 61 additions and 29 deletions
--- a/egs/mgb2/ASR/README.md
+++ b/egs/mgb2/ASR/README.md
@ -1,30 +1,35 @@
-# Introduction
+# MGB2

-Please refer to <https://icefall.readthedocs.io/en/latest/recipes/librispeech/index.html> for how to run models in this recipe.
+The Multi-Dialect Broadcast News Arabic Speech Recognition (MGB-2):
+The second edition of the Multi-Genre Broadcast (MGB-2) Challenge is
+an evaluation of speech recognition and lightly supervised alignment
+using TV recordings in Arabic. The speech data is broad and multi-genre,
+spanning the whole range of TV output, and represents a challenging task for
+speech technology. In 2016, the challenge featured two new Arabic tracks based
+on TV data from Aljazeera. It was an official challenge at the 2016 IEEE
+Workshop on Spoken Language Technology. The 1,200 hours MGB-2: from Aljazeera
+TV programs have been manually captioned with no timing information.
+QCRI Arabic ASR system has been used to recognize all programs. The ASR output
+was used to align the manual captioning and produce speech segments for
+training speech recognition. More than 20 hours from 2015 programs have been
+transcribed verbatim and manually segmented. This data is split into a
+development set of 10 hours, and a similar evaluation set of 10 hours.
+Both the development and evaluation data have been released in the 2016 MGB
+challenge

-[./RESULTS.md](./RESULTS.md) contains the latest results.
+Official reference:

-# Transducers
+Ali, Ahmed, et al. "The MGB-2 challenge: Arabic multi-dialect broadcast media recognition." 
+2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016.

-There are various folders containing the name `transducer` in this folder.
-The following table lists the differences among them.
+IEEE link: https://ieeexplore.ieee.org/abstract/document/7846277

-|                                       | Encoder             | Decoder            | Comment                                           |
-|---------------------------------------|---------------------|--------------------|---------------------------------------------------|
-| `transducer`                          | Conformer           | LSTM               |                                                   |
-| `transducer_stateless`                | Conformer           | Embedding + Conv1d | Using optimized_transducer from computing RNN-T loss  |
-| `transducer_stateless2`               | Conformer           | Embedding + Conv1d | Using torchaudio for computing RNN-T loss             |
-| `transducer_lstm`                     | LSTM                | LSTM               |                                                   |
-| `transducer_stateless_multi_datasets` | Conformer           | Embedding + Conv1d | Using data from GigaSpeech as extra training data |
-| `pruned_transducer_stateless`         | Conformer           | Embedding + Conv1d | Using k2 pruned RNN-T loss                        |
-| `pruned_transducer_stateless2`        | Conformer(modified) | Embedding + Conv1d | Using k2 pruned RNN-T loss                        |
-| `pruned_transducer_stateless3`        | Conformer(modified) | Embedding + Conv1d | Using k2 pruned RNN-T loss + using GigaSpeech as extra training data |
-| `pruned_transducer_stateless4`        | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless2 + save averaged models periodically during training                        |
-| `pruned_transducer_stateless5`        | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless4 + more layers + random combiner|
-| `pruned_transducer_stateless6`        | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless4 + distillation with hubert|
-| `pruned_stateless_emformer_rnnt2`     | Emformer(from torchaudio) | Embedding + Conv1d | Using Emformer from torchaudio for streaming ASR|
+## Performance Record (after 3 epochs)
+
+| Decoding method           | dev WER    | test WER |
+|---------------------------|------------|---------|
+| attention-decoder         | 27.87      |  26.12   |
+| whole-lattice-rescoring   | 25.32      |  23.53   |


-The decoder in `transducer_stateless` is modified from the paper
-[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).
-We place an additional Conv1d layer right after the input embedding layer.
+See [RESULTS](/egs/mgb2/ASR/RESULTS.md) for details.
--- a/egs/mgb2/ASR/RESULTS.md
+++ b/egs/mgb2/ASR/RESULTS.md
@ -1,20 +1,33 @@
 # Results

-### MGB2 BPE training results (Conformer-CTC)
+### MGB2 BPE training results (Conformer-CTC) (after 3 epochs)

 #### 2022-06-04

 The best WER, as of 2022-06-04, for the MGB2 test dataset is below
-(using HLG decoding + n-gram LM rescoring + attention decoder rescoring):

-|     | dev | test |
+Using whole lattice HLG decoding + n-gram LM rescoring + attention decoder rescoring
+
+|     | dev        | test       |
 |-----|------------|------------|
-| WER | -       | -      |
+| WER | 25.32      |  23.53     |

 Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:
 | ngram_lm_scale | attention_scale |
 |----------------|-----------------|
-| -           | -            |
+| 0.1            | -            |
+
+
+Using n-best (n=0.5) HLG decoding + n-gram LM rescoring + attention decoder rescoring:
+
+|     | dev        | test       |
+|-----|------------|------------|
+| WER |    27.87   |  26.12     |
+
+Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:
+| ngram_lm_scale | attention_scale |
+|----------------|-----------------|
+| 0.01           | 0.3             |


 To reproduce the above result, use the following commands for training:
@ -40,7 +53,7 @@ export CUDA_VISIBLE_DEVICES="0,1"
  
 ```

-and the following command for decoding
+and the following command for nbest decoding

 ```
 ./conformer_ctc/decode.py \
@ -55,6 +68,20 @@ and the following command for decoding
  --nbest-scale 0.5
 ```

+and the following command for whole-lattice decoding
+
+```
+./conformer_ctc/decode.py \
+  --lang-dir data/lang_bpe_5000 \
+  --max-duration 30 \
+  --concatenate-cuts 0 \
+  --bucketing-sampler 1 \
+  --num-paths 1000 \
+  --epoch 2 \
+  --avg 2 \
+  --method  whole-lattice-rescoring
+```
+
 You can find the pre-trained model by visiting
 <comming soon>