mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-09-09 09:04:19 +00:00
mgb2
This commit is contained in:
parent
86e1f9b056
commit
68aa924eeb
@ -1,30 +1,35 @@
|
|||||||
# Introduction
|
# MGB2
|
||||||
|
|
||||||
Please refer to <https://icefall.readthedocs.io/en/latest/recipes/librispeech/index.html> for how to run models in this recipe.
|
The Multi-Dialect Broadcast News Arabic Speech Recognition (MGB-2):
|
||||||
|
The second edition of the Multi-Genre Broadcast (MGB-2) Challenge is
|
||||||
|
an evaluation of speech recognition and lightly supervised alignment
|
||||||
|
using TV recordings in Arabic. The speech data is broad and multi-genre,
|
||||||
|
spanning the whole range of TV output, and represents a challenging task for
|
||||||
|
speech technology. In 2016, the challenge featured two new Arabic tracks based
|
||||||
|
on TV data from Aljazeera. It was an official challenge at the 2016 IEEE
|
||||||
|
Workshop on Spoken Language Technology. The 1,200 hours MGB-2: from Aljazeera
|
||||||
|
TV programs have been manually captioned with no timing information.
|
||||||
|
QCRI Arabic ASR system has been used to recognize all programs. The ASR output
|
||||||
|
was used to align the manual captioning and produce speech segments for
|
||||||
|
training speech recognition. More than 20 hours from 2015 programs have been
|
||||||
|
transcribed verbatim and manually segmented. This data is split into a
|
||||||
|
development set of 10 hours, and a similar evaluation set of 10 hours.
|
||||||
|
Both the development and evaluation data have been released in the 2016 MGB
|
||||||
|
challenge
|
||||||
|
|
||||||
[./RESULTS.md](./RESULTS.md) contains the latest results.
|
Official reference:
|
||||||
|
|
||||||
# Transducers
|
Ali, Ahmed, et al. "The MGB-2 challenge: Arabic multi-dialect broadcast media recognition."
|
||||||
|
2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016.
|
||||||
|
|
||||||
There are various folders containing the name `transducer` in this folder.
|
IEEE link: https://ieeexplore.ieee.org/abstract/document/7846277
|
||||||
The following table lists the differences among them.
|
|
||||||
|
|
||||||
| | Encoder | Decoder | Comment |
|
## Performance Record (after 3 epochs)
|
||||||
|---------------------------------------|---------------------|--------------------|---------------------------------------------------|
|
|
||||||
| `transducer` | Conformer | LSTM | |
|
| Decoding method | dev WER | test WER |
|
||||||
| `transducer_stateless` | Conformer | Embedding + Conv1d | Using optimized_transducer from computing RNN-T loss |
|
|---------------------------|------------|---------|
|
||||||
| `transducer_stateless2` | Conformer | Embedding + Conv1d | Using torchaudio for computing RNN-T loss |
|
| attention-decoder | 27.87 | 26.12 |
|
||||||
| `transducer_lstm` | LSTM | LSTM | |
|
| whole-lattice-rescoring | 25.32 | 23.53 |
|
||||||
| `transducer_stateless_multi_datasets` | Conformer | Embedding + Conv1d | Using data from GigaSpeech as extra training data |
|
|
||||||
| `pruned_transducer_stateless` | Conformer | Embedding + Conv1d | Using k2 pruned RNN-T loss |
|
|
||||||
| `pruned_transducer_stateless2` | Conformer(modified) | Embedding + Conv1d | Using k2 pruned RNN-T loss |
|
|
||||||
| `pruned_transducer_stateless3` | Conformer(modified) | Embedding + Conv1d | Using k2 pruned RNN-T loss + using GigaSpeech as extra training data |
|
|
||||||
| `pruned_transducer_stateless4` | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless2 + save averaged models periodically during training |
|
|
||||||
| `pruned_transducer_stateless5` | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless4 + more layers + random combiner|
|
|
||||||
| `pruned_transducer_stateless6` | Conformer(modified) | Embedding + Conv1d | same as pruned_transducer_stateless4 + distillation with hubert|
|
|
||||||
| `pruned_stateless_emformer_rnnt2` | Emformer(from torchaudio) | Embedding + Conv1d | Using Emformer from torchaudio for streaming ASR|
|
|
||||||
|
|
||||||
|
|
||||||
The decoder in `transducer_stateless` is modified from the paper
|
See [RESULTS](/egs/mgb2/ASR/RESULTS.md) for details.
|
||||||
[Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/).
|
|
||||||
We place an additional Conv1d layer right after the input embedding layer.
|
|
||||||
|
@ -1,20 +1,33 @@
|
|||||||
# Results
|
# Results
|
||||||
|
|
||||||
### MGB2 BPE training results (Conformer-CTC)
|
### MGB2 BPE training results (Conformer-CTC) (after 3 epochs)
|
||||||
|
|
||||||
#### 2022-06-04
|
#### 2022-06-04
|
||||||
|
|
||||||
The best WER, as of 2022-06-04, for the MGB2 test dataset is below
|
The best WER, as of 2022-06-04, for the MGB2 test dataset is below
|
||||||
(using HLG decoding + n-gram LM rescoring + attention decoder rescoring):
|
|
||||||
|
|
||||||
| | dev | test |
|
Using whole lattice HLG decoding + n-gram LM rescoring + attention decoder rescoring
|
||||||
|
|
||||||
|
| | dev | test |
|
||||||
|-----|------------|------------|
|
|-----|------------|------------|
|
||||||
| WER | - | - |
|
| WER | 25.32 | 23.53 |
|
||||||
|
|
||||||
Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:
|
Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:
|
||||||
| ngram_lm_scale | attention_scale |
|
| ngram_lm_scale | attention_scale |
|
||||||
|----------------|-----------------|
|
|----------------|-----------------|
|
||||||
| - | - |
|
| 0.1 | - |
|
||||||
|
|
||||||
|
|
||||||
|
Using n-best (n=0.5) HLG decoding + n-gram LM rescoring + attention decoder rescoring:
|
||||||
|
|
||||||
|
| | dev | test |
|
||||||
|
|-----|------------|------------|
|
||||||
|
| WER | 27.87 | 26.12 |
|
||||||
|
|
||||||
|
Scale values used in n-gram LM rescoring and attention rescoring for the best WERs are:
|
||||||
|
| ngram_lm_scale | attention_scale |
|
||||||
|
|----------------|-----------------|
|
||||||
|
| 0.01 | 0.3 |
|
||||||
|
|
||||||
|
|
||||||
To reproduce the above result, use the following commands for training:
|
To reproduce the above result, use the following commands for training:
|
||||||
@ -40,7 +53,7 @@ export CUDA_VISIBLE_DEVICES="0,1"
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
and the following command for decoding
|
and the following command for nbest decoding
|
||||||
|
|
||||||
```
|
```
|
||||||
./conformer_ctc/decode.py \
|
./conformer_ctc/decode.py \
|
||||||
@ -55,6 +68,20 @@ and the following command for decoding
|
|||||||
--nbest-scale 0.5
|
--nbest-scale 0.5
|
||||||
```
|
```
|
||||||
|
|
||||||
|
and the following command for whole-lattice decoding
|
||||||
|
|
||||||
|
```
|
||||||
|
./conformer_ctc/decode.py \
|
||||||
|
--lang-dir data/lang_bpe_5000 \
|
||||||
|
--max-duration 30 \
|
||||||
|
--concatenate-cuts 0 \
|
||||||
|
--bucketing-sampler 1 \
|
||||||
|
--num-paths 1000 \
|
||||||
|
--epoch 2 \
|
||||||
|
--avg 2 \
|
||||||
|
--method whole-lattice-rescoring
|
||||||
|
```
|
||||||
|
|
||||||
You can find the pre-trained model by visiting
|
You can find the pre-trained model by visiting
|
||||||
<comming soon>
|
<comming soon>
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user