mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-27 10:44:19 +00:00
Update README and RESULTS.
This commit is contained in:
parent
8a3c2a00db
commit
0a15bee545
23
README.md
23
README.md
@ -34,11 +34,12 @@ We do provide a Colab notebook for this recipe.
|
|||||||
|
|
||||||
### LibriSpeech
|
### LibriSpeech
|
||||||
|
|
||||||
We provide 3 models for this recipe:
|
We provide 4 models for this recipe:
|
||||||
|
|
||||||
- [conformer CTC model][LibriSpeech_conformer_ctc]
|
- [conformer CTC model][LibriSpeech_conformer_ctc]
|
||||||
- [TDNN LSTM CTC model][LibriSpeech_tdnn_lstm_ctc]
|
- [TDNN LSTM CTC model][LibriSpeech_tdnn_lstm_ctc]
|
||||||
- [RNN-T Conformer model][LibriSpeech_transducer]
|
- [Transducer: Conformer encoder + LSTM decoder][LibriSpeech_transducer]
|
||||||
|
- [Transducer: Conformer encoder + Embedding decoder][LibriSpeech_transducer_stateless]
|
||||||
|
|
||||||
#### Conformer CTC Model
|
#### Conformer CTC Model
|
||||||
|
|
||||||
@ -62,9 +63,9 @@ The WER for this model is:
|
|||||||
We provide a Colab notebook to run a pre-trained TDNN LSTM CTC model: [](https://colab.research.google.com/drive/1kNmDXNMwREi0rZGAOIAOJo93REBuOTcd?usp=sharing)
|
We provide a Colab notebook to run a pre-trained TDNN LSTM CTC model: [](https://colab.research.google.com/drive/1kNmDXNMwREi0rZGAOIAOJo93REBuOTcd?usp=sharing)
|
||||||
|
|
||||||
|
|
||||||
#### RNN-T Conformer model
|
#### Transducer: Conformer encoder + LSTM decoder
|
||||||
|
|
||||||
Using Conformer as encoder.
|
Using Conformer as encoder and LSTM as decoder.
|
||||||
|
|
||||||
The best WER with greedy search is:
|
The best WER with greedy search is:
|
||||||
|
|
||||||
@ -74,6 +75,19 @@ The best WER with greedy search is:
|
|||||||
|
|
||||||
We provide a Colab notebook to run a pre-trained RNN-T conformer model: [](https://colab.research.google.com/drive/1_u6yK9jDkPwG_NLrZMN2XK7Aeq4suMO2?usp=sharing)
|
We provide a Colab notebook to run a pre-trained RNN-T conformer model: [](https://colab.research.google.com/drive/1_u6yK9jDkPwG_NLrZMN2XK7Aeq4suMO2?usp=sharing)
|
||||||
|
|
||||||
|
#### Transducer: Conformer encoder + Embedding decoder
|
||||||
|
|
||||||
|
Using Conformer as encoder. The decoder consists of 1 embedding layer
|
||||||
|
and 1 convolutional layer.
|
||||||
|
|
||||||
|
The best WER with beam search with beam size 4 is:
|
||||||
|
|
||||||
|
| | test-clean | test-other |
|
||||||
|
|-----|------------|------------|
|
||||||
|
| WER | 2.92 | 7.37 |
|
||||||
|
|
||||||
|
Note: No auxiliary losses are used in the training and no LMs are used
|
||||||
|
in the decoding.
|
||||||
|
|
||||||
### Aishell
|
### Aishell
|
||||||
|
|
||||||
@ -143,6 +157,7 @@ Please see: [.
|
||||||
|
We place an additional Conv1d layer right after the input embedding layer.
|
||||||
|
@ -1,11 +1,69 @@
|
|||||||
## Results
|
## Results
|
||||||
|
|
||||||
### LibriSpeech BPE training results (RNN-T)
|
### LibriSpeech BPE training results (Transducer)
|
||||||
|
|
||||||
|
#### 2021-12-22
|
||||||
|
Conformer encoder + non-current decoder. The decoder
|
||||||
|
contains only an embedding layer and a Conv1d (with kernel size 2).
|
||||||
|
|
||||||
|
The WERs are
|
||||||
|
|
||||||
|
| | test-clean | test-other | comment |
|
||||||
|
|---------------------------|------------|------------|------------------------------------------|
|
||||||
|
| greedy search | 2.99 | 7.52 | --epoch 20, --avg 10, --max-duration 100 |
|
||||||
|
| beam search (beam size 2) | 2.95 | 7.43 | |
|
||||||
|
| beam search (beam size 3) | 2.94 | 7.37 | |
|
||||||
|
| beam search (beam size 4) | 2.92 | 7.37 | |
|
||||||
|
| beam search (beam size 5) | 2.93 | 7.38 | |
|
||||||
|
| beam search (beam size 8) | 2.92 | 7.38 | |
|
||||||
|
|
||||||
|
The training command for reproducing is given below:
|
||||||
|
|
||||||
|
```
|
||||||
|
export CUDA_VISIBLE_DEVICES="0,1,2,3"
|
||||||
|
|
||||||
|
./transducer_stateless/train.py \
|
||||||
|
--world-size 4 \
|
||||||
|
--num-epochs 30 \
|
||||||
|
--start-epoch 0 \
|
||||||
|
--exp-dir transducer_stateless/exp-full \
|
||||||
|
--full-libri 1 \
|
||||||
|
--max-duration 250 \
|
||||||
|
--lr-factor 3
|
||||||
|
```
|
||||||
|
|
||||||
|
The tensorboard training log can be found at
|
||||||
|
<https://tensorboard.dev/experiment/PsJ3LgkEQfOmzedAlYfVeg/#scalars&_smoothingWeight=0>
|
||||||
|
|
||||||
|
The decoding command is:
|
||||||
|
```
|
||||||
|
epoch=20
|
||||||
|
avg=10
|
||||||
|
|
||||||
|
## greedy search
|
||||||
|
./transducer_stateless/decode.py \
|
||||||
|
--epoch $epoch \
|
||||||
|
--avg $avg \
|
||||||
|
--exp-dir transducer_stateless/exp-full \
|
||||||
|
--bpe-model ./data/lang_bpe_500/bpe.model \
|
||||||
|
--max-duration 100
|
||||||
|
|
||||||
|
## beam search
|
||||||
|
./transducer_stateless/decode.py \
|
||||||
|
--epoch $epoch \
|
||||||
|
--avg $avg \
|
||||||
|
--exp-dir transducer_stateless/exp-full \
|
||||||
|
--bpe-model ./data/lang_bpe_500/bpe.model \
|
||||||
|
--max-duration 100 \
|
||||||
|
--decoding-method beam_search \
|
||||||
|
--beam-size 4
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
#### 2021-12-17
|
#### 2021-12-17
|
||||||
Using commit `cb04c8a7509425ab45fae888b0ca71bbbd23f0de`.
|
Using commit `cb04c8a7509425ab45fae888b0ca71bbbd23f0de`.
|
||||||
|
|
||||||
RNN-T + Conformer encoder.
|
Conformer encoder + LSTM decoder.
|
||||||
|
|
||||||
The best WER is
|
The best WER is
|
||||||
|
|
||||||
|
@ -27,11 +27,6 @@ from encoder_interface import EncoderInterface
|
|||||||
|
|
||||||
from icefall.utils import add_sos
|
from icefall.utils import add_sos
|
||||||
|
|
||||||
assert hasattr(torchaudio.functional, "rnnt_loss"), (
|
|
||||||
f"Current torchaudio version: {torchaudio.__version__}\n"
|
|
||||||
"Please install a version >= 0.10.0"
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
class Transducer(nn.Module):
|
class Transducer(nn.Module):
|
||||||
"""It implements https://arxiv.org/pdf/1211.3711.pdf
|
"""It implements https://arxiv.org/pdf/1211.3711.pdf
|
||||||
@ -115,6 +110,11 @@ class Transducer(nn.Module):
|
|||||||
# Note: y does not start with SOS
|
# Note: y does not start with SOS
|
||||||
y_padded = y.pad(mode="constant", padding_value=0)
|
y_padded = y.pad(mode="constant", padding_value=0)
|
||||||
|
|
||||||
|
assert hasattr(torchaudio.functional, "rnnt_loss"), (
|
||||||
|
f"Current torchaudio version: {torchaudio.__version__}\n"
|
||||||
|
"Please install a version >= 0.10.0"
|
||||||
|
)
|
||||||
|
|
||||||
loss = torchaudio.functional.rnnt_loss(
|
loss = torchaudio.functional.rnnt_loss(
|
||||||
logits=logits,
|
logits=logits,
|
||||||
targets=y_padded,
|
targets=y_padded,
|
||||||
|
@ -27,11 +27,6 @@ from encoder_interface import EncoderInterface
|
|||||||
|
|
||||||
from icefall.utils import add_sos
|
from icefall.utils import add_sos
|
||||||
|
|
||||||
assert hasattr(torchaudio.functional, "rnnt_loss"), (
|
|
||||||
f"Current torchaudio version: {torchaudio.__version__}\n"
|
|
||||||
"Please install a version >= 0.10.0"
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
class Transducer(nn.Module):
|
class Transducer(nn.Module):
|
||||||
"""It implements https://arxiv.org/pdf/1211.3711.pdf
|
"""It implements https://arxiv.org/pdf/1211.3711.pdf
|
||||||
@ -115,6 +110,11 @@ class Transducer(nn.Module):
|
|||||||
# Note: y does not start with SOS
|
# Note: y does not start with SOS
|
||||||
y_padded = y.pad(mode="constant", padding_value=0)
|
y_padded = y.pad(mode="constant", padding_value=0)
|
||||||
|
|
||||||
|
assert hasattr(torchaudio.functional, "rnnt_loss"), (
|
||||||
|
f"Current torchaudio version: {torchaudio.__version__}\n"
|
||||||
|
"Please install a version >= 0.10.0"
|
||||||
|
)
|
||||||
|
|
||||||
loss = torchaudio.functional.rnnt_loss(
|
loss = torchaudio.functional.rnnt_loss(
|
||||||
logits=logits,
|
logits=logits,
|
||||||
targets=y_padded,
|
targets=y_padded,
|
||||||
|
@ -27,11 +27,6 @@ from encoder_interface import EncoderInterface
|
|||||||
|
|
||||||
from icefall.utils import add_sos
|
from icefall.utils import add_sos
|
||||||
|
|
||||||
assert hasattr(torchaudio.functional, "rnnt_loss"), (
|
|
||||||
f"Current torchaudio version: {torchaudio.__version__}\n"
|
|
||||||
"Please install a version >= 0.10.0"
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
class Transducer(nn.Module):
|
class Transducer(nn.Module):
|
||||||
"""It implements https://arxiv.org/pdf/1211.3711.pdf
|
"""It implements https://arxiv.org/pdf/1211.3711.pdf
|
||||||
@ -113,6 +108,11 @@ class Transducer(nn.Module):
|
|||||||
# Note: y does not start with SOS
|
# Note: y does not start with SOS
|
||||||
y_padded = y.pad(mode="constant", padding_value=0)
|
y_padded = y.pad(mode="constant", padding_value=0)
|
||||||
|
|
||||||
|
assert hasattr(torchaudio.functional, "rnnt_loss"), (
|
||||||
|
f"Current torchaudio version: {torchaudio.__version__}\n"
|
||||||
|
"Please install a version >= 0.10.0"
|
||||||
|
)
|
||||||
|
|
||||||
loss = torchaudio.functional.rnnt_loss(
|
loss = torchaudio.functional.rnnt_loss(
|
||||||
logits=logits,
|
logits=logits,
|
||||||
targets=y_padded,
|
targets=y_padded,
|
||||||
|
Loading…
x
Reference in New Issue
Block a user