mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-27 02:34:21 +00:00
Update README and RESULTS.
This commit is contained in:
parent
8a3c2a00db
commit
0a15bee545
23
README.md
23
README.md
@ -34,11 +34,12 @@ We do provide a Colab notebook for this recipe.
|
||||
|
||||
### LibriSpeech
|
||||
|
||||
We provide 3 models for this recipe:
|
||||
We provide 4 models for this recipe:
|
||||
|
||||
- [conformer CTC model][LibriSpeech_conformer_ctc]
|
||||
- [TDNN LSTM CTC model][LibriSpeech_tdnn_lstm_ctc]
|
||||
- [RNN-T Conformer model][LibriSpeech_transducer]
|
||||
- [Transducer: Conformer encoder + LSTM decoder][LibriSpeech_transducer]
|
||||
- [Transducer: Conformer encoder + Embedding decoder][LibriSpeech_transducer_stateless]
|
||||
|
||||
#### Conformer CTC Model
|
||||
|
||||
@ -62,9 +63,9 @@ The WER for this model is:
|
||||
We provide a Colab notebook to run a pre-trained TDNN LSTM CTC model: [](https://colab.research.google.com/drive/1kNmDXNMwREi0rZGAOIAOJo93REBuOTcd?usp=sharing)
|
||||
|
||||
|
||||
#### RNN-T Conformer model
|
||||
#### Transducer: Conformer encoder + LSTM decoder
|
||||
|
||||
Using Conformer as encoder.
|
||||
Using Conformer as encoder and LSTM as decoder.
|
||||
|
||||
The best WER with greedy search is:
|
||||
|
||||
@ -74,6 +75,19 @@ The best WER with greedy search is:
|
||||
|
||||
We provide a Colab notebook to run a pre-trained RNN-T conformer model: [](https://colab.research.google.com/drive/1_u6yK9jDkPwG_NLrZMN2XK7Aeq4suMO2?usp=sharing)
|
||||
|
||||
#### Transducer: Conformer encoder + Embedding decoder
|
||||
|
||||
Using Conformer as encoder. The decoder consists of 1 embedding layer
|
||||
and 1 convolutional layer.
|
||||
|
||||
The best WER with beam search with beam size 4 is:
|
||||
|
||||
| | test-clean | test-other |
|
||||
|-----|------------|------------|
|
||||
| WER | 2.92 | 7.37 |
|
||||
|
||||
Note: No auxiliary losses are used in the training and no LMs are used
|
||||
in the decoding.
|
||||
|
||||
### Aishell
|
||||
|
||||
@ -143,6 +157,7 @@ Please see: [.
|
||||
We place an additional Conv1d layer right after the input embedding layer.
|
||||
|
@ -1,11 +1,69 @@
|
||||
## Results
|
||||
|
||||
### LibriSpeech BPE training results (RNN-T)
|
||||
### LibriSpeech BPE training results (Transducer)
|
||||
|
||||
#### 2021-12-22
|
||||
Conformer encoder + non-current decoder. The decoder
|
||||
contains only an embedding layer and a Conv1d (with kernel size 2).
|
||||
|
||||
The WERs are
|
||||
|
||||
| | test-clean | test-other | comment |
|
||||
|---------------------------|------------|------------|------------------------------------------|
|
||||
| greedy search | 2.99 | 7.52 | --epoch 20, --avg 10, --max-duration 100 |
|
||||
| beam search (beam size 2) | 2.95 | 7.43 | |
|
||||
| beam search (beam size 3) | 2.94 | 7.37 | |
|
||||
| beam search (beam size 4) | 2.92 | 7.37 | |
|
||||
| beam search (beam size 5) | 2.93 | 7.38 | |
|
||||
| beam search (beam size 8) | 2.92 | 7.38 | |
|
||||
|
||||
The training command for reproducing is given below:
|
||||
|
||||
```
|
||||
export CUDA_VISIBLE_DEVICES="0,1,2,3"
|
||||
|
||||
./transducer_stateless/train.py \
|
||||
--world-size 4 \
|
||||
--num-epochs 30 \
|
||||
--start-epoch 0 \
|
||||
--exp-dir transducer_stateless/exp-full \
|
||||
--full-libri 1 \
|
||||
--max-duration 250 \
|
||||
--lr-factor 3
|
||||
```
|
||||
|
||||
The tensorboard training log can be found at
|
||||
<https://tensorboard.dev/experiment/PsJ3LgkEQfOmzedAlYfVeg/#scalars&_smoothingWeight=0>
|
||||
|
||||
The decoding command is:
|
||||
```
|
||||
epoch=20
|
||||
avg=10
|
||||
|
||||
## greedy search
|
||||
./transducer_stateless/decode.py \
|
||||
--epoch $epoch \
|
||||
--avg $avg \
|
||||
--exp-dir transducer_stateless/exp-full \
|
||||
--bpe-model ./data/lang_bpe_500/bpe.model \
|
||||
--max-duration 100
|
||||
|
||||
## beam search
|
||||
./transducer_stateless/decode.py \
|
||||
--epoch $epoch \
|
||||
--avg $avg \
|
||||
--exp-dir transducer_stateless/exp-full \
|
||||
--bpe-model ./data/lang_bpe_500/bpe.model \
|
||||
--max-duration 100 \
|
||||
--decoding-method beam_search \
|
||||
--beam-size 4
|
||||
```
|
||||
|
||||
|
||||
#### 2021-12-17
|
||||
Using commit `cb04c8a7509425ab45fae888b0ca71bbbd23f0de`.
|
||||
|
||||
RNN-T + Conformer encoder.
|
||||
Conformer encoder + LSTM decoder.
|
||||
|
||||
The best WER is
|
||||
|
||||
|
@ -27,11 +27,6 @@ from encoder_interface import EncoderInterface
|
||||
|
||||
from icefall.utils import add_sos
|
||||
|
||||
assert hasattr(torchaudio.functional, "rnnt_loss"), (
|
||||
f"Current torchaudio version: {torchaudio.__version__}\n"
|
||||
"Please install a version >= 0.10.0"
|
||||
)
|
||||
|
||||
|
||||
class Transducer(nn.Module):
|
||||
"""It implements https://arxiv.org/pdf/1211.3711.pdf
|
||||
@ -115,6 +110,11 @@ class Transducer(nn.Module):
|
||||
# Note: y does not start with SOS
|
||||
y_padded = y.pad(mode="constant", padding_value=0)
|
||||
|
||||
assert hasattr(torchaudio.functional, "rnnt_loss"), (
|
||||
f"Current torchaudio version: {torchaudio.__version__}\n"
|
||||
"Please install a version >= 0.10.0"
|
||||
)
|
||||
|
||||
loss = torchaudio.functional.rnnt_loss(
|
||||
logits=logits,
|
||||
targets=y_padded,
|
||||
|
@ -27,11 +27,6 @@ from encoder_interface import EncoderInterface
|
||||
|
||||
from icefall.utils import add_sos
|
||||
|
||||
assert hasattr(torchaudio.functional, "rnnt_loss"), (
|
||||
f"Current torchaudio version: {torchaudio.__version__}\n"
|
||||
"Please install a version >= 0.10.0"
|
||||
)
|
||||
|
||||
|
||||
class Transducer(nn.Module):
|
||||
"""It implements https://arxiv.org/pdf/1211.3711.pdf
|
||||
@ -115,6 +110,11 @@ class Transducer(nn.Module):
|
||||
# Note: y does not start with SOS
|
||||
y_padded = y.pad(mode="constant", padding_value=0)
|
||||
|
||||
assert hasattr(torchaudio.functional, "rnnt_loss"), (
|
||||
f"Current torchaudio version: {torchaudio.__version__}\n"
|
||||
"Please install a version >= 0.10.0"
|
||||
)
|
||||
|
||||
loss = torchaudio.functional.rnnt_loss(
|
||||
logits=logits,
|
||||
targets=y_padded,
|
||||
|
@ -27,11 +27,6 @@ from encoder_interface import EncoderInterface
|
||||
|
||||
from icefall.utils import add_sos
|
||||
|
||||
assert hasattr(torchaudio.functional, "rnnt_loss"), (
|
||||
f"Current torchaudio version: {torchaudio.__version__}\n"
|
||||
"Please install a version >= 0.10.0"
|
||||
)
|
||||
|
||||
|
||||
class Transducer(nn.Module):
|
||||
"""It implements https://arxiv.org/pdf/1211.3711.pdf
|
||||
@ -113,6 +108,11 @@ class Transducer(nn.Module):
|
||||
# Note: y does not start with SOS
|
||||
y_padded = y.pad(mode="constant", padding_value=0)
|
||||
|
||||
assert hasattr(torchaudio.functional, "rnnt_loss"), (
|
||||
f"Current torchaudio version: {torchaudio.__version__}\n"
|
||||
"Please install a version >= 0.10.0"
|
||||
)
|
||||
|
||||
loss = torchaudio.functional.rnnt_loss(
|
||||
logits=logits,
|
||||
targets=y_padded,
|
||||
|
Loading…
x
Reference in New Issue
Block a user