update README & RESULTS

This commit is contained in:
Triplecq 2024-05-01 17:59:40 -04:00
parent 3505a8ec45
commit ea1d9b20a8
2 changed files with 80 additions and 0 deletions

View File

@ -0,0 +1,31 @@
# Introduction
**ReazonSpeech** is an open-source dataset that contains a diverse set of natural Japanese speech, collected from terrestrial television streams. It contains more than 35,000 hours of audio.
The dataset is available on Hugging Face. For more details, please visit:
- Dataset: https://huggingface.co/datasets/reazon-research/reazonspeech
- Paper: https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf
[./RESULTS.md](./RESULTS.md) contains the latest results.
# Transducers
There are various folders containing the name `transducer` in this folder. The following table lists the differences among them.
| | Encoder | Decoder | Comment |
| ---------------------------------------- | -------------------- | ------------------ | ------------------------------------------------- |
| `pruned_transducer_stateless2` | Conformer (modified) | Embedding + Conv1d | Using k2 pruned RNN-T loss |
| `pruned_transducer_stateless7_streaming` | Streaming Zipformer | Embedding + Conv1d | streaming version of pruned_transducer_stateless7 |
| `zipformer` | Upgraded Zipformer | Embedding + Conv1d | The latest recipe |
The decoder in `transducer_stateless` is modified from the paper [Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/). We place an additional Conv1d layer right after the input embedding layer.

View File

@ -0,0 +1,49 @@
## Results
### Zipformer
#### Non-streaming
##### large-scaled model, number of model parameters: 159337842, i.e., 159.34 M
| decoding method | In-Distribution CER | JSUT | CommonVoice | TEDx | comment |
| :------------------: | :-----------------: | :--: | :---------: | :---: | :----------------: |
| greedy search | 4.2 | 6.7 | 7.84 | 17.9 | --epoch 39 --avg 7 |
| modified beam search | 4.13 | 6.77 | 7.69 | 17.82 | --epoch 39 --avg 7 |
The training command is:
```shell
./zipformer/train.py \
--world-size 8 \
--num-epochs 40 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp-large \
--causal 0 \
--num-encoder-layers 2,2,4,5,4,2 \
--feedforward-dim 512,768,1536,2048,1536,768 \
--encoder-dim 192,256,512,768,512,256 \
--encoder-unmasked-dim 192,192,256,320,256,192 \
--lang data/lang_char \
--max-duration 1600
```
The decoding command is:
```shell
./zipformer/decode.py \
--epoch 40 \
--avg 16 \
--exp-dir zipformer/exp-large \
--max-duration 600 \
--causal 0 \
--decoding-method greedy_search \
--num-encoder-layers 2,2,4,5,4,2 \
--feedforward-dim 512,768,1536,2048,1536,768 \
--encoder-dim 192,256,512,768,512,256 \
--encoder-unmasked-dim 192,192,256,320,256,192 \
--lang data/lang_char \
--blank-penalty 0
```