diff --git a/egs/reazonspeech/ASR/README.md b/egs/reazonspeech/ASR/README.md new file mode 100644 index 000000000..445680fbc --- /dev/null +++ b/egs/reazonspeech/ASR/README.md @@ -0,0 +1,31 @@ +# Introduction + + + +**ReazonSpeech** is an open-source dataset that contains a diverse set of natural Japanese speech, collected from terrestrial television streams. It contains more than 35,000 hours of audio. + + + +The dataset is available on Hugging Face. For more details, please visit: + +- Dataset: https://huggingface.co/datasets/reazon-research/reazonspeech +- Paper: https://research.reazon.jp/_static/reazonspeech_nlp2023.pdf + + + +[./RESULTS.md](./RESULTS.md) contains the latest results. + +# Transducers + + + +There are various folders containing the name `transducer` in this folder. The following table lists the differences among them. + +| | Encoder | Decoder | Comment | +| ---------------------------------------- | -------------------- | ------------------ | ------------------------------------------------- | +| `pruned_transducer_stateless2` | Conformer (modified) | Embedding + Conv1d | Using k2 pruned RNN-T loss | +| `pruned_transducer_stateless7_streaming` | Streaming Zipformer | Embedding + Conv1d | streaming version of pruned_transducer_stateless7 | +| `zipformer` | Upgraded Zipformer | Embedding + Conv1d | The latest recipe | + +The decoder in `transducer_stateless` is modified from the paper [Rnn-Transducer with Stateless Prediction Network](https://ieeexplore.ieee.org/document/9054419/). We place an additional Conv1d layer right after the input embedding layer. + diff --git a/egs/reazonspeech/ASR/RESULTS.md b/egs/reazonspeech/ASR/RESULTS.md new file mode 100644 index 000000000..c0b4fe54a --- /dev/null +++ b/egs/reazonspeech/ASR/RESULTS.md @@ -0,0 +1,49 @@ +## Results + +### Zipformer + +#### Non-streaming + +##### large-scaled model, number of model parameters: 159337842, i.e., 159.34 M + +| decoding method | In-Distribution CER | JSUT | CommonVoice | TEDx | comment | +| :------------------: | :-----------------: | :--: | :---------: | :---: | :----------------: | +| greedy search | 4.2 | 6.7 | 7.84 | 17.9 | --epoch 39 --avg 7 | +| modified beam search | 4.13 | 6.77 | 7.69 | 17.82 | --epoch 39 --avg 7 | + +The training command is: + +```shell +./zipformer/train.py \ + --world-size 8 \ + --num-epochs 40 \ + --start-epoch 1 \ + --use-fp16 1 \ + --exp-dir zipformer/exp-large \ + --causal 0 \ + --num-encoder-layers 2,2,4,5,4,2 \ + --feedforward-dim 512,768,1536,2048,1536,768 \ + --encoder-dim 192,256,512,768,512,256 \ + --encoder-unmasked-dim 192,192,256,320,256,192 \ + --lang data/lang_char \ + --max-duration 1600 +``` + +The decoding command is: + +```shell +./zipformer/decode.py \ + --epoch 40 \ + --avg 16 \ + --exp-dir zipformer/exp-large \ + --max-duration 600 \ + --causal 0 \ + --decoding-method greedy_search \ + --num-encoder-layers 2,2,4,5,4,2 \ + --feedforward-dim 512,768,1536,2048,1536,768 \ + --encoder-dim 192,256,512,768,512,256 \ + --encoder-unmasked-dim 192,192,256,320,256,192 \ + --lang data/lang_char \ + --blank-penalty 0 +``` +