zr_jin 0f1bc6f8af
Multi_zh-Hans Recipe (#1238)
* Init commit for recipes trained on multiple zh datasets.

* fbank extraction for thchs30

* added support for aishell1

* added support for aishell-2

* fixes

* fixes

* fixes

* added support for stcmds and primewords

* fixes

* added support for magicdata

script for fbank computation not done yet

* added script for magicdata fbank computation

* file permission fixed

* updated for the wenetspeech recipe

* updated

* Update preprocess_kespeech.py

* updated

* updated

* updated

* updated

* file permission fixed

* updated paths

* fixes

* added support for kespeech dev/test set fbank computation

* fixes for file permission

* refined support for KeSpeech

* added scripts for BPE model training

* updated

* init commit for the multi_zh-cn zipformer recipe

* disable speed perturbation by default

* updated

* updated

* added necessary files for the zipformer recipe

* removed redundant wenetspeech M and S sets

* updates for multi dataset decoding

* refined

* formatting issues fixed

* updated

* minor fixes

* this commit finalize the recipe (hopefully)

* fixed formatting issues

* minor fixes

* updated

* using soft links to reduce redundancy

* minor updates

* using soft links to reduce redundancy

* minor updates

* minor updates

* using soft links to reduce redundancy

* minor updates

* Update README.md

* minor updates

* Update egs/multi_zh-hans/ASR/local/compute_fbank_magicdata.py

Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com>

* Update egs/multi_zh-hans/ASR/local/compute_fbank_magicdata.py

Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com>

* Update egs/multi_zh-hans/ASR/local/compute_fbank_stcmds.py

Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com>

* Update egs/multi_zh-hans/ASR/local/compute_fbank_stcmds.py

Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com>

* Update egs/multi_zh-hans/ASR/local/compute_fbank_primewords.py

Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com>

* Update egs/multi_zh-hans/ASR/local/compute_fbank_primewords.py

Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com>

* minor updates

* minor fixes

* fixed a formatting issue

* Update preprocess_kespeech.py

* Update prepare.sh

* Update egs/multi_zh-hans/ASR/local/compute_fbank_kespeech_splits.py

Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com>

* Update egs/multi_zh-hans/ASR/local/preprocess_kespeech.py

Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com>

* removed redundant files

* symlinks added

* minor updates

* added CI tests for `multi_zh-hans`

* minor fixes

* Update run-multi-zh_hans-zipformer.sh

* Update run-multi-zh_hans-zipformer.sh

* Update run-multi-zh_hans-zipformer.sh

* Update run-multi-zh_hans-zipformer.sh

* Update run-multi-zh_hans-zipformer.sh

* Update run-multi-zh_hans-zipformer.sh

* Update run-multi-zh_hans-zipformer.sh

---------

Co-authored-by: Fangjun Kuang <csukuangfj@gmail.com>
2023-09-13 11:57:05 +08:00

982 B

Introduction

This recipe includes scripts for training Zipformer model using multiple Chinese datasets.

Included Training Sets

  1. THCHS-30
  2. AiShell-{1,2,4}
  3. ST-CMDS
  4. Primewords
  5. MagicData
  6. Aidatatang_200zh
  7. AliMeeting
  8. WeNetSpeech
  9. KeSpeech-ASR
Datset Number of hours URL
TOTAL 14,106 ---
THCHS-30 35 https://www.openslr.org/18/
AiShell-1 170 https://www.openslr.org/33/
AiShell-2 1,000 http://www.aishelltech.com/aishell_2
AiShell-4 120 https://www.openslr.org/111/
ST-CMDS 110 https://www.openslr.org/38/
Primewords 99 https://www.openslr.org/47/
aidatatang_200zh 200 https://www.openslr.org/62/
MagicData 755 https://www.openslr.org/68/
AliMeeting 100 https://openslr.org/119/
WeNetSpeech 10,000 https://github.com/wenet-e2e/WenetSpeech
KeSpeech 1,542 https://github.com/KeSpeech/KeSpeech

Included Test Sets

  1. Aishell-{1,2,4}
  2. Aidatatang_200zh
  3. AliMeeting
  4. MagicData
  5. KeSpeech-ASR
  6. WeNetSpeech