mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-10 02:22:17 +00:00
77 lines
3.4 KiB
Markdown
77 lines
3.4 KiB
Markdown
# Results
|
|
| LLM Model | Flow matching Model | Seed-TTS test_zh CER | Comment |
|
|
|---------------------------------------|----------|-----------|--------|
|
|
| pretrained cosyvoice2 llm | pretrained cosyvoice2 unet | 1.45% | See [paper](https://arxiv.org/abs/2412.10117)|
|
|
| pretrained cosyvoice2 llm | f5-tts-small (wenetspeech4tts) | 1.79% (16 steps) | See [PR](https://github.com/k2-fsa/icefall/pull/1880)|
|
|
| llasa_cosyvoice2_token llm (Emilia 50k hours ZH) | f5-tts-small (wenetspeech4tts) | 1.81% (16 steps) | |
|
|
|
|
# Introduction
|
|
|
|
[**Emilia**](https://huggingface.co/datasets/amphion/Emilia-Dataset) starts with over 101k
|
|
hours of speech across six languages, covering a wide range of speaking styles to enable more natural and spontaneous speech generation.
|
|
|
|
See https://arxiv.org/pdf/2407.05361.
|
|
|
|
# Llasa (cosyvoice2 token)
|
|
|
|
./llasa_cosyvoice2_token contains the code for training qwen2.5-0.5b models to predict cosyvoice2 semantic tokens.
|
|
|
|
Generated samples and training logs of [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) 50k hours Chinese data can be found [here](https://huggingface.co/yuekai/llasa_cosyvoice2_token_qwen_0.5b/tree/main).
|
|
|
|
Preparation:
|
|
|
|
```
|
|
# extract cosyvoice2 semantic tokens
|
|
bash prepare.sh --stage 3 --stop_stage 4
|
|
|
|
# Or you could use the prepared tokens.
|
|
huggingface-cli download yuekai/emilia_cosyvoice_v2_token --local-dir emilia_cosyvoice_v2_token
|
|
```
|
|
|
|
The training command is given below:
|
|
|
|
```
|
|
# docker: ghcr.io/swivid/f5-tts:main
|
|
# pip install -r llasa_cosyvoice2_token/requirements.txt
|
|
|
|
WANDB_KEY=$your_wandb_key
|
|
wandb login ${WANDB_KEY}
|
|
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct --local-dir Qwen2.5-0.5B-Instruct
|
|
torchrun --nproc_per_node=8 train.py config.json
|
|
```
|
|
|
|
To inference with Icefall Emilia trained Chinese Llasa_cosyvoice2_token model, we need to use cosyvoice2 token flow matching [model](https://github.com/k2-fsa/icefall/pull/1880):
|
|
```
|
|
cd icefall/egs/wenetspeech4tts/TTS
|
|
huggingface-cli login
|
|
huggingface-cli download --local-dir ${exp_dir} yuekai/llasa_cosyvoice2_token_qwen_0.5b
|
|
huggingface-cli download nvidia/bigvgan_v2_24khz_100band_256x --local-dir bigvgan_v2_24khz_100band_256x
|
|
vocoder=./bigvgan_v2_24khz_100band_256x
|
|
split=test_zh
|
|
llm_path=llasa_cosyvoice2_token_qwen_0.5b/checkpoint-800000
|
|
|
|
huggingface-cli download --local-dir f5-tts-small-wenetspeech4tts-basic yuekai/f5-tts-semantic-token-small-wenetspeech4tts-basic
|
|
model_path=f5-tts-small-wenetspeech4tts-basic/epoch-10-avg-5.pt
|
|
torchrun --nproc_per_node=2 \
|
|
f5-tts/infer_dist.py \
|
|
--output_dir $output_dir \
|
|
--batch_size 1 \
|
|
--num_workers 2 \
|
|
--llm-model-name-or-path $llm_path \
|
|
--flow-matching-model-path $model_path \
|
|
--decoder-dim 768 --nhead 12 --num-decoder-layers 18 \
|
|
--use-cosyvoice-semantic-token True \
|
|
--vocoder-dir $vocoder \
|
|
--split-name $split -top-k 50 -top-p 0.95 -temperature 0.8 \
|
|
--tokenizer-dir Qwen/Qwen2.5-0.5B-Instruct
|
|
# compute cer
|
|
huggingface-cli download yuekai/seed_tts_eval --local-dir seed_tts_eval --repo-type dataset
|
|
manifest=./seed_tts_eval/seedtts_testset/zh/meta.lst
|
|
bash local/compute_wer.sh $output_dir $manifest
|
|
```
|
|
|
|
# Credits
|
|
- [Llasa](https://arxiv.org/abs/2502.04128)
|
|
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
|
|
- [S3Tokenizer](https://github.com/xingchensong/S3Tokenizer/tree/main)
|