## ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](http://arxiv.org/abs/2506.13053) [![demo](https://img.shields.io/badge/GitHub-Demo%20page-orange.svg)](https://zipvoice.github.io/) ## Overview ZipVoice is a high-quality zero-shot TTS model with a small model size and fast inference speed. #### Key features: - Small and fast: only 123M parameters. - High-quality: state-of-the-art voice cloning performance in speaker similarity, intelligibility, and naturalness. - Multi-lingual: support Chinese and English. ## News **2025/06/16**: 🔥 ZipVoice is released. ## Installation ``` pip install -r requirements.txt ``` ## Usage To generate speech with our pre-trained ZipVoice or ZipVoice-Distill models, use the following commands (Required models will be downloaded from HuggingFace): ### 1. Inference of a single sentence: ```bash python3 zipvoice/zipvoice_infer.py \ --model-name "zipvoice_distill" \ --prompt-wav prompt.wav \ --prompt-text "I am the transcription of the prompt wav." \ --text "I am the text to be synthesized." \ --res-wav-path result.wav ``` ### 2. Inference of a list of sentences: ```bash python3 zipvoice/zipvoice_infer.py \ --model-name "zipvoice_distill" \ --test-list test.tsv \ --res-dir results/test ``` - `--model-name` can be `zipvoice` or `zipvoice_distill`, which are models before and after distillation, respectively. - Each line of `test.tsv` is in the format of `{wav_name}\t{prompt_transcription}\t{prompt_wav}\t{text}`. > **Note:** If you having trouble connecting to HuggingFace, try: ```bash export HF_ENDPOINT=https://hf-mirror.com ``` ## Training Your Own Model The following steps show how to train a model from scratch on Emilia and LibriTTS datasets, respectively. ### 1. Data Preparation #### 1.1. Prepare the Emilia dataset #### 1.2 Prepare the LibriTTS dataset See [local/prepare_libritts.sh](local/prepare_libritts.sh) ### 2. Training #### 2.1 Traininig on Emilia
Expand to view training steps ##### 2.1.1 Train the ZipVoice model - Training: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/train_flow.py \ --world-size 8 \ --use-fp16 1 \ --dataset emilia \ --max-duration 500 \ --lr-hours 30000 \ --lr-batches 7500 \ --token-file "data/tokens_emilia.txt" \ --manifest-dir "data/fbank_emilia" \ --num-epochs 11 \ --exp-dir zipvoice/exp_zipvoice ``` - Average the checkpoints to produce the final model: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/generate_averaged_model.py \ --epoch 11 \ --avg 4 \ --distill 0 \ --token-file data/tokens_emilia.txt \ --dataset "emilia" \ --exp-dir ./zipvoice/exp_zipvoice # The generated model is zipvoice/exp_zipvoice/epoch-11-avg-4.pt ``` ##### 2.1.2. Train the ZipVoice-Distill model (Optional) - The first-stage distillation: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/train_distill.py \ --world-size 8 \ --use-fp16 1 \ --tensorboard 1 \ --dataset "emilia" \ --base-lr 0.0005 \ --max-duration 500 \ --token-file "data/tokens_emilia.txt" \ --manifest-dir "data/fbank_emilia" \ --teacher-model zipvoice/exp_zipvoice/epoch-11-avg-4.pt \ --num-updates 60000 \ --distill-stage "first" \ --exp-dir zipvoice/exp_zipvoice_distill_1stage ``` - Average checkpoints for the second-stage initialization: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/generate_averaged_model.py \ --iter 60000 \ --avg 7 \ --distill 1 \ --token-file data/tokens_emilia.txt \ --dataset "emilia" \ --exp-dir ./zipvoice/exp_zipvoice_distill_1stage # The generated model is zipvoice/exp_zipvoice_distill_1stage/iter-60000-avg-7.pt ``` - The second-stage distillation: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/train_distill.py \ --world-size 8 \ --use-fp16 1 \ --tensorboard 1 \ --dataset "emilia" \ --base-lr 0.0001 \ --max-duration 200 \ --token-file "data/tokens_emilia.txt" \ --manifest-dir "data/fbank_emilia" \ --teacher-model zipvoice/exp_zipvoice_distill_1stage/iter-60000-avg-7.pt \ --num-updates 2000 \ --distill-stage "second" \ --exp-dir zipvoice/exp_zipvoice_distill_new ```
#### 2.2 Traininig on LibriTTS
Expand to view training steps ##### 2.2.1 Train the ZipVoice model - Training: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/train_flow.py \ --world-size 8 \ --use-fp16 1 \ --dataset libritts \ --max-duration 250 \ --lr-epochs 10 \ --lr-batches 7500 \ --token-file "data/tokens_libritts.txt" \ --manifest-dir "data/fbank_libritts" \ --num-epochs 60 \ --exp-dir zipvoice/exp_zipvoice_libritts ``` - Average the checkpoints to produce the final model: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/generate_averaged_model.py \ --epoch 60 \ --avg 10 \ --distill 0 \ --token-file data/tokens_libritts.txt \ --dataset "libritts" \ --exp-dir ./zipvoice/exp_zipvoice_libritts # The generated model is zipvoice/exp_zipvoice_libritts/epoch-60-avg-10.pt ``` ##### 2.1.2 Train the ZipVoice-Distill model (Optional) - The first-stage distillation: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/train_distill.py \ --world-size 8 \ --use-fp16 1 \ --tensorboard 1 \ --dataset "libritts" \ --base-lr 0.001 \ --max-duration 250 \ --token-file "data/tokens_libritts.txt" \ --manifest-dir "data/fbank_libritts" \ --teacher-model zipvoice/exp_zipvoice_libritts/epoch-60-avg-10.pt \ --num-epochs 6 \ --distill-stage "first" \ --exp-dir zipvoice/exp_zipvoice_distill_1stage_libritts ``` - Average checkpoints for the second-stage initialization: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 ./zipvoice/generate_averaged_model.py \ --epoch 6 \ --avg 3 \ --distill 1 \ --token-file data/tokens_libritts.txt \ --dataset "libritts" \ --exp-dir ./zipvoice/exp_zipvoice_distill_1stage_libritts # The generated model is zipvoice/exp_zipvoice_distill_1stage_libritts/epoch-6-avg-3.pt ``` - The second-stage distillation: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/train_distill.py \ --world-size 8 \ --use-fp16 1 \ --tensorboard 1 \ --dataset "libritts" \ --base-lr 0.001 \ --max-duration 250 \ --token-file "data/tokens_libritts.txt" \ --manifest-dir "data/fbank_libritts" \ --teacher-model zipvoice/exp_zipvoice_distill_1stage_libritts/epoch-6-avg-3.pt \ --num-epochs 6 \ --distill-stage "second" \ --exp-dir zipvoice/exp_zipvoice_distill_libritts ``` - Average checkpoints to produce the final model: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 ./zipvoice/generate_averaged_model.py \ --epoch 6 \ --avg 3 \ --distill 1 \ --token-file data/tokens_libritts.txt \ --dataset "libritts" \ --exp-dir ./zipvoice/exp_zipvoice_distill_libritts # The generated model is ./zipvoice/exp_zipvoice_distill_libritts/epoch-6-avg-3.pt ```
### 3. Inference with the trained model #### 3.1 Inference with the model trained on Emilia
Expand to view inference commands. ##### 3.1.1 ZipVoice model before distill: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/infer.py \ --checkpoint zipvoice/exp_zipvoice/epoch-11-avg-4.pt \ --distill 0 \ --token-file "data/tokens_emilia.txt" \ --test-list test.tsv \ --res-dir results/test \ --num-step 16 \ --guidance-scale 1 ``` ##### 3.1.2 ZipVoice-Distill model before distill: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/infer.py \ --checkpoint zipvoice/exp_zipvoice_distill/checkpoint-2000.pt \ --distill 1 \ --token-file "data/tokens_emilia.txt" \ --test-list test.tsv \ --res-dir results/test_distill \ --num-step 8 \ --guidance-scale 3 ```
#### 3.2 Inference with the model trained on LibriTTS
Expand to view inference commands. ##### 3.2.1 ZipVoice model before distill: ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/infer.py \ --checkpoint zipvoice/exp_zipvoice_libritts/epoch-60-avg-10.pt \ --distill 0 \ --token-file "data/tokens_libritts.txt" \ --test-list test.tsv \ --res-dir results/test_libritts \ --num-step 8 \ --guidance-scale 1 \ --target-rms 1.0 \ --t-shift 0.7 ``` ##### 3.2.2 ZipVoice-Distill model before distill ```bash export PYTHONPATH=../../:$PYTHONPATH python3 zipvoice/infer.py \ --checkpoint zipvoice/exp_zipvoice_distill/epoch-6-avg-3.pt \ --distill 1 \ --token-file "data/tokens_libritts.txt" \ --test-list test.tsv \ --res-dir results/test_distill_libritts \ --num-step 4 \ --guidance-scale 3 \ --target-rms 1.0 \ --t-shift 0.7 ```
### 4. Evaluation on benchmarks See [local/evaluate.sh](local/evaluate.sh) for details of objective metrics evaluation on three test sets, i.e., LibriSpeech-PC test-clean, Seed-TTS test-en and Seed-TTS test-zh. ## Citation ```bibtex @article{zhu-2025-zipvoice, title={ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching}, author={Han Zhu and Wei Kang and Zengwei Yao and Liyong Guo and Fangjun Kuang and Zhaoqing Li and Weiji Zhuang and Long Lin and Daniel Povey} journal={arXiv preprint arXiv:2506.13053}, year={2025}, } ```