ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
Overview
ZipVoice is a high-quality zero-shot TTS model with a small model size and fast inference speed.
Key features:
-
Small and fast: only 123M parameters.
-
High-quality: state-of-the-art voice cloning performance in speaker similarity, intelligibility, and naturalness.
-
Multi-lingual: support Chinese and English.
News
2025/06/16: 🔥 ZipVoice is released.
Installation
pip install -r requirements.txt
Usage
To generate speech with our pre-trained ZipVoice or ZipVoice-Distill models, use the following commands (Required models will be downloaded from HuggingFace):
1. Inference of a single sentence:
python3 zipvoice/zipvoice_infer.py \
--model-name "zipvoice_distill" \
--prompt-wav prompt.wav \
--prompt-text "I am the transcription of the prompt wav." \
--text "I am the text to be synthesized." \
--res-wav-path result.wav
2. Inference of a list of sentences:
python3 zipvoice/zipvoice_infer.py \
--model-name "zipvoice_distill" \
--test-list test.tsv \
--res-dir results/test
--model-name
can bezipvoice
orzipvoice_distill
, which are models before and after distillation, respectively.- Each line of
test.tsv
is in the format of{wav_name}\t{prompt_transcription}\t{prompt_wav}\t{text}
.
Note: If you having trouble connecting to HuggingFace, try:
export HF_ENDPOINT=https://hf-mirror.com
Training Your Own Model
The following steps show how to train a model from scratch on Emilia and LibriTTS datasets, respectively.
1. Data Preparation
1.1. Prepare the Emilia dataset
1.2 Prepare the LibriTTS dataset
2. Training
2.1 Traininig on Emilia
Expand to view training steps
2.1.1 Train the ZipVoice model
- Training:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_flow.py \
--world-size 8 \
--use-fp16 1 \
--dataset emilia \
--max-duration 500 \
--lr-hours 30000 \
--lr-batches 7500 \
--token-file "data/tokens_emilia.txt" \
--manifest-dir "data/fbank_emilia" \
--num-epochs 11 \
--exp-dir zipvoice/exp_zipvoice
- Average the checkpoints to produce the final model:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/generate_averaged_model.py \
--epoch 11 \
--avg 4 \
--distill 0 \
--token-file data/tokens_emilia.txt \
--dataset "emilia" \
--exp-dir ./zipvoice/exp_zipvoice
# The generated model is zipvoice/exp_zipvoice/epoch-11-avg-4.pt
2.1.2. Train the ZipVoice-Distill model (Optional)
- The first-stage distillation:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_distill.py \
--world-size 8 \
--use-fp16 1 \
--tensorboard 1 \
--dataset "emilia" \
--base-lr 0.0005 \
--max-duration 500 \
--token-file "data/tokens_emilia.txt" \
--manifest-dir "data/fbank_emilia" \
--teacher-model zipvoice/exp_zipvoice/epoch-11-avg-4.pt \
--num-updates 60000 \
--distill-stage "first" \
--exp-dir zipvoice/exp_zipvoice_distill_1stage
- Average checkpoints for the second-stage initialization:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/generate_averaged_model.py \
--iter 60000 \
--avg 7 \
--distill 1 \
--token-file data/tokens_emilia.txt \
--dataset "emilia" \
--exp-dir ./zipvoice/exp_zipvoice_distill_1stage
# The generated model is zipvoice/exp_zipvoice_distill_1stage/iter-60000-avg-7.pt
- The second-stage distillation:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_distill.py \
--world-size 8 \
--use-fp16 1 \
--tensorboard 1 \
--dataset "emilia" \
--base-lr 0.0001 \
--max-duration 200 \
--token-file "data/tokens_emilia.txt" \
--manifest-dir "data/fbank_emilia" \
--teacher-model zipvoice/exp_zipvoice_distill_1stage/iter-60000-avg-7.pt \
--num-updates 2000 \
--distill-stage "second" \
--exp-dir zipvoice/exp_zipvoice_distill_new
2.2 Traininig on LibriTTS
Expand to view training steps
2.2.1 Train the ZipVoice model
- Training:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_flow.py \
--world-size 8 \
--use-fp16 1 \
--dataset libritts \
--max-duration 250 \
--lr-epochs 10 \
--lr-batches 7500 \
--token-file "data/tokens_libritts.txt" \
--manifest-dir "data/fbank_libritts" \
--num-epochs 60 \
--exp-dir zipvoice/exp_zipvoice_libritts
- Average the checkpoints to produce the final model:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/generate_averaged_model.py \
--epoch 60 \
--avg 10 \
--distill 0 \
--token-file data/tokens_libritts.txt \
--dataset "libritts" \
--exp-dir ./zipvoice/exp_zipvoice_libritts
# The generated model is zipvoice/exp_zipvoice_libritts/epoch-60-avg-10.pt
2.1.2 Train the ZipVoice-Distill model (Optional)
- The first-stage distillation:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_distill.py \
--world-size 8 \
--use-fp16 1 \
--tensorboard 1 \
--dataset "libritts" \
--base-lr 0.001 \
--max-duration 250 \
--token-file "data/tokens_libritts.txt" \
--manifest-dir "data/fbank_libritts" \
--teacher-model zipvoice/exp_zipvoice_libritts/epoch-60-avg-10.pt \
--num-epochs 6 \
--distill-stage "first" \
--exp-dir zipvoice/exp_zipvoice_distill_1stage_libritts
- Average checkpoints for the second-stage initialization:
export PYTHONPATH=../../:$PYTHONPATH
python3 ./zipvoice/generate_averaged_model.py \
--epoch 6 \
--avg 3 \
--distill 1 \
--token-file data/tokens_libritts.txt \
--dataset "libritts" \
--exp-dir ./zipvoice/exp_zipvoice_distill_1stage_libritts
# The generated model is zipvoice/exp_zipvoice_distill_1stage_libritts/epoch-6-avg-3.pt
- The second-stage distillation:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_distill.py \
--world-size 8 \
--use-fp16 1 \
--tensorboard 1 \
--dataset "libritts" \
--base-lr 0.001 \
--max-duration 250 \
--token-file "data/tokens_libritts.txt" \
--manifest-dir "data/fbank_libritts" \
--teacher-model zipvoice/exp_zipvoice_distill_1stage_libritts/epoch-6-avg-3.pt \
--num-epochs 6 \
--distill-stage "second" \
--exp-dir zipvoice/exp_zipvoice_distill_libritts
- Average checkpoints to produce the final model:
export PYTHONPATH=../../:$PYTHONPATH
python3 ./zipvoice/generate_averaged_model.py \
--epoch 6 \
--avg 3 \
--distill 1 \
--token-file data/tokens_libritts.txt \
--dataset "libritts" \
--exp-dir ./zipvoice/exp_zipvoice_distill_libritts
# The generated model is ./zipvoice/exp_zipvoice_distill_libritts/epoch-6-avg-3.pt
3. Inference with the trained model
3.1 Inference with the model trained on Emilia
Expand to view inference commands.
3.1.1 ZipVoice model before distill:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/infer.py \
--checkpoint zipvoice/exp_zipvoice/epoch-11-avg-4.pt \
--distill 0 \
--token-file "data/tokens_emilia.txt" \
--test-list test.tsv \
--res-dir results/test \
--num-step 16 \
--guidance-scale 1
3.1.2 ZipVoice-Distill model before distill:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/infer.py \
--checkpoint zipvoice/exp_zipvoice_distill/checkpoint-2000.pt \
--distill 1 \
--token-file "data/tokens_emilia.txt" \
--test-list test.tsv \
--res-dir results/test_distill \
--num-step 8 \
--guidance-scale 3
3.2 Inference with the model trained on LibriTTS
Expand to view inference commands.
3.2.1 ZipVoice model before distill:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/infer.py \
--checkpoint zipvoice/exp_zipvoice_libritts/epoch-60-avg-10.pt \
--distill 0 \
--token-file "data/tokens_libritts.txt" \
--test-list test.tsv \
--res-dir results/test_libritts \
--num-step 8 \
--guidance-scale 1 \
--target-rms 1.0 \
--t-shift 0.7
3.2.2 ZipVoice-Distill model before distill
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/infer.py \
--checkpoint zipvoice/exp_zipvoice_distill/epoch-6-avg-3.pt \
--distill 1 \
--token-file "data/tokens_libritts.txt" \
--test-list test.tsv \
--res-dir results/test_distill_libritts \
--num-step 4 \
--guidance-scale 3 \
--target-rms 1.0 \
--t-shift 0.7
4. Evaluation on benchmarks
See local/evaluate.sh for details of objective metrics evaluation on three test sets, i.e., LibriSpeech-PC test-clean, Seed-TTS test-en and Seed-TTS test-zh.
Citation
@article{zhu-2025-zipvoice,
title={ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching},
author={Han Zhu and Wei Kang and Zengwei Yao and Liyong Guo and Fangjun Kuang and Zhaoqing Li and Weiji Zhuang and Long Lin and Daniel Povey}
journal={arXiv preprint arXiv:2506.13053},
year={2025},
}