11 KiB
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
Overview
ZipVoice is a high-quality zero-shot TTS model with a small model size and fast inference speed.
Key features:
-
Small and fast: only 123M parameters.
-
High-quality: state-of-the-art voice cloning performance in speaker similarity, intelligibility, and naturalness.
-
Multi-lingual: support Chinese and English.
News
2025/06/16: 🔥 ZipVoice is released.
Installation
- Clone icefall repository and change to zipvoice directory:
git clone https://github.com/k2-fsa/icefall.git
cd icefall/egs/zipvoice
- Create a Python virtual environment (optional but recommended):
python3 -m venv venv
source venv/bin/activate
- Install the required packages:
# Install pytorch and k2.
# If you want to use different versions, please refer to https://k2-fsa.org/get-started/k2/ for details.
# For users in China mainland, please refer to https://k2-fsa.org/zh-CN/get-started/k2/
pip install torch==2.5.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install k2==1.24.4.dev20250208+cuda12.1.torch2.5.1 -f https://k2-fsa.github.io/k2/cuda.html
# Install other dependencies.
pip install piper_phonemize -f https://k2-fsa.github.io/icefall/piper_phonemize.html
pip install -r requirements.txt
Usage
To generate speech with our pre-trained ZipVoice or ZipVoice-Distill models, use the following commands (Required models will be downloaded from HuggingFace):
1. Inference of a single sentence:
# Chinese example
python3 zipvoice/zipvoice_infer.py \
--model-name "zipvoice_distill" \
--prompt-wav assets/prompt-zh.wav \
--prompt-text "对,这就是我,万人敬仰的太乙真人,虽然有点婴儿肥,但也掩不住我逼人的帅气。" \
--text "欢迎使用我们的语音合成模型,希望它能给你带来惊喜!" \
--res-wav-path result-zh.wav
# English example
python3 zipvoice/zipvoice_infer.py \
--model-name "zipvoice_distill" \
--prompt-wav assets/prompt-en.wav \
--prompt-text "Some call me nature, others call me mother nature. I've been here for over four point five billion years, twenty two thousand five hundred times longer than you." \
--text "Welcome to use our tts model, have fun!" \
--res-wav-path result-en.wav
2. Inference of a list of sentences:
python3 zipvoice/zipvoice_infer.py \
--model-name "zipvoice_distill" \
--test-list test.tsv \
--res-dir results/test
--model-name
can bezipvoice
orzipvoice_distill
, which are models before and after distillation, respectively.- Each line of
test.tsv
is in the format of{wav_name}\t{prompt_transcription}\t{prompt_wav}\t{text}
.
Note: If you having trouble connecting to HuggingFace, try:
export HF_ENDPOINT=https://hf-mirror.com
Training Your Own Model
The following steps show how to train a model from scratch on Emilia and LibriTTS datasets, respectively.
1. Data Preparation
1.1. Prepare the Emilia dataset
1.2 Prepare the LibriTTS dataset
2. Training
2.1 Traininig on Emilia
Expand to view training steps
2.1.1 Train the ZipVoice model
- Training:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_flow.py \
--world-size 8 \
--use-fp16 1 \
--dataset emilia \
--max-duration 500 \
--lr-hours 30000 \
--lr-batches 7500 \
--token-file "data/tokens_emilia.txt" \
--manifest-dir "data/fbank_emilia" \
--num-epochs 11 \
--exp-dir zipvoice/exp_zipvoice
- Average the checkpoints to produce the final model:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/generate_averaged_model.py \
--epoch 11 \
--avg 4 \
--distill 0 \
--token-file data/tokens_emilia.txt \
--dataset "emilia" \
--exp-dir ./zipvoice/exp_zipvoice
# The generated model is zipvoice/exp_zipvoice/epoch-11-avg-4.pt
2.1.2. Train the ZipVoice-Distill model (Optional)
- The first-stage distillation:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_distill.py \
--world-size 8 \
--use-fp16 1 \
--tensorboard 1 \
--dataset "emilia" \
--base-lr 0.0005 \
--max-duration 500 \
--token-file "data/tokens_emilia.txt" \
--manifest-dir "data/fbank_emilia" \
--teacher-model zipvoice/exp_zipvoice/epoch-11-avg-4.pt \
--num-updates 60000 \
--distill-stage "first" \
--exp-dir zipvoice/exp_zipvoice_distill_1stage
- Average checkpoints for the second-stage initialization:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/generate_averaged_model.py \
--iter 60000 \
--avg 7 \
--distill 1 \
--token-file data/tokens_emilia.txt \
--dataset "emilia" \
--exp-dir ./zipvoice/exp_zipvoice_distill_1stage
# The generated model is zipvoice/exp_zipvoice_distill_1stage/iter-60000-avg-7.pt
- The second-stage distillation:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_distill.py \
--world-size 8 \
--use-fp16 1 \
--tensorboard 1 \
--dataset "emilia" \
--base-lr 0.0001 \
--max-duration 200 \
--token-file "data/tokens_emilia.txt" \
--manifest-dir "data/fbank_emilia" \
--teacher-model zipvoice/exp_zipvoice_distill_1stage/iter-60000-avg-7.pt \
--num-updates 2000 \
--distill-stage "second" \
--exp-dir zipvoice/exp_zipvoice_distill_new
2.2 Traininig on LibriTTS
Expand to view training steps
2.2.1 Train the ZipVoice model
- Training:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_flow.py \
--world-size 8 \
--use-fp16 1 \
--dataset libritts \
--max-duration 250 \
--lr-epochs 10 \
--lr-batches 7500 \
--token-file "data/tokens_libritts.txt" \
--manifest-dir "data/fbank_libritts" \
--num-epochs 60 \
--exp-dir zipvoice/exp_zipvoice_libritts
- Average the checkpoints to produce the final model:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/generate_averaged_model.py \
--epoch 60 \
--avg 10 \
--distill 0 \
--token-file data/tokens_libritts.txt \
--dataset "libritts" \
--exp-dir ./zipvoice/exp_zipvoice_libritts
# The generated model is zipvoice/exp_zipvoice_libritts/epoch-60-avg-10.pt
2.1.2 Train the ZipVoice-Distill model (Optional)
- The first-stage distillation:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_distill.py \
--world-size 8 \
--use-fp16 1 \
--tensorboard 1 \
--dataset "libritts" \
--base-lr 0.001 \
--max-duration 250 \
--token-file "data/tokens_libritts.txt" \
--manifest-dir "data/fbank_libritts" \
--teacher-model zipvoice/exp_zipvoice_libritts/epoch-60-avg-10.pt \
--num-epochs 6 \
--distill-stage "first" \
--exp-dir zipvoice/exp_zipvoice_distill_1stage_libritts
- Average checkpoints for the second-stage initialization:
export PYTHONPATH=../../:$PYTHONPATH
python3 ./zipvoice/generate_averaged_model.py \
--epoch 6 \
--avg 3 \
--distill 1 \
--token-file data/tokens_libritts.txt \
--dataset "libritts" \
--exp-dir ./zipvoice/exp_zipvoice_distill_1stage_libritts
# The generated model is zipvoice/exp_zipvoice_distill_1stage_libritts/epoch-6-avg-3.pt
- The second-stage distillation:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/train_distill.py \
--world-size 8 \
--use-fp16 1 \
--tensorboard 1 \
--dataset "libritts" \
--base-lr 0.001 \
--max-duration 250 \
--token-file "data/tokens_libritts.txt" \
--manifest-dir "data/fbank_libritts" \
--teacher-model zipvoice/exp_zipvoice_distill_1stage_libritts/epoch-6-avg-3.pt \
--num-epochs 6 \
--distill-stage "second" \
--exp-dir zipvoice/exp_zipvoice_distill_libritts
- Average checkpoints to produce the final model:
export PYTHONPATH=../../:$PYTHONPATH
python3 ./zipvoice/generate_averaged_model.py \
--epoch 6 \
--avg 3 \
--distill 1 \
--token-file data/tokens_libritts.txt \
--dataset "libritts" \
--exp-dir ./zipvoice/exp_zipvoice_distill_libritts
# The generated model is ./zipvoice/exp_zipvoice_distill_libritts/epoch-6-avg-3.pt
3. Inference with the trained model
3.1 Inference with the model trained on Emilia
Expand to view inference commands.
3.1.1 ZipVoice model before distill:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/infer.py \
--checkpoint zipvoice/exp_zipvoice/epoch-11-avg-4.pt \
--distill 0 \
--token-file "data/tokens_emilia.txt" \
--test-list test.tsv \
--res-dir results/test \
--num-step 16 \
--guidance-scale 1
3.1.2 ZipVoice-Distill model before distill:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/infer.py \
--checkpoint zipvoice/exp_zipvoice_distill/checkpoint-2000.pt \
--distill 1 \
--token-file "data/tokens_emilia.txt" \
--test-list test.tsv \
--res-dir results/test_distill \
--num-step 8 \
--guidance-scale 3
3.2 Inference with the model trained on LibriTTS
Expand to view inference commands.
3.2.1 ZipVoice model before distill:
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/infer.py \
--checkpoint zipvoice/exp_zipvoice_libritts/epoch-60-avg-10.pt \
--distill 0 \
--token-file "data/tokens_libritts.txt" \
--test-list test.tsv \
--res-dir results/test_libritts \
--num-step 8 \
--guidance-scale 1 \
--target-rms 1.0 \
--t-shift 0.7
3.2.2 ZipVoice-Distill model before distill
export PYTHONPATH=../../:$PYTHONPATH
python3 zipvoice/infer.py \
--checkpoint zipvoice/exp_zipvoice_distill/epoch-6-avg-3.pt \
--distill 1 \
--token-file "data/tokens_libritts.txt" \
--test-list test.tsv \
--res-dir results/test_distill_libritts \
--num-step 4 \
--guidance-scale 3 \
--target-rms 1.0 \
--t-shift 0.7
4. Evaluation on benchmarks
See local/evaluate.sh for details of objective metrics evaluation on three test sets, i.e., LibriSpeech-PC test-clean, Seed-TTS test-en and Seed-TTS test-zh.
Citation
@article{zhu-2025-zipvoice,
title={ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching},
author={Han Zhu and Wei Kang and Zengwei Yao and Liyong Guo and Fangjun Kuang and Zhaoqing Li and Weiji Zhuang and Long Lin and Daniel Povey}
journal={arXiv preprint arXiv:2506.13053},
year={2025},
}