From 0e0ad4449ecfb184cf527bffdc59975f6493d4b5 Mon Sep 17 00:00:00 2001 From: Fangjun Kuang Date: Mon, 30 Dec 2024 16:13:13 +0800 Subject: [PATCH] Update README --- egs/baker_zh/TTS/README.md | 119 +++++++++++++++++++++++++++++++ egs/baker_zh/TTS/matcha/infer.py | 10 +++ 2 files changed, 129 insertions(+) diff --git a/egs/baker_zh/TTS/README.md b/egs/baker_zh/TTS/README.md index 01caaa0e7..67241ca19 100644 --- a/egs/baker_zh/TTS/README.md +++ b/egs/baker_zh/TTS/README.md @@ -5,3 +5,122 @@ https://en.data-baker.com/datasets/freeDatasets/ The dataset contains 10000 Chinese sentences of a native Chinese female speaker. + +# matcha + +[./matcha](./matcha) contains the code for training [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS) + +Checkpoints and training logs can be found [here](https://huggingface.co/csukuangfj/icefall-tts-baker-matcha-zh-2024-12-27). +The pull-request for this recipe can be found at + +The training command is given below: +```bash +python3 ./matcha/train.py \ + --exp-dir ./matcha/exp-1/ \ + --num-workers 4 \ + --world-size 1 \ + --num-epochs 2000 \ + --max-duration 1200 \ + --bucketing-sampler 1 \ + --start-epoch 1 +``` + +To inference, use: + +```bash +# Download Hifigan vocoder. We use Hifigan v1 below. You can select from v1, v2, or v3 + +wget https://github.com/csukuangfj/models/raw/refs/heads/master/hifigan/generator_v2 + +python3 ./matcha/infer.py \ + --epoch 2000 \ + --exp-dir ./matcha/exp-1 \ + --vocoder ./generator_v2 \ + --tokens ./data/tokens.txt \ + --cmvn ./data/fbank/cmvn.json \ + --input-text "当夜幕降临,星光点点,伴随着微风拂面,我在静谧中感受着时光的流转,思念如涟漪荡漾,梦境如画卷展开,我与自然融为一体,沉静在这片宁静的美丽之中,感受着生命的奇迹与温柔。" \ + --output-wav ./generated.wav +``` + +```bash +soxi ./generated.wav +``` + +prints: +``` +Input File : './generated.wav' +Channels : 1 +Sample Rate : 22050 +Precision : 16-bit +Duration : 00:00:17.31 = 381696 samples ~ 1298.29 CDDA sectors +File Size : 763k +Bit Rate : 353k +Sample Encoding: 16-bit Signed Integer PCM +``` + +https://github.com/user-attachments/assets/88d4e88f-ebc4-4f32-b216-16d46b966024 + + +To export the checkpoint to onnx: +```bash +python3 ./matcha/export_onnx.py \ + --exp-dir ./matcha/exp-1 \ + --epoch 2000 \ + --tokens ./data/tokens.txt \ + --cmvn ./data/fbank/cmvn.json +``` + +The above command generate the following files: +``` +-rw-r--r-- 1 kuangfangjun root 72M Dec 27 18:53 model-steps-2.onnx +-rw-r--r-- 1 kuangfangjun root 73M Dec 27 18:54 model-steps-3.onnx +-rw-r--r-- 1 kuangfangjun root 73M Dec 27 18:54 model-steps-4.onnx +-rw-r--r-- 1 kuangfangjun root 74M Dec 27 18:55 model-steps-5.onnx +-rw-r--r-- 1 kuangfangjun root 74M Dec 27 18:57 model-steps-6.onnx +``` + +where the 2 in `model-steps-2.onnx` means it uses 2 steps for the ODE solver. + +To export the Hifigan vocoder to onnx, please use: + +```bash +wget https://github.com/csukuangfj/models/raw/refs/heads/master/hifigan/generator_v1 +wget https://github.com/csukuangfj/models/raw/refs/heads/master/hifigan/generator_v2 +wget https://github.com/csukuangfj/models/raw/refs/heads/master/hifigan/generator_v3 + +python3 ./matcha/export_onnx_hifigan.py +``` + +The above command generates 3 files: + + - hifigan_v1.onnx + - hifigan_v2.onnx + - hifigan_v3.onnx + +To use the generated onnx files to generate speech from text, please run: + +```bash +python3 ./matcha/onnx_pretrained.py \ + --acoustic-model ./model-steps-4.onnx \ + --vocoder ./hifigan_v2.onnx \ + --tokens ./data/tokens.txt \ + --lexicon ./lexicon.txt \ + --input-text "在一个阳光明媚的夏天,小马、小羊和小狗它们一块儿在广阔的草地上,嬉戏玩耍,这时小猴来了,还带着它心爱的足球活蹦乱跳地跑前、跑后教小马、小羊、小狗踢足球。" \ + --output-wav ./1.wav +``` + +```bash +soxi ./1.wav + +Input File : './1.wav' +Channels : 1 +Sample Rate : 22050 +Precision : 16-bit +Duration : 00:00:16.37 = 360960 samples ~ 1227.76 CDDA sectors +File Size : 722k +Bit Rate : 353k +Sample Encoding: 16-bit Signed Integer PCM +``` + +https://github.com/user-attachments/assets/578d04bb-fee8-47e5-9984-a868dcce610e + diff --git a/egs/baker_zh/TTS/matcha/infer.py b/egs/baker_zh/TTS/matcha/infer.py index b7e785a04..3f18e4345 100755 --- a/egs/baker_zh/TTS/matcha/infer.py +++ b/egs/baker_zh/TTS/matcha/infer.py @@ -1,5 +1,15 @@ #!/usr/bin/env python3 # Copyright 2024 Xiaomi Corp. (authors: Fangjun Kuang) +""" +python3 ./matcha/infer.py \ + --epoch 2000 \ + --exp-dir ./matcha/exp-1 \ + --vocoder ./generator_v2 \ + --tokens ./data/tokens.txt \ + --cmvn ./data/fbank/cmvn.json \ + --input-text "当夜幕降临,星光点点,伴随着微风拂面,我在静谧中感受着时光的流转,思念如涟漪荡漾,梦境如画卷展开,我与自然融为一体,沉静在这片宁静的美丽之中,感受着生命的奇迹与温柔。" \ + --output-wav ./generated.wav +""" import argparse import datetime as dt