VITS-LJSpeech =============== This tutorial shows you how to train an VITS model with the `LJSpeech `_ dataset. .. note:: TTS related recipes require packages in ``requirements-tts.txt``. .. note:: The VITS paper: `Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech `_ Install extra dependencies -------------------------- .. code-block:: bash pip install piper_phonemize -f https://k2-fsa.github.io/icefall/piper_phonemize.html pip install numba espnet_tts_frontend cython Data preparation ---------------- .. code-block:: bash $ cd egs/ljspeech/TTS $ ./prepare.sh To run stage 1 to stage 5, use .. code-block:: bash $ ./prepare.sh --stage 1 --stop_stage 5 Build Monotonic Alignment Search -------------------------------- .. code-block:: bash $ ./prepare.sh --stage -1 --stop_stage -1 or .. code-block:: bash $ cd vits/monotonic_align $ python setup.py build_ext --inplace $ cd ../../ Training -------- .. code-block:: bash $ export CUDA_VISIBLE_DEVICES="0,1,2,3" $ ./vits/train.py \ --world-size 4 \ --num-epochs 1000 \ --start-epoch 1 \ --use-fp16 1 \ --exp-dir vits/exp \ --tokens data/tokens.txt \ --model-type high \ --max-duration 500 .. note:: You can adjust the hyper-parameters to control the size of the VITS model and the training configurations. For more details, please run ``./vits/train.py --help``. .. warning:: If you want a model that runs faster on CPU, please use ``--model-type low`` or ``--model-type medium``. .. note:: The training can take a long time (usually a couple of days). Training logs, checkpoints and tensorboard logs are saved in ``vits/exp``. Inference --------- The inference part uses checkpoints saved by the training part, so you have to run the training part first. It will save the ground-truth and generated wavs to the directory ``vits/exp/infer/epoch-*/wav``, e.g., ``vits/exp/infer/epoch-1000/wav``. .. code-block:: bash $ export CUDA_VISIBLE_DEVICES="0" $ ./vits/infer.py \ --epoch 1000 \ --exp-dir vits/exp \ --tokens data/tokens.txt \ --max-duration 500 .. note:: For more details, please run ``./vits/infer.py --help``. Export models ------------- Currently we only support ONNX model exporting. It will generate one file in the given ``exp-dir``: ``vits-epoch-*.onnx``. .. code-block:: bash $ ./vits/export-onnx.py \ --epoch 1000 \ --exp-dir vits/exp \ --tokens data/tokens.txt You can test the exported ONNX model with: .. code-block:: bash $ ./vits/test_onnx.py \ --model-filename vits/exp/vits-epoch-1000.onnx \ --tokens data/tokens.txt Download pretrained models -------------------------- If you don't want to train from scratch, you can download the pretrained models by visiting the following link: - ``--model-type=high``: ``_ - ``--model-type=medium``: ``_ - ``--model-type=low``: ``_ Usage in sherpa-onnx -------------------- The following describes how to test the exported ONNX model in `sherpa-onnx`_. .. hint:: `sherpa-onnx`_ supports different programming languages, e.g., C++, C, Python, Kotlin, Java, Swift, Go, C#, etc. It also supports Android and iOS. We only describe how to use pre-built binaries from `sherpa-onnx`_ below. Please refer to ``_ for more documentation. Install sherpa-onnx ^^^^^^^^^^^^^^^^^^^ .. code-block:: bash pip install sherpa-onnx To check that you have installed `sherpa-onnx`_ successfully, please run: .. code-block:: bash which sherpa-onnx-offline-tts sherpa-onnx-offline-tts --help Download lexicon files ^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash cd /tmp wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/espeak-ng-data.tar.bz2 tar xf espeak-ng-data.tar.bz2 Run sherpa-onnx ^^^^^^^^^^^^^^^ .. code-block:: bash cd egs/ljspeech/TTS sherpa-onnx-offline-tts \ --vits-model=vits/exp/vits-epoch-1000.onnx \ --vits-tokens=data/tokens.txt \ --vits-data-dir=/tmp/espeak-ng-data \ --num-threads=1 \ --output-filename=./high.wav \ "Ask not what your country can do for you; ask what you can do for your country." .. hint:: You can also use ``sherpa-onnx-offline-tts-play`` to play the audio as it is generating. You should get a file ``high.wav`` after running the above command. Congratulations! You have successfully trained and exported a text-to-speech model and run it with `sherpa-onnx`_.