add doc

2025-12-11 06:55:27 +00:00 · 2023-11-28 10:50:07 +08:00 · 2023-11-28 10:50:07 +08:00 · 5ab142842e
commit 5ab142842e
parent a983dcd469
3 changed files with 115 additions and 1 deletions
--- a/docs/source/recipes/TTS/index.rst
+++ b/docs/source/recipes/TTS/index.rst
@ -0,0 +1,7 @@
+TTS
+======
+
+.. toctree::
+   :maxdepth: 2
+
+   ljspeech/vits
--- a/docs/source/recipes/TTS/ljspeech/vits.rst
+++ b/docs/source/recipes/TTS/ljspeech/vits.rst
@ -0,0 +1,106 @@
+VITS
+===============
+
+This tutorial shows you how to train an VITS model
+with the `LJSpeech <https://keithito.com/LJ-Speech-Dataset/>`_ dataset.
+
+.. note::
+
+   The VITS paper: `Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech <https://arxiv.org/pdf/2106.06103.pdf>`_
+
+
+Data preparation
+----------------
+
+.. code-block:: bash
+
+  $ cd egs/ljspeech/TTS
+  $ ./prepare.sh
+
+To run stage 1 to stage 5, use
+
+.. code-block:: bash
+
+  $ ./prepare.sh --stage 1 --stop_stage 5
+
+
+Build Monotonic Alignment Search
+--------------------------------
+
+.. code-block:: bash
+
+  $ cd vits/monotonic_align
+  $ python setup.py build_ext --inplace
+  $ cd ../../
+
+
+Training
+--------
+
+.. code-block:: bash
+
+  $ export CUDA_VISIBLE_DEVICES="0,1,2,3"
+  $ ./vits/train.py \
+      --world-size 4 \
+      --num-epochs 1000 \
+      --start-epoch 1 \
+      --use-fp16 1 \
+      --exp-dir vits/exp \
+      --tokens data/tokens.txt
+      --max-duration 500
+
+.. note::
+
+    You can adjust the hyper-parameters to control the size of the VITS model and
+    the training configurations. For more details, please run ``./vits/train.py --help``.
+
+.. note::
+
+    The training can take a long time (usually a couple of days).
+
+Training logs, checkpoints and tensorboard logs are saved in ``vits/exp``.
+
+
+Inference
+---------
+
+The inference part uses checkpoints saved by the training part, so you have to run the
+training part first. It will save the ground-truth and generated wavs to the directory
+``vits/exp/infer/epoch-*/wav``, e.g., ``vits/exp/infer/epoch-1000/wav``.
+
+.. code-block:: bash
+
+  $ export CUDA_VISIBLE_DEVICES="0"
+  $ ./vits/infer.py \
+      --epoch 1000 \
+      --exp-dir vits/exp \
+      --tokens data/tokens.txt
+      --max-duration 500
+
+.. note::
+
+    For more details, please run ``./vits/infer.py --help``.
+
+
+Export models
+-------------
+
+Currently we only support ONNX model exporting. It will generate two files in the given ``exp-dir``:
+``vits-epoch-*.onnx`` and ``vits-epoch-*.int8.onnx``.
+
+.. code-block:: bash
+
+  $ ./vits/export-onnx.py \
+      --epoch 1000 \
+      --exp-dir vits/exp \
+      --tokens data/tokens.txt
+
+You can test the exported ONNX model with:
+
+.. code-block:: bash
+
+  $ ./vits/test_onnx.py \
+      --model-filename vits/exp/vits-epoch-1000.onnx \
+      --tokens data/tokens.txt
+
+
--- a/docs/source/recipes/index.rst
+++ b/docs/source/recipes/index.rst
@ -2,7 +2,7 @@ Recipes
 =======

 This page contains various recipes in ``icefall``.
-Currently, only speech recognition recipes are provided.
+Currently, we provide recipes for speech recognition, language model, and speech synthesis.

 We may add recipes for other tasks as well in the future.

@ -16,3 +16,4 @@ We may add recipes for other tasks as well in the future.
   Non-streaming-ASR/index
   Streaming-ASR/index
   RNN-LM/index
+   TTS/index