mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-10 02:22:17 +00:00
102 lines
5.4 KiB
Markdown
102 lines
5.4 KiB
Markdown
# HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
|
|
|
|
### Jungil Kong, Jaehyeon Kim, Jaekyoung Bae
|
|
|
|
In our [paper](https://arxiv.org/abs/2010.05646),
|
|
we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.<br/>
|
|
We provide our implementation and pretrained models as open source in this repository.
|
|
|
|
**Abstract :**
|
|
Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms.
|
|
Although such methods improve the sampling efficiency and memory usage,
|
|
their sample quality has not yet reached that of autoregressive and flow-based generative models.
|
|
In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis.
|
|
As speech audio consists of sinusoidal signals with various periods,
|
|
we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality.
|
|
A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method
|
|
demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than
|
|
real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen
|
|
speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times
|
|
faster than real-time on CPU with comparable quality to an autoregressive counterpart.
|
|
|
|
Visit our [demo website](https://jik876.github.io/hifi-gan-demo/) for audio samples.
|
|
|
|
## Pre-requisites
|
|
|
|
1. Python >= 3.6
|
|
2. Clone this repository.
|
|
3. Install python requirements. Please refer [requirements.txt](requirements.txt)
|
|
4. Download and extract the [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/).
|
|
And move all wav files to `LJSpeech-1.1/wavs`
|
|
|
|
## Training
|
|
|
|
```
|
|
python train.py --config config_v1.json
|
|
```
|
|
|
|
To train V2 or V3 Generator, replace `config_v1.json` with `config_v2.json` or `config_v3.json`.<br>
|
|
Checkpoints and copy of the configuration file are saved in `cp_hifigan` directory by default.<br>
|
|
You can change the path by adding `--checkpoint_path` option.
|
|
|
|
Validation loss during training with V1 generator.<br>
|
|

|
|
|
|
## Pretrained Model
|
|
|
|
You can also use pretrained models we provide.<br/>
|
|
[Download pretrained models](https://drive.google.com/drive/folders/1-eEYTB5Av9jNql0WGBlRoi-WH2J7bp5Y?usp=sharing)<br/>
|
|
Details of each folder are as in follows:
|
|
|
|
| Folder Name | Generator | Dataset | Fine-Tuned |
|
|
| ------------ | --------- | --------- | ------------------------------------------------------ |
|
|
| LJ_V1 | V1 | LJSpeech | No |
|
|
| LJ_V2 | V2 | LJSpeech | No |
|
|
| LJ_V3 | V3 | LJSpeech | No |
|
|
| LJ_FT_T2_V1 | V1 | LJSpeech | Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2)) |
|
|
| LJ_FT_T2_V2 | V2 | LJSpeech | Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2)) |
|
|
| LJ_FT_T2_V3 | V3 | LJSpeech | Yes ([Tacotron2](https://github.com/NVIDIA/tacotron2)) |
|
|
| VCTK_V1 | V1 | VCTK | No |
|
|
| VCTK_V2 | V2 | VCTK | No |
|
|
| VCTK_V3 | V3 | VCTK | No |
|
|
| UNIVERSAL_V1 | V1 | Universal | No |
|
|
|
|
We provide the universal model with discriminator weights that can be used as a base for transfer learning to other datasets.
|
|
|
|
## Fine-Tuning
|
|
|
|
1. Generate mel-spectrograms in numpy format using [Tacotron2](https://github.com/NVIDIA/tacotron2) with teacher-forcing.<br/>
|
|
The file name of the generated mel-spectrogram should match the audio file and the extension should be `.npy`.<br/>
|
|
Example:
|
|
` Audio File : LJ001-0001.wav
|
|
Mel-Spectrogram File : LJ001-0001.npy`
|
|
2. Create `ft_dataset` folder and copy the generated mel-spectrogram files into it.<br/>
|
|
3. Run the following command.
|
|
```
|
|
python train.py --fine_tuning True --config config_v1.json
|
|
```
|
|
For other command line options, please refer to the training section.
|
|
|
|
## Inference from wav file
|
|
|
|
1. Make `test_files` directory and copy wav files into the directory.
|
|
2. Run the following command.
|
|
` python inference.py --checkpoint_file [generator checkpoint file path]`
|
|
Generated wav files are saved in `generated_files` by default.<br>
|
|
You can change the path by adding `--output_dir` option.
|
|
|
|
## Inference for end-to-end speech synthesis
|
|
|
|
1. Make `test_mel_files` directory and copy generated mel-spectrogram files into the directory.<br>
|
|
You can generate mel-spectrograms using [Tacotron2](https://github.com/NVIDIA/tacotron2),
|
|
[Glow-TTS](https://github.com/jaywalnut310/glow-tts) and so forth.
|
|
2. Run the following command.
|
|
` python inference_e2e.py --checkpoint_file [generator checkpoint file path]`
|
|
Generated wav files are saved in `generated_files_from_mel` by default.<br>
|
|
You can change the path by adding `--output_dir` option.
|
|
|
|
## Acknowledgements
|
|
|
|
We referred to [WaveGlow](https://github.com/NVIDIA/waveglow), [MelGAN](https://github.com/descriptinc/melgan-neurips)
|
|
and [Tacotron2](https://github.com/NVIDIA/tacotron2) to implement this.
|