mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-10 10:32:17 +00:00
docs for finetune zipformer (#1509)
This commit is contained in:
parent
c19b414778
commit
13daf73468
@ -0,0 +1,140 @@
|
|||||||
|
Finetune from a supervised pre-trained Zipformer model
|
||||||
|
======================================================
|
||||||
|
|
||||||
|
This tutorial shows you how to fine-tune a supervised pre-trained **Zipformer**
|
||||||
|
transducer model on a new dataset.
|
||||||
|
|
||||||
|
.. HINT::
|
||||||
|
|
||||||
|
We assume you have read the page :ref:`install icefall` and have setup
|
||||||
|
the environment for ``icefall``.
|
||||||
|
|
||||||
|
.. HINT::
|
||||||
|
|
||||||
|
We recommend you to use a GPU or several GPUs to run this recipe
|
||||||
|
|
||||||
|
|
||||||
|
For illustration purpose, we fine-tune the Zipformer transducer model
|
||||||
|
pre-trained on `LibriSpeech`_ on the small subset of `GigaSpeech`_. You could use your
|
||||||
|
own data for fine-tuning if you create a manifest for your new dataset.
|
||||||
|
|
||||||
|
Data preparation
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Please follow the instructions in the `GigaSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/gigaspeech/ASR>`_
|
||||||
|
to prepare the fine-tune data used in this tutorial. We only require the small subset in GigaSpeech for this tutorial.
|
||||||
|
|
||||||
|
|
||||||
|
Model preparation
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
We are using the Zipformer model trained on full LibriSpeech (960 hours) as the intialization. The
|
||||||
|
checkpoint of the model can be downloaded via the following command:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
|
||||||
|
$ cd icefall-asr-librispeech-zipformer-2023-05-15/exp
|
||||||
|
$ git lfs pull --include "pretrained.pt"
|
||||||
|
$ ln -s pretrained.pt epoch-99.pt
|
||||||
|
$ cd ../data/lang_bpe_500
|
||||||
|
$ git lfs pull --include bpe.model
|
||||||
|
$ cd ../../..
|
||||||
|
|
||||||
|
Before fine-tuning, let's test the model's WER on the new domain. The following command performs
|
||||||
|
decoding on the GigaSpeech test sets:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
./zipformer/decode_gigaspeech.py \
|
||||||
|
--epoch 99 \
|
||||||
|
--avg 1 \
|
||||||
|
--exp-dir icefall-asr-librispeech-zipformer-2023-05-15/exp \
|
||||||
|
--use-averaged-model 0 \
|
||||||
|
--max-duration 1000 \
|
||||||
|
--decoding-method greedy_search
|
||||||
|
|
||||||
|
You should see the following numbers:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
For dev, WER of different settings are:
|
||||||
|
greedy_search 20.06 best for dev
|
||||||
|
|
||||||
|
For test, WER of different settings are:
|
||||||
|
greedy_search 19.27 best for test
|
||||||
|
|
||||||
|
|
||||||
|
Fine-tune
|
||||||
|
---------
|
||||||
|
|
||||||
|
Since LibriSpeech and GigaSpeech are both English dataset, we can initialize the whole
|
||||||
|
Zipformer model with the checkpoint downloaded in the previous step (otherwise we should consider
|
||||||
|
initializing the stateless decoder and joiner from scratch due to the mismatch of the output
|
||||||
|
vocabulary). The following command starts a fine-tuning experiment:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
$ use_mux=0
|
||||||
|
$ do_finetune=1
|
||||||
|
|
||||||
|
$ ./zipformer/finetune.py \
|
||||||
|
--world-size 2 \
|
||||||
|
--num-epochs 20 \
|
||||||
|
--start-epoch 1 \
|
||||||
|
--exp-dir zipformer/exp_giga_finetune${do_finetune}_mux${use_mux} \
|
||||||
|
--use-fp16 1 \
|
||||||
|
--base-lr 0.0045 \
|
||||||
|
--bpe-model data/lang_bpe_500/bpe.model \
|
||||||
|
--do-finetune $do_finetune \
|
||||||
|
--use-mux $use_mux \
|
||||||
|
--master-port 13024 \
|
||||||
|
--finetune-ckpt icefall-asr-librispeech-zipformer-2023-05-15/exp/pretrained.pt \
|
||||||
|
--max-duration 1000
|
||||||
|
|
||||||
|
The following arguments are related to fine-tuning:
|
||||||
|
|
||||||
|
- ``--base-lr``
|
||||||
|
The learning rate used for fine-tuning. We suggest to set a **small** learning rate for fine-tuning,
|
||||||
|
otherwise the model may forget the initialization very quickly. A reasonable value should be around
|
||||||
|
1/10 of the original lr, i.e 0.0045.
|
||||||
|
|
||||||
|
- ``--do-finetune``
|
||||||
|
If True, do fine-tuning by initializing the model from a pre-trained checkpoint.
|
||||||
|
**Note that if you want to resume your fine-tuning experiment from certain epochs, you
|
||||||
|
need to set this to False.**
|
||||||
|
|
||||||
|
- ``--finetune-ckpt``
|
||||||
|
The path to the pre-trained checkpoint (used for initialization).
|
||||||
|
|
||||||
|
- ``--use-mux``
|
||||||
|
If True, mix the fine-tune data with the original training data by using `CutSet.mux <https://lhotse.readthedocs.io/en/latest/api.html#lhotse.supervision.SupervisionSet.mux>`_
|
||||||
|
This helps maintain the model's performance on the original domain if the original training
|
||||||
|
is available. **If you don't have the original training data, please set it to False.**
|
||||||
|
|
||||||
|
After fine-tuning, let's test the WERs. You can do this via the following command:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
$ use_mux=0
|
||||||
|
$ do_finetune=1
|
||||||
|
$ ./zipformer/decode_gigaspeech.py \
|
||||||
|
--epoch 20 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir zipformer/exp_giga_finetune${do_finetune}_mux${use_mux} \
|
||||||
|
--use-averaged-model 1 \
|
||||||
|
--max-duration 1000 \
|
||||||
|
--decoding-method greedy_search
|
||||||
|
|
||||||
|
You should see numbers similar to the ones below:
|
||||||
|
|
||||||
|
.. code-block:: text
|
||||||
|
|
||||||
|
For dev, WER of different settings are:
|
||||||
|
greedy_search 13.47 best for dev
|
||||||
|
|
||||||
|
For test, WER of different settings are:
|
||||||
|
greedy_search 13.66 best for test
|
||||||
|
|
||||||
|
Compared to the original checkpoint, the fine-tuned model achieves much lower WERs
|
||||||
|
on the GigaSpeech test sets.
|
15
docs/source/recipes/Finetune/index.rst
Normal file
15
docs/source/recipes/Finetune/index.rst
Normal file
@ -0,0 +1,15 @@
|
|||||||
|
Fine-tune a pre-trained model
|
||||||
|
=============================
|
||||||
|
|
||||||
|
After pre-training on public available datasets, the ASR model is already capable of
|
||||||
|
performing general speech recognition with relatively high accuracy. However, the accuracy
|
||||||
|
could be still low on certain domains that are quite different from the original training
|
||||||
|
set. In this case, we can fine-tune the model with a small amount of additional labelled
|
||||||
|
data to improve the performance on new domains.
|
||||||
|
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
:caption: Table of Contents
|
||||||
|
|
||||||
|
from_supervised/finetune_zipformer
|
@ -17,3 +17,4 @@ We may add recipes for other tasks as well in the future.
|
|||||||
Streaming-ASR/index
|
Streaming-ASR/index
|
||||||
RNN-LM/index
|
RNN-LM/index
|
||||||
TTS/index
|
TTS/index
|
||||||
|
Finetune/index
|
||||||
|
Loading…
x
Reference in New Issue
Block a user