This recipe uses GigaSpeech + LibriSpeech during training.
(1) and (2) use the same model architecture. The only difference is that (2) supports
multi-dataset. Since (2) uses more data, it has a lower WER than (1) but it needs
more training time.
We use lstm_transducer_stateless2 as an example below.
Note
You need to download the GigaSpeech dataset
to run (2). If you have only LibriSpeech dataset available, feel free to use (1).
$ cd egs/librispeech/ASR
$ ./prepare.sh
# If you use (1), you can **skip** the following command
$ ./prepare_giga_speech.sh
The script ./prepare.sh handles the data preparation for you, automagically.
All you need to do is to run it.
Note
We encourage you to read ./prepare.sh.
The data preparation contains several stages. You can use the following two
options:
--stage
--stop-stage
to control which stage(s) should be run. By default, all stages are executed.
For example,
$ cd egs/librispeech/ASR
$ ./prepare.sh --stage 0 --stop-stage 0
means to run only stage 0.
To run stage 2 to stage 5, use:
$ ./prepare.sh --stage 2 --stop-stage 5
Hint
If you have pre-downloaded the LibriSpeech
dataset and the musan dataset, say,
they are saved in /tmp/LibriSpeech and /tmp/musan, you can modify
the dl_dir variable in ./prepare.sh to point to /tmp so that
./prepare.sh won’t re-download them.
Note
All generated files by ./prepare.sh, e.g., features, lexicon, etc,
are saved in ./data directory.
We provide the following YouTube video showing how to run ./prepare.sh.
Note
To get the latest news of next-gen Kaldi, please subscribe
the following YouTube channel by Nadira Povey:
$ cd egs/librispeech/ASR
$ ./lstm_transducer_stateless2/train.py --help
shows you the training options that can be passed from the commandline.
The following options are used quite often:
--full-libri
If it’s True, the training part uses all the training data, i.e.,
960 hours. Otherwise, the training part uses only the subset
train-clean-100, which has 100 hours of training data.
Caution
The training set is perturbed by speed with two factors: 0.9 and 1.1.
If --full-libri is True, each epoch actually processes
3x960==2880 hours of data.
--num-epochs
It is the number of epochs to train. For instance,
./lstm_transducer_stateless2/train.py--num-epochs30 trains for 30 epochs
and generates epoch-1.pt, epoch-2.pt, …, epoch-30.pt
in the folder ./lstm_transducer_stateless2/exp.
--start-epoch
It’s used to resume training.
./lstm_transducer_stateless2/train.py--start-epoch10 loads the
checkpoint ./lstm_transducer_stateless2/exp/epoch-9.pt and starts
training from epoch 10, based on the state from epoch 9.
--world-size
It is used for multi-GPU single-machine DDP training.
If it is 1, then no DDP training is used.
If it is 2, then GPU 0 and GPU 1 are used for DDP training.
The following shows some use cases with it.
Use case 1: You have 4 GPUs, but you only want to use GPU 0 and
GPU 2 for training. You can do the following:
$ cd egs/librispeech/ASR
$ exportCUDA_VISIBLE_DEVICES="0,2"
$ ./lstm_transducer_stateless2/train.py --world-size 2
Use case 2: You have 4 GPUs and you want to use all of them
for training. You can do the following:
$ cd egs/librispeech/ASR
$ ./lstm_transducer_stateless2/train.py --world-size 4
Use case 3: You have 4 GPUs but you only want to use GPU 3
for training. You can do the following:
$ cd egs/librispeech/ASR
$ exportCUDA_VISIBLE_DEVICES="3"
$ ./lstm_transducer_stateless2/train.py --world-size 1
Caution
Only multi-GPU single-machine DDP training is implemented at present.
Multi-GPU multi-machine DDP training will be added later.
--max-duration
It specifies the number of seconds over all utterances in a
batch, before padding.
If you encounter CUDA OOM, please reduce it.
Hint
Due to padding, the number of seconds of all utterances in a
batch will usually be larger than --max-duration.
A larger value for --max-duration may cause OOM during training,
while a smaller value may increase the training time. You have to
tune it.
--giga-prob
The probability to select a batch from the GigaSpeech dataset.
Note: It is available only for (2).
There are some training options, e.g., weight decay,
number of warmup steps, results dir, etc,
that are not passed from the commandline.
They are pre-configured by the function get_params() in
lstm_transducer_stateless2/train.py
You don’t need to change these pre-configured parameters. If you really need to change
them, please modify ./lstm_transducer_stateless2/train.py directly.
Training logs and checkpoints are saved in lstm_transducer_stateless2/exp.
You will find the following files in that directory:
epoch-1.pt, epoch-2.pt, …
These are checkpoint files saved at the end of each epoch, containing model
state_dict and optimizer state_dict.
To resume training from some checkpoint, say epoch-10.pt, you can use:
These are checkpoint files saved every --save-every-n batches,
containing model state_dict and optimizer state_dict.
To resume training from some checkpoint, say checkpoint-436000, you can use:
This folder contains tensorBoard logs. Training loss, validation loss, learning
rate, etc, are recorded in these logs. You can visualize them by:
$ cd lstm_transducer_stateless2/exp/tensorboard
$ tensorboard dev upload --logdir . --description "LSTM transducer training for LibriSpeech with icefall"
It will print something like below:
TensorFlowinstallationnotfound-runningwithreducedfeatureset.Uploadstartedandwillcontinuereadinganynewdataasit's added to the logdir.Tostopuploading,pressCtrl-C.Newexperimentcreated.ViewyourTensorBoardat:https://tensorboard.dev/experiment/cj2vtPiwQHKN9Q1tx6PTpg/[2022-09-20T15:50:50]Startedscanninglogdir.Uploading4468scalars...[2022-09-20T15:53:02]Totaluploaded:210171scalars,0tensors,0binaryobjectsListeningfornewdatainlogdir...
Note there is a URL in the above output. Click it and you will see
the following screenshot:
The decoding part uses checkpoints saved by the training part, so you have
to run the training part first.
Hint
There are two kinds of checkpoints:
(1) epoch-1.pt, epoch-2.pt, …, which are saved at the end
of each epoch. You can pass --epoch to
lstm_transducer_stateless2/decode.py to use them.
(2) checkpoints-436000.pt, epoch-438000.pt, …, which are saved
every --save-every-n batches. You can pass --iter to
lstm_transducer_stateless2/decode.py to use them.
We suggest that you try both types of checkpoints and choose the one
that produces the lowest WERs.
$ cd egs/librispeech/ASR
$ ./lstm_transducer_stateless2/decode.py --help
Checkpoints saved by lstm_transducer_stateless2/train.py also include
optimizer.state_dict(). It is useful for resuming training. But after training,
we are interested only in model.state_dict(). You can use the following
command to extract model.state_dict().
# Assume that --iter 468000 --avg 16 produces the smallest WER# (You can get such information after running ./lstm_transducer_stateless2/decode.py)iter=468000avg=16
./lstm_transducer_stateless2/export.py \
--exp-dir ./lstm_transducer_stateless2/exp \
--bpe-model data/lang_bpe_500/bpe.model \
--iter $iter\
--avg $avg
It will generate a file ./lstm_transducer_stateless2/exp/pretrained.pt.
Hint
To use the generated pretrained.pt for lstm_transducer_stateless2/decode.py,
you can run:
cd lstm_transducer_stateless2/exp
ln -s pretrained epoch-9999.pt
And then pass --epoch9999--avg1--use-averaged-model0 to
./lstm_transducer_stateless2/decode.py.
To use the exported model with ./lstm_transducer_stateless2/pretrained.py, you
can run: