diff --git a/docs/source/installation/index.rst b/docs/source/installation/index.rst index efec1f389..e3ccf3e1e 100644 --- a/docs/source/installation/index.rst +++ b/docs/source/installation/index.rst @@ -310,7 +310,7 @@ The correct fix is: .6.1 tensorboard-plugin-wit-1.8.0 urllib3-1.26.6 werkzeug-2.0.1 -Test your Installation +Test Your Installation ---------------------- To test that your installation is successful, let us run diff --git a/docs/source/recipes/images/yesno-tdnn-tensorboard-log.png b/docs/source/recipes/images/yesno-tdnn-tensorboard-log.png new file mode 100644 index 000000000..3d2612c9c Binary files /dev/null and b/docs/source/recipes/images/yesno-tdnn-tensorboard-log.png differ diff --git a/docs/source/recipes/yesno.rst b/docs/source/recipes/yesno.rst index c5a341759..5d549b06d 100644 --- a/docs/source/recipes/yesno.rst +++ b/docs/source/recipes/yesno.rst @@ -19,7 +19,7 @@ This page shows you how to run the ``yesno`` recipe. Data preparation ---------------- -.. code-block:: +.. code-block:: bash $ cd egs/yesno/ASR $ ./prepare.sh @@ -64,17 +64,94 @@ The command to run the training part is: .. code-block:: bash $ cd egs/yesno/ASR + $ export CUDA_VISIBLE_DEVICES="" $ ./tdnn/train.py By default, it will run ``15`` epochs. Training logs and checkpoints are saved in ``tdnn/exp``. -To see the training options, you can use: +In ``tdnn/exp``, you will find the following files: + + - ``epoch-0.pt``, ``epoch-1.pt``, ... + + These are checkpoint files, containing model parameters and optimizer ``state_dict``. + To resume training from some checkpoint, say ``epoch-10.pt``, you can use: + + .. code-block:: bash + + $ ./tdnn/train.py --start-epoch 11 + + - ``tensorboard/`` + + This folder contains TensorBoard logs. Training loss, validation loss, learning + rate, etc, are recorded in these logs. You can visualize them by: + + .. code-block:: bash + + $ cd tdnn/exp/tensorboard + $ tensorboard dev upload --logdir . --description "TDNN training for yesno with icefall" + + It will print something like below: + + .. code-block:: + + TensorFlow installation not found - running with reduced feature set. + Upload started and will continue reading any new data as it's added to the logdir. + + To stop uploading, press Ctrl-C. + + New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/yKUbhb5wRmOSXYkId1z9eg/ + + [2021-08-23T23:49:41] Started scanning logdir. + [2021-08-23T23:49:42] Total uploaded: 135 scalars, 0 tensors, 0 binary objects + Listening for new data in logdir... + + Note there is a URL in the above output, click it and you will see + the following screenshot: + + .. figure:: images/yesno-tdnn-tensorboard-log.png + :width: 600 + :alt: TensorBoard screenshot + :align: center + :target: https://tensorboard.dev/experiment/yKUbhb5wRmOSXYkId1z9eg/ + + TensorBoard screenshot. + + - ``log/log-train-xxxx`` + + It is the detailed training log in text format, same as the one + you saw printed to the console during training. + + +To see available training options, you can use: .. code-block:: bash $ ./tdnn/train.py --help +.. NOTE:: + + By default, ``./tdnn/train.py`` uses GPU 0 for training if GPUs are available. + If you have two GPUs, say, GPU 0 and GPU 1, and you want to use GPU 1 for + training, you can run: + + .. code-block:: bash + + $ export CUDA_VISIBLE_DEVICES="1" + $ ./tdnn/train.py + + Since the ``yesno`` dataset is very small, containing only 30 sound files + for training, and the model in use is also very small, we use: + + .. code-block:: bash + + $ export CUDA_VISIBLE_DEVICES="" + + so that ``./tdnn/train.py`` uses CPU during training. + + If you don't have GPUs, then you don't need to + run ``export CUDA_VISIBLE_DEVICES=""``. + Decoding -------- @@ -85,10 +162,12 @@ The command for decoding is: .. code-block:: bash + $ export CUDA_VISIBLE_DEVICES="" $ ./tdnn/decode.py You will see the WER in the output log. -Decoding results are saved in ``tdnn/exp``. + +Decoded results are saved in ``tdnn/exp``. Colab notebook -------------- diff --git a/egs/yesno/ASR/tdnn/train.py b/egs/yesno/ASR/tdnn/train.py index 04e1ab698..39c5ef3ef 100755 --- a/egs/yesno/ASR/tdnn/train.py +++ b/egs/yesno/ASR/tdnn/train.py @@ -60,6 +60,16 @@ def get_parser(): help="Number of epochs to train.", ) + parser.add_argument( + "--start-epoch", + type=int, + default=0, + help="""Resume training from from this epoch. + If it is positive, it will load checkpoint from + tdnn/exp/epoch-{start_epoch-1}.pt + """, + ) + return parser @@ -92,8 +102,6 @@ def get_params() -> AttributeDict: - start_epoch: If it is not zero, load checkpoint `start_epoch-1` and continue training from that checkpoint. - - num_epochs: Number of epochs to train. - - best_train_loss: Best training loss so far. It is used to select the model that has the lowest training loss. It is updated during the training. @@ -420,6 +428,19 @@ def train_one_epoch( f"batch size: {batch_size}" ) + if tb_writer is not None: + tb_writer.add_scalar( + "train/current_loss", + loss_cpu / params.train_frames, + params.batch_idx_train, + ) + + tb_writer.add_scalar( + "train/tot_avg_loss", + tot_avg_loss, + params.batch_idx_train, + ) + if batch_idx > 0 and batch_idx % params.valid_interval == 0: compute_validation_loss( params=params, @@ -434,6 +455,12 @@ def train_one_epoch( f" best valid loss: {params.best_valid_loss:.4f} " f"best valid epoch: {params.best_valid_epoch}" ) + if tb_writer is not None: + tb_writer.add_scalar( + "train/valid_loss", + params.valid_loss, + params.batch_idx_train, + ) params.train_loss = tot_loss / tot_frames