Add more doc for the recipe yesno.

2025-08-13 20:12:24 +00:00 · 2021-08-24 00:00:47 +08:00 · 2021-08-24 00:00:47 +08:00 · 39554781b2
commit 39554781b2
parent dcf71b31a5
4 changed files with 112 additions and 6 deletions
--- a/docs/source/installation/index.rst
+++ b/docs/source/installation/index.rst
@ -310,7 +310,7 @@ The correct fix is:
  .6.1 tensorboard-plugin-wit-1.8.0 urllib3-1.26.6 werkzeug-2.0.1


-Test your Installation
+Test Your Installation
 ----------------------

 To test that your installation is successful, let us run
--- a/docs/source/recipes/images/yesno-tdnn-tensorboard-log.png
+++ b/docs/source/recipes/images/yesno-tdnn-tensorboard-log.png
--- a/docs/source/recipes/yesno.rst
+++ b/docs/source/recipes/yesno.rst
@ -19,7 +19,7 @@ This page shows you how to run the ``yesno`` recipe.
 Data preparation
 ----------------

-.. code-block::
+.. code-block:: bash

  $ cd egs/yesno/ASR
  $ ./prepare.sh
@ -64,17 +64,94 @@ The command to run the training part is:
 .. code-block:: bash

  $ cd egs/yesno/ASR
+  $ export CUDA_VISIBLE_DEVICES=""
  $ ./tdnn/train.py

 By default, it will run ``15`` epochs. Training logs and checkpoints are saved
 in ``tdnn/exp``.

-To see the training options, you can use:
+In ``tdnn/exp``, you will find the following files:
+
+  - ``epoch-0.pt``, ``epoch-1.pt``, ...
+
+    These are checkpoint files, containing model parameters and optimizer ``state_dict``.
+    To resume training from some checkpoint, say ``epoch-10.pt``, you can use:
+
+      .. code-block:: bash
+
+        $ ./tdnn/train.py --start-epoch 11
+
+  - ``tensorboard/``
+
+    This folder contains TensorBoard logs. Training loss, validation loss, learning
+    rate, etc, are recorded in these logs. You can visualize them by:
+
+      .. code-block:: bash
+
+        $ cd tdnn/exp/tensorboard
+        $ tensorboard dev upload --logdir . --description "TDNN training for yesno with icefall"
+
+    It will print something like below:
+
+      .. code-block::
+
+        TensorFlow installation not found - running with reduced feature set.
+        Upload started and will continue reading any new data as it's added to the logdir.
+
+        To stop uploading, press Ctrl-C.
+
+        New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/yKUbhb5wRmOSXYkId1z9eg/
+
+        [2021-08-23T23:49:41] Started scanning logdir.
+        [2021-08-23T23:49:42] Total uploaded: 135 scalars, 0 tensors, 0 binary objects
+        Listening for new data in logdir...
+
+    Note there is a URL in the above output, click it and you will see
+    the following screenshot:
+
+      .. figure:: images/yesno-tdnn-tensorboard-log.png
+         :width: 600
+         :alt: TensorBoard screenshot
+         :align: center
+         :target: https://tensorboard.dev/experiment/yKUbhb5wRmOSXYkId1z9eg/
+
+         TensorBoard screenshot.
+
+  - ``log/log-train-xxxx``
+
+    It is the detailed training log in text format, same as the one
+    you saw printed to the console during training.
+
+
+To see available training options, you can use:

 .. code-block:: bash

  $ ./tdnn/train.py --help

+.. NOTE::
+
+  By default, ``./tdnn/train.py`` uses GPU 0 for training if GPUs are available.
+  If you have two GPUs, say, GPU 0 and GPU 1, and you want to use GPU 1 for
+  training, you can run:
+
+    .. code-block:: bash
+
+      $ export CUDA_VISIBLE_DEVICES="1"
+      $ ./tdnn/train.py
+
+  Since the ``yesno`` dataset is very small, containing only 30 sound files
+  for training, and the model in use is also very small, we use:
+
+    .. code-block:: bash
+
+      $ export CUDA_VISIBLE_DEVICES=""
+
+  so that ``./tdnn/train.py`` uses CPU during training.
+
+  If you don't have GPUs, then you don't need to
+  run ``export CUDA_VISIBLE_DEVICES=""``.
+
 Decoding
 --------

@ -85,10 +162,12 @@ The command for decoding is:

 .. code-block:: bash

+  $ export CUDA_VISIBLE_DEVICES=""
  $ ./tdnn/decode.py

 You will see the WER in the output log.
-Decoding results are saved in ``tdnn/exp``.
+
+Decoded results are saved in ``tdnn/exp``.

 Colab notebook
 --------------
--- a/egs/yesno/ASR/tdnn/train.py
+++ b/egs/yesno/ASR/tdnn/train.py
@ -60,6 +60,16 @@ def get_parser():
        help="Number of epochs to train.",
    )

+    parser.add_argument(
+        "--start-epoch",
+        type=int,
+        default=0,
+        help="""Resume training from from this epoch.
+        If it is positive, it will load checkpoint from
+        tdnn/exp/epoch-{start_epoch-1}.pt
+        """,
+    )
+
    return parser


@ -92,8 +102,6 @@ def get_params() -> AttributeDict:
        - start_epoch:  If it is not zero, load checkpoint `start_epoch-1`
                        and continue training from that checkpoint.

-        - num_epochs:  Number of epochs to train.
-
        - best_train_loss: Best training loss so far. It is used to select
                           the model that has the lowest training loss. It is
                           updated during the training.
@ -420,6 +428,19 @@ def train_one_epoch(
                f"batch size: {batch_size}"
            )

+            if tb_writer is not None:
+                tb_writer.add_scalar(
+                    "train/current_loss",
+                    loss_cpu / params.train_frames,
+                    params.batch_idx_train,
+                )
+
+                tb_writer.add_scalar(
+                    "train/tot_avg_loss",
+                    tot_avg_loss,
+                    params.batch_idx_train,
+                )
+
        if batch_idx > 0 and batch_idx % params.valid_interval == 0:
            compute_validation_loss(
                params=params,
@ -434,6 +455,12 @@ def train_one_epoch(
                f" best valid loss: {params.best_valid_loss:.4f} "
                f"best valid epoch: {params.best_valid_epoch}"
            )
+            if tb_writer is not None:
+                tb_writer.add_scalar(
+                    "train/valid_loss",
+                    params.valid_loss,
+                    params.batch_idx_train,
+                )

    params.train_loss = tot_loss / tot_frames