Add readme.

2025-12-11 06:55:27 +00:00 · 2021-08-10 20:08:23 +08:00 · 2021-08-10 20:08:23 +08:00 · 55be10534d
commit 55be10534d
parent 5a0b9bcb23
6 changed files with 151 additions and 134 deletions
--- a/README.md
+++ b/README.md
@ -1 +1,77 @@
-Working in progress.
+
+## Installation
+
+`icefall` depends on [k2][k2] for FSA operations and [lhotse][lhotse] for
+data preparations. To use `icefall`, you have to install its dependencies first.
+The following subsections describe how to setup the environment.
+
+CAUTION: There are various ways to setup the environment. What we describe
+here is just one alternative.
+
+### Install k2
+
+Please refer to [k2's installation documentation][k2-install] to install k2.
+If you have any issues about installing k2, please open an issue at
+<https://github.com/k2-fsa/k2/issues>.
+
+The following shows the minimal commands needed to install k2 from source:
+
+```bash
+mkdir $HOME/open-source
+cd $HOME/open-source
+git clone https://github.com/k2-fsa/k2.git
+cd k2
+mkdir build_release
+cd build_release
+cmake -DCMAKE_BUILD_TYPE=Release
+make -j _k2
+export PYTHONPATH=$HOME/open-source/k2/k2/python:$PYTHONPATH
+export PYTHONPATH=$HOME/open-source/k2/build_release/lib:$PYTHONPATH
+```
+
+To check that k2 is installed successfully, please run
+
+```
+python3 -m k2.version
+```
+
+It should show the information about the environment in which
+k2 was built.
+
+### Install lhotse
+
+Please refer to [lhotse's installation documentation][lhotse-install] to install
+lhotse.
+
+### Install icefall
+
+`icefall` is a set of Python scripts. What you need to do is just to set
+the environment variable `PYTHONPATH`:
+
+```
+cd $HOME/open-source
+git clone https://github.com/k2-fsa/icefall
+cd icefall
+pip install -r requirements.txt
+export PYTHONPATH=$HOME/open-source/icefall:$PYTHONPATHON
+```
+
+To verify `icefall` was installed successfully, you can run:
+
+```
+python3 -c "import icefall; print(icefall.__file__)"
+```
+
+It should print the path to `icefall`.
+
+
+## Run recipes
+
+Currently only the LibriSpeech recipe is provided. Please
+follow the [egs/librispeech/ASR/README.md][LibriSpeech] to run it.
+
+[LibriSpeech]: egs/librispeech/ASR/README.md
+[k2-install]: https://k2.readthedocs.io/en/latest/installation/index.html#
+[k2]: https://github.com/k2-fsa/k2
+[lhotse]: https://github.com/lhotse-speech/lhotse
+[lhotse-install]: https://lhotse.readthedocs.io/en/latest/getting-started.html#installation
--- a/egs/librispeech/ASR/README.md
+++ b/egs/librispeech/ASR/README.md
@ -1,121 +1,64 @@

-Run `./prepare.sh` to prepare the data.
+## Data preparation

-Run `./xxx_train.py` (to be added) to train a model.
-
-## Conformer-CTC
-Results of the pre-trained model from
-`<https://huggingface.co/GuoLiyong/snowfall_bpe_model/tree/main/exp-duration-200-feat_batchnorm-bpe-lrfactor5.0-conformer-512-8-noam>`
-are given below
-
-### HLG - no LM rescoring
-
-(output beam size is 8)
-
-#### 1-best decoding
+If you want to `./prepare.sh` to download everything for you,
+you can just run

 ```
-[test-clean-no_rescore] %WER 3.15% [1656 / 52576, 127 ins, 377 del, 1152 sub ]
-[test-other-no_rescore] %WER 7.03% [3682 / 52343, 220 ins, 1024 del, 2438 sub ]
+./prepare.sh
 ```

-#### n-best decoding
-
-For n=100,
+If you have pre-downloaded the LibriSpeech dataset, please
+read `./prepare.sh` and modify it to point to the location
+of your dataset so that it won't re-download it. After modification,
+please run

 ```
-[test-clean-no_rescore-100] %WER 3.15% [1656 / 52576, 127 ins, 377 del, 1152 sub ]
-[test-other-no_rescore-100] %WER 7.14% [3737 / 52343, 275 ins, 1020 del, 2442 sub ]
+./prepare.sh
 ```

-For n=200,
+The script `./prepare.sh` prepare features, lexicon, LMs, etc.
+All generated files are saved in the folder `./data`.
+
+HINT: `./prepare.sh` support options `--stage` and `--stop-stage`.
+
+## TDNN-LSTM CTC training
+
+The folder `tdnn_lstm_ctc` contains scripts for CTC training
+with TDNN-LSTM models.
+
+Pre-configured parameters for training and decoding are set in the function
+`get_params()` within `tdnn_lstm_ctc/train.py`
+and `tdnn_lstm_ctc/decode.py`.
+
+Parameters that can be passed from commandlin can be found by

 ```
-[test-clean-no_rescore-200] %WER 3.16% [1660 / 52576, 125 ins, 378 del, 1157 sub ]
-[test-other-no_rescore-200] %WER 7.04% [3684 / 52343, 228 ins, 1012 del, 2444 sub ]
+./tdnn_lstm_ctc/train.py --help
+./tdnn_lstm_ctc/decode.py --help
 ```

-### HLG - with LM rescoring
-
-#### Whole lattice rescoring
+If you have 4 GPUs on a machine and want to use GPU 0, 2, 3 for
+mutli-GPU training, you can run

 ```
-[test-clean-lm_scale_0.8] %WER 2.77% [1456 / 52576, 150 ins, 210 del, 1096 sub ]
-[test-other-lm_scale_0.8] %WER 6.23% [3262 / 52343, 246 ins, 635 del, 2381 sub ]
+export CUDA_VISIBLE_DEVICES="0,2,3"
+./tdnn_lstm_ctc/train.py \
+  --master-port 12345 \
+  --world-size 3
 ```

-WERs of different LM scales are:
+If you want to decode by averaging checkpoints `epoch-8.pt`,
+`epoch-9.pt` and `epoch-10.pt`, you can run

 ```
-For test-clean, WER of different settings are:
-lm_scale_0.8    2.77    best for test-clean
-lm_scale_0.9    2.87
-lm_scale_1.0    3.06
-lm_scale_1.1    3.34
-lm_scale_1.2    3.71
-lm_scale_1.3    4.18
-lm_scale_1.4    4.8
-lm_scale_1.5    5.48
-lm_scale_1.6    6.08
-lm_scale_1.7    6.79
-lm_scale_1.8    7.49
-lm_scale_1.9    8.14
-lm_scale_2.0    8.82
-
-For test-other, WER of different settings are:
-lm_scale_0.8    6.23    best for test-other
-lm_scale_0.9    6.37
-lm_scale_1.0    6.62
-lm_scale_1.1    6.99
-lm_scale_1.2    7.46
-lm_scale_1.3    8.13
-lm_scale_1.4    8.84
-lm_scale_1.5    9.61
-lm_scale_1.6    10.32
-lm_scale_1.7    11.17
-lm_scale_1.8    12.12
-lm_scale_1.9    12.93
-lm_scale_2.0    13.77
+./tdnn_lstm_ctc/decode.py \
+  --epoch 10 \
+  --avg 3
 ```

-#### n-best LM rescoring
+## Conformer CTC training

-n = 100
-
-```
-[test-clean-lm_scale_0.8] %WER 2.79% [1469 / 52576, 149 ins, 212 del, 1108 sub ]
-[test-other-lm_scale_0.8] %WER 6.36% [3329 / 52343, 259 ins, 666 del, 2404 sub ]
-```
-
-WERs of different LM scales are:
-
-```
-For test-clean, WER of different settings are:
-lm_scale_0.8    2.79    best for test-clean
-lm_scale_0.9    2.89
-lm_scale_1.0    3.03
-lm_scale_1.1    3.28
-lm_scale_1.2    3.52
-lm_scale_1.3    3.78
-lm_scale_1.4    4.04
-lm_scale_1.5    4.24
-lm_scale_1.6    4.45
-lm_scale_1.7    4.58
-lm_scale_1.8    4.7
-lm_scale_1.9    4.8
-lm_scale_2.0    4.92
-For test-other, WER of different settings are:
-lm_scale_0.8    6.36    best for test-other
-lm_scale_0.9    6.45
-lm_scale_1.0    6.64
-lm_scale_1.1    6.92
-lm_scale_1.2    7.25
-lm_scale_1.3    7.59
-lm_scale_1.4    7.88
-lm_scale_1.5    8.13
-lm_scale_1.6    8.36
-lm_scale_1.7    8.54
-lm_scale_1.8    8.71
-lm_scale_1.9    8.88
-lm_scale_2.0    9.02
-```
+The folder `conformer-ctc` contains scripts for CTC training
+with conformer models. The steps of running the training and
+decoding are similar as `tdnn_lstm_ctc`.
--- a/egs/librispeech/ASR/conformer_ctc/train.py
+++ b/egs/librispeech/ASR/conformer_ctc/train.py
@ -16,6 +16,7 @@ import torch.nn as nn
 from conformer import Conformer
 from lhotse.utils import fix_random_seed
 from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.nn.utils import clip_grad_norm_
 from torch.utils.tensorboard import SummaryWriter
 from transformer import Noam

@ -114,7 +115,9 @@ def get_params() -> AttributeDict:

        - log_interval:  Print training loss if batch_idx % log_interval` is 0

-        - valid_interval:  Run validation if batch_idx % valid_interval` is 0
+        - valid_interval:  Run validation if batch_idx % valid_interval is 0
+
+        - reset_interval: Reset statistics if batch_idx % reset_interval is 0

        - beam_size: It is used in k2.ctc_loss

@ -124,19 +127,20 @@ def get_params() -> AttributeDict:
    """
    params = AttributeDict(
        {
-            "exp_dir": Path("conformer_ctc/exp"),
+            "exp_dir": Path("conformer_ctc/exp_new"),
            "lang_dir": Path("data/lang_bpe"),
            "feature_dim": 80,
-            "weight_decay": 0.0,
+            "weight_decay": 1e-6,
            "subsampling_factor": 4,
            "start_epoch": 0,
-            "num_epochs": 50,
+            "num_epochs": 20,
            "best_train_loss": float("inf"),
            "best_valid_loss": float("inf"),
            "best_train_epoch": -1,
            "best_valid_epoch": -1,
            "batch_idx_train": 0,
            "log_interval": 10,
+            "reset_interval": 200,
            "valid_interval": 3000,
            "beam_size": 10,
            "reduction": "sum",
@ -440,6 +444,8 @@ def train_one_epoch(
    tot_att_loss = 0.0

    tot_frames = 0.0  # sum of frames over all batches
+    params.tot_loss = 0.0
+    params.tot_frames = 0.0
    for batch_idx, batch in enumerate(train_dl):
        params.batch_idx_train += 1
        batch_size = len(batch["supervisions"]["text"])
@ -457,6 +463,7 @@ def train_one_epoch(

        optimizer.zero_grad()
        loss.backward()
+        clip_grad_norm_(model.parameters(), 5.0, 2.0)
        optimizer.step()

        loss_cpu = loss.detach().cpu().item()
@ -468,6 +475,9 @@ def train_one_epoch(
        tot_ctc_loss += ctc_loss_cpu
        tot_att_loss += att_loss_cpu

+        params.tot_frames += params.train_frames
+        params.tot_loss += loss_cpu
+
        tot_avg_loss = tot_loss / tot_frames
        tot_avg_ctc_loss = tot_ctc_loss / tot_frames
        tot_avg_att_loss = tot_att_loss / tot_frames
@ -516,6 +526,12 @@ def train_one_epoch(
                    tot_avg_loss,
                    params.batch_idx_train,
                )
+        if batch_idx > 0 and batch_idx % params.reset_interval == 0:
+            tot_loss = 0.0  # sum of losses over all batches
+            tot_ctc_loss = 0.0
+            tot_att_loss = 0.0
+
+            tot_frames = 0.0  # sum of frames over all batches

        if batch_idx > 0 and batch_idx % params.valid_interval == 0:
            compute_validation_loss(
@ -551,7 +567,7 @@ def train_one_epoch(
                    params.batch_idx_train,
                )

-    params.train_loss = tot_loss / tot_frames
+    params.train_loss = params.tot_loss / params.tot_frames

    if params.train_loss < params.best_train_loss:
        params.best_train_epoch = params.cur_epoch
--- a/egs/librispeech/ASR/conformer_ctc/transformer.py
+++ b/egs/librispeech/ASR/conformer_ctc/transformer.py
@ -4,12 +4,9 @@
 import math
 from typing import Dict, List, Optional, Tuple

-import k2
 import torch
 import torch.nn as nn
 from subsampling import Conv2dSubsampling, VggSubsampling
-
-from icefall.utils import get_texts
 from torch.nn.utils.rnn import pad_sequence

 # Note: TorchScript requires Dict/List/etc. to be fully typed.
@ -274,9 +271,11 @@ class Transformer(nn.Module):
            device
        )

-        # TODO: Use eos_id as ignore_id.
-        #  tgt_key_padding_mask = decoder_padding_mask(ys_in_pad, ignore_id=eos_id)
-        tgt_key_padding_mask = decoder_padding_mask(ys_in_pad)
+        tgt_key_padding_mask = decoder_padding_mask(ys_in_pad, ignore_id=eos_id)
+        # TODO: Use length information to create the decoder padding mask
+        # We set the first column to False since the first column in ys_in_pad
+        # contains sos_id, which is the same as eos_id in our current setting.
+        tgt_key_padding_mask[:, 0] = False

        tgt = self.decoder_embed(ys_in_pad)  # (N, T) -> (N, T, C)
        tgt = self.decoder_pos(tgt)
@ -339,9 +338,11 @@ class Transformer(nn.Module):
            device
        )

-        # TODO: Use eos_id as ignore_id.
-        #  tgt_key_padding_mask = decoder_padding_mask(ys_in_pad, ignore_id=eos_id)
-        tgt_key_padding_mask = decoder_padding_mask(ys_in_pad)
+        tgt_key_padding_mask = decoder_padding_mask(ys_in_pad, ignore_id=eos_id)
+        # TODO: Use length information to create the decoder padding mask
+        # We set the first column to False since the first column in ys_in_pad
+        # contains sos_id, which is the same as eos_id in our current setting.
+        tgt_key_padding_mask[:, 0] = False

        tgt = self.decoder_embed(ys_in_pad)  # (B, T) -> (B, T, F)
        tgt = self.decoder_pos(tgt)
--- a/egs/librispeech/ASR/tdnn_lstm_ctc/README.md
+++ b/egs/librispeech/ASR/tdnn_lstm_ctc/README.md
@ -1,22 +1,2 @@
-## (To be filled in)

-It will contain:
-
- How to run
- WERs
-
-```bash
-cd $PWD/..
-
-./prepare.sh
-
-./tdnn_lstm_ctc/train.py
-```
-
-If you have 4 GPUs and want to use GPU 1 and GPU 3 for DDP training,
-you can do the following:
-
-```
-export CUDA_VISIBLE_DEVICES="1,3"
-./tdnn_lstm_ctc/train.py --world-size=2
-```
+Will add results later.
--- a/requirements.txt
+++ b/requirements.txt
@ -1,3 +1,4 @@
 kaldilm
 kaldialign
 sentencepiece>=0.1.96
+tensorboard