Introduction

This is a weakly supervised ASR recipe for the LibriSpeech (clean 100 hours) dataset. We train a conformer model using Bypass Temporal Classification (BTC)/Omni-temporal Classification (OTC) with transcripts with synthetic errors. In this README, we will describe the task and the BTC/OTC training process.

Note that OTC is an extension of BTC and supports all BTC functions. Therefore, in the following, we only describe OTC.

Task

We propose BTC/OTC to directly train an ASR system leveraging weak supervision, i.e., speech with non-verbatim transcripts. This is achieved by using a special token \star to model uncertainties (i.e., substitution errors, insertion errors, and deletion errors) within the WFST framework during training.

Image 1
Image 2
Image 3
Examples of errors (substitution, insertion, and deletion) in the transcript. The grey box is the verbatim transcript and the red box is the inaccurate transcript. Inaccurate words are marked in bold.


We modify G(\mathbf{y}) by adding self-loop arcs into each state and bypass arcs into each arc.

Image Alt Text

We incorporate the penalty strategy and apply different configurations for the self-loop arc and bypass arc. The penalties are set as

\lambda_{1_{i}} = \beta_{1} * \tau_{1}^{i},\quad \lambda_{2_{i}} = \beta_{2} * \tau_{2}^{i}

for the $i$-th training epoch. \beta is the initial penalty that encourages the model to rely more on the given transcript at the start of training. It decays exponentially by a factor of \tau \in (0, 1), gradually encouraging the model to align speech with \star when getting confused.

After composing the modified WFST G_{\text{otc}}(\mathbf{y}) with L and T, the OTC training graph is shown in this figure:

Image Alt Text
OTC training graph. The self-loop arcs and bypass arcs are highlighted in green and blue, respectively.

The \star is represented as the average probability of all non-blank tokens.

The weight of \star is the log average probability of "a" and "b": \log \frac{e^{-1.2} + e^{-2.3}}{2} = -1.6 and \log \frac{e^{-1.9} + e^{-0.5}}{2} = -1.0 for 2 frames.

Description of the recipe

Preparation

# feature_type can be ssl or fbank
feature_type=ssl
feature_dir="data/${feature_type}"
manifest_dir="${feature_dir}"
lang_dir="data/lang"
lm_dir="data/lm"
exp_dir="conformer_ctc2/exp"
otc_token="<star>"

./prepare.sh \
  --feature-type "${feature_type}" \
  --feature-dir "${feature_dir}" \
  --lang-dir "${lang_dir}" \
  --lm-dir "${lm_dir}" \
  --otc-token "${otc_token}" 

This script adds the 'otc_token' ('<star>') and its corresponding sentence-piece ('▁<star>') to 'words.txt' and 'tokens.txt,' respectively. Additionally, it computes SSL features using the 'wav2vec2-base' model. (You can use GPU to accelerate feature extraction).

Making synthetic errors to the transcript (train-clean-100) [optional]

sub_er=0.17
ins_er=0.17
del_er=0.17
synthetic_train_manifest="librispeech_cuts_train-clean-100_${sub_er}_${ins_er}_${del_er}.jsonl.gz"

./local/make_error_cutset.py \
  --input-cutset "${manifest_dir}/librispeech_cuts_train-clean-100.jsonl.gz" \
  --words-file "${lang_dir}/words.txt" \
  --sub-error-rate "${sub_er}" \
  --ins-error-rate "${ins_er}" \
  --del-error-rate "${del_er}" \
  --output-cutset "${manifest_dir}/${synthetic_train_manifest}"

This script generates synthetic substitution, insertion, and deletion errors in the transcript with ratios 'sub_er', 'ins_er', and 'del_er', respectively. The original transcript is saved as 'verbatim transcript' in the cutset, along with information on how the transcript is corrupted:

  • '[hello]' indicates the original word 'hello' is substituted by another word
  • '[]' indicates an extra word is inserted into the transcript
  • '-hello-' indicates the word 'hello' is deleted from the transcript

So if the original transcript is "have a nice day" and the synthetic one is "a very good day", the 'verbatim transcript' would be:

original:  have  a      nice  day
synthetic:       a very good  day
verbatim: -have- a  [] [nice] day

Training

The training uses synthetic data based on the train-clean-100 subset.

otc_lang_dir=data/lang_bpe_200

allow_bypass_arc=true
allow_self_loop_arc=true
initial_bypass_weight=-19
initial_self_loop_weight=3.75
bypass_weight_decay=0.975
self_loop_weight_decay=0.999

show_alignment=true

export CUDA_VISIBLE_DEVICES="0,1,2,3"
./conformer_ctc2/train.py \
  --world-size 4 \
  --manifest-dir "${manifest_dir}" \
  --train-manifest "${synthetic_train_manifest}" \
  --exp-dir "${exp_dir}" \
  --lang-dir "${otc_lang_dir}" \
  --otc-token "${otc_token}" \
  --allow-bypass-arc "${allow_bypass_arc}" \
  --allow-self-loop-arc "${allow_self_loop_arc}" \
  --initial-bypass-weight "${initial_bypass_weight}" \
  --initial-self-loop-weight "${initial_self_loop_weight}" \
  --bypass-weight-decay "${bypass_weight_decay}" \
  --self-loop-weight-decay "${self_loop_weight_decay}" \
  --show-alignment "${show_alignment}"

The bypass arc deals with substitution and insertion errors, while the self-loop arc deals with deletion errors. Using "--show-alignment" would print the best alignment during training, which is very helpful for tuning hyperparameters and debugging.

Decoding

export CUDA_VISIBLE_DEVICES="0"
./conformer_ctc2/decode.py \
  --manifest-dir "${manifest_dir}" \
  --exp-dir "${exp_dir}" \
  --lang-dir "${otc_lang_dir}" \
  --lm-dir "${lm_dir}" \
  --otc-token "${otc_token}"
Training Criterion ssl fbank
test-clean test-other test-clean test-other
CTC 100.0 100.0 99.89 99.98
OTC 11.89 25.46 20.14 44.24

Results (1best, blank_bias=-4)

Training Criterion ssl fbank
test-clean test-other test-clean test-other
CTC 98.40 98.68 99.79 99.86
OTC 6.59 15.98 11.78 32.38

Pre-trained Model

Pre-trained model: https://huggingface.co/dgao/icefall-otc-librispeech-conformer-ctc2

Citations

@inproceedings{gao2023bypass,
  title={Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts},
  author={Gao, Dongji and Wiesner, Matthew and Xu, Hainan and Garcia, Leibny Paola and Povey, Daniel and Khudanpur, Sanjeev},
  booktitle={INTERSPEECH},
  year={2023}
}

@inproceedings{gao2023learning,
  title={Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition},
  author={Gao, Dongji and Xu, Hainan and Raj, Desh and Garcia, Leibny Paola and Povey, Daniel and Khudanpur, Sanjeev},
  booktitle={IEEE ASRU},
  year={2023}
}