diff --git a/docs/source/_static/kaldi-align/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav b/docs/source/_static/kaldi-align/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav new file mode 100644 index 000000000..004a33532 Binary files /dev/null and b/docs/source/_static/kaldi-align/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav differ diff --git a/docs/source/_static/kaldi-align/at.wav b/docs/source/_static/kaldi-align/at.wav new file mode 100644 index 000000000..caad1178c Binary files /dev/null and b/docs/source/_static/kaldi-align/at.wav differ diff --git a/docs/source/_static/kaldi-align/beside.wav b/docs/source/_static/kaldi-align/beside.wav new file mode 100644 index 000000000..5576d84c2 Binary files /dev/null and b/docs/source/_static/kaldi-align/beside.wav differ diff --git a/docs/source/_static/kaldi-align/curiosity.wav b/docs/source/_static/kaldi-align/curiosity.wav new file mode 100644 index 000000000..32d106d7b Binary files /dev/null and b/docs/source/_static/kaldi-align/curiosity.wav differ diff --git a/docs/source/_static/kaldi-align/had.wav b/docs/source/_static/kaldi-align/had.wav new file mode 100644 index 000000000..4d50ec80f Binary files /dev/null and b/docs/source/_static/kaldi-align/had.wav differ diff --git a/docs/source/_static/kaldi-align/i.wav b/docs/source/_static/kaldi-align/i.wav new file mode 100644 index 000000000..9db292a47 Binary files /dev/null and b/docs/source/_static/kaldi-align/i.wav differ diff --git a/docs/source/_static/kaldi-align/me.wav b/docs/source/_static/kaldi-align/me.wav new file mode 100644 index 000000000..e4f16f17c Binary files /dev/null and b/docs/source/_static/kaldi-align/me.wav differ diff --git a/docs/source/_static/kaldi-align/moment.wav b/docs/source/_static/kaldi-align/moment.wav new file mode 100644 index 000000000..eb60e44fe Binary files /dev/null and b/docs/source/_static/kaldi-align/moment.wav differ diff --git a/docs/source/_static/kaldi-align/that.wav b/docs/source/_static/kaldi-align/that.wav new file mode 100644 index 000000000..ddf0f666b Binary files /dev/null and b/docs/source/_static/kaldi-align/that.wav differ diff --git a/docs/source/_static/kaldi-align/this.wav b/docs/source/_static/kaldi-align/this.wav new file mode 100644 index 000000000..1c8bed956 Binary files /dev/null and b/docs/source/_static/kaldi-align/this.wav differ diff --git a/docs/source/conf.py b/docs/source/conf.py index 5a534e126..ded6977ac 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -98,4 +98,6 @@ rst_epilog = """ .. _Next-gen Kaldi: https://github.com/k2-fsa .. _Kaldi: https://github.com/kaldi-asr/kaldi .. _lilcom: https://github.com/danpovey/lilcom +.. _CTC: https://www.cs.toronto.edu/~graves/icml_2006.pdf +.. _kaldi-decoder: https://github.com/k2-fsa/kaldi-decoder """ diff --git a/docs/source/docker/intro.rst b/docs/source/docker/intro.rst index 2f4bdb3f6..f3d2b0727 100644 --- a/docs/source/docker/intro.rst +++ b/docs/source/docker/intro.rst @@ -34,6 +34,8 @@ which will give you something like below: .. code-block:: bash + "torch2.3.1-cuda12.1" + "torch2.3.1-cuda11.8" "torch2.2.2-cuda12.1" "torch2.2.2-cuda11.8" "torch2.2.1-cuda12.1" diff --git a/docs/source/fst-based-forced-alignment/diff.rst b/docs/source/fst-based-forced-alignment/diff.rst new file mode 100644 index 000000000..56b6c430e --- /dev/null +++ b/docs/source/fst-based-forced-alignment/diff.rst @@ -0,0 +1,41 @@ +Two approaches +============== + +Two approaches for FST-based forced alignment will be described: + + - `Kaldi`_-based + - `k2`_-based + +Note that the `Kaldi`_-based approach does not depend on `Kaldi`_ at all. +That is, you don't need to install `Kaldi`_ in order to use it. Instead, +we use `kaldi-decoder`_, which has ported the C++ decoding code from `Kaldi`_ +without depending on it. + +Differences between the two approaches +-------------------------------------- + +The following table compares the differences between the two approaches. + +.. list-table:: + + * - Features + - `Kaldi`_-based + - `k2`_-based + * - Support CUDA + - No + - Yes + * - Support CPU + - Yes + - Yes + * - Support batch processing + - No + - Yes on CUDA; No on CPU + * - Support streaming models + - Yes + - No + * - Support C++ APIs + - Yes + - Yes + * - Support Python APIs + - Yes + - Yes diff --git a/docs/source/fst-based-forced-alignment/index.rst b/docs/source/fst-based-forced-alignment/index.rst new file mode 100644 index 000000000..92a05faaa --- /dev/null +++ b/docs/source/fst-based-forced-alignment/index.rst @@ -0,0 +1,18 @@ +FST-based forced alignment +========================== + +This section describes how to perform **FST-based** ``forced alignment`` with models +trained by `CTC`_ loss. + +We use `CTC FORCED ALIGNMENT API TUTORIAL `_ +from `torchaudio`_ as a reference in this section. + +Different from `torchaudio`_, we use an ``FST``-based approach. + +.. toctree:: + :maxdepth: 2 + :caption: Contents: + + diff + kaldi-based + k2-based diff --git a/docs/source/fst-based-forced-alignment/k2-based.rst b/docs/source/fst-based-forced-alignment/k2-based.rst new file mode 100644 index 000000000..373e49f3e --- /dev/null +++ b/docs/source/fst-based-forced-alignment/k2-based.rst @@ -0,0 +1,4 @@ +k2-based forced alignment +========================= + +TODO(fangjun) diff --git a/docs/source/fst-based-forced-alignment/kaldi-based.rst b/docs/source/fst-based-forced-alignment/kaldi-based.rst new file mode 100644 index 000000000..69b6a665b --- /dev/null +++ b/docs/source/fst-based-forced-alignment/kaldi-based.rst @@ -0,0 +1,712 @@ +Kaldi-based forced alignment +============================ + +This section describes in detail how to use `kaldi-decoder`_ +for **FST-based** ``forced alignment`` with models trained by `CTC`_ loss. + +.. hint:: + + We have a colab notebook walking you through this section step by step. + + |kaldi-based forced alignment colab notebook| + + .. |kaldi-based forced alignment colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg + :target: https://github.com/k2-fsa/colab/blob/master/icefall/ctc_forced_alignment_fst_based_kaldi.ipynb + +Prepare the environment +----------------------- + +Before you continue, make sure you have setup `icefall`_ by following :ref:`install icefall`. + +.. hint:: + + You don't need to install `Kaldi`_. We will ``NOT`` use `Kaldi`_ below. + +Get the test data +----------------- + +We use the test wave +from `CTC FORCED ALIGNMENT API TUTORIAL `_ + +.. code-block:: python3 + + import torchaudio + + # Download test wave + speech_file = torchaudio.utils.download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav") + print(speech_file) + waveform, sr = torchaudio.load(speech_file) + transcript = "i had that curiosity beside me at this moment".split() + print(waveform.shape, sr) + + assert waveform.ndim == 2 + assert waveform.shape[0] == 1 + assert sr == 16000 + +The test wave is downloaded to:: + + $HOME/.cache/torch/hub/torchaudio/tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav + +.. raw:: html + + + + + + + + + + + + +
Wave filenameContentText
Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav + + + i had that curiosity beside me at this moment +
+ +We use the test model +from `CTC FORCED ALIGNMENT API TUTORIAL `_ + +.. code-block:: python3 + + import torch + + bundle = torchaudio.pipelines.MMS_FA + + device = torch.device("cuda" if torch.cuda.is_available() else "cpu") + model = bundle.get_model(with_star=False).to(device) + +The model is downloaded to:: + + $HOME/.cache/torch/hub/checkpoints/model.pt + +Compute log_probs +----------------- + +.. code-block:: bash + + with torch.inference_mode(): + emission, _ = model(waveform.to(device)) + print(emission.shape) + +It should print:: + + torch.Size([1, 169, 28]) + +Create token2id and id2token +---------------------------- + +.. code-block:: python3 + + token2id = bundle.get_dict(star=None) + id2token = {i:t for t, i in token2id.items()} + token2id[""] = 0 + del token2id["-"] + +Create word2id and id2word +-------------------------- + +.. code-block:: python3 + + words = list(set(transcript)) + word2id = dict() + word2id['eps'] = 0 + for i, w in enumerate(words): + word2id[w] = i + 1 + + id2word = {i:w for w, i in word2id.items()} + +Note that we only use words from the transcript of the test wave. + +Generate lexicon-related files +------------------------------ + +We use the code below to generate the following 4 files: + + - ``lexicon.txt`` + - ``tokens.txt`` + - ``words.txt`` + - ``lexicon_disambig.txt`` + +.. caution:: + + ``words.txt`` contains only words from the transcript of the test wave. + +.. code-block:: python3 + + from prepare_lang import add_disambig_symbols + + lexicon = [(w, list(w)) for w in word2id if w != "eps"] + lexicon_disambig, max_disambig_id = add_disambig_symbols(lexicon) + + with open('lexicon.txt', 'w', encoding='utf-8') as f: + for w, tokens in lexicon: + f.write(f"{w} {' '.join(tokens)}\n") + + with open('lexicon_disambig.txt', 'w', encoding='utf-8') as f: + for w, tokens in lexicon_disambig: + f.write(f"{w} {' '.join(tokens)}\n") + + with open('tokens.txt', 'w', encoding='utf-8') as f: + for t, i in token2id.items(): + if t == '-': + t = "" + f.write(f"{t} {i}\n") + + for k in range(max_disambig_id + 2): + f.write(f"#{k} {len(token2id) + k}\n") + + with open('words.txt', 'w', encoding='utf-8') as f: + for w, i in word2id.items(): + f.write(f"{w} {i}\n") + f.write(f'#0 {len(word2id)}\n') + + +To give you an idea about what the generated files look like:: + + head -n 50 lexicon.txt lexicon_disambig.txt tokens.txt words.txt + +prints:: + + ==> lexicon.txt <== + moment m o m e n t + beside b e s i d e + i i + this t h i s + curiosity c u r i o s i t y + had h a d + that t h a t + at a t + me m e + + ==> lexicon_disambig.txt <== + moment m o m e n t + beside b e s i d e + i i + this t h i s + curiosity c u r i o s i t y + had h a d + that t h a t + at a t + me m e + + ==> tokens.txt <== + a 1 + i 2 + e 3 + n 4 + o 5 + u 6 + t 7 + s 8 + r 9 + m 10 + k 11 + l 12 + d 13 + g 14 + h 15 + y 16 + b 17 + p 18 + w 19 + c 20 + v 21 + j 22 + z 23 + f 24 + ' 25 + q 26 + x 27 + 0 + #0 28 + #1 29 + + ==> words.txt <== + eps 0 + moment 1 + beside 2 + i 3 + this 4 + curiosity 5 + had 6 + that 7 + at 8 + me 9 + #0 10 + +.. note:: + + This test model uses characters as modeling unit. If you use other types of + modeling unit, the same code can be used without any change. + +Convert transcript to an FST graph +---------------------------------- + +.. code-block:: bash + + egs/librispeech/ASR/local/prepare_lang_fst.py --lang-dir ./ + +The above command should generate two files ``H.fst`` and ``HL.fst``. We will +use ``HL.fst`` below:: + + -rw-r--r-- 1 root root 13K Jun 12 08:28 H.fst + -rw-r--r-- 1 root root 3.7K Jun 12 08:28 HL.fst + +Force aligner +------------- + +Now, everything is ready. We can use the following code to get forced alignments. + +.. code-block:: python3 + + from kaldi_decoder import DecodableCtc, FasterDecoder, FasterDecoderOptions + import kaldifst + + def force_align(): + HL = kaldifst.StdVectorFst.read("./HL.fst") + decodable = DecodableCtc(emission[0].contiguous().cpu().numpy()) + decoder_opts = FasterDecoderOptions(max_active=3000) + decoder = FasterDecoder(HL, decoder_opts) + decoder.decode(decodable) + if not decoder.reached_final(): + print(f"failed to decode xxx") + return None + ok, best_path = decoder.get_best_path() + + ( + ok, + isymbols_out, + osymbols_out, + total_weight, + ) = kaldifst.get_linear_symbol_sequence(best_path) + if not ok: + print(f"failed to get linear symbol sequence for xxx") + return None + + # We need to use i-1 here since we have incremented tokens during + # HL construction + alignment = [i-1 for i in isymbols_out] + return alignment + + alignment = force_align() + + for i, a in enumerate(alignment): + print(i, id2token[a]) + +The output should be identical to +``_. + +For ease of reference, we list the output below:: + + 0 - + 1 - + 2 - + 3 - + 4 - + 5 - + 6 - + 7 - + 8 - + 9 - + 10 - + 11 - + 12 - + 13 - + 14 - + 15 - + 16 - + 17 - + 18 - + 19 - + 20 - + 21 - + 22 - + 23 - + 24 - + 25 - + 26 - + 27 - + 28 - + 29 - + 30 - + 31 - + 32 i + 33 - + 34 - + 35 h + 36 h + 37 a + 38 - + 39 - + 40 - + 41 d + 42 - + 43 - + 44 t + 45 h + 46 - + 47 a + 48 - + 49 - + 50 t + 51 - + 52 - + 53 - + 54 c + 55 - + 56 - + 57 - + 58 u + 59 u + 60 - + 61 - + 62 - + 63 r + 64 - + 65 i + 66 - + 67 - + 68 - + 69 - + 70 - + 71 - + 72 o + 73 - + 74 - + 75 - + 76 - + 77 - + 78 - + 79 s + 80 - + 81 - + 82 - + 83 i + 84 - + 85 t + 86 - + 87 - + 88 y + 89 - + 90 - + 91 - + 92 - + 93 b + 94 - + 95 e + 96 - + 97 - + 98 - + 99 - + 100 - + 101 s + 102 - + 103 - + 104 - + 105 - + 106 - + 107 - + 108 - + 109 - + 110 i + 111 - + 112 - + 113 d + 114 e + 115 - + 116 m + 117 - + 118 - + 119 e + 120 - + 121 - + 122 - + 123 - + 124 a + 125 - + 126 - + 127 t + 128 - + 129 t + 130 h + 131 - + 132 i + 133 - + 134 - + 135 - + 136 s + 137 - + 138 - + 139 - + 140 - + 141 m + 142 - + 143 - + 144 o + 145 - + 146 - + 147 - + 148 m + 149 - + 150 - + 151 e + 152 - + 153 n + 154 - + 155 t + 156 - + 157 - + 158 - + 159 - + 160 - + 161 - + 162 - + 163 - + 164 - + 165 - + 166 - + 167 - + 168 - + +To merge tokens, we use:: + + from icefall.ctc import merge_tokens + token_spans = merge_tokens(alignment) + for span in token_spans: + print(id2token[span.token], span.start, span.end) + +The output is given below:: + + i 32 33 + h 35 37 + a 37 38 + d 41 42 + t 44 45 + h 45 46 + a 47 48 + t 50 51 + c 54 55 + u 58 60 + r 63 64 + i 65 66 + o 72 73 + s 79 80 + i 83 84 + t 85 86 + y 88 89 + b 93 94 + e 95 96 + s 101 102 + i 110 111 + d 113 114 + e 114 115 + m 116 117 + e 119 120 + a 124 125 + t 127 128 + t 129 130 + h 130 131 + i 132 133 + s 136 137 + m 141 142 + o 144 145 + m 148 149 + e 151 152 + n 153 154 + t 155 156 + +All of the code below is copied and modified +from ``_. + +Segment each word using the computed alignments +----------------------------------------------- + +.. code-block:: python3 + + def unflatten(list_, lengths): + assert len(list_) == sum(lengths) + i = 0 + ret = [] + for l in lengths: + ret.append(list_[i : i + l]) + i += l + return ret + + + word_spans = unflatten(token_spans, [len(word) for word in transcript]) + print(word_spans) + +The output is:: + + [[TokenSpan(token=2, start=32, end=33)], + [TokenSpan(token=15, start=35, end=37), TokenSpan(token=1, start=37, end=38), TokenSpan(token=13, start=41, end=42)], + [TokenSpan(token=7, start=44, end=45), TokenSpan(token=15, start=45, end=46), TokenSpan(token=1, start=47, end=48), TokenSpan(token=7, start=50, end=51)], + [TokenSpan(token=20, start=54, end=55), TokenSpan(token=6, start=58, end=60), TokenSpan(token=9, start=63, end=64), TokenSpan(token=2, start=65, end=66), TokenSpan(token=5, start=72, end=73), TokenSpan(token=8, start=79, end=80), TokenSpan(token=2, start=83, end=84), TokenSpan(token=7, start=85, end=86), TokenSpan(token=16, start=88, end=89)], + [TokenSpan(token=17, start=93, end=94), TokenSpan(token=3, start=95, end=96), TokenSpan(token=8, start=101, end=102), TokenSpan(token=2, start=110, end=111), TokenSpan(token=13, start=113, end=114), TokenSpan(token=3, start=114, end=115)], + [TokenSpan(token=10, start=116, end=117), TokenSpan(token=3, start=119, end=120)], + [TokenSpan(token=1, start=124, end=125), TokenSpan(token=7, start=127, end=128)], + [TokenSpan(token=7, start=129, end=130), TokenSpan(token=15, start=130, end=131), TokenSpan(token=2, start=132, end=133), TokenSpan(token=8, start=136, end=137)], + [TokenSpan(token=10, start=141, end=142), TokenSpan(token=5, start=144, end=145), TokenSpan(token=10, start=148, end=149), TokenSpan(token=3, start=151, end=152), TokenSpan(token=4, start=153, end=154), TokenSpan(token=7, start=155, end=156)] + ] + + +.. code-block:: python3 + + def preview_word(waveform, spans, num_frames, transcript, sample_rate=bundle.sample_rate): + ratio = waveform.size(1) / num_frames + x0 = int(ratio * spans[0].start) + x1 = int(ratio * spans[-1].end) + print(f"{transcript} {x0 / sample_rate:.3f} - {x1 / sample_rate:.3f} sec") + segment = waveform[:, x0:x1] + return IPython.display.Audio(segment.numpy(), rate=sample_rate) + num_frames = emission.size(1) + +.. code-block:: python3 + + preview_word(waveform, word_spans[0], num_frames, transcript[0]) + preview_word(waveform, word_spans[1], num_frames, transcript[1]) + preview_word(waveform, word_spans[2], num_frames, transcript[2]) + preview_word(waveform, word_spans[3], num_frames, transcript[3]) + preview_word(waveform, word_spans[4], num_frames, transcript[4]) + preview_word(waveform, word_spans[5], num_frames, transcript[5]) + preview_word(waveform, word_spans[6], num_frames, transcript[6]) + preview_word(waveform, word_spans[7], num_frames, transcript[7]) + preview_word(waveform, word_spans[8], num_frames, transcript[8]) + +The segmented wave of each word along with its time stamp is given below: + +.. raw:: html + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
WordTimeWave
i0.644 - 0.664 sec + +
had0.704 - 0.845 sec + +
that0.885 - 1.026 sec + +
curiosity1.086 - 1.790 sec + +
beside1.871 - 2.314 sec + +
me2.334 - 2.414 sec + +
at2.495 - 2.575 sec + +
this2.595 - 2.756 sec + +
moment2.837 - 3.138 sec + +
+ +We repost the whole wave below for ease of reference: + +.. raw:: html + + + + + + + + + + + + +
Wave filenameContentText
Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav + + + i had that curiosity beside me at this moment +
+ +Summary +------- + +Congratulations! You have succeeded in using the FST-based approach to +compute alignment of a test wave. diff --git a/docs/source/index.rst b/docs/source/index.rst index fb539d3f2..d46a4038f 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -25,7 +25,7 @@ speech recognition recipes using `k2 `_. docker/index faqs model-export/index - + fst-based-forced-alignment/index .. toctree:: :maxdepth: 3 @@ -40,5 +40,5 @@ speech recognition recipes using `k2 `_. .. toctree:: :maxdepth: 2 - + decoding-with-langugage-models/index diff --git a/docs/source/model-export/export-ncnn-conv-emformer.rst b/docs/source/model-export/export-ncnn-conv-emformer.rst index 93392aee7..4cdc25ee6 100644 --- a/docs/source/model-export/export-ncnn-conv-emformer.rst +++ b/docs/source/model-export/export-ncnn-conv-emformer.rst @@ -15,8 +15,8 @@ We will show you step by step how to export it to `ncnn`_ and run it with `sherp .. caution:: - Please use a more recent version of PyTorch. For instance, ``torch 1.8`` - may ``not`` work. + ``torch > 2.0`` may not work. If you get errors while building pnnx, please switch + to ``torch < 2.0``. 1. Download the pre-trained model --------------------------------- diff --git a/docs/source/model-export/export-ncnn-lstm.rst b/docs/source/model-export/export-ncnn-lstm.rst index 310c3d8e4..ccf522dec 100644 --- a/docs/source/model-export/export-ncnn-lstm.rst +++ b/docs/source/model-export/export-ncnn-lstm.rst @@ -15,8 +15,8 @@ We will show you step by step how to export it to `ncnn`_ and run it with `sherp .. caution:: - Please use a more recent version of PyTorch. For instance, ``torch 1.8`` - may ``not`` work. + ``torch > 2.0`` may not work. If you get errors while building pnnx, please switch + to ``torch < 2.0``. 1. Download the pre-trained model ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ diff --git a/docs/source/model-export/export-ncnn-zipformer.rst b/docs/source/model-export/export-ncnn-zipformer.rst index a5845b0e4..51fc6c8e5 100644 --- a/docs/source/model-export/export-ncnn-zipformer.rst +++ b/docs/source/model-export/export-ncnn-zipformer.rst @@ -15,8 +15,8 @@ We will show you step by step how to export it to `ncnn`_ and run it with `sherp .. caution:: - Please use a more recent version of PyTorch. For instance, ``torch 1.8`` - may ``not`` work. + ``torch > 2.0`` may not work. If you get errors while building pnnx, please switch + to ``torch < 2.0``. 1. Download the pre-trained model ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^