WIP: Add doc about FST-based CTC forced alignment.

This commit is contained in:
Fangjun Kuang 2024-01-30 19:29:33 +08:00
parent 37b975cac9
commit 0a244463c3
5 changed files with 95 additions and 2 deletions

View File

@ -98,4 +98,6 @@ rst_epilog = """
.. _Next-gen Kaldi: https://github.com/k2-fsa .. _Next-gen Kaldi: https://github.com/k2-fsa
.. _Kaldi: https://github.com/kaldi-asr/kaldi .. _Kaldi: https://github.com/kaldi-asr/kaldi
.. _lilcom: https://github.com/danpovey/lilcom .. _lilcom: https://github.com/danpovey/lilcom
.. _CTC: https://www.cs.toronto.edu/~graves/icml_2006.pdf
.. _kaldi-decoder: https://github.com/k2-fsa/kaldi-decoder
""" """

View File

@ -0,0 +1,56 @@
FST-based forced alignment
==========================
This section describes how to perform **FST-based** ``forced alignment`` with models
trained by the `CTC`_ loss.
We use `CTC FORCED ALIGNMENT API TUTORIAL <https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html>`_
from `torchaudio`_ as a reference in this section. The difference is that we are using an ``FST``-based approach.
Two approaches for FST-based forced alignment will be described:
- `Kaldi`_-based
- `k2`_-base
Note that the `Kaldi`_-based approach does not depend on `Kaldi`_ at all.
That is, you don't need to install `Kaldi`_ in order to use it. Instead,
we will use `kaldi-decoder`_, which has ported the C++ decoding code from `Kaldi`_
without depending on it.
Differences between the two approaches
--------------------------------------
The following table compares the differences between the two approaches.
.. list-table::
* - Features
- `Kaldi`_-based
- `k2`_-based
* - Support CUDA
- No
- Yes
* - Support CPU
- Yes
- Yes
* - Support batch processing
- No
- Yes on CUDA; No on CPU
* - Support streaming models
- Yes
- No
* - Support C++ APIs
- Yes
- Yes
* - Support Python APIs
- Yes
- Yes
.. toctree::
:maxdepth: 2
:caption: Contents:
kaldi-based
k2-based

View File

@ -0,0 +1,4 @@
k2-based forced alignment
=========================
TODO(fangjun)

View File

@ -0,0 +1,31 @@
Kaldi-based forced alignment
============================
This section describes in detail how to use `kaldi-decoder`_
for **FST-based** ``forced alignment`` with models trained by the `CTC`_ loss.
We will use the test data
from `CTC FORCED ALIGNMENT API TUTORIAL <https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html>`_
Prepare the environment
-----------------------
Before you continue, make sure you have setup `icefall`_ by following :ref:`install icefall`.
.. hint::
You don't need to install `Kaldi`_. We will ``NOT`` use `Kaldi`_ below.
Get the test data
-----------------
Compute log_probs
-----------------
Convert transcript to an FST graph
----------------------------------
Force aligner
-------------

View File

@ -25,7 +25,7 @@ speech recognition recipes using `k2 <https://github.com/k2-fsa/k2>`_.
docker/index docker/index
faqs faqs
model-export/index model-export/index
fst-based-forced-alignment/index
.. toctree:: .. toctree::
:maxdepth: 3 :maxdepth: 3
@ -40,5 +40,5 @@ speech recognition recipes using `k2 <https://github.com/k2-fsa/k2>`_.
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
decoding-with-langugage-models/index decoding-with-langugage-models/index