init commit

minor simplifications
This commit is contained in:
zr_jin 2024-10-22 14:49:25 +08:00
parent 88bacfb9e6
commit 30098827d9
123 changed files with 17 additions and 15846 deletions

383
README.md
View File

@ -1,384 +1,11 @@
<div align="center">
<img src="https://raw.githubusercontent.com/k2-fsa/icefall/master/docs/source/_static/logo.png" width=168>
<img src="https://raw.githubusercontent.com/bio-icefall/biofall/master/icefall/logo.jpg" width=168>
</div>
# Introduction
The icefall project contains speech-related recipes for various datasets
using [k2-fsa](https://github.com/k2-fsa/k2) and [lhotse](https://github.com/lhotse-speech/lhotse).
The biofall project is forked from [icefall](https://github.com/k2-fsa/icefall) and contains recipes for biomarker processing with [Lhotse](https://github.com/lhotse-speech/lhotse).
You can use [sherpa](https://github.com/k2-fsa/sherpa), [sherpa-ncnn](https://github.com/k2-fsa/sherpa-ncnn) or [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) for deployment with models
in icefall; these frameworks also support models not included in icefall; please refer to respective documents for more details.
You can try pre-trained models from within your browser without the need
to download or install anything by visiting this [huggingface space](https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition).
Please refer to [document](https://k2-fsa.github.io/icefall/huggingface/spaces.html) for more details.
# Installation
Please refer to [document](https://k2-fsa.github.io/icefall/installation/index.html)
for installation.
# Recipes
Please refer to [document](https://k2-fsa.github.io/icefall/recipes/index.html)
for more details.
## ASR: Automatic Speech Recognition
### Supported Datasets
- [yesno][yesno]
- [Aidatatang_200zh][aidatatang_200zh]
- [Aishell][aishell]
- [Aishell2][aishell2]
- [Aishell4][aishell4]
- [Alimeeting][alimeeting]
- [AMI][ami]
- [CommonVoice][commonvoice]
- [Corpus of Spontaneous Japanese][csj]
- [GigaSpeech][gigaspeech]
- [LibriCSS][libricss]
- [LibriSpeech][librispeech]
- [Libriheavy][libriheavy]
- [Multi-Dialect Broadcast News Arabic Speech Recognition][mgb2]
- [PeopleSpeech][peoplespeech]
- [SPGISpeech][spgispeech]
- [Switchboard][swbd]
- [TIMIT][timit]
- [TED-LIUM3][tedlium3]
- [TAL_CSASR][tal_csasr]
- [Voxpopuli][voxpopuli]
- [XBMU-AMDO31][xbmu-amdo31]
- [WenetSpeech][wenetspeech]
More datasets will be added in the future.
### Supported Models
The [LibriSpeech][librispeech] recipe supports the most comprehensive set of models, you are welcome to try them out.
#### CTC
- TDNN LSTM CTC
- Conformer CTC
- Zipformer CTC
#### MMI
- Conformer MMI
- Zipformer MMI
#### Transducer
- Conformer-based Encoder
- LSTM-based Encoder
- Zipformer-based Encoder
- LSTM-based Predictor
- [Stateless Predictor](https://research.google/pubs/rnn-transducer-with-stateless-prediction-network/)
#### Whisper
- [OpenAi Whisper](https://arxiv.org/abs/2212.04356) (We support fine-tuning on AiShell-1.)
If you are willing to contribute to icefall, please refer to [contributing](https://k2-fsa.github.io/icefall/contributing/index.html) for more details.
We would like to highlight the performance of some of the recipes here.
### [yesno][yesno]
This is the simplest ASR recipe in `icefall` and can be run on CPU.
Training takes less than 30 seconds and gives you the following WER:
```
[test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]
```
We provide a Colab notebook for this recipe: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tIjjzaJc3IvGyKiMCDWO-TSnBgkcuN3B?usp=sharing)
### [LibriSpeech][librispeech]
Please see [RESULTS.md](https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md)
for the **latest** results.
#### [Conformer CTC](https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/conformer_ctc)
| | test-clean | test-other |
|-----|------------|------------|
| WER | 2.42 | 5.73 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1huyupXAcHsUrKaWfI83iMEJ6J0Nh0213?usp=sharing)
#### [TDNN LSTM CTC](https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/tdnn_lstm_ctc)
| | test-clean | test-other |
|-----|------------|------------|
| WER | 6.59 | 17.69 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-iSfQMp2So-We_Uu49N4AAcMInB72u9z?usp=sharing)
#### [Transducer (Conformer Encoder + LSTM Predictor)](https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/transducer)
| | test-clean | test-other |
|---------------|------------|------------|
| greedy_search | 3.07 | 7.51 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1_u6yK9jDkPwG_NLrZMN2XK7Aeq4suMO2?usp=sharing)
#### [Transducer (Conformer Encoder + Stateless Predictor)](https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/transducer)
| | test-clean | test-other |
|---------------------------------------|------------|------------|
| modified_beam_search (`beam_size=4`) | 2.56 | 6.27 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1CO1bXJ-2khDckZIW8zjOPHGSKLHpTDlp?usp=sharing)
#### [Transducer (Zipformer Encoder + Stateless Predictor)](https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/zipformer)
WER (modified_beam_search `beam_size=4` unless further stated)
1. LibriSpeech-960hr
| Encoder | Params | test-clean | test-other | epochs | devices |
|-----------------|--------|------------|------------|---------|------------|
| Zipformer | 65.5M | 2.21 | 4.79 | 50 | 4 32G-V100 |
| Zipformer-small | 23.2M | 2.42 | 5.73 | 50 | 2 32G-V100 |
| Zipformer-large | 148.4M | 2.06 | 4.63 | 50 | 4 32G-V100 |
| Zipformer-large | 148.4M | 2.00 | 4.38 | 174 | 8 80G-A100 |
2. LibriSpeech-960hr + GigaSpeech
| Encoder | Params | test-clean | test-other |
|-----------------|--------|------------|------------|
| Zipformer | 65.5M | 1.78 | 4.08 |
3. LibriSpeech-960hr + GigaSpeech + CommonVoice
| Encoder | Params | test-clean | test-other |
|-----------------|--------|------------|------------|
| Zipformer | 65.5M | 1.90 | 3.98 |
### [GigaSpeech][gigaspeech]
#### [Conformer CTC](https://github.com/k2-fsa/icefall/tree/master/egs/gigaspeech/ASR/conformer_ctc)
| | Dev | Test |
|-----|-------|-------|
| WER | 10.47 | 10.58 |
#### [Transducer (pruned_transducer_stateless2)](https://github.com/k2-fsa/icefall/tree/master/egs/gigaspeech/ASR/pruned_transducer_stateless2)
Conformer Encoder + Stateless Predictor + k2 Pruned RNN-T Loss
| | Dev | Test |
|----------------------|-------|-------|
| greedy_search | 10.51 | 10.73 |
| fast_beam_search | 10.50 | 10.69 |
| modified_beam_search | 10.40 | 10.51 |
#### [Transducer (Zipformer Encoder + Stateless Predictor)](https://github.com/k2-fsa/icefall/tree/master/egs/gigaspeech/ASR/zipformer)
| | Dev | Test |
|----------------------|-------|-------|
| greedy_search | 10.31 | 10.50 |
| fast_beam_search | 10.26 | 10.48 |
| modified_beam_search | 10.25 | 10.38 |
### [Aishell][aishell]
#### [TDNN LSTM CTC](https://github.com/k2-fsa/icefall/tree/master/egs/aishell/ASR/tdnn_lstm_ctc)
| | test |
|-----|-------|
| CER | 10.16 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1jbyzYq3ytm6j2nlEt-diQm-6QVWyDDEa?usp=sharing)
#### [Transducer (Conformer Encoder + Stateless Predictor)](https://github.com/k2-fsa/icefall/tree/master/egs/aishell/ASR/transducer_stateless)
| | test |
|-----|------|
| CER | 4.38 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/14XaT2MhnBkK-3_RqqWq3K90Xlbin-GZC?usp=sharing)
#### [Transducer (Zipformer Encoder + Stateless Predictor)](https://github.com/k2-fsa/icefall/tree/master/egs/aishell/ASR/zipformer)
WER (modified_beam_search `beam_size=4`)
| Encoder | Params | dev | test | epochs |
|-----------------|--------|-----|------|---------|
| Zipformer | 73.4M | 4.13| 4.40 | 55 |
| Zipformer-small | 30.2M | 4.40| 4.67 | 55 |
| Zipformer-large | 157.3M | 4.03| 4.28 | 56 |
### [Aishell4][aishell4]
#### [Transducer (pruned_transducer_stateless5)](https://github.com/k2-fsa/icefall/tree/master/egs/aishell4/ASR/pruned_transducer_stateless5)
1 Trained with all subsets:
| | test |
|-----|------------|
| CER | 29.08 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1z3lkURVv9M7uTiIgf3Np9IntMHEknaks?usp=sharing)
### [TIMIT][timit]
#### [TDNN LSTM CTC](https://github.com/k2-fsa/icefall/tree/master/egs/timit/ASR/tdnn_lstm_ctc)
| |TEST|
|---|----|
|PER| 19.71% |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Hs9DA4V96uapw_30uNp32OMJgkuR5VVd?usp=sharing)
#### [TDNN LiGRU CTC](https://github.com/k2-fsa/icefall/tree/master/egs/timit/ASR/tdnn_ligru_ctc)
| |TEST|
|---|----|
|PER| 17.66% |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1z3lkURVv9M7uTiIgf3Np9IntMHEknaks?usp=sharing)
### [TED-LIUM3][tedlium3]
#### [Transducer (Conformer Encoder + Stateless Predictor)](https://github.com/k2-fsa/icefall/tree/master/egs/tedlium3/ASR/transducer_stateless)
| | dev | test |
|--------------------------------------|-------|--------|
| modified_beam_search (`beam_size=4`) | 6.91 | 6.33 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1MmY5bBxwvKLNT4A2DJnwiqRXhdchUqPN?usp=sharing)
#### [Transducer (pruned_transducer_stateless)](https://github.com/k2-fsa/icefall/tree/master/egs/tedlium3/ASR/pruned_transducer_stateless)
| | dev | test |
|--------------------------------------|-------|--------|
| modified_beam_search (`beam_size=4`) | 6.77 | 6.14 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1je_1zGrOkGVVd4WLzgkXRHxl-I27yWtz?usp=sharing)
### [Aidatatang_200zh][aidatatang_200zh]
#### [Transducer (pruned_transducer_stateless2)](https://github.com/k2-fsa/icefall/tree/master/egs/aidatatang_200zh/ASR/pruned_transducer_stateless2)
| | Dev | Test |
|----------------------|-------|-------|
| greedy_search | 5.53 | 6.59 |
| fast_beam_search | 5.30 | 6.34 |
| modified_beam_search | 5.27 | 6.33 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wNSnSj3T5oOctbh5IGCa393gKOoQw2GH?usp=sharing)
### [WenetSpeech][wenetspeech]
#### [Transducer (pruned_transducer_stateless2)](https://github.com/k2-fsa/icefall/tree/master/egs/wenetspeech/ASR/pruned_transducer_stateless2)
| | Dev | Test-Net | Test-Meeting |
|----------------------|-------|----------|--------------|
| greedy_search | 7.80 | 8.75 | 13.49 |
| fast_beam_search | 7.94 | 8.74 | 13.80 |
| modified_beam_search | 7.76 | 8.71 | 13.41 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1EV4e1CHa1GZgEF-bZgizqI9RyFFehIiN?usp=sharing)
#### [Transducer **Streaming** (pruned_transducer_stateless5) ](https://github.com/k2-fsa/icefall/tree/master/egs/wenetspeech/ASR/pruned_transducer_stateless5)
| | Dev | Test-Net | Test-Meeting |
|----------------------|-------|----------|--------------|
| greedy_search | 8.78 | 10.12 | 16.16 |
| fast_beam_search| 9.01 | 10.47 | 16.28 |
| modified_beam_search | 8.53| 9.95 | 15.81 |
### [Alimeeting][alimeeting]
#### [Transducer (pruned_transducer_stateless2)](https://github.com/k2-fsa/icefall/tree/master/egs/alimeeting/ASR/pruned_transducer_stateless2)
| | Eval | Test-Net |
|----------------------|--------|----------|
| greedy_search | 31.77 | 34.66 |
| fast_beam_search | 31.39 | 33.02 |
| modified_beam_search | 30.38 | 34.25 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1tKr3f0mL17uO_ljdHGKtR7HOmthYHwJG?usp=sharing)
### [TAL_CSASR][tal_csasr]
#### [Transducer (pruned_transducer_stateless5)](https://github.com/k2-fsa/icefall/tree/master/egs/tal_csasr/ASR/pruned_transducer_stateless5)
The best results for Chinese CER(%) and English WER(%) respectively (zh: Chinese, en: English):
|decoding-method | dev | dev_zh | dev_en | test | test_zh | test_en |
|--|--|--|--|--|--|--|
|greedy_search| 7.30 | 6.48 | 19.19 |7.39| 6.66 | 19.13|
|fast_beam_search| 7.18 | 6.39| 18.90 | 7.27| 6.55 | 18.77|
|modified_beam_search| 7.15 | 6.35 | 18.95 | 7.22| 6.50 | 18.70 |
We provide a Colab notebook to test the pre-trained model: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1DmIx-NloI1CMU5GdZrlse7TRu4y3Dpf8?usp=sharing)
## TTS: Text-to-Speech
### Supported Datasets
- [LJSpeech][ljspeech]
- [VCTK][vctk]
### Supported Models
- [VITS](https://arxiv.org/abs/2106.06103)
# Deployment with C++
Once you have trained a model in icefall, you may want to deploy it with C++ without Python dependencies.
Please refer to
- https://k2-fsa.github.io/icefall/model-export/export-with-torch-jit-script.html
- https://k2-fsa.github.io/icefall/model-export/export-onnx.html
- https://k2-fsa.github.io/icefall/model-export/export-ncnn.html
for how to do this.
We also provide a Colab notebook, showing you how to run a torch scripted model in [k2][k2] with C++.
Please see: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1BIGLWzS36isskMXHKcqC9ysN6pspYXs_?usp=sharing)
[yesno]: egs/yesno/ASR
[librispeech]: egs/librispeech/ASR
[aishell]: egs/aishell/ASR
[aishell2]: egs/aishell2/ASR
[aishell4]: egs/aishell4/ASR
[timit]: egs/timit/ASR
[tedlium3]: egs/tedlium3/ASR
[gigaspeech]: egs/gigaspeech/ASR
[aidatatang_200zh]: egs/aidatatang_200zh/ASR
[wenetspeech]: egs/wenetspeech/ASR
[alimeeting]: egs/alimeeting/ASR
[tal_csasr]: egs/tal_csasr/ASR
[ami]: egs/ami
[swbd]: egs/swbd/ASR
[k2]: https://github.com/k2-fsa/k2
[commonvoice]: egs/commonvoice/ASR
[csj]: egs/csj/ASR
[libricss]: egs/libricss/SURT
[libriheavy]: egs/libriheavy/ASR
[mgb2]: egs/mgb2/ASR
[peoplespeech]: egs/peoples_speech/ASR
[spgispeech]: egs/spgispeech/ASR
[voxpopuli]: egs/voxpopuli/ASR
[xbmu-amdo31]: egs/xbmu-amdo31/ASR
[vctk]: egs/vctk/TTS
[ljspeech]: egs/ljspeech/TTS
> [!CAUTION]
> The project is under active development and is not yet ready for general use.
>

1
docs/.gitignore vendored
View File

@ -1 +0,0 @@
build/

View File

@ -1,20 +0,0 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

View File

@ -1,24 +0,0 @@
## Usage
```bash
cd /path/to/icefall/docs
pip install -r requirements.txt
make clean
make html
cd build/html
python3 -m http.server 8000
```
It prints:
```
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
```
Open your browser and go to <http://0.0.0.0:8000/> to view the generated
documentation.
Done!
**Hint**: You can change the port number when starting the server.

View File

@ -1,35 +0,0 @@
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd

View File

@ -1,3 +0,0 @@
sphinx_rtd_theme
sphinx
sphinxcontrib-youtube==1.1.0

Binary file not shown.

Before

Width:  |  Height:  |  Size: 666 KiB

View File

@ -1,103 +0,0 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
import sphinx_rtd_theme
# -- Project information -----------------------------------------------------
project = "icefall"
copyright = "2021, icefall development team"
author = "icefall development team"
# The full version, including alpha/beta/rc tags
release = "0.1"
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"sphinx.ext.todo",
"sphinx_rtd_theme",
"sphinxcontrib.youtube",
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []
source_suffix = {
".rst": "restructuredtext",
}
master_doc = "index"
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "sphinx_rtd_theme"
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
html_show_sourcelink = True
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static", "installation/images"]
pygments_style = "sphinx"
numfig = True
html_context = {
"display_github": True,
"github_user": "k2-fsa",
"github_repo": "icefall",
"github_version": "master",
"conf_py_path": "/docs/source/",
}
todo_include_todos = True
rst_epilog = """
.. _sherpa-ncnn: https://github.com/k2-fsa/sherpa-ncnn
.. _sherpa-onnx: https://github.com/k2-fsa/sherpa-onnx
.. _icefall: https://github.com/k2-fsa/icefall
.. _git-lfs: https://git-lfs.com/
.. _ncnn: https://github.com/tencent/ncnn
.. _LibriSpeech: https://www.openslr.org/12
.. _Gigaspeech: https://github.com/SpeechColab/GigaSpeech
.. _musan: http://www.openslr.org/17/
.. _ONNX: https://github.com/onnx/onnx
.. _onnxruntime: https://github.com/microsoft/onnxruntime
.. _torch: https://github.com/pytorch/pytorch
.. _torchaudio: https://github.com/pytorch/audio
.. _k2: https://github.com/k2-fsa/k2
.. _lhotse: https://github.com/lhotse-speech/lhotse
.. _yesno: https://www.openslr.org/1/
.. _Next-gen Kaldi: https://github.com/k2-fsa
.. _Kaldi: https://github.com/kaldi-asr/kaldi
.. _lilcom: https://github.com/danpovey/lilcom
.. _CTC: https://www.cs.toronto.edu/~graves/icml_2006.pdf
.. _kaldi-decoder: https://github.com/k2-fsa/kaldi-decoder
"""

View File

@ -1,74 +0,0 @@
.. _follow the code style:
Follow the code style
=====================
We use the following tools to make the code style to be as consistent as possible:
- `black <https://github.com/psf/black>`_, to format the code
- `flake8 <https://github.com/PyCQA/flake8>`_, to check the style and quality of the code
- `isort <https://github.com/PyCQA/isort>`_, to sort ``imports``
The following versions of the above tools are used:
- ``black == 22.3.0``
- ``flake8 == 5.0.4``
- ``isort == 5.10.1``
After running the following commands:
.. code-block::
$ git clone https://github.com/k2-fsa/icefall
$ cd icefall
$ pip install pre-commit
$ pre-commit install
it will run the following checks whenever you run ``git commit``, **automatically**:
.. figure:: images/pre-commit-check.png
:width: 600
:align: center
pre-commit hooks invoked by ``git commit`` (Failed).
If any of the above checks failed, your ``git commit`` was not successful.
Please fix any issues reported by the check tools.
.. HINT::
Some of the check tools, i.e., ``black`` and ``isort`` will modify
the files to be committed **in-place**. So please run ``git status``
after failure to see which file has been modified by the tools
before you make any further changes.
After fixing all the failures, run ``git commit`` again and
it should succeed this time:
.. figure:: images/pre-commit-check-success.png
:width: 600
:align: center
pre-commit hooks invoked by ``git commit`` (Succeeded).
If you want to check the style of your code before ``git commit``, you
can do the following:
.. code-block:: bash
$ pre-commit install
$ pre-commit run
Or without installing the pre-commit hooks:
.. code-block:: bash
$ cd icefall
$ pip install black==22.3.0 flake8==5.0.4 isort==5.10.1
$ black --check your_changed_file.py
$ black your_changed_file.py # modify it in-place
$
$ flake8 your_changed_file.py
$
$ isort --check your_changed_file.py # modify it in-place
$ isort your_changed_file.py

View File

@ -1,45 +0,0 @@
Contributing to Documentation
=============================
We use `sphinx <https://www.sphinx-doc.org/en/master/>`_
for documentation.
Before writing documentation, you have to prepare the environment:
.. code-block:: bash
$ cd docs
$ pip install -r requirements.txt
After setting up the environment, you are ready to write documentation.
Please refer to `reStructuredText Primer <https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html>`_
if you are not familiar with ``reStructuredText``.
After writing some documentation, you can build the documentation **locally**
to preview what it looks like if it is published:
.. code-block:: bash
$ cd docs
$ make html
The generated documentation is in ``docs/build/html`` and can be viewed
with the following commands:
.. code-block:: bash
$ cd docs/build/html
$ python3 -m http.server
It will print::
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...
Open your browser, go to `<http://0.0.0.0:8000/>`_, and you will see
the following:
.. figure:: images/doc-contrib.png
:width: 600
:align: center
View generated documentation locally with ``python3 -m http.server``.

View File

@ -1,156 +0,0 @@
How to create a recipe
======================
.. HINT::
Please read :ref:`follow the code style` to adjust your code style.
.. CAUTION::
``icefall`` is designed to be as Pythonic as possible. Please use
Python in your recipe if possible.
Data Preparation
----------------
We recommend you to prepare your training/test/validate dataset
with `lhotse <https://github.com/lhotse-speech/lhotse>`_.
Please refer to `<https://lhotse.readthedocs.io/en/latest/index.html>`_
for how to create a recipe in ``lhotse``.
.. HINT::
The ``yesno`` recipe in ``lhotse`` is a very good example.
Please refer to `<https://github.com/lhotse-speech/lhotse/pull/380>`_,
which shows how to add a new recipe to ``lhotse``.
Suppose you would like to add a recipe for a dataset named ``foo``.
You can do the following:
.. code-block::
$ cd egs
$ mkdir -p foo/ASR
$ cd foo/ASR
$ touch prepare.sh
$ chmod +x prepare.sh
If your dataset is very simple, please follow
`egs/yesno/ASR/prepare.sh <https://github.com/k2-fsa/icefall/blob/master/egs/yesno/ASR/prepare.sh>`_
to write your own ``prepare.sh``.
Otherwise, please refer to
`egs/librispeech/ASR/prepare.sh <https://github.com/k2-fsa/icefall/blob/master/egs/yesno/ASR/prepare.sh>`_
to prepare your data.
Training
--------
Assume you have a fancy model, called ``bar`` for the ``foo`` recipe, you can
organize your files in the following way:
.. code-block::
$ cd egs/foo/ASR
$ mkdir bar
$ cd bar
$ touch README.md model.py train.py decode.py asr_datamodule.py pretrained.py
For instance , the ``yesno`` recipe has a ``tdnn`` model and its directory structure
looks like the following:
.. code-block:: bash
egs/yesno/ASR/tdnn/
|-- README.md
|-- asr_datamodule.py
|-- decode.py
|-- model.py
|-- pretrained.py
`-- train.py
**File description**:
- ``README.md``
It contains information of this recipe, e.g., how to run it, what the WER is, etc.
- ``asr_datamodule.py``
It provides code to create PyTorch dataloaders with train/test/validation dataset.
- ``decode.py``
It takes as inputs the checkpoints saved during the training stage to decode the test
dataset(s).
- ``model.py``
It contains the definition of your fancy neural network model.
- ``pretrained.py``
We can use this script to do inference with a pre-trained model.
- ``train.py``
It contains training code.
.. HINT::
Please take a look at
- `egs/yesno/tdnn <https://github.com/k2-fsa/icefall/tree/master/egs/yesno/ASR/tdnn>`_
- `egs/librispeech/tdnn_lstm_ctc <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/tdnn_lstm_ctc>`_
- `egs/librispeech/conformer_ctc <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/conformer_ctc>`_
to get a feel what the resulting files look like.
.. NOTE::
Every model in a recipe is kept to be as self-contained as possible.
We tolerate duplicate code among different recipes.
The training stage should be invocable by:
.. code-block::
$ cd egs/foo/ASR
$ ./bar/train.py
$ ./bar/train.py --help
Decoding
--------
Please refer to
- `<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/conformer_ctc/decode.py>`_
If your model is transformer/conformer based.
- `<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py>`_
If your model is TDNN/LSTM based, i.e., there is no attention decoder.
- `<https://github.com/k2-fsa/icefall/blob/master/egs/yesno/ASR/tdnn/decode.py>`_
If there is no LM rescoring.
The decoding stage should be invocable by:
.. code-block::
$ cd egs/foo/ASR
$ ./bar/decode.py
$ ./bar/decode.py --help
Pre-trained model
-----------------
Please demonstrate how to use your model for inference in ``egs/foo/ASR/bar/pretrained.py``.
If possible, please consider creating a Colab notebook to show that.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 198 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 153 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 214 KiB

View File

@ -1,22 +0,0 @@
Contributing
============
Contributions to ``icefall`` are very welcomed.
There are many possible ways to make contributions and
two of them are:
- To write documentation
- To write code
- (1) To follow the code style in the repository
- (2) To write a new recipe
In this page, we describe how to contribute documentation
and code to ``icefall``.
.. toctree::
:maxdepth: 2
doc
code-style
how-to-create-a-recipe

View File

@ -1,187 +0,0 @@
.. _LODR:
LODR for RNN Transducer
=======================
As a type of E2E model, neural transducers are usually considered as having an internal
language model, which learns the language level information on the training corpus.
In real-life scenario, there is often a mismatch between the training corpus and the target corpus space.
This mismatch can be a problem when decoding for neural transducer models with language models as its internal
language can act "against" the external LM. In this tutorial, we show how to use
`Low-order Density Ratio <https://arxiv.org/abs/2203.16776>`_ to alleviate this effect to further improve the performance
of langugae model integration.
.. note::
This tutorial is based on the recipe
`pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
which is a streaming transducer model trained on `LibriSpeech`_.
However, you can easily apply LODR to other recipes.
If you encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`__.
.. note::
For simplicity, the training and testing corpus in this tutorial are the same (`LibriSpeech`_). However,
you can change the testing set to any other domains (e.g `GigaSpeech`_) and prepare the language models
using that corpus.
First, let's have a look at some background information. As the predecessor of LODR, Density Ratio (DR) is first proposed `here <https://arxiv.org/abs/2002.11268>`_
to address the language information mismatch between the training
corpus (source domain) and the testing corpus (target domain). Assuming that the source domain and the test domain
are acoustically similar, DR derives the following formula for decoding with Bayes' theorem:
.. math::
\text{score}\left(y_u|\mathit{x},y\right) =
\log p\left(y_u|\mathit{x},y_{1:u-1}\right) +
\lambda_1 \log p_{\text{Target LM}}\left(y_u|\mathit{x},y_{1:u-1}\right) -
\lambda_2 \log p_{\text{Source LM}}\left(y_u|\mathit{x},y_{1:u-1}\right)
where :math:`\lambda_1` and :math:`\lambda_2` are the weights of LM scores for target domain and source domain respectively.
Here, the source domain LM is trained on the training corpus. The only difference in the above formula compared to
shallow fusion is the subtraction of the source domain LM.
Some works treat the predictor and the joiner of the neural transducer as its internal LM. However, the LM is
considered to be weak and can only capture low-level language information. Therefore, `LODR <https://arxiv.org/abs/2203.16776>`__ proposed to use
a low-order n-gram LM as an approximation of the ILM of the neural transducer. This leads to the following formula
during decoding for transducer model:
.. math::
\text{score}\left(y_u|\mathit{x},y\right) =
\log p_{rnnt}\left(y_u|\mathit{x},y_{1:u-1}\right) +
\lambda_1 \log p_{\text{Target LM}}\left(y_u|\mathit{x},y_{1:u-1}\right) -
\lambda_2 \log p_{\text{bi-gram}}\left(y_u|\mathit{x},y_{1:u-1}\right)
In LODR, an additional bi-gram LM estimated on the source domain (e.g training corpus) is required. Compared to DR,
the only difference lies in the choice of source domain LM. According to the original `paper <https://arxiv.org/abs/2203.16776>`_,
LODR achieves similar performance compared to DR in both intra-domain and cross-domain settings.
As a bi-gram is much faster to evaluate, LODR is usually much faster.
Now, we will show you how to use LODR in ``icefall``.
For illustration purpose, we will use a pre-trained ASR model from this `link <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`_.
If you want to train your model from scratch, please have a look at :ref:`non_streaming_librispeech_pruned_transducer_stateless`.
The testing scenario here is intra-domain (we decode the model trained on `LibriSpeech`_ on `LibriSpeech`_ testing sets).
As the initial step, let's download the pre-trained model.
.. code-block:: bash
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
$ cd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
$ cd ../data/lang_bpe_500
$ git lfs pull --include bpe.model
$ cd ../../..
To test the model, let's have a look at the decoding results **without** using LM. This can be done via the following command:
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--exp-dir $exp_dir \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search
The following WERs are achieved on test-clean and test-other:
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 3.11 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.93 best for test-other
Then, we download the external language model and bi-gram LM that are necessary for LODR.
Note that the bi-gram is estimated on the LibriSpeech 960 hours' text.
.. code-block:: bash
$ # download the external LM
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
$ # create a symbolic link so that the checkpoint can be loaded
$ pushd icefall-librispeech-rnn-lm/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt
$ popd
$
$ # download the bi-gram
$ git lfs install
$ git clone https://huggingface.co/marcoyang/librispeech_bigram
$ pushd data/lang_bpe_500
$ ln -s ../../librispeech_bigram/2gram.fst.txt .
$ popd
Then, we perform LODR decoding by setting ``--decoding-method`` to ``modified_beam_search_lm_LODR``:
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ lm_dir=./icefall-librispeech-rnn-lm/exp
$ lm_scale=0.42
$ LODR_scale=-0.24
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--beam-size 4 \
--exp-dir $exp_dir \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_LODR \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \
--use-shallow-fusion 1 \
--lm-type rnn \
--lm-exp-dir $lm_dir \
--lm-epoch 99 \
--lm-scale $lm_scale \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500 \
--tokens-ngram 2 \
--ngram-lm-scale $LODR_scale
There are two extra arguments that need to be given when doing LODR. ``--tokens-ngram`` specifies the order of n-gram. As we
are using a bi-gram, we set it to 2. ``--ngram-lm-scale`` is the scale of the bi-gram, it should be a negative number
as we are subtracting the bi-gram's score during decoding.
The decoding results obtained with the above command are shown below:
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 2.61 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 6.74 best for test-other
Recall that the lowest WER we obtained in :ref:`shallow_fusion` with beam size of 4 is ``2.77/7.08``, LODR
indeed **further improves** the WER. We can do even better if we increase ``--beam-size``:
.. list-table:: WER of LODR with different beam sizes
:widths: 25 25 50
:header-rows: 1
* - Beam size
- test-clean
- test-other
* - 4
- 2.61
- 6.74
* - 8
- 2.45
- 6.38
* - 12
- 2.4
- 6.23

View File

@ -1,34 +0,0 @@
Decoding with language models
=============================
This section describes how to use external langugage models
during decoding to improve the WER of transducer models. To train an external language model,
please refer to this tutorial: :ref:`train_nnlm`.
The following decoding methods with external langugage models are available:
.. list-table::
:widths: 25 50
:header-rows: 1
* - Decoding method
- beam=4
* - ``modified_beam_search``
- Beam search (i.e. really n-best decoding, the "beam" is the value of n), similar to the original RNN-T paper. Note, this method does not use language model.
* - ``modified_beam_search_lm_shallow_fusion``
- As ``modified_beam_search``, but interpolate RNN-T scores with language model scores, also known as shallow fusion
* - ``modified_beam_search_LODR``
- As ``modified_beam_search_lm_shallow_fusion``, but subtract score of a (BPE-symbol-level) bigram backoff language model used as an approximation to the internal language model of RNN-T.
* - ``modified_beam_search_lm_rescore``
- As ``modified_beam_search``, but rescore the n-best hypotheses with external language model (e.g. RNNLM) and re-rank them.
* - ``modified_beam_search_lm_rescore_LODR``
- As ``modified_beam_search_lm_rescore``, but also subtract the score of a (BPE-symbol-level) bigram backoff language model during re-ranking.
.. toctree::
:maxdepth: 2
shallow-fusion
LODR
rescoring

View File

@ -1,255 +0,0 @@
.. _rescoring:
LM rescoring for Transducer
=================================
LM rescoring is a commonly used approach to incorporate external LM information. Unlike shallow-fusion-based
methods (see :ref:`shallow_fusion`, :ref:`LODR`), rescoring is usually performed to re-rank the n-best hypotheses after beam search.
Rescoring is usually more efficient than shallow fusion since less computation is performed on the external LM.
In this tutorial, we will show you how to use external LM to rescore the n-best hypotheses decoded from neural transducer models in
`icefall <https://github.com/k2-fsa/icefall>`__.
.. note::
This tutorial is based on the recipe
`pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
which is a streaming transducer model trained on `LibriSpeech`_.
However, you can easily apply shallow fusion to other recipes.
If you encounter any problems, please open an issue `here <https://github.com/k2-fsa/icefall/issues>`_.
.. note::
For simplicity, the training and testing corpus in this tutorial is the same (`LibriSpeech`_). However, you can change the testing set
to any other domains (e.g `GigaSpeech`_) and use an external LM trained on that domain.
.. HINT::
We recommend you to use a GPU for decoding.
For illustration purpose, we will use a pre-trained ASR model from this `link <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`__.
If you want to train your model from scratch, please have a look at :ref:`non_streaming_librispeech_pruned_transducer_stateless`.
As the initial step, let's download the pre-trained model.
.. code-block:: bash
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
$ cd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
$ cd ../data/lang_bpe_500
$ git lfs pull --include bpe.model
$ cd ../../..
As usual, we first test the model's performance without external LM. This can be done via the following command:
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--exp-dir $exp_dir \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search
The following WERs are achieved on test-clean and test-other:
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 3.11 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.93 best for test-other
Now, we will try to improve the above WER numbers via external LM rescoring. We will download
a pre-trained LM from this `link <https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm>`__.
.. note::
This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus.
You may also train a RNN LM from scratch. Please refer to this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/rnn_lm/train.py>`__
for training a RNN LM and this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/transformer_lm/train.py>`__ to train a transformer LM.
.. code-block:: bash
$ # download the external LM
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
$ # create a symbolic link so that the checkpoint can be loaded
$ pushd icefall-librispeech-rnn-lm/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt
$ popd
With the RNNLM available, we can rescore the n-best hypotheses generated from `modified_beam_search`. Here,
`n` should be the number of beams, i.e ``--beam-size``. The command for LM rescoring is
as follows. Note that the ``--decoding-method`` is set to `modified_beam_search_lm_rescore` and ``--use-shallow-fusion``
is set to `False`.
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ lm_dir=./icefall-librispeech-rnn-lm/exp
$ lm_scale=0.43
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--beam-size 4 \
--exp-dir $exp_dir \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_lm_rescore \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \
--use-shallow-fusion 0 \
--lm-type rnn \
--lm-exp-dir $lm_dir \
--lm-epoch 99 \
--lm-scale $lm_scale \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 2.93 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.6 best for test-other
Great! We made some improvements! Increasing the size of the n-best hypotheses will further boost the performance,
see the following table:
.. list-table:: WERs of LM rescoring with different beam sizes
:widths: 25 25 25
:header-rows: 1
* - Beam size
- test-clean
- test-other
* - 4
- 2.93
- 7.6
* - 8
- 2.67
- 7.11
* - 12
- 2.59
- 6.86
In fact, we can also apply LODR (see :ref:`LODR`) when doing LM rescoring. To do so, we need to
download the bi-gram required by LODR:
.. code-block:: bash
$ # download the bi-gram
$ git lfs install
$ git clone https://huggingface.co/marcoyang/librispeech_bigram
$ pushd data/lang_bpe_500
$ ln -s ../../librispeech_bigram/2gram.arpa .
$ popd
Then we can performn LM rescoring + LODR by changing the decoding method to `modified_beam_search_lm_rescore_LODR`.
.. note::
This decoding method requires the dependency of `kenlm <https://github.com/kpu/kenlm>`_. You can install it
via this command: `pip install https://github.com/kpu/kenlm/archive/master.zip`.
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ lm_dir=./icefall-librispeech-rnn-lm/exp
$ lm_scale=0.43
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--beam-size 4 \
--exp-dir $exp_dir \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_lm_rescore_LODR \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \
--use-shallow-fusion 0 \
--lm-type rnn \
--lm-exp-dir $lm_dir \
--lm-epoch 99 \
--lm-scale $lm_scale \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500
You should see the following WERs after executing the commands above:
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 2.9 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.57 best for test-other
It's slightly better than LM rescoring. If we further increase the beam size, we will see
further improvements from LM rescoring + LODR:
.. list-table:: WERs of LM rescoring + LODR with different beam sizes
:widths: 25 25 25
:header-rows: 1
* - Beam size
- test-clean
- test-other
* - 4
- 2.9
- 7.57
* - 8
- 2.63
- 7.04
* - 12
- 2.52
- 6.73
As mentioned earlier, LM rescoring is usually faster than shallow-fusion based methods.
Here, we benchmark the WERs and decoding speed of them:
.. list-table:: LM-rescoring-based methods vs shallow-fusion-based methods (The numbers in each field is WER on test-clean, WER on test-other and decoding time on test-clean)
:widths: 25 25 25 25
:header-rows: 1
* - Decoding method
- beam=4
- beam=8
- beam=12
* - ``modified_beam_search``
- 3.11/7.93; 132s
- 3.1/7.95; 177s
- 3.1/7.96; 210s
* - ``modified_beam_search_lm_shallow_fusion``
- 2.77/7.08; 262s
- 2.62/6.65; 352s
- 2.58/6.65; 488s
* - ``modified_beam_search_LODR``
- 2.61/6.74; 400s
- 2.45/6.38; 610s
- 2.4/6.23; 870s
* - ``modified_beam_search_lm_rescore``
- 2.93/7.6; 156s
- 2.67/7.11; 203s
- 2.59/6.86; 255s
* - ``modified_beam_search_lm_rescore_LODR``
- 2.9/7.57; 160s
- 2.63/7.04; 203s
- 2.52/6.73; 263s
.. note::
Decoding is performed with a single 32G V100, we set ``--max-duration`` to 600.
Decoding time here is only for reference and it may vary.

View File

@ -1,179 +0,0 @@
.. _shallow_fusion:
Shallow fusion for Transducer
=================================
External language models (LM) are commonly used to improve WERs for E2E ASR models.
This tutorial shows you how to perform ``shallow fusion`` with an external LM
to improve the word-error-rate of a transducer model.
.. note::
This tutorial is based on the recipe
`pruned_transducer_stateless7_streaming <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_,
which is a streaming transducer model trained on `LibriSpeech`_.
However, you can easily apply shallow fusion to other recipes.
If you encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`_.
.. note::
For simplicity, the training and testing corpus in this tutorial is the same (`LibriSpeech`_). However, you can change the testing set
to any other domains (e.g `GigaSpeech`_) and use an external LM trained on that domain.
.. HINT::
We recommend you to use a GPU for decoding.
For illustration purpose, we will use a pre-trained ASR model from this `link <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`__.
If you want to train your model from scratch, please have a look at :ref:`non_streaming_librispeech_pruned_transducer_stateless`.
As the initial step, let's download the pre-trained model.
.. code-block:: bash
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
$ cd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt # create a symbolic link so that the checkpoint can be loaded
$ cd ../data/lang_bpe_500
$ git lfs pull --include bpe.model
$ cd ../../..
To test the model, let's have a look at the decoding results without using LM. This can be done via the following command:
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--exp-dir $exp_dir \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search
The following WERs are achieved on test-clean and test-other:
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 3.11 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.93 best for test-other
These are already good numbers! But we can further improve it by using shallow fusion with external LM.
Training a language model usually takes a long time, we can download a pre-trained LM from this `link <https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm>`__.
.. code-block:: bash
$ # download the external LM
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm
$ # create a symbolic link so that the checkpoint can be loaded
$ pushd icefall-librispeech-rnn-lm/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt
$ popd
.. note::
This is an RNN LM trained on the LibriSpeech text corpus. So it might not be ideal for other corpus.
You may also train a RNN LM from scratch. Please refer to this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/rnn_lm/train.py>`__
for training a RNN LM and this `script <https://github.com/k2-fsa/icefall/blob/master/icefall/transformer_lm/train.py>`__ to train a transformer LM.
To use shallow fusion for decoding, we can execute the following command:
.. code-block:: bash
$ exp_dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
$ lm_dir=./icefall-librispeech-rnn-lm/exp
$ lm_scale=0.29
$ ./pruned_transducer_stateless7_streaming/decode.py \
--epoch 99 \
--avg 1 \
--use-averaged-model False \
--beam-size 4 \
--exp-dir $exp_dir \
--max-duration 600 \
--decode-chunk-len 32 \
--decoding-method modified_beam_search_lm_shallow_fusion \
--bpe-model ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model \
--use-shallow-fusion 1 \
--lm-type rnn \
--lm-exp-dir $lm_dir \
--lm-epoch 99 \
--lm-scale $lm_scale \
--lm-avg 1 \
--rnn-lm-embedding-dim 2048 \
--rnn-lm-hidden-dim 2048 \
--rnn-lm-num-layers 3 \
--lm-vocab-size 500
Note that we set ``--decoding-method modified_beam_search_lm_shallow_fusion`` and ``--use-shallow-fusion True``
to use shallow fusion. ``--lm-type`` specifies the type of neural LM we are going to use, you can either choose
between ``rnn`` or ``transformer``. The following three arguments are associated with the rnn:
- ``--rnn-lm-embedding-dim``
The embedding dimension of the RNN LM
- ``--rnn-lm-hidden-dim``
The hidden dimension of the RNN LM
- ``--rnn-lm-num-layers``
The number of RNN layers in the RNN LM.
The decoding result obtained with the above command are shown below.
.. code-block:: text
$ For test-clean, WER of different settings are:
$ beam_size_4 2.77 best for test-clean
$ For test-other, WER of different settings are:
$ beam_size_4 7.08 best for test-other
The improvement of shallow fusion is very obvious! The relative WER reduction on test-other is around 10.5%.
A few parameters can be tuned to further boost the performance of shallow fusion:
- ``--lm-scale``
Controls the scale of the LM. If too small, the external language model may not be fully utilized; if too large,
the LM score might be dominant during decoding, leading to bad WER. A typical value of this is around 0.3.
- ``--beam-size``
The number of active paths in the search beam. It controls the trade-off between decoding efficiency and accuracy.
Here, we also show how `--beam-size` effect the WER and decoding time:
.. list-table:: WERs and decoding time (on test-clean) of shallow fusion with different beam sizes
:widths: 25 25 25 25
:header-rows: 1
* - Beam size
- test-clean
- test-other
- Decoding time on test-clean (s)
* - 4
- 2.77
- 7.08
- 262
* - 8
- 2.62
- 6.65
- 352
* - 12
- 2.58
- 6.65
- 488
As we see, a larger beam size during shallow fusion improves the WER, but is also slower.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 356 KiB

View File

@ -1,17 +0,0 @@
.. _icefall_docker:
Docker
======
This section describes how to use pre-built docker images to run `icefall`_.
.. hint::
If you only have CPUs available, you can still use the pre-built docker
images.
.. toctree::
:maxdepth: 2
./intro.rst

View File

@ -1,223 +0,0 @@
Introduction
=============
We have pre-built docker images hosted at the following address:
`<https://hub.docker.com/repository/docker/k2fsa/icefall/general>`_
.. figure:: img/docker-hub.png
:width: 600
:align: center
You can find the ``Dockerfile`` at `<https://github.com/k2-fsa/icefall/tree/master/docker>`_.
We describe the following items in this section:
- How to view available tags
- How to download pre-built docker images
- How to run the `yesno`_ recipe within a docker container on ``CPU``
View available tags
===================
CUDA-enabled docker images
--------------------------
You can use the following command to view available tags for CUDA-enabled
docker images:
.. code-block:: bash
curl -s 'https://registry.hub.docker.com/v2/repositories/k2fsa/icefall/tags/'|jq '."results"[]["name"]'
which will give you something like below:
.. code-block:: bash
"torch2.4.1-cuda12.4"
"torch2.4.1-cuda12.1"
"torch2.4.1-cuda11.8"
"torch2.4.0-cuda12.4"
"torch2.4.0-cuda12.1"
"torch2.4.0-cuda11.8"
"torch2.3.1-cuda12.1"
"torch2.3.1-cuda11.8"
"torch2.2.2-cuda12.1"
"torch2.2.2-cuda11.8"
"torch2.2.1-cuda12.1"
"torch2.2.1-cuda11.8"
"torch2.2.0-cuda12.1"
"torch2.2.0-cuda11.8"
"torch2.1.0-cuda12.1"
"torch2.1.0-cuda11.8"
"torch2.0.0-cuda11.7"
"torch1.12.1-cuda11.3"
"torch1.9.0-cuda10.2"
"torch1.13.0-cuda11.6"
.. hint::
Available tags will be updated when there are new releases of `torch`_.
Please select an appropriate combination of `torch`_ and CUDA.
CPU-only docker images
----------------------
To view CPU-only docker images, please visit `<https://github.com/k2-fsa/icefall/pkgs/container/icefall>`_
for available tags.
You can select different combinations of ``Python`` and ``torch``. For instance,
to select ``Python 3.8`` and ``torch 2.1.2``, you can use the following tag
.. code-block:: bash
cpu-py3.8-torch2.1.2-v1.1
where ``v1.1`` is the current version of the docker image. You may see
``ghcr.io/k2-fsa/icefall:cpu-py3.8-torch2.1.2-v1.2`` or some other versions.
We recommend that you always use the latest version.
Download a docker image (CUDA)
==============================
Suppose that you select the tag ``torch1.13.0-cuda11.6``, you can use
the following command to download it:
.. code-block:: bash
sudo docker image pull k2fsa/icefall:torch1.13.0-cuda11.6
Download a docker image (CPU)
==============================
Suppose that you select the tag ``cpu-py3.8-torch2.1.2-v1.1``, you can use
the following command to download it:
.. code-block:: bash
sudo docker pull ghcr.io/k2-fsa/icefall:cpu-py3.8-torch2.1.2-v1.1
Run a docker image with GPU
===========================
.. code-block:: bash
sudo docker run --gpus all --rm -it k2fsa/icefall:torch1.13.0-cuda11.6 /bin/bash
Run a docker image with CPU
===========================
.. code-block:: bash
sudo docker run --rm -it ghcr.io/k2-fsa/icefall:cpu-py3.8-torch2.1.2-v1.1 /bin/bash
Run yesno within a docker container
===================================
After starting the container, the following interface is presented:
.. code-block:: bash
# GPU-enabled docker
root@60c947eac59c:/workspace/icefall#
# CPU-only docker
root@60c947eac59c:# mkdir /workspace; git clone https://github.com/k2-fsa/icefall
root@60c947eac59c:# export PYTHONPATH=/workspace/icefall:$PYTHONPATH
It shows the current user is ``root`` and the current working directory
is ``/workspace/icefall``.
Update the code
---------------
Please first run:
.. code-block:: bash
root@60c947eac59c:/workspace/icefall# git pull
so that your local copy contains the latest code.
Data preparation
----------------
Now we can use
.. code-block:: bash
root@60c947eac59c:/workspace/icefall# cd egs/yesno/ASR/
to switch to the ``yesno`` recipe and run
.. code-block:: bash
root@60c947eac59c:/workspace/icefall/egs/yesno/ASR# ./prepare.sh
.. hint::
If you are running without GPU with a GPU-enabled docker, it may report the following error:
.. code-block:: bash
File "/opt/conda/lib/python3.9/site-packages/k2/__init__.py", line 23, in <module>
from _k2 import DeterminizeWeightPushingType
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
We can use the following command to fix it:
.. code-block:: bash
root@60c947eac59c:/workspace/icefall/egs/yesno/ASR# ln -s /opt/conda/lib/stubs/libcuda.so /opt/conda/lib/stubs/libcuda.so.1
The logs of running ``./prepare.sh`` are listed below:
.. literalinclude:: ./log/log-preparation.txt
Training
--------
After preparing the data, we can start training with the following command
.. code-block:: bash
root@60c947eac59c:/workspace/icefall/egs/yesno/ASR# ./tdnn/train.py
All of the training logs are given below:
.. hint::
It is running on CPU and it takes only 16 seconds for this run.
.. literalinclude:: ./log/log-train-2023-08-01-01-55-27
Decoding
--------
After training, we can decode the trained model with
.. code-block:: bash
root@60c947eac59c:/workspace/icefall/egs/yesno/ASR# ./tdnn/decode.py
The decoding logs are given below:
.. code-block:: bash
2023-08-01 02:06:22,400 INFO [decode.py:263] Decoding started
2023-08-01 02:06:22,400 INFO [decode.py:264] {'exp_dir': PosixPath('tdnn/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 23, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 14, 'avg': 2, 'export': False, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30.0, 'bucketing_sampler': False, 'num_buckets': 10, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': False, 'return_cuts': True, 'num_workers': 2, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.16.0.dev+git.7640d663.clean', 'torch-version': '1.13.0', 'torch-cuda-available': False, 'torch-cuda-version': '11.6', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': '375520d-clean', 'icefall-git-date': 'Fri Jul 28 07:43:08 2023', 'icefall-path': '/workspace/icefall', 'k2-path': '/opt/conda/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/opt/conda/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': '60c947eac59c', 'IP address': '172.17.0.2'}}
2023-08-01 02:06:22,401 INFO [lexicon.py:168] Loading pre-compiled data/lang_phone/Linv.pt
2023-08-01 02:06:22,403 INFO [decode.py:273] device: cpu
2023-08-01 02:06:22,406 INFO [decode.py:291] averaging ['tdnn/exp/epoch-13.pt', 'tdnn/exp/epoch-14.pt']
2023-08-01 02:06:22,424 INFO [asr_datamodule.py:218] About to get test cuts
2023-08-01 02:06:22,425 INFO [asr_datamodule.py:252] About to get test cuts
2023-08-01 02:06:22,504 INFO [decode.py:204] batch 0/?, cuts processed until now is 4
[W NNPACK.cpp:53] Could not initialize NNPACK! Reason: Unsupported hardware.
2023-08-01 02:06:22,687 INFO [decode.py:241] The transcripts are stored in tdnn/exp/recogs-test_set.txt
2023-08-01 02:06:22,688 INFO [utils.py:564] [test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]
2023-08-01 02:06:22,690 INFO [decode.py:249] Wrote detailed error stats to tdnn/exp/errs-test_set.txt
2023-08-01 02:06:22,690 INFO [decode.py:316] Done!
Congratulations! You have finished successfully running `icefall`_ within a docker container.

View File

@ -1,107 +0,0 @@
Frequently Asked Questions (FAQs)
=================================
In this section, we collect issues reported by users and post the corresponding
solutions.
OSError: libtorch_hip.so: cannot open shared object file: no such file or directory
-----------------------------------------------------------------------------------
One user is using the following code to install ``torch`` and ``torchaudio``:
.. code-block:: bash
pip install \
torch==1.10.0+cu111 \
torchvision==0.11.0+cu111 \
torchaudio==0.10.0 \
-f https://download.pytorch.org/whl/torch_stable.html
and it throws the following error when running ``tdnn/train.py``:
.. code-block::
OSError: libtorch_hip.so: cannot open shared object file: no such file or directory
The fix is to specify the CUDA version while installing ``torchaudio``. That
is, change ``torchaudio==0.10.0`` to ``torchaudio==0.10.0+cu11```. Therefore,
the correct command is:
.. code-block:: bash
pip install \
torch==1.10.0+cu111 \
torchvision==0.11.0+cu111 \
torchaudio==0.10.0+cu111 \
-f https://download.pytorch.org/whl/torch_stable.html
AttributeError: module 'distutils' has no attribute 'version'
-------------------------------------------------------------
The error log is:
.. code-block::
Traceback (most recent call last):
File "./tdnn/train.py", line 14, in <module>
from asr_datamodule import YesNoAsrDataModule
File "/home/xxx/code/next-gen-kaldi/icefall/egs/yesno/ASR/tdnn/asr_datamodule.py", line 34, in <module>
from icefall.dataset.datamodule import DataModule
File "/home/xxx/code/next-gen-kaldi/icefall/icefall/__init__.py", line 3, in <module>
from . import (
File "/home/xxx/code/next-gen-kaldi/icefall/icefall/decode.py", line 23, in <module>
from icefall.utils import add_eos, add_sos, get_texts
File "/home/xxx/code/next-gen-kaldi/icefall/icefall/utils.py", line 39, in <module>
from torch.utils.tensorboard import SummaryWriter
File "/home/xxx/tool/miniconda3/envs/yyy/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py", line 4, in <module>
LooseVersion = distutils.version.LooseVersion
AttributeError: module 'distutils' has no attribute 'version'
The fix is:
.. code-block:: bash
pip uninstall setuptools
pip install setuptools==58.0.4
ImportError: libpython3.10.so.1.0: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------------------------
If you are using ``conda`` and encounter the following issue:
.. code-block::
Traceback (most recent call last):
File "/k2-dev/yangyifan/anaconda3/envs/icefall/lib/python3.10/site-packages/k2-1.23.3.dev20230112+cuda11.6.torch1.13.1-py3.10-linux-x86_64.egg/k2/__init__.py", line 24, in <module>
from _k2 import DeterminizeWeightPushingType
ImportError: libpython3.10.so.1.0: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/k2-dev/yangyifan/icefall/egs/librispeech/ASR/./pruned_transducer_stateless7_ctc_bs/decode.py", line 104, in <module>
import k2
File "/k2-dev/yangyifan/anaconda3/envs/icefall/lib/python3.10/site-packages/k2-1.23.3.dev20230112+cuda11.6.torch1.13.1-py3.10-linux-x86_64.egg/k2/__init__.py", line 30, in <module>
raise ImportError(
ImportError: libpython3.10.so.1.0: cannot open shared object file: No such file or directory
Note: If you're using anaconda and importing k2 on MacOS,
you can probably fix this by setting the environment variable:
export DYLD_LIBRARY_PATH=$CONDA_PREFIX/lib/python3.10/site-packages:$DYLD_LIBRARY_PATH
Please first try to find where ``libpython3.10.so.1.0`` locates.
For instance,
.. code-block:: bash
cd $CONDA_PREFIX/lib
find . -name "libpython*"
If you are able to find it inside ``$CODNA_PREFIX/lib``, please set the
following environment variable:
.. code-block:: bash
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

View File

@ -1,180 +0,0 @@
.. _dummies_tutorial_data_preparation:
Data Preparation
================
After :ref:`dummies_tutorial_environment_setup`, we can start preparing the
data for training and decoding.
The first step is to prepare the data for training. We have already provided
`prepare.sh <https://github.com/k2-fsa/icefall/blob/master/egs/yesno/ASR/prepare.sh>`_
that would prepare everything required for training.
.. code-block::
cd /tmp/icefall
export PYTHONPATH=/tmp/icefall:$PYTHONPATH
cd egs/yesno/ASR
./prepare.sh
Note that in each recipe from `icefall`_, there exists a file ``prepare.sh``,
which you should run before you run anything else.
That is all you need for data preparation.
For the more curious
--------------------
If you are wondering how to prepare your own dataset, please refer to the following
URLs for more details:
- `<https://github.com/lhotse-speech/lhotse/tree/master/lhotse/recipes>`_
It contains recipes for a variety of dataset. If you want to add your own
dataset, please read recipes in this folder first.
- `<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/yesno.py>`_
The `yesno`_ recipe in `lhotse`_.
If you already have a `Kaldi`_ dataset directory, which contains files like
``wav.scp``, ``feats.scp``, then you can refer to `<https://lhotse.readthedocs.io/en/latest/kaldi.html#example>`_.
A quick look to the generated files
-----------------------------------
``./prepare.sh`` puts generated files into two directories:
- ``download``
- ``data``
download
^^^^^^^^
The ``download`` directory contains downloaded dataset files:
.. code-block:: bas
tree -L 1 ./download/
./download/
|-- waves_yesno
`-- waves_yesno.tar.gz
.. hint::
Please refer to `<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/yesno.py#L41>`_
for how the data is downloaded and extracted.
data
^^^^
.. code-block:: bash
tree ./data/
./data/
|-- fbank
| |-- yesno_cuts_test.jsonl.gz
| |-- yesno_cuts_train.jsonl.gz
| |-- yesno_feats_test.lca
| `-- yesno_feats_train.lca
|-- lang_phone
| |-- HLG.pt
| |-- L.pt
| |-- L_disambig.pt
| |-- Linv.pt
| |-- lexicon.txt
| |-- lexicon_disambig.txt
| |-- tokens.txt
| `-- words.txt
|-- lm
| |-- G.arpa
| `-- G.fst.txt
`-- manifests
|-- yesno_recordings_test.jsonl.gz
|-- yesno_recordings_train.jsonl.gz
|-- yesno_supervisions_test.jsonl.gz
`-- yesno_supervisions_train.jsonl.gz
4 directories, 18 files
**data/manifests**:
This directory contains manifests. They are used to generate files in
``data/fbank``.
To give you an idea of what it contains, we examine the first few lines of
the manifests related to the ``train`` dataset.
.. code-block:: bash
cd data/manifests
gunzip -c yesno_recordings_train.jsonl.gz | head -n 3
The output is given below:
.. code-block:: bash
{"id": "0_0_0_0_1_1_1_1", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_0_1_1_1_1.wav"}], "sampling_rate": 8000, "num_samples": 50800, "duration": 6.35, "channel_ids": [0]}
{"id": "0_0_0_1_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_1_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48880, "duration": 6.11, "channel_ids": [0]}
{"id": "0_0_1_0_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_1_0_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48160, "duration": 6.02, "channel_ids": [0]}
Please refer to `<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L300>`_
for the meaning of each field per line.
.. code-block:: bash
gunzip -c yesno_supervisions_train.jsonl.gz | head -n 3
The output is given below:
.. code-block:: bash
{"id": "0_0_0_0_1_1_1_1", "recording_id": "0_0_0_0_1_1_1_1", "start": 0.0, "duration": 6.35, "channel": 0, "text": "NO NO NO NO YES YES YES YES", "language": "Hebrew"}
{"id": "0_0_0_1_0_1_1_0", "recording_id": "0_0_0_1_0_1_1_0", "start": 0.0, "duration": 6.11, "channel": 0, "text": "NO NO NO YES NO YES YES NO", "language": "Hebrew"}
{"id": "0_0_1_0_0_1_1_0", "recording_id": "0_0_1_0_0_1_1_0", "start": 0.0, "duration": 6.02, "channel": 0, "text": "NO NO YES NO NO YES YES NO", "language": "Hebrew"}
Please refer to `<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/supervision.py#L510>`_
for the meaning of each field per line.
**data/fbank**:
This directory contains everything from ``data/manifests``. Furthermore, it also contains features
for training.
``data/fbank/yesno_feats_train.lca`` contains the features for the train dataset.
Features are compressed using `lilcom`_.
``data/fbank/yesno_cuts_train.jsonl.gz`` stores the `CutSet <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/cut/set.py#L72>`_,
which stores `RecordingSet <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L928>`_,
`SupervisionSet <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/supervision.py#L510>`_,
and `FeatureSet <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/features/base.py#L593>`_.
To give you an idea about what it looks like, we can run the following command:
.. code-block:: bash
cd data/fbank
gunzip -c yesno_cuts_train.jsonl.gz | head -n 3
The output is given below:
.. code-block:: bash
{"id": "0_0_0_0_1_1_1_1-0", "start": 0, "duration": 6.35, "channel": 0, "supervisions": [{"id": "0_0_0_0_1_1_1_1", "recording_id": "0_0_0_0_1_1_1_1", "start": 0.0, "duration": 6.35, "channel": 0, "text": "NO NO NO NO YES YES YES YES", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 635, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.35, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "0,13000,3570", "channels": 0}, "recording": {"id": "0_0_0_0_1_1_1_1", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_0_1_1_1_1.wav"}], "sampling_rate": 8000, "num_samples": 50800, "duration": 6.35, "channel_ids": [0]}, "type": "MonoCut"}
{"id": "0_0_0_1_0_1_1_0-1", "start": 0, "duration": 6.11, "channel": 0, "supervisions": [{"id": "0_0_0_1_0_1_1_0", "recording_id": "0_0_0_1_0_1_1_0", "start": 0.0, "duration": 6.11, "channel": 0, "text": "NO NO NO YES NO YES YES NO", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 611, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.11, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "16570,12964,2929", "channels": 0}, "recording": {"id": "0_0_0_1_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_1_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48880, "duration": 6.11, "channel_ids": [0]}, "type": "MonoCut"}
{"id": "0_0_1_0_0_1_1_0-2", "start": 0, "duration": 6.02, "channel": 0, "supervisions": [{"id": "0_0_1_0_0_1_1_0", "recording_id": "0_0_1_0_0_1_1_0", "start": 0.0, "duration": 6.02, "channel": 0, "text": "NO NO YES NO NO YES YES NO", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 602, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.02, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "32463,12936,2696", "channels": 0}, "recording": {"id": "0_0_1_0_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_1_0_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48160, "duration": 6.02, "channel_ids": [0]}, "type": "MonoCut"}
Note that ``yesno_cuts_train.jsonl.gz`` only stores the information about how to read the features.
The actual features are stored separately in ``data/fbank/yesno_feats_train.lca``.
**data/lang**:
This directory contains the lexicon.
**data/lm**:
This directory contains language models.

View File

@ -1,39 +0,0 @@
.. _dummies_tutorial_decoding:
Decoding
========
After :ref:`dummies_tutorial_training`, we can start decoding.
The command to start the decoding is quite simple:
.. code-block:: bash
cd /tmp/icefall
export PYTHONPATH=/tmp/icefall:$PYTHONPATH
cd egs/yesno/ASR
# We use CPU for decoding by setting the following environment variable
export CUDA_VISIBLE_DEVICES=""
./tdnn/decode.py
The output logs are given below:
.. literalinclude:: ./code/decoding-yesno.txt
For the more curious
--------------------
.. code-block:: bash
./tdnn/decode.py --help
will print the usage information about ``./tdnn/decode.py``. For instance, you
can specify:
- ``--epoch`` to use which checkpoint for decoding
- ``--avg`` to select how many checkpoints to use for model averaging
You usually try different combinations of ``--epoch`` and ``--avg`` and select
one that leads to the lowest WER (`Word Error Rate <https://en.wikipedia.org/wiki/Word_error_rate>`_).

View File

@ -1,125 +0,0 @@
.. _dummies_tutorial_environment_setup:
Environment setup
=================
We will create an environment for `Next-gen Kaldi`_ that runs on ``CPU``
in this tutorial.
.. note::
Since the `yesno`_ dataset used in this tutorial is very tiny, training on
``CPU`` works very well for it.
If your dataset is very large, e.g., hundreds or thousands of hours of
training data, please follow :ref:`install icefall` to install `icefall`_
that works with ``GPU``.
Create a virtual environment
----------------------------
.. code-block:: bash
virtualenv -p python3 /tmp/icefall_env
The above command creates a virtual environment in the directory ``/tmp/icefall_env``.
You can select any directory you want.
The output of the above command is given below:
.. code-block:: bash
Already using interpreter /usr/bin/python3
Using base prefix '/usr'
New python executable in /tmp/icefall_env/bin/python3
Also creating executable in /tmp/icefall_env/bin/python
Installing setuptools, pkg_resources, pip, wheel...done.
Now we can activate the environment using:
.. code-block:: bash
source /tmp/icefall_env/bin/activate
Install dependencies
--------------------
.. warning::
Remeber to activate your virtual environment before you continue!
After activating the virtual environment, we can use the following command
to install dependencies of `icefall`_:
.. hint::
Remeber that we will run this tutorial on ``CPU``, so we install
dependencies required only by running on ``CPU``.
.. code-block:: bash
# Caution: Installation order matters!
# We use torch 2.0.0 and torchaduio 2.0.0 in this tutorial.
# Other versions should also work.
pip install torch==2.0.0+cpu torchaudio==2.0.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
# If you are using macOS, please use the following command to install torch and torchaudio
# pip install torch==2.0.0 torchaudio==2.0.0 -f https://download.pytorch.org/whl/torch_stable.html
# Now install k2
# Please refer to https://k2-fsa.github.io/k2/installation/from_wheels.html#linux-cpu-example
pip install k2==1.24.4.dev20231220+cpu.torch2.0.0 -f https://k2-fsa.github.io/k2/cpu.html
# For users from China
# 中国国内用户,如果访问不了 huggingface, 请使用
# pip install k2==1.24.4.dev20231220+cpu.torch2.0.0 -f https://k2-fsa.github.io/k2/cpu-cn.html
# Install the latest version of lhotse
pip install git+https://github.com/lhotse-speech/lhotse
Install icefall
---------------
We will put the source code of `icefall`_ into the directory ``/tmp``
You can select any directory you want.
.. code-block:: bash
cd /tmp
git clone https://github.com/k2-fsa/icefall
cd icefall
pip install -r ./requirements.txt
.. code-block:: bash
# Anytime we want to use icefall, we have to set the following
# environment variable
export PYTHONPATH=/tmp/icefall:$PYTHONPATH
.. hint::
If you get the following error during this tutorial:
.. code-block:: bash
ModuleNotFoundError: No module named 'icefall'
please set the above environment variable to fix it.
Congratulations! You have installed `icefall`_ successfully.
For the more curious
--------------------
`icefall`_ contains a collection of Python scripts and you don't need to
use ``python3 setup.py install`` or ``pip install icefall`` to install it.
All you need to do is to download the code and set the environment variable
``PYTHONPATH``.

View File

@ -1,34 +0,0 @@
Icefall for dummies tutorial
============================
This tutorial walks you step by step about how to create a simple
ASR (`Automatic Speech Recognition <https://en.wikipedia.org/wiki/Speech_recognition>`_)
system with `Next-gen Kaldi`_.
We use the `yesno`_ dataset for demonstration. We select it out of two reasons:
- It is quite tiny, containing only about 12 minutes of data
- The training can be finished within 20 seconds on ``CPU``.
That also means you don't need a ``GPU`` to run this tutorial.
Let's get started!
Please follow items below **sequentially**.
.. note::
The :ref:`dummies_tutorial_data_preparation` runs only on Linux and on macOS.
All other parts run on Linux, macOS, and Windows.
Help from the community is appreciated to port the :ref:`dummies_tutorial_data_preparation`
to Windows.
.. toctree::
:maxdepth: 2
./environment-setup.rst
./data-preparation.rst
./training.rst
./decoding.rst
./model-export.rst

View File

@ -1,310 +0,0 @@
Model Export
============
There are three ways to export a pre-trained model.
- Export the model parameters via `model.state_dict() <https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=load_state_dict#torch.nn.Module.state_dict>`_
- Export via `torchscript <https://pytorch.org/docs/stable/jit.html>`_: either `torch.jit.script() <https://pytorch.org/docs/stable/generated/torch.jit.script.html#torch.jit.script>`_ or `torch.jit.trace() <https://pytorch.org/docs/stable/generated/torch.jit.trace.html>`_
- Export to `ONNX`_ via `torch.onnx.export() <https://pytorch.org/docs/stable/onnx.html>`_
Each method is explained below in detail.
Export the model parameters via model.state_dict()
---------------------------------------------------
The command for this kind of export is
.. code-block:: bash
cd /tmp/icefall
export PYTHONPATH=/tmp/icefall:$PYTHONPATH
cd egs/yesno/ASR
# assume that "--epoch 14 --avg 2" produces the lowest WER.
./tdnn/export.py --epoch 14 --avg 2
The output logs are given below:
.. code-block:: bash
2023-08-16 20:42:03,912 INFO [export.py:76] {'exp_dir': PosixPath('tdnn/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lr': 0.01, 'feature_dim': 23, 'weight_decay': 1e-06, 'start_epoch': 0, 'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 20, 'valid_interval': 10, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'epoch': 14, 'avg': 2, 'jit': False}
2023-08-16 20:42:03,913 INFO [lexicon.py:168] Loading pre-compiled data/lang_phone/Linv.pt
2023-08-16 20:42:03,950 INFO [export.py:93] averaging ['tdnn/exp/epoch-13.pt', 'tdnn/exp/epoch-14.pt']
2023-08-16 20:42:03,971 INFO [export.py:106] Not using torch.jit.script
2023-08-16 20:42:03,974 INFO [export.py:111] Saved to tdnn/exp/pretrained.pt
We can see from the logs that the exported model is saved to the file ``tdnn/exp/pretrained.pt``.
To give you an idea of what ``tdnn/exp/pretrained.pt`` contains, we can use the following command:
.. code-block:: python3
>>> import torch
>>> m = torch.load("tdnn/exp/pretrained.pt")
>>> list(m.keys())
['model']
>>> list(m["model"].keys())
['tdnn.0.weight', 'tdnn.0.bias', 'tdnn.2.running_mean', 'tdnn.2.running_var', 'tdnn.2.num_batches_tracked', 'tdnn.3.weight', 'tdnn.3.bias', 'tdnn.5.running_mean', 'tdnn.5.running_var', 'tdnn.5.num_batches_tracked', 'tdnn.6.weight', 'tdnn.6.bias', 'tdnn.8.running_mean', 'tdnn.8.running_var', 'tdnn.8.num_batches_tracked', 'output_linear.weight', 'output_linear.bias']
We can use ``tdnn/exp/pretrained.pt`` in the following way with ``./tdnn/decode.py``:
.. code-block:: bash
cd tdnn/exp
ln -s pretrained.pt epoch-99.pt
cd ../..
./tdnn/decode.py --epoch 99 --avg 1
The output logs of the above command are given below:
.. code-block:: bash
2023-08-16 20:45:48,089 INFO [decode.py:262] Decoding started
2023-08-16 20:45:48,090 INFO [decode.py:263] {'exp_dir': PosixPath('tdnn/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'feature_dim': 23, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 99, 'avg': 1, 'export': False, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30.0, 'bucketing_sampler': False, 'num_buckets': 10, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': False, 'return_cuts': True, 'num_workers': 2, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': False, 'k2-git-sha1': 'ad79f1c699c684de9785ed6ca5edb805a41f78c3', 'k2-git-date': 'Wed Jul 26 09:30:42 2023', 'lhotse-version': '1.16.0.dev+git.aa073f6.clean', 'torch-version': '2.0.0', 'torch-cuda-available': False, 'torch-cuda-version': None, 'python-version': '3.1', 'icefall-git-branch': 'master', 'icefall-git-sha1': '9a47c08-clean', 'icefall-git-date': 'Mon Aug 14 22:10:50 2023', 'icefall-path': '/private/tmp/icefall', 'k2-path': '/private/tmp/icefall_env/lib/python3.11/site-packages/k2/__init__.py', 'lhotse-path': '/private/tmp/icefall_env/lib/python3.11/site-packages/lhotse/__init__.py', 'hostname': 'fangjuns-MacBook-Pro.local', 'IP address': '127.0.0.1'}}
2023-08-16 20:45:48,092 INFO [lexicon.py:168] Loading pre-compiled data/lang_phone/Linv.pt
2023-08-16 20:45:48,103 INFO [decode.py:272] device: cpu
2023-08-16 20:45:48,109 INFO [checkpoint.py:112] Loading checkpoint from tdnn/exp/epoch-99.pt
2023-08-16 20:45:48,115 INFO [asr_datamodule.py:218] About to get test cuts
2023-08-16 20:45:48,115 INFO [asr_datamodule.py:253] About to get test cuts
2023-08-16 20:45:50,386 INFO [decode.py:203] batch 0/?, cuts processed until now is 4
2023-08-16 20:45:50,556 INFO [decode.py:240] The transcripts are stored in tdnn/exp/recogs-test_set.txt
2023-08-16 20:45:50,557 INFO [utils.py:564] [test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]
2023-08-16 20:45:50,558 INFO [decode.py:248] Wrote detailed error stats to tdnn/exp/errs-test_set.txt
2023-08-16 20:45:50,559 INFO [decode.py:315] Done!
We can see that it produces an identical WER as before.
We can also use it to decode files with the following command:
.. code-block:: bash
# ./tdnn/pretrained.py requires kaldifeat
#
# Please refer to https://csukuangfj.github.io/kaldifeat/installation/from_wheels.html
# for how to install kaldifeat
pip install kaldifeat==1.25.3.dev20231221+cpu.torch2.0.0 -f https://csukuangfj.github.io/kaldifeat/cpu.html
./tdnn/pretrained.py \
--checkpoint ./tdnn/exp/pretrained.pt \
--HLG ./data/lang_phone/HLG.pt \
--words-file ./data/lang_phone/words.txt \
download/waves_yesno/0_0_0_1_0_0_0_1.wav \
download/waves_yesno/0_0_1_0_0_0_1_0.wav
The output is given below:
.. code-block:: bash
2023-08-16 20:53:19,208 INFO [pretrained.py:136] {'feature_dim': 23, 'num_classes': 4, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'checkpoint': './tdnn/exp/pretrained.pt', 'words_file': './data/lang_phone/words.txt', 'HLG': './data/lang_phone/HLG.pt', 'sound_files': ['download/waves_yesno/0_0_0_1_0_0_0_1.wav', 'download/waves_yesno/0_0_1_0_0_0_1_0.wav']}
2023-08-16 20:53:19,208 INFO [pretrained.py:142] device: cpu
2023-08-16 20:53:19,208 INFO [pretrained.py:144] Creating model
2023-08-16 20:53:19,212 INFO [pretrained.py:156] Loading HLG from ./data/lang_phone/HLG.pt
2023-08-16 20:53:19,213 INFO [pretrained.py:160] Constructing Fbank computer
2023-08-16 20:53:19,213 INFO [pretrained.py:170] Reading sound files: ['download/waves_yesno/0_0_0_1_0_0_0_1.wav', 'download/waves_yesno/0_0_1_0_0_0_1_0.wav']
2023-08-16 20:53:19,224 INFO [pretrained.py:176] Decoding started
2023-08-16 20:53:19,304 INFO [pretrained.py:212]
download/waves_yesno/0_0_0_1_0_0_0_1.wav:
NO NO NO YES NO NO NO YES
download/waves_yesno/0_0_1_0_0_0_1_0.wav:
NO NO YES NO NO NO YES NO
2023-08-16 20:53:19,304 INFO [pretrained.py:214] Decoding Done
Export via torch.jit.script()
-----------------------------
The command for this kind of export is
.. code-block:: bash
cd /tmp/icefall
export PYTHONPATH=/tmp/icefall:$PYTHONPATH
cd egs/yesno/ASR
# assume that "--epoch 14 --avg 2" produces the lowest WER.
./tdnn/export.py --epoch 14 --avg 2 --jit true
The output logs are given below:
.. code-block:: bash
2023-08-16 20:47:44,666 INFO [export.py:76] {'exp_dir': PosixPath('tdnn/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lr': 0.01, 'feature_dim': 23, 'weight_decay': 1e-06, 'start_epoch': 0, 'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 20, 'valid_interval': 10, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'epoch': 14, 'avg': 2, 'jit': True}
2023-08-16 20:47:44,667 INFO [lexicon.py:168] Loading pre-compiled data/lang_phone/Linv.pt
2023-08-16 20:47:44,670 INFO [export.py:93] averaging ['tdnn/exp/epoch-13.pt', 'tdnn/exp/epoch-14.pt']
2023-08-16 20:47:44,677 INFO [export.py:100] Using torch.jit.script
2023-08-16 20:47:44,843 INFO [export.py:104] Saved to tdnn/exp/cpu_jit.pt
From the output logs we can see that the generated file is saved to ``tdnn/exp/cpu_jit.pt``.
Don't be confused by the name ``cpu_jit.pt``. The ``cpu`` part means the model is moved to
CPU before exporting. That means, when you load it with:
.. code-block:: bash
torch.jit.load()
you don't need to specify the argument `map_location <https://pytorch.org/docs/stable/generated/torch.jit.load.html#torch.jit.load>`_
and it resides on CPU by default.
To use ``tdnn/exp/cpu_jit.pt`` with `icefall`_ to decode files, we can use:
.. code-block:: bash
# ./tdnn/jit_pretrained.py requires kaldifeat
#
# Please refer to https://csukuangfj.github.io/kaldifeat/installation/from_wheels.html
# for how to install kaldifeat
pip install kaldifeat==1.25.3.dev20231221+cpu.torch2.0.0 -f https://csukuangfj.github.io/kaldifeat/cpu.html
./tdnn/jit_pretrained.py \
--nn-model ./tdnn/exp/cpu_jit.pt \
--HLG ./data/lang_phone/HLG.pt \
--words-file ./data/lang_phone/words.txt \
download/waves_yesno/0_0_0_1_0_0_0_1.wav \
download/waves_yesno/0_0_1_0_0_0_1_0.wav
The output is given below:
.. code-block:: bash
2023-08-16 20:56:00,603 INFO [jit_pretrained.py:121] {'feature_dim': 23, 'num_classes': 4, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'nn_model': './tdnn/exp/cpu_jit.pt', 'words_file': './data/lang_phone/words.txt', 'HLG': './data/lang_phone/HLG.pt', 'sound_files': ['download/waves_yesno/0_0_0_1_0_0_0_1.wav', 'download/waves_yesno/0_0_1_0_0_0_1_0.wav']}
2023-08-16 20:56:00,603 INFO [jit_pretrained.py:127] device: cpu
2023-08-16 20:56:00,603 INFO [jit_pretrained.py:129] Loading torchscript model
2023-08-16 20:56:00,640 INFO [jit_pretrained.py:134] Loading HLG from ./data/lang_phone/HLG.pt
2023-08-16 20:56:00,641 INFO [jit_pretrained.py:138] Constructing Fbank computer
2023-08-16 20:56:00,641 INFO [jit_pretrained.py:148] Reading sound files: ['download/waves_yesno/0_0_0_1_0_0_0_1.wav', 'download/waves_yesno/0_0_1_0_0_0_1_0.wav']
2023-08-16 20:56:00,642 INFO [jit_pretrained.py:154] Decoding started
2023-08-16 20:56:00,727 INFO [jit_pretrained.py:190]
download/waves_yesno/0_0_0_1_0_0_0_1.wav:
NO NO NO YES NO NO NO YES
download/waves_yesno/0_0_1_0_0_0_1_0.wav:
NO NO YES NO NO NO YES NO
2023-08-16 20:56:00,727 INFO [jit_pretrained.py:192] Decoding Done
.. hint::
We provide only code for ``torch.jit.script()``. You can try ``torch.jit.trace()``
if you want.
Export via torch.onnx.export()
------------------------------
The command for this kind of export is
.. code-block:: bash
cd /tmp/icefall
export PYTHONPATH=/tmp/icefall:$PYTHONPATH
cd egs/yesno/ASR
# tdnn/export_onnx.py requires onnx and onnxruntime
pip install onnx onnxruntime
# assume that "--epoch 14 --avg 2" produces the lowest WER.
./tdnn/export_onnx.py \
--epoch 14 \
--avg 2
The output logs are given below:
.. code-block:: bash
2023-08-16 20:59:20,888 INFO [export_onnx.py:83] {'exp_dir': PosixPath('tdnn/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lr': 0.01, 'feature_dim': 23, 'weight_decay': 1e-06, 'start_epoch': 0, 'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 20, 'valid_interval': 10, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'epoch': 14, 'avg': 2}
2023-08-16 20:59:20,888 INFO [lexicon.py:168] Loading pre-compiled data/lang_phone/Linv.pt
2023-08-16 20:59:20,892 INFO [export_onnx.py:100] averaging ['tdnn/exp/epoch-13.pt', 'tdnn/exp/epoch-14.pt']
================ Diagnostic Run torch.onnx.export version 2.0.0 ================
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
2023-08-16 20:59:21,047 INFO [export_onnx.py:127] Saved to tdnn/exp/model-epoch-14-avg-2.onnx
2023-08-16 20:59:21,047 INFO [export_onnx.py:136] meta_data: {'model_type': 'tdnn', 'version': '1', 'model_author': 'k2-fsa', 'comment': 'non-streaming tdnn for the yesno recipe', 'vocab_size': 4}
2023-08-16 20:59:21,049 INFO [export_onnx.py:140] Generate int8 quantization models
2023-08-16 20:59:21,075 INFO [onnx_quantizer.py:538] Quantization parameters for tensor:"/Transpose_1_output_0" not specified
2023-08-16 20:59:21,081 INFO [export_onnx.py:151] Saved to tdnn/exp/model-epoch-14-avg-2.int8.onnx
We can see from the logs that it generates two files:
- ``tdnn/exp/model-epoch-14-avg-2.onnx`` (ONNX model with ``float32`` weights)
- ``tdnn/exp/model-epoch-14-avg-2.int8.onnx`` (ONNX model with ``int8`` weights)
To use the generated ONNX model files for decoding with `onnxruntime`_, we can use
.. code-block:: bash
# ./tdnn/onnx_pretrained.py requires kaldifeat
#
# Please refer to https://csukuangfj.github.io/kaldifeat/installation/from_wheels.html
# for how to install kaldifeat
pip install kaldifeat==1.25.3.dev20231221+cpu.torch2.0.0 -f https://csukuangfj.github.io/kaldifeat/cpu.html
./tdnn/onnx_pretrained.py \
--nn-model ./tdnn/exp/model-epoch-14-avg-2.onnx \
--HLG ./data/lang_phone/HLG.pt \
--words-file ./data/lang_phone/words.txt \
download/waves_yesno/0_0_0_1_0_0_0_1.wav \
download/waves_yesno/0_0_1_0_0_0_1_0.wav
The output is given below:
.. code-block:: bash
2023-08-16 21:03:24,260 INFO [onnx_pretrained.py:166] {'feature_dim': 23, 'sample_rate': 8000, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'nn_model': './tdnn/exp/model-epoch-14-avg-2.onnx', 'words_file': './data/lang_phone/words.txt', 'HLG': './data/lang_phone/HLG.pt', 'sound_files': ['download/waves_yesno/0_0_0_1_0_0_0_1.wav', 'download/waves_yesno/0_0_1_0_0_0_1_0.wav']}
2023-08-16 21:03:24,260 INFO [onnx_pretrained.py:171] device: cpu
2023-08-16 21:03:24,260 INFO [onnx_pretrained.py:173] Loading onnx model ./tdnn/exp/model-epoch-14-avg-2.onnx
2023-08-16 21:03:24,267 INFO [onnx_pretrained.py:176] Loading HLG from ./data/lang_phone/HLG.pt
2023-08-16 21:03:24,270 INFO [onnx_pretrained.py:180] Constructing Fbank computer
2023-08-16 21:03:24,273 INFO [onnx_pretrained.py:190] Reading sound files: ['download/waves_yesno/0_0_0_1_0_0_0_1.wav', 'download/waves_yesno/0_0_1_0_0_0_1_0.wav']
2023-08-16 21:03:24,279 INFO [onnx_pretrained.py:196] Decoding started
2023-08-16 21:03:24,318 INFO [onnx_pretrained.py:232]
download/waves_yesno/0_0_0_1_0_0_0_1.wav:
NO NO NO YES NO NO NO YES
download/waves_yesno/0_0_1_0_0_0_1_0.wav:
NO NO YES NO NO NO YES NO
2023-08-16 21:03:24,318 INFO [onnx_pretrained.py:234] Decoding Done
.. note::
To use the ``int8`` ONNX model for decoding, please use:
.. code-block:: bash
./tdnn/onnx_pretrained.py \
--nn-model ./tdnn/exp/model-epoch-14-avg-2.onnx \
--HLG ./data/lang_phone/HLG.pt \
--words-file ./data/lang_phone/words.txt \
download/waves_yesno/0_0_0_1_0_0_0_1.wav \
download/waves_yesno/0_0_1_0_0_0_1_0.wav
For the more curious
--------------------
If you are wondering how to deploy the model without ``torch``, please
continue reading. We will show how to use `sherpa-onnx`_ to run the
exported ONNX models, which depends only on `onnxruntime`_ and does not
depend on ``torch``.
In this tutorial, we will only demonstrate the usage of `sherpa-onnx`_ with the
pre-trained model of the `yesno`_ recipe. There are also other two frameworks
available:
- `sherpa`_. It works with torchscript models.
- `sherpa-ncnn`_. It works with models exported using :ref:`icefall_export_to_ncnn` with `ncnn`_
Please see `<https://k2-fsa.github.io/sherpa/>`_ for further details.

View File

@ -1,39 +0,0 @@
.. _dummies_tutorial_training:
Training
========
After :ref:`dummies_tutorial_data_preparation`, we can start training.
The command to start the training is quite simple:
.. code-block:: bash
cd /tmp/icefall
export PYTHONPATH=/tmp/icefall:$PYTHONPATH
cd egs/yesno/ASR
# We use CPU for training by setting the following environment variable
export CUDA_VISIBLE_DEVICES=""
./tdnn/train.py
That's it!
You can find the training logs below:
.. literalinclude:: ./code/train-yesno.txt
For the more curious
--------------------
.. code-block:: bash
./tdnn/train.py --help
will print the usage information about ``./tdnn/train.py``. For instance, you
can specify the number of epochs to train and the location to save the training
results.
The training text logs are saved in ``tdnn/exp/log`` while the tensorboard
logs are in ``tdnn/exp/tensorboard``.

View File

@ -1,41 +0,0 @@
Two approaches
==============
Two approaches for FST-based forced alignment will be described:
- `Kaldi`_-based
- `k2`_-based
Note that the `Kaldi`_-based approach does not depend on `Kaldi`_ at all.
That is, you don't need to install `Kaldi`_ in order to use it. Instead,
we use `kaldi-decoder`_, which has ported the C++ decoding code from `Kaldi`_
without depending on it.
Differences between the two approaches
--------------------------------------
The following table compares the differences between the two approaches.
.. list-table::
* - Features
- `Kaldi`_-based
- `k2`_-based
* - Support CUDA
- No
- Yes
* - Support CPU
- Yes
- Yes
* - Support batch processing
- No
- Yes on CUDA; No on CPU
* - Support streaming models
- Yes
- No
* - Support C++ APIs
- Yes
- Yes
* - Support Python APIs
- Yes
- Yes

View File

@ -1,18 +0,0 @@
FST-based forced alignment
==========================
This section describes how to perform **FST-based** ``forced alignment`` with models
trained by `CTC`_ loss.
We use `CTC FORCED ALIGNMENT API TUTORIAL <https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html>`_
from `torchaudio`_ as a reference in this section.
Different from `torchaudio`_, we use an ``FST``-based approach.
.. toctree::
:maxdepth: 2
:caption: Contents:
diff
kaldi-based
k2-based

View File

@ -1,4 +0,0 @@
k2-based forced alignment
=========================
TODO(fangjun)

View File

@ -1,712 +0,0 @@
Kaldi-based forced alignment
============================
This section describes in detail how to use `kaldi-decoder`_
for **FST-based** ``forced alignment`` with models trained by `CTC`_ loss.
.. hint::
We have a colab notebook walking you through this section step by step.
|kaldi-based forced alignment colab notebook|
.. |kaldi-based forced alignment colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://github.com/k2-fsa/colab/blob/master/icefall/ctc_forced_alignment_fst_based_kaldi.ipynb
Prepare the environment
-----------------------
Before you continue, make sure you have setup `icefall`_ by following :ref:`install icefall`.
.. hint::
You don't need to install `Kaldi`_. We will ``NOT`` use `Kaldi`_ below.
Get the test data
-----------------
We use the test wave
from `CTC FORCED ALIGNMENT API TUTORIAL <https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html>`_
.. code-block:: python3
import torchaudio
# Download test wave
speech_file = torchaudio.utils.download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav")
print(speech_file)
waveform, sr = torchaudio.load(speech_file)
transcript = "i had that curiosity beside me at this moment".split()
print(waveform.shape, sr)
assert waveform.ndim == 2
assert waveform.shape[0] == 1
assert sr == 16000
The test wave is downloaded to::
$HOME/.cache/torch/hub/torchaudio/tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav
.. raw:: html
<table>
<tr>
<th>Wave filename</th>
<th>Content</th>
<th>Text</th>
</tr>
<tr>
<td>Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav</td>
<td>
<audio title="Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav" controls="controls">
<source src="/icefall/_static/kaldi-align/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
i had that curiosity beside me at this moment
</td>
</tr>
</table>
We use the test model
from `CTC FORCED ALIGNMENT API TUTORIAL <https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html>`_
.. code-block:: python3
import torch
bundle = torchaudio.pipelines.MMS_FA
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = bundle.get_model(with_star=False).to(device)
The model is downloaded to::
$HOME/.cache/torch/hub/checkpoints/model.pt
Compute log_probs
-----------------
.. code-block:: bash
with torch.inference_mode():
emission, _ = model(waveform.to(device))
print(emission.shape)
It should print::
torch.Size([1, 169, 28])
Create token2id and id2token
----------------------------
.. code-block:: python3
token2id = bundle.get_dict(star=None)
id2token = {i:t for t, i in token2id.items()}
token2id["<eps>"] = 0
del token2id["-"]
Create word2id and id2word
--------------------------
.. code-block:: python3
words = list(set(transcript))
word2id = dict()
word2id['eps'] = 0
for i, w in enumerate(words):
word2id[w] = i + 1
id2word = {i:w for w, i in word2id.items()}
Note that we only use words from the transcript of the test wave.
Generate lexicon-related files
------------------------------
We use the code below to generate the following 4 files:
- ``lexicon.txt``
- ``tokens.txt``
- ``words.txt``
- ``lexicon_disambig.txt``
.. caution::
``words.txt`` contains only words from the transcript of the test wave.
.. code-block:: python3
from prepare_lang import add_disambig_symbols
lexicon = [(w, list(w)) for w in word2id if w != "eps"]
lexicon_disambig, max_disambig_id = add_disambig_symbols(lexicon)
with open('lexicon.txt', 'w', encoding='utf-8') as f:
for w, tokens in lexicon:
f.write(f"{w} {' '.join(tokens)}\n")
with open('lexicon_disambig.txt', 'w', encoding='utf-8') as f:
for w, tokens in lexicon_disambig:
f.write(f"{w} {' '.join(tokens)}\n")
with open('tokens.txt', 'w', encoding='utf-8') as f:
for t, i in token2id.items():
if t == '-':
t = "<eps>"
f.write(f"{t} {i}\n")
for k in range(max_disambig_id + 2):
f.write(f"#{k} {len(token2id) + k}\n")
with open('words.txt', 'w', encoding='utf-8') as f:
for w, i in word2id.items():
f.write(f"{w} {i}\n")
f.write(f'#0 {len(word2id)}\n')
To give you an idea about what the generated files look like::
head -n 50 lexicon.txt lexicon_disambig.txt tokens.txt words.txt
prints::
==> lexicon.txt <==
moment m o m e n t
beside b e s i d e
i i
this t h i s
curiosity c u r i o s i t y
had h a d
that t h a t
at a t
me m e
==> lexicon_disambig.txt <==
moment m o m e n t
beside b e s i d e
i i
this t h i s
curiosity c u r i o s i t y
had h a d
that t h a t
at a t
me m e
==> tokens.txt <==
a 1
i 2
e 3
n 4
o 5
u 6
t 7
s 8
r 9
m 10
k 11
l 12
d 13
g 14
h 15
y 16
b 17
p 18
w 19
c 20
v 21
j 22
z 23
f 24
' 25
q 26
x 27
<eps> 0
#0 28
#1 29
==> words.txt <==
eps 0
moment 1
beside 2
i 3
this 4
curiosity 5
had 6
that 7
at 8
me 9
#0 10
.. note::
This test model uses characters as modeling unit. If you use other types of
modeling unit, the same code can be used without any change.
Convert transcript to an FST graph
----------------------------------
.. code-block:: bash
egs/librispeech/ASR/local/prepare_lang_fst.py --lang-dir ./
The above command should generate two files ``H.fst`` and ``HL.fst``. We will
use ``HL.fst`` below::
-rw-r--r-- 1 root root 13K Jun 12 08:28 H.fst
-rw-r--r-- 1 root root 3.7K Jun 12 08:28 HL.fst
Force aligner
-------------
Now, everything is ready. We can use the following code to get forced alignments.
.. code-block:: python3
from kaldi_decoder import DecodableCtc, FasterDecoder, FasterDecoderOptions
import kaldifst
def force_align():
HL = kaldifst.StdVectorFst.read("./HL.fst")
decodable = DecodableCtc(emission[0].contiguous().cpu().numpy())
decoder_opts = FasterDecoderOptions(max_active=3000)
decoder = FasterDecoder(HL, decoder_opts)
decoder.decode(decodable)
if not decoder.reached_final():
print(f"failed to decode xxx")
return None
ok, best_path = decoder.get_best_path()
(
ok,
isymbols_out,
osymbols_out,
total_weight,
) = kaldifst.get_linear_symbol_sequence(best_path)
if not ok:
print(f"failed to get linear symbol sequence for xxx")
return None
# We need to use i-1 here since we have incremented tokens during
# HL construction
alignment = [i-1 for i in isymbols_out]
return alignment
alignment = force_align()
for i, a in enumerate(alignment):
print(i, id2token[a])
The output should be identical to
`<https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html#frame-level-alignments>`_.
For ease of reference, we list the output below::
0 -
1 -
2 -
3 -
4 -
5 -
6 -
7 -
8 -
9 -
10 -
11 -
12 -
13 -
14 -
15 -
16 -
17 -
18 -
19 -
20 -
21 -
22 -
23 -
24 -
25 -
26 -
27 -
28 -
29 -
30 -
31 -
32 i
33 -
34 -
35 h
36 h
37 a
38 -
39 -
40 -
41 d
42 -
43 -
44 t
45 h
46 -
47 a
48 -
49 -
50 t
51 -
52 -
53 -
54 c
55 -
56 -
57 -
58 u
59 u
60 -
61 -
62 -
63 r
64 -
65 i
66 -
67 -
68 -
69 -
70 -
71 -
72 o
73 -
74 -
75 -
76 -
77 -
78 -
79 s
80 -
81 -
82 -
83 i
84 -
85 t
86 -
87 -
88 y
89 -
90 -
91 -
92 -
93 b
94 -
95 e
96 -
97 -
98 -
99 -
100 -
101 s
102 -
103 -
104 -
105 -
106 -
107 -
108 -
109 -
110 i
111 -
112 -
113 d
114 e
115 -
116 m
117 -
118 -
119 e
120 -
121 -
122 -
123 -
124 a
125 -
126 -
127 t
128 -
129 t
130 h
131 -
132 i
133 -
134 -
135 -
136 s
137 -
138 -
139 -
140 -
141 m
142 -
143 -
144 o
145 -
146 -
147 -
148 m
149 -
150 -
151 e
152 -
153 n
154 -
155 t
156 -
157 -
158 -
159 -
160 -
161 -
162 -
163 -
164 -
165 -
166 -
167 -
168 -
To merge tokens, we use::
from icefall.ctc import merge_tokens
token_spans = merge_tokens(alignment)
for span in token_spans:
print(id2token[span.token], span.start, span.end)
The output is given below::
i 32 33
h 35 37
a 37 38
d 41 42
t 44 45
h 45 46
a 47 48
t 50 51
c 54 55
u 58 60
r 63 64
i 65 66
o 72 73
s 79 80
i 83 84
t 85 86
y 88 89
b 93 94
e 95 96
s 101 102
i 110 111
d 113 114
e 114 115
m 116 117
e 119 120
a 124 125
t 127 128
t 129 130
h 130 131
i 132 133
s 136 137
m 141 142
o 144 145
m 148 149
e 151 152
n 153 154
t 155 156
All of the code below is copied and modified
from `<https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html>`_.
Segment each word using the computed alignments
-----------------------------------------------
.. code-block:: python3
def unflatten(list_, lengths):
assert len(list_) == sum(lengths)
i = 0
ret = []
for l in lengths:
ret.append(list_[i : i + l])
i += l
return ret
word_spans = unflatten(token_spans, [len(word) for word in transcript])
print(word_spans)
The output is::
[[TokenSpan(token=2, start=32, end=33)],
[TokenSpan(token=15, start=35, end=37), TokenSpan(token=1, start=37, end=38), TokenSpan(token=13, start=41, end=42)],
[TokenSpan(token=7, start=44, end=45), TokenSpan(token=15, start=45, end=46), TokenSpan(token=1, start=47, end=48), TokenSpan(token=7, start=50, end=51)],
[TokenSpan(token=20, start=54, end=55), TokenSpan(token=6, start=58, end=60), TokenSpan(token=9, start=63, end=64), TokenSpan(token=2, start=65, end=66), TokenSpan(token=5, start=72, end=73), TokenSpan(token=8, start=79, end=80), TokenSpan(token=2, start=83, end=84), TokenSpan(token=7, start=85, end=86), TokenSpan(token=16, start=88, end=89)],
[TokenSpan(token=17, start=93, end=94), TokenSpan(token=3, start=95, end=96), TokenSpan(token=8, start=101, end=102), TokenSpan(token=2, start=110, end=111), TokenSpan(token=13, start=113, end=114), TokenSpan(token=3, start=114, end=115)],
[TokenSpan(token=10, start=116, end=117), TokenSpan(token=3, start=119, end=120)],
[TokenSpan(token=1, start=124, end=125), TokenSpan(token=7, start=127, end=128)],
[TokenSpan(token=7, start=129, end=130), TokenSpan(token=15, start=130, end=131), TokenSpan(token=2, start=132, end=133), TokenSpan(token=8, start=136, end=137)],
[TokenSpan(token=10, start=141, end=142), TokenSpan(token=5, start=144, end=145), TokenSpan(token=10, start=148, end=149), TokenSpan(token=3, start=151, end=152), TokenSpan(token=4, start=153, end=154), TokenSpan(token=7, start=155, end=156)]
]
.. code-block:: python3
def preview_word(waveform, spans, num_frames, transcript, sample_rate=bundle.sample_rate):
ratio = waveform.size(1) / num_frames
x0 = int(ratio * spans[0].start)
x1 = int(ratio * spans[-1].end)
print(f"{transcript} {x0 / sample_rate:.3f} - {x1 / sample_rate:.3f} sec")
segment = waveform[:, x0:x1]
return IPython.display.Audio(segment.numpy(), rate=sample_rate)
num_frames = emission.size(1)
.. code-block:: python3
preview_word(waveform, word_spans[0], num_frames, transcript[0])
preview_word(waveform, word_spans[1], num_frames, transcript[1])
preview_word(waveform, word_spans[2], num_frames, transcript[2])
preview_word(waveform, word_spans[3], num_frames, transcript[3])
preview_word(waveform, word_spans[4], num_frames, transcript[4])
preview_word(waveform, word_spans[5], num_frames, transcript[5])
preview_word(waveform, word_spans[6], num_frames, transcript[6])
preview_word(waveform, word_spans[7], num_frames, transcript[7])
preview_word(waveform, word_spans[8], num_frames, transcript[8])
The segmented wave of each word along with its time stamp is given below:
.. raw:: html
<table>
<tr>
<th>Word</th>
<th>Time</th>
<th>Wave</th>
</tr>
<tr>
<td>i</td>
<td>0.644 - 0.664 sec</td>
<td>
<audio title="i.wav" controls="controls">
<source src="/icefall/_static/kaldi-align/i.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
<tr>
<td>had</td>
<td>0.704 - 0.845 sec</td>
<td>
<audio title="had.wav" controls="controls">
<source src="/icefall/_static/kaldi-align/had.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
<tr>
<td>that</td>
<td>0.885 - 1.026 sec</td>
<td>
<audio title="that.wav" controls="controls">
<source src="/icefall/_static/kaldi-align/that.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
<tr>
<td>curiosity</td>
<td>1.086 - 1.790 sec</td>
<td>
<audio title="curiosity.wav" controls="controls">
<source src="/icefall/_static/kaldi-align/curiosity.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
<tr>
<td>beside</td>
<td>1.871 - 2.314 sec</td>
<td>
<audio title="beside.wav" controls="controls">
<source src="/icefall/_static/kaldi-align/beside.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
<tr>
<td>me</td>
<td>2.334 - 2.414 sec</td>
<td>
<audio title="me.wav" controls="controls">
<source src="/icefall/_static/kaldi-align/me.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
<tr>
<td>at</td>
<td>2.495 - 2.575 sec</td>
<td>
<audio title="at.wav" controls="controls">
<source src="/icefall/_static/kaldi-align/at.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
<tr>
<td>this</td>
<td>2.595 - 2.756 sec</td>
<td>
<audio title="this.wav" controls="controls">
<source src="/icefall/_static/kaldi-align/this.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
<tr>
<td>moment</td>
<td>2.837 - 3.138 sec</td>
<td>
<audio title="moment.wav" controls="controls">
<source src="/icefall/_static/kaldi-align/moment.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
</tr>
</table>
We repost the whole wave below for ease of reference:
.. raw:: html
<table>
<tr>
<th>Wave filename</th>
<th>Content</th>
<th>Text</th>
</tr>
<tr>
<td>Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav</td>
<td>
<audio title="Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav" controls="controls">
<source src="/icefall/_static/kaldi-align/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
i had that curiosity beside me at this moment
</td>
</tr>
</table>
Summary
-------
Congratulations! You have succeeded in using the FST-based approach to
compute alignment of a test wave.

View File

@ -1,13 +0,0 @@
Huggingface
===========
This section describes how to find pre-trained models.
It also demonstrates how to try them from within your browser
without installing anything by using
`Huggingface spaces <https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition>`_.
.. toctree::
:maxdepth: 2
pretrained-models
spaces

Binary file not shown.

Before

Width:  |  Height:  |  Size: 455 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 392 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 426 KiB

View File

@ -1,17 +0,0 @@
Pre-trained models
==================
We have uploaded pre-trained models for all recipes in ``icefall``
to `<https://huggingface.co/>`_.
You can find them by visiting the following link:
`<https://huggingface.co/models?search=icefall>`_.
You can also find links of pre-trained models for a specific recipe
by looking at the corresponding ``RESULTS.md``. For instance:
- `<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md>`_
- `<https://github.com/k2-fsa/icefall/blob/master/egs/aishell/ASR/RESULTS.md>`_
- `<https://github.com/k2-fsa/icefall/blob/master/egs/gigaspeech/ASR/RESULTS.md>`_
- `<https://github.com/k2-fsa/icefall/blob/master/egs/wenetspeech/ASR/RESULTS.md>`_

View File

@ -1,65 +0,0 @@
Huggingface spaces
==================
We have integrated the server framework
`sherpa <http://github.com/k2-fsa/sherpa>`_
with `Huggingface spaces <https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition>`_
so that you can try pre-trained models from within your browser
without the need to download or install anything.
All you need is a browser, which can be run on Windows, macOS, Linux, or even on your
iPad and your phone.
Start your browser and visit the following address:
`<https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition>`_
and you will see a page like the following screenshot:
.. image:: ./pic/hugging-face-sherpa.png
:alt: screenshot of `<https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition>`_
:target: https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
You can:
1. Select a language for recognition. Currently, we provide pre-trained models
from ``icefall`` for the following languages: ``Chinese``, ``English``, and
``Chinese+English``.
2. After selecting the target language, you can select a pre-trained model
corresponding to the language.
3. Select the decoding method. Currently, it provides ``greedy search``
and ``modified_beam_search``.
4. If you selected ``modified_beam_search``, you can choose the number of
active paths during the search.
5. Either upload a file or record your speech for recognition.
6. Click the button ``Submit for recognition``.
7. Wait for a moment and you will get the recognition results.
The following screenshot shows an example when selecting ``Chinese+English``:
.. image:: ./pic/hugging-face-sherpa-3.png
:alt: screenshot of `<https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition>`_
:target: https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
In the bottom part of the page, you can find a table of examples. You can click
one of them and then click ``Submit for recognition``.
.. image:: ./pic/hugging-face-sherpa-2.png
:alt: screenshot of `<https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition>`_
:target: https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
YouTube Video
-------------
We provide the following YouTube video demonstrating how to use
`<https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition>`_.
.. note::
To get the latest news of `next-gen Kaldi <https://github.com/k2-fsa>`_, please subscribe
the following YouTube channel by `Nadira Povey <https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_:
`<https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_
.. youtube:: ElN3r9dkKE4

View File

@ -1,44 +0,0 @@
.. icefall documentation master file, created by
sphinx-quickstart on Mon Aug 23 16:07:39 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Icefall
=======
.. image:: _static/logo.png
:alt: icefall logo
:width: 168px
:align: center
:target: https://github.com/k2-fsa/icefall
Documentation for `icefall <https://github.com/k2-fsa/icefall>`_, containing
speech recognition recipes using `k2 <https://github.com/k2-fsa/k2>`_.
.. toctree::
:maxdepth: 2
:caption: Contents:
for-dummies/index.rst
installation/index
docker/index
faqs
model-export/index
fst-based-forced-alignment/index
.. toctree::
:maxdepth: 3
recipes/index
.. toctree::
:maxdepth: 2
contributing/index
huggingface/index
.. toctree::
:maxdepth: 2
decoding-with-langugage-models/index

View File

@ -1,4 +0,0 @@
# Introduction
<https://shields.io/> is used to generate files in this directory.

View File

@ -1 +0,0 @@
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="122" height="20" role="img" aria-label="device: CPU | CUDA"><title>device: CPU | CUDA</title><linearGradient id="s" x2="0" y2="100%"><stop offset="0" stop-color="#bbb" stop-opacity=".1"/><stop offset="1" stop-opacity=".1"/></linearGradient><clipPath id="r"><rect width="122" height="20" rx="3" fill="#fff"/></clipPath><g clip-path="url(#r)"><rect width="45" height="20" fill="#555"/><rect x="45" width="77" height="20" fill="#fe7d37"/><rect width="122" height="20" fill="url(#s)"/></g><g fill="#fff" text-anchor="middle" font-family="Verdana,Geneva,DejaVu Sans,sans-serif" text-rendering="geometricPrecision" font-size="110"><text aria-hidden="true" x="235" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="350">device</text><text x="235" y="140" transform="scale(.1)" fill="#fff" textLength="350">device</text><text aria-hidden="true" x="825" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="670">CPU | CUDA</text><text x="825" y="140" transform="scale(.1)" fill="#fff" textLength="670">CPU | CUDA</text></g></svg>

Before

Width:  |  Height:  |  Size: 1.1 KiB

View File

@ -1 +0,0 @@
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="80" height="20" role="img" aria-label="k2: &gt;= v1.9"><title>k2: &gt;= v1.9</title><linearGradient id="s" x2="0" y2="100%"><stop offset="0" stop-color="#bbb" stop-opacity=".1"/><stop offset="1" stop-opacity=".1"/></linearGradient><clipPath id="r"><rect width="80" height="20" rx="3" fill="#fff"/></clipPath><g clip-path="url(#r)"><rect width="23" height="20" fill="#555"/><rect x="23" width="57" height="20" fill="blueviolet"/><rect width="80" height="20" fill="url(#s)"/></g><g fill="#fff" text-anchor="middle" font-family="Verdana,Geneva,DejaVu Sans,sans-serif" text-rendering="geometricPrecision" font-size="110"><text aria-hidden="true" x="125" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="130">k2</text><text x="125" y="140" transform="scale(.1)" fill="#fff" textLength="130">k2</text><text aria-hidden="true" x="505" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="470">&gt;= v1.9</text><text x="505" y="140" transform="scale(.1)" fill="#fff" textLength="470">&gt;= v1.9</text></g></svg>

Before

Width:  |  Height:  |  Size: 1.1 KiB

View File

@ -1 +0,0 @@
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="114" height="20" role="img" aria-label="os: Linux | macOS"><title>os: Linux | macOS</title><linearGradient id="s" x2="0" y2="100%"><stop offset="0" stop-color="#bbb" stop-opacity=".1"/><stop offset="1" stop-opacity=".1"/></linearGradient><clipPath id="r"><rect width="114" height="20" rx="3" fill="#fff"/></clipPath><g clip-path="url(#r)"><rect width="23" height="20" fill="#555"/><rect x="23" width="91" height="20" fill="#ff69b4"/><rect width="114" height="20" fill="url(#s)"/></g><g fill="#fff" text-anchor="middle" font-family="Verdana,Geneva,DejaVu Sans,sans-serif" text-rendering="geometricPrecision" font-size="110"><text aria-hidden="true" x="125" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="130">os</text><text x="125" y="140" transform="scale(.1)" fill="#fff" textLength="130">os</text><text aria-hidden="true" x="675" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="810">Linux | macOS</text><text x="675" y="140" transform="scale(.1)" fill="#fff" textLength="810">Linux | macOS</text></g></svg>

Before

Width:  |  Height:  |  Size: 1.1 KiB

View File

@ -1 +0,0 @@
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="98" height="20" role="img" aria-label="python: &gt;= 3.6"><title>python: &gt;= 3.6</title><linearGradient id="s" x2="0" y2="100%"><stop offset="0" stop-color="#bbb" stop-opacity=".1"/><stop offset="1" stop-opacity=".1"/></linearGradient><clipPath id="r"><rect width="98" height="20" rx="3" fill="#fff"/></clipPath><g clip-path="url(#r)"><rect width="49" height="20" fill="#555"/><rect x="49" width="49" height="20" fill="#007ec6"/><rect width="98" height="20" fill="url(#s)"/></g><g fill="#fff" text-anchor="middle" font-family="Verdana,Geneva,DejaVu Sans,sans-serif" text-rendering="geometricPrecision" font-size="110"><text aria-hidden="true" x="255" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="390">python</text><text x="255" y="140" transform="scale(.1)" fill="#fff" textLength="390">python</text><text aria-hidden="true" x="725" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="390">&gt;= 3.6</text><text x="725" y="140" transform="scale(.1)" fill="#fff" textLength="390">&gt;= 3.6</text></g></svg>

Before

Width:  |  Height:  |  Size: 1.1 KiB

View File

@ -1 +0,0 @@
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="100" height="20" role="img" aria-label="torch: &gt;= 1.6.0"><title>torch: &gt;= 1.6.0</title><linearGradient id="s" x2="0" y2="100%"><stop offset="0" stop-color="#bbb" stop-opacity=".1"/><stop offset="1" stop-opacity=".1"/></linearGradient><clipPath id="r"><rect width="100" height="20" rx="3" fill="#fff"/></clipPath><g clip-path="url(#r)"><rect width="39" height="20" fill="#555"/><rect x="39" width="61" height="20" fill="#97ca00"/><rect width="100" height="20" fill="url(#s)"/></g><g fill="#fff" text-anchor="middle" font-family="Verdana,Geneva,DejaVu Sans,sans-serif" text-rendering="geometricPrecision" font-size="110"><text aria-hidden="true" x="205" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="290">torch</text><text x="205" y="140" transform="scale(.1)" fill="#fff" textLength="290">torch</text><text aria-hidden="true" x="685" y="150" fill="#010101" fill-opacity=".3" transform="scale(.1)" textLength="510">&gt;= 1.6.0</text><text x="685" y="140" transform="scale(.1)" fill="#fff" textLength="510">&gt;= 1.6.0</text></g></svg>

Before

Width:  |  Height:  |  Size: 1.1 KiB

View File

@ -1,553 +0,0 @@
.. _install icefall:
Installation
============
.. hint::
We also provide :ref:`icefall_docker` support, which has already setup
the environment for you.
.. hint::
We have a colab notebook guiding you step by step to setup the environment.
|yesno colab notebook|
.. |yesno colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/1tIjjzaJc3IvGyKiMCDWO-TSnBgkcuN3B?usp=sharing
`icefall`_ depends on `k2`_ and `lhotse`_.
We recommend that you use the following steps to install the dependencies.
- (0) Install CUDA toolkit and cuDNN
- (1) Install `torch`_ and `torchaudio`_
- (2) Install `k2`_
- (3) Install `lhotse`_
.. caution::
Installation order matters.
(0) Install CUDA toolkit and cuDNN
----------------------------------
Please refer to
`<https://k2-fsa.github.io/k2/installation/cuda-cudnn.html>`_
to install CUDA and cuDNN.
(1) Install torch and torchaudio
--------------------------------
Please refer `<https://pytorch.org/>`_ to install `torch`_ and `torchaudio`_.
.. caution::
Please install torch and torchaudio at the same time.
(2) Install k2
--------------
Please refer to `<https://k2-fsa.github.io/k2/installation/index.html>`_
to install `k2`_.
.. caution::
Please don't change your installed PyTorch after you have installed k2.
.. note::
We suggest that you install k2 from pre-compiled wheels by following
`<https://k2-fsa.github.io/k2/installation/from_wheels.html>`_
.. hint::
Please always install the latest version of `k2`_.
(3) Install lhotse
------------------
Please refer to `<https://lhotse.readthedocs.io/en/latest/getting-started.html#installation>`_
to install `lhotse`_.
.. hint::
We strongly recommend you to use::
pip install git+https://github.com/lhotse-speech/lhotse
to install the latest version of `lhotse`_.
(4) Download icefall
--------------------
`icefall`_ is a collection of Python scripts; what you need is to download it
and set the environment variable ``PYTHONPATH`` to point to it.
Assume you want to place `icefall`_ in the folder ``/tmp``. The
following commands show you how to setup `icefall`_:
.. code-block:: bash
cd /tmp
git clone https://github.com/k2-fsa/icefall
cd icefall
pip install -r requirements.txt
export PYTHONPATH=/tmp/icefall:$PYTHONPATH
.. HINT::
You can put several versions of `icefall`_ in the same virtual environment.
To switch among different versions of `icefall`_, just set ``PYTHONPATH``
to point to the version you want.
Installation example
--------------------
The following shows an example about setting up the environment.
(1) Create a virtual environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
kuangfangjun:~$ virtualenv -p python3.8 test-icefall
created virtual environment CPython3.8.0.final.0-64 in 9422ms
creator CPython3Posix(dest=/star-fj/fangjun/test-icefall, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/star-fj/fangjun/.local/share/virtualenv)
added seed packages: pip==22.3.1, setuptools==65.6.3, wheel==0.38.4
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
kuangfangjun:~$ source test-icefall/bin/activate
(test-icefall) kuangfangjun:~$
(2) Install CUDA toolkit and cuDNN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You need to determine the version of CUDA toolkit to install.
.. code-block:: bash
(test-icefall) kuangfangjun:~$ nvidia-smi | head -n 4
Wed Jul 26 21:57:49 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
You can choose any CUDA version that is ``not`` greater than the version printed by ``nvidia-smi``.
In our case, we can choose any version ``<= 11.6``.
We will use ``CUDA 11.6`` in this example. Please follow
`<https://k2-fsa.github.io/k2/installation/cuda-cudnn.html#cuda-11-6>`_
to install CUDA toolkit and cuDNN if you have not done that before.
After installing CUDA toolkit, you can use the following command to verify it:
.. code-block:: bash
(test-icefall) kuangfangjun:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
(3) Install torch and torchaudio
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since we have selected CUDA toolkit ``11.6``, we have to install a version of `torch`_
that is compiled against CUDA ``11.6``. We select ``torch 1.13.0+cu116`` in this
example.
After selecting the version of `torch`_ to install, we need to also install
a compatible version of `torchaudio`_, which is ``0.13.0+cu116`` in our case.
Please refer to `<https://pytorch.org/audio/stable/installation.html#compatibility-matrix>`_
to select an appropriate version of `torchaudio`_ to install if you use a different
version of `torch`_.
.. code-block:: bash
(test-icefall) kuangfangjun:~$ pip install torch==1.13.0+cu116 torchaudio==0.13.0+cu116 -f https://download.pytorch.org/whl/torch_stable.html
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.13.0+cu116
Downloading https://download.pytorch.org/whl/cu116/torch-1.13.0%2Bcu116-cp38-cp38-linux_x86_64.whl (1983.0 MB)
________________________________________ 2.0/2.0 GB 764.4 kB/s eta 0:00:00
Collecting torchaudio==0.13.0+cu116
Downloading https://download.pytorch.org/whl/cu116/torchaudio-0.13.0%2Bcu116-cp38-cp38-linux_x86_64.whl (4.2 MB)
________________________________________ 4.2/4.2 MB 1.3 MB/s eta 0:00:00
Requirement already satisfied: typing-extensions in /star-fj/fangjun/test-icefall/lib/python3.8/site-packages (from torch==1.13.0+cu116) (4.7.1)
Installing collected packages: torch, torchaudio
Successfully installed torch-1.13.0+cu116 torchaudio-0.13.0+cu116
Verify that `torch`_ and `torchaudio`_ are successfully installed:
.. code-block:: bash
(test-icefall) kuangfangjun:~$ python3 -c "import torch; print(torch.__version__)"
1.13.0+cu116
(test-icefall) kuangfangjun:~$ python3 -c "import torchaudio; print(torchaudio.__version__)"
0.13.0+cu116
(4) Install k2
~~~~~~~~~~~~~~
We will install `k2`_ from pre-compiled wheels by following
`<https://k2-fsa.github.io/k2/installation/from_wheels.html>`_
.. code-block:: bash
(test-icefall) kuangfangjun:~$ pip install k2==1.24.3.dev20230725+cuda11.6.torch1.13.0 -f https://k2-fsa.github.io/k2/cuda.html
# For users from China
# 中国国内用户,如果访问不了 huggingface, 请使用
# pip install k2==1.24.3.dev20230725+cuda11.6.torch1.13.0 -f https://k2-fsa.github.io/k2/cuda-cn.html
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Looking in links: https://k2-fsa.github.io/k2/cuda.html
Collecting k2==1.24.3.dev20230725+cuda11.6.torch1.13.0
Downloading https://huggingface.co/csukuangfj/k2/resolve/main/ubuntu-cuda/k2-1.24.3.dev20230725%2Bcuda11.6.torch1.13.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (104.3 MB)
________________________________________ 104.3/104.3 MB 5.1 MB/s eta 0:00:00
Requirement already satisfied: torch==1.13.0 in /star-fj/fangjun/test-icefall/lib/python3.8/site-packages (from k2==1.24.3.dev20230725+cuda11.6.torch1.13.0) (1.13.0+cu116)
Collecting graphviz
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/de/5e/fcbb22c68208d39edff467809d06c9d81d7d27426460ebc598e55130c1aa/graphviz-0.20.1-py3-none-any.whl (47 kB)
Requirement already satisfied: typing-extensions in /star-fj/fangjun/test-icefall/lib/python3.8/site-packages (from torch==1.13.0->k2==1.24.3.dev20230725+cuda11.6.torch1.13.0) (4.7.1)
Installing collected packages: graphviz, k2
Successfully installed graphviz-0.20.1 k2-1.24.3.dev20230725+cuda11.6.torch1.13.0
.. hint::
Please refer to `<https://k2-fsa.github.io/k2/cuda.html>`_ for the available
pre-compiled wheels about `k2`_.
Verify that `k2`_ has been installed successfully:
.. code-block:: bash
(test-icefall) kuangfangjun:~$ python3 -m k2.version
Collecting environment information...
k2 version: 1.24.3
Build type: Release
Git SHA1: 4c05309499a08454997adf500b56dcc629e35ae5
Git date: Tue Jul 25 16:23:36 2023
Cuda used to build k2: 11.6
cuDNN used to build k2: 8.3.2
Python version used to build k2: 3.8
OS used to build k2: CentOS Linux release 7.9.2009 (Core)
CMake version: 3.27.0
GCC version: 9.3.1
CMAKE_CUDA_FLAGS: -Wno-deprecated-gpu-targets -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_35,code=sm_35 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_50,code=sm_50 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_60,code=sm_60 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_61,code=sm_61 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_70,code=sm_70 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_75,code=sm_75 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_80,code=sm_80 -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=0 --compiler-options -Wall --compiler-options -Wno-strict-overflow --compiler-options -Wno-unknown-pragmas
CMAKE_CXX_FLAGS: -D_GLIBCXX_USE_CXX11_ABI=0 -Wno-unused-variable -Wno-strict-overflow
PyTorch version used to build k2: 1.13.0+cu116
PyTorch is using Cuda: 11.6
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800 bytes (or 200.0 GB)
k2 abort: False
__file__: /star-fj/fangjun/test-icefall/lib/python3.8/site-packages/k2/version/version.py
_k2.__file__: /star-fj/fangjun/test-icefall/lib/python3.8/site-packages/_k2.cpython-38-x86_64-linux-gnu.so
(5) Install lhotse
~~~~~~~~~~~~~~~~~~
.. code-block:: bash
(test-icefall) kuangfangjun:~$ pip install git+https://github.com/lhotse-speech/lhotse
Collecting git+https://github.com/lhotse-speech/lhotse
Cloning https://github.com/lhotse-speech/lhotse to /tmp/pip-req-build-vq12fd5i
Running command git clone --filter=blob:none --quiet https://github.com/lhotse-speech/lhotse /tmp/pip-req-build-vq12fd5i
Resolved https://github.com/lhotse-speech/lhotse to commit 7640d663469b22cd0b36f3246ee9b849cd25e3b7
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting cytoolz>=0.10.1
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1e/3b/a7828d575aa17fb7acaf1ced49a3655aa36dad7e16eb7e6a2e4df0dda76f/cytoolz-0.12.2-cp38-cp38-
manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
________________________________________ 2.0/2.0 MB 33.2 MB/s eta 0:00:00
Collecting pyyaml>=5.3.1
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c8/6b/6600ac24725c7388255b2f5add93f91e58a5d7efaf4af244fdbcc11a541b/PyYAML-6.0.1-cp38-cp38-ma
nylinux_2_17_x86_64.manylinux2014_x86_64.whl (736 kB)
________________________________________ 736.6/736.6 kB 38.6 MB/s eta 0:00:00
Collecting dataclasses
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/26/2f/1095cdc2868052dd1e64520f7c0d5c8c550ad297e944e641dbf1ffbb9a5d/dataclasses-0.6-py3-none-
any.whl (14 kB)
Requirement already satisfied: torchaudio in ./test-icefall/lib/python3.8/site-packages (from lhotse==1.16.0.dev0+git.7640d66.clean) (0.13.0+cu116)
Collecting lilcom>=1.1.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a8/65/df0a69c52bd085ca1ad4e5c4c1a5c680e25f9477d8e49316c4ff1e5084a4/lilcom-1.7-cp38-cp38-many
linux_2_17_x86_64.manylinux2014_x86_64.whl (87 kB)
________________________________________ 87.1/87.1 kB 8.7 MB/s eta 0:00:00
Collecting tqdm
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/e6/02/a2cff6306177ae6bc73bc0665065de51dfb3b9db7373e122e2735faf0d97/tqdm-4.65.0-py3-none-any
.whl (77 kB)
Requirement already satisfied: numpy>=1.18.1 in ./test-icefall/lib/python3.8/site-packages (from lhotse==1.16.0.dev0+git.7640d66.clean) (1.24.4)
Collecting audioread>=2.1.9
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/5d/cb/82a002441902dccbe427406785db07af10182245ee639ea9f4d92907c923/audioread-3.0.0.tar.gz (
377 kB)
Preparing metadata (setup.py) ... done
Collecting tabulate>=0.8.1
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/40/44/4a5f08c96eb108af5cb50b41f76142f0afa346dfa99d5296fe7202a11854/tabulate-0.9.0-py3-none-
any.whl (35 kB)
Collecting click>=7.1.1
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1a/70/e63223f8116931d365993d4a6b7ef653a4d920b41d03de7c59499962821f/click-8.1.6-py3-none-any.
whl (97 kB)
________________________________________ 97.9/97.9 kB 8.4 MB/s eta 0:00:00
Collecting packaging
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/ab/c3/57f0601a2d4fe15de7a553c00adbc901425661bf048f2a22dfc500caf121/packaging-23.1-py3-none-
any.whl (48 kB)
Collecting intervaltree>=3.1.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/50/fb/396d568039d21344639db96d940d40eb62befe704ef849b27949ded5c3bb/intervaltree-3.1.0.tar.gz
(32 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: torch in ./test-icefall/lib/python3.8/site-packages (from lhotse==1.16.0.dev0+git.7640d66.clean) (1.13.0+cu116)
Collecting SoundFile>=0.10
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ad/bd/0602167a213d9184fc688b1086dc6d374b7ae8c33eccf169f9b50ce6568c/soundfile-0.12.1-py2.py3-
none-manylinux_2_17_x86_64.whl (1.3 MB)
________________________________________ 1.3/1.3 MB 46.5 MB/s eta 0:00:00
Collecting toolz>=0.8.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/7f/5c/922a3508f5bda2892be3df86c74f9cf1e01217c2b1f8a0ac4841d903e3e9/toolz-0.12.0-py3-none-any.whl (55 kB)
Collecting sortedcontainers<3.0,>=2.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/32/46/9cb0e58b2deb7f82b84065f37f3bffeb12413f947f9388e4cac22c4621ce/sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Collecting cffi>=1.0
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/b7/8b/06f30caa03b5b3ac006de4f93478dbd0239e2a16566d81a106c322dc4f79/cffi-1.15.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (442 kB)
Requirement already satisfied: typing-extensions in ./test-icefall/lib/python3.8/site-packages (from torch->lhotse==1.16.0.dev0+git.7640d66.clean) (4.7.1)
Collecting pycparser
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/62/d5/5f610ebe421e85889f2e55e33b7f9a6795bd982198517d912eb1c76e1a53/pycparser-2.21-py2.py3-none-any.whl (118 kB)
Building wheels for collected packages: lhotse, audioread, intervaltree
Building wheel for lhotse (pyproject.toml) ... done
Created wheel for lhotse: filename=lhotse-1.16.0.dev0+git.7640d66.clean-py3-none-any.whl size=687627 sha256=cbf0a4d2d0b639b33b91637a4175bc251d6a021a069644ecb1a9f2b3a83d072a
Stored in directory: /tmp/pip-ephem-wheel-cache-wwtk90_m/wheels/7f/7a/8e/a0bf241336e2e3cb573e1e21e5600952d49f5162454f2e612f
Building wheel for audioread (setup.py) ... done
Created wheel for audioread: filename=audioread-3.0.0-py3-none-any.whl size=23704 sha256=5e2d3537c96ce9cf0f645a654c671163707bf8cb8d9e358d0e2b0939a85ff4c2
Stored in directory: /star-fj/fangjun/.cache/pip/wheels/e2/c3/9c/f19ae5a03f8862d9f0776b0c0570f1fdd60a119d90954e3f39
Building wheel for intervaltree (setup.py) ... done
Created wheel for intervaltree: filename=intervaltree-3.1.0-py2.py3-none-any.whl size=26098 sha256=2604170976cfffe0d2f678cb1a6e5b525f561cd50babe53d631a186734fec9f9
Stored in directory: /star-fj/fangjun/.cache/pip/wheels/f3/ed/2b/c179ebfad4e15452d6baef59737f27beb9bfb442e0620f7271
Successfully built lhotse audioread intervaltree
Installing collected packages: sortedcontainers, dataclasses, tqdm, toolz, tabulate, pyyaml, pycparser, packaging, lilcom, intervaltree, click, audioread, cytoolz, cffi, SoundFile, lhotse
Successfully installed SoundFile-0.12.1 audioread-3.0.0 cffi-1.15.1 click-8.1.6 cytoolz-0.12.2 dataclasses-0.6 intervaltree-3.1.0 lhotse-1.16.0.dev0+git.7640d66.clean lilcom-1.7 packaging-23.1 pycparser-2.21 pyyaml-6.0.1 sortedcontainers-2.4.0 tabulate-0.9.0 toolz-0.12.0 tqdm-4.65.0
Verify that `lhotse`_ has been installed successfully:
.. code-block:: bash
(test-icefall) kuangfangjun:~$ python3 -c "import lhotse; print(lhotse.__version__)"
1.16.0.dev+git.7640d66.clean
(6) Download icefall
~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
(test-icefall) kuangfangjun:~$ cd /tmp/
(test-icefall) kuangfangjun:tmp$ git clone https://github.com/k2-fsa/icefall
Cloning into 'icefall'...
remote: Enumerating objects: 12942, done.
remote: Counting objects: 100% (67/67), done.
remote: Compressing objects: 100% (56/56), done.
remote: Total 12942 (delta 17), reused 35 (delta 6), pack-reused 12875
Receiving objects: 100% (12942/12942), 14.77 MiB | 9.29 MiB/s, done.
Resolving deltas: 100% (8835/8835), done.
(test-icefall) kuangfangjun:tmp$ cd icefall/
(test-icefall) kuangfangjun:icefall$ pip install -r ./requirements.txt
Test Your Installation
----------------------
To test that your installation is successful, let us run
the `yesno recipe <https://github.com/k2-fsa/icefall/tree/master/egs/yesno/ASR>`_
on ``CPU``.
Data preparation
~~~~~~~~~~~~~~~~
.. code-block:: bash
(test-icefall) kuangfangjun:icefall$ export PYTHONPATH=/tmp/icefall:$PYTHONPATH
(test-icefall) kuangfangjun:icefall$ cd /tmp/icefall
(test-icefall) kuangfangjun:icefall$ cd egs/yesno/ASR
(test-icefall) kuangfangjun:ASR$ ./prepare.sh
The log of running ``./prepare.sh`` is:
.. code-block::
2023-07-27 12:41:39 (prepare.sh:27:main) dl_dir: /tmp/icefall/egs/yesno/ASR/download
2023-07-27 12:41:39 (prepare.sh:30:main) Stage 0: Download data
/tmp/icefall/egs/yesno/ASR/download/waves_yesno.tar.gz: 100%|___________________________________________________| 4.70M/4.70M [00:00<00:00, 11.1MB/s]
2023-07-27 12:41:46 (prepare.sh:39:main) Stage 1: Prepare yesno manifest
2023-07-27 12:41:50 (prepare.sh:45:main) Stage 2: Compute fbank for yesno
2023-07-27 12:41:55,718 INFO [compute_fbank_yesno.py:65] Processing train
Extracting and storing features: 100%|_______________________________________________________________________________| 90/90 [00:01<00:00, 87.82it/s]
2023-07-27 12:41:56,778 INFO [compute_fbank_yesno.py:65] Processing test
Extracting and storing features: 100%|______________________________________________________________________________| 30/30 [00:00<00:00, 256.92it/s]
2023-07-27 12:41:57 (prepare.sh:51:main) Stage 3: Prepare lang
2023-07-27 12:42:02 (prepare.sh:66:main) Stage 4: Prepare G
/project/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):79
[I] Reading \data\ section.
/project/kaldilm/csrc/arpa_file_parser.cc:void kaldilm::ArpaFileParser::Read(std::istream&):140
[I] Reading \1-grams: section.
2023-07-27 12:42:02 (prepare.sh:92:main) Stage 5: Compile HLG
2023-07-27 12:42:07,275 INFO [compile_hlg.py:124] Processing data/lang_phone
2023-07-27 12:42:07,276 INFO [lexicon.py:171] Converting L.pt to Linv.pt
2023-07-27 12:42:07,309 INFO [compile_hlg.py:48] Building ctc_topo. max_token_id: 3
2023-07-27 12:42:07,310 INFO [compile_hlg.py:52] Loading G.fst.txt
2023-07-27 12:42:07,314 INFO [compile_hlg.py:62] Intersecting L and G
2023-07-27 12:42:07,323 INFO [compile_hlg.py:64] LG shape: (4, None)
2023-07-27 12:42:07,323 INFO [compile_hlg.py:66] Connecting LG
2023-07-27 12:42:07,323 INFO [compile_hlg.py:68] LG shape after k2.connect: (4, None)
2023-07-27 12:42:07,323 INFO [compile_hlg.py:70] <class 'torch.Tensor'>
2023-07-27 12:42:07,323 INFO [compile_hlg.py:71] Determinizing LG
2023-07-27 12:42:07,341 INFO [compile_hlg.py:74] <class '_k2.ragged.RaggedTensor'>
2023-07-27 12:42:07,341 INFO [compile_hlg.py:76] Connecting LG after k2.determinize
2023-07-27 12:42:07,341 INFO [compile_hlg.py:79] Removing disambiguation symbols on LG
2023-07-27 12:42:07,354 INFO [compile_hlg.py:91] LG shape after k2.remove_epsilon: (6, None)
2023-07-27 12:42:07,445 INFO [compile_hlg.py:96] Arc sorting LG
2023-07-27 12:42:07,445 INFO [compile_hlg.py:99] Composing H and LG
2023-07-27 12:42:07,446 INFO [compile_hlg.py:106] Connecting LG
2023-07-27 12:42:07,446 INFO [compile_hlg.py:109] Arc sorting LG
2023-07-27 12:42:07,447 INFO [compile_hlg.py:111] HLG.shape: (8, None)
2023-07-27 12:42:07,447 INFO [compile_hlg.py:127] Saving HLG.pt to data/lang_phone
Training
~~~~~~~~
Now let us run the training part:
.. code-block::
(test-icefall) kuangfangjun:ASR$ export CUDA_VISIBLE_DEVICES=""
(test-icefall) kuangfangjun:ASR$ ./tdnn/train.py
.. CAUTION::
We use ``export CUDA_VISIBLE_DEVICES=""`` so that `icefall`_ uses CPU
even if there are GPUs available.
.. hint::
In case you get a ``Segmentation fault (core dump)`` error, please use:
.. code-block:: bash
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
See more at `<https://github.com/k2-fsa/icefall/issues/674>` if you are
interested.
The training log is given below:
.. code-block::
2023-07-27 12:50:51,936 INFO [train.py:481] Training started
2023-07-27 12:50:51,936 INFO [train.py:482] {'exp_dir': PosixPath('tdnn/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lr': 0.01, 'feature_dim': 23, 'weight_decay': 1e-06, 'start_epoch': 0, 'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 10, 'reset_interval': 20, 'valid_interval': 10, 'beam_size': 10, 'reduction': 'sum', 'use_double_scores': True, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'seed': 42, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30.0, 'bucketing_sampler': False, 'num_buckets': 10, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': False, 'return_cuts': True, 'num_workers': 2, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.16.0.dev+git.7640d66.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': False, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '3fb0a43-clean', 'icefall-git-date': 'Thu Jul 27 12:36:05 2023', 'icefall-path': '/tmp/icefall', 'k2-path': '/star-fj/fangjun/test-icefall/lib/python3.8/site-packages/k2/__init__.py', 'lhotse-path': '/star-fj/fangjun/test-icefall/lib/python3.8/site-packages/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-1-1220091118-57c4d55446-sph26', 'IP address': '10.177.77.20'}}
2023-07-27 12:50:51,941 INFO [lexicon.py:168] Loading pre-compiled data/lang_phone/Linv.pt
2023-07-27 12:50:51,949 INFO [train.py:495] device: cpu
2023-07-27 12:50:51,965 INFO [asr_datamodule.py:146] About to get train cuts
2023-07-27 12:50:51,965 INFO [asr_datamodule.py:244] About to get train cuts
2023-07-27 12:50:51,967 INFO [asr_datamodule.py:149] About to create train dataset
2023-07-27 12:50:51,967 INFO [asr_datamodule.py:199] Using SingleCutSampler.
2023-07-27 12:50:51,967 INFO [asr_datamodule.py:205] About to create train dataloader
2023-07-27 12:50:51,968 INFO [asr_datamodule.py:218] About to get test cuts
2023-07-27 12:50:51,968 INFO [asr_datamodule.py:252] About to get test cuts
2023-07-27 12:50:52,565 INFO [train.py:422] Epoch 0, batch 0, loss[loss=1.065, over 2436.00 frames. ], tot_loss[loss=1.065, over 2436.00 frames. ], batch size: 4
2023-07-27 12:50:53,681 INFO [train.py:422] Epoch 0, batch 10, loss[loss=0.4561, over 2828.00 frames. ], tot_loss[loss=0.7076, over 22192.90 frames.], batch size: 4
2023-07-27 12:50:54,167 INFO [train.py:444] Epoch 0, validation loss=0.9002, over 18067.00 frames.
2023-07-27 12:50:55,011 INFO [train.py:422] Epoch 0, batch 20, loss[loss=0.2555, over 2695.00 frames. ], tot_loss[loss=0.484, over 34971.47 frames. ], batch size: 5
2023-07-27 12:50:55,331 INFO [train.py:444] Epoch 0, validation loss=0.4688, over 18067.00 frames.
2023-07-27 12:50:55,368 INFO [checkpoint.py:75] Saving checkpoint to tdnn/exp/epoch-0.pt
2023-07-27 12:50:55,633 INFO [train.py:422] Epoch 1, batch 0, loss[loss=0.2532, over 2436.00 frames. ], tot_loss[loss=0.2532, over 2436.00 frames. ],
batch size: 4
2023-07-27 12:50:56,242 INFO [train.py:422] Epoch 1, batch 10, loss[loss=0.1139, over 2828.00 frames. ], tot_loss[loss=0.1592, over 22192.90 frames.], batch size: 4
2023-07-27 12:50:56,522 INFO [train.py:444] Epoch 1, validation loss=0.1627, over 18067.00 frames.
2023-07-27 12:50:57,209 INFO [train.py:422] Epoch 1, batch 20, loss[loss=0.07055, over 2695.00 frames. ], tot_loss[loss=0.1175, over 34971.47 frames.], batch size: 5
2023-07-27 12:50:57,600 INFO [train.py:444] Epoch 1, validation loss=0.07091, over 18067.00 frames.
2023-07-27 12:50:57,640 INFO [checkpoint.py:75] Saving checkpoint to tdnn/exp/epoch-1.pt
2023-07-27 12:50:57,847 INFO [train.py:422] Epoch 2, batch 0, loss[loss=0.07731, over 2436.00 frames. ], tot_loss[loss=0.07731, over 2436.00 frames.], batch size: 4
2023-07-27 12:50:58,427 INFO [train.py:422] Epoch 2, batch 10, loss[loss=0.04391, over 2828.00 frames. ], tot_loss[loss=0.05341, over 22192.90 frames. ], batch size: 4
2023-07-27 12:50:58,884 INFO [train.py:444] Epoch 2, validation loss=0.04384, over 18067.00 frames.
2023-07-27 12:50:59,387 INFO [train.py:422] Epoch 2, batch 20, loss[loss=0.03458, over 2695.00 frames. ], tot_loss[loss=0.04616, over 34971.47 frames. ], batch size: 5
2023-07-27 12:50:59,707 INFO [train.py:444] Epoch 2, validation loss=0.03379, over 18067.00 frames.
2023-07-27 12:50:59,758 INFO [checkpoint.py:75] Saving checkpoint to tdnn/exp/epoch-2.pt
... ...
2023-07-27 12:51:23,433 INFO [train.py:422] Epoch 13, batch 0, loss[loss=0.01054, over 2436.00 frames. ], tot_loss[loss=0.01054, over 2436.00 frames. ], batch size: 4
2023-07-27 12:51:23,980 INFO [train.py:422] Epoch 13, batch 10, loss[loss=0.009014, over 2828.00 frames. ], tot_loss[loss=0.009974, over 22192.90 frames. ], batch size: 4
2023-07-27 12:51:24,489 INFO [train.py:444] Epoch 13, validation loss=0.01085, over 18067.00 frames.
2023-07-27 12:51:25,258 INFO [train.py:422] Epoch 13, batch 20, loss[loss=0.01172, over 2695.00 frames. ], tot_loss[loss=0.01055, over 34971.47 frames. ], batch size: 5
2023-07-27 12:51:25,621 INFO [train.py:444] Epoch 13, validation loss=0.01074, over 18067.00 frames.
2023-07-27 12:51:25,699 INFO [checkpoint.py:75] Saving checkpoint to tdnn/exp/epoch-13.pt
2023-07-27 12:51:25,866 INFO [train.py:422] Epoch 14, batch 0, loss[loss=0.01044, over 2436.00 frames. ], tot_loss[loss=0.01044, over 2436.00 frames. ], batch size: 4
2023-07-27 12:51:26,844 INFO [train.py:422] Epoch 14, batch 10, loss[loss=0.008942, over 2828.00 frames. ], tot_loss[loss=0.01, over 22192.90 frames. ], batch size: 4
2023-07-27 12:51:27,221 INFO [train.py:444] Epoch 14, validation loss=0.01082, over 18067.00 frames.
2023-07-27 12:51:27,970 INFO [train.py:422] Epoch 14, batch 20, loss[loss=0.01169, over 2695.00 frames. ], tot_loss[loss=0.01054, over 34971.47 frames. ], batch size: 5
2023-07-27 12:51:28,247 INFO [train.py:444] Epoch 14, validation loss=0.01073, over 18067.00 frames.
2023-07-27 12:51:28,323 INFO [checkpoint.py:75] Saving checkpoint to tdnn/exp/epoch-14.pt
2023-07-27 12:51:28,326 INFO [train.py:555] Done!
Decoding
~~~~~~~~
Let us use the trained model to decode the test set:
.. code-block::
(test-icefall) kuangfangjun:ASR$ ./tdnn/decode.py
2023-07-27 12:55:12,840 INFO [decode.py:263] Decoding started
2023-07-27 12:55:12,840 INFO [decode.py:264] {'exp_dir': PosixPath('tdnn/exp'), 'lang_dir': PosixPath('data/lang_phone'), 'lm_dir': PosixPath('data/lm'), 'feature_dim': 23, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'epoch': 14, 'avg': 2, 'export': False, 'feature_dir': PosixPath('data/fbank'), 'max_duration': 30.0, 'bucketing_sampler': False, 'num_buckets': 10, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': False, 'return_cuts': True, 'num_workers': 2, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4c05309499a08454997adf500b56dcc629e35ae5', 'k2-git-date': 'Tue Jul 25 16:23:36 2023', 'lhotse-version': '1.16.0.dev+git.7640d66.clean', 'torch-version': '1.13.0+cu116', 'torch-cuda-available': False, 'torch-cuda-version': '11.6', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '3fb0a43-clean', 'icefall-git-date': 'Thu Jul 27 12:36:05 2023', 'icefall-path': '/tmp/icefall', 'k2-path': '/star-fj/fangjun/test-icefall/lib/python3.8/site-packages/k2/__init__.py', 'lhotse-path': '/star-fj/fangjun/test-icefall/lib/python3.8/site-packages/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-1-1220091118-57c4d55446-sph26', 'IP address': '10.177.77.20'}}
2023-07-27 12:55:12,841 INFO [lexicon.py:168] Loading pre-compiled data/lang_phone/Linv.pt
2023-07-27 12:55:12,855 INFO [decode.py:273] device: cpu
2023-07-27 12:55:12,868 INFO [decode.py:291] averaging ['tdnn/exp/epoch-13.pt', 'tdnn/exp/epoch-14.pt']
2023-07-27 12:55:12,882 INFO [asr_datamodule.py:218] About to get test cuts
2023-07-27 12:55:12,883 INFO [asr_datamodule.py:252] About to get test cuts
2023-07-27 12:55:13,157 INFO [decode.py:204] batch 0/?, cuts processed until now is 4
2023-07-27 12:55:13,701 INFO [decode.py:241] The transcripts are stored in tdnn/exp/recogs-test_set.txt
2023-07-27 12:55:13,702 INFO [utils.py:564] [test_set] %WER 0.42% [1 / 240, 0 ins, 1 del, 0 sub ]
2023-07-27 12:55:13,704 INFO [decode.py:249] Wrote detailed error stats to tdnn/exp/errs-test_set.txt
2023-07-27 12:55:13,704 INFO [decode.py:316] Done!
**Congratulations!** You have successfully setup the environment and have run the first recipe in `icefall`_.
Have fun with ``icefall``!
YouTube Video
-------------
We provide the following YouTube video showing how to install `icefall`_.
It also shows how to debug various problems that you may encounter while
using `icefall`_.
.. note::
To get the latest news of `next-gen Kaldi <https://github.com/k2-fsa>`_, please subscribe
the following YouTube channel by `Nadira Povey <https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_:
`<https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_
.. youtube:: LVmrBD0tLfE

View File

@ -1,21 +0,0 @@
2023-01-11 12:15:38,677 INFO [export-for-ncnn.py:220] device: cpu
2023-01-11 12:15:38,681 INFO [export-for-ncnn.py:229] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_v
alid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampl
ing_factor': 4, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.23.2', 'k2-build-type':
'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'a34171ed85605b0926eebbd0463d059431f4f74a', 'k2-git-date': 'Wed Dec 14 00:06:38 2022',
'lhotse-version': '1.12.0.dev+missing.version.file', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': False, 'torch-cuda-vers
ion': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'fix-stateless3-train-2022-12-27', 'icefall-git-sha1': '530e8a1-dirty', '
icefall-git-date': 'Tue Dec 27 13:59:18 2022', 'icefall-path': '/star-fj/fangjun/open-source/icefall', 'k2-path': '/star-fj/fangjun/op
en-source/k2/k2/python/k2/__init__.py', 'lhotse-path': '/star-fj/fangjun/open-source/lhotse/lhotse/__init__.py', 'hostname': 'de-74279
-k2-train-3-1220120619-7695ff496b-s9n4w', 'IP address': '127.0.0.1'}, 'epoch': 30, 'iter': 0, 'avg': 1, 'exp_dir': PosixPath('icefa
ll-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp'), 'bpe_model': './icefall-asr-librispeech-conv-emformer-transdu
cer-stateless2-2022-07-05//data/lang_bpe_500/bpe.model', 'jit': False, 'context_size': 2, 'use_averaged_model': False, 'encoder_dim':
512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'cnn_module_kernel': 31, 'left_context_length': 32, 'chunk_length'
: 32, 'right_context_length': 8, 'memory_size': 32, 'blank_id': 0, 'vocab_size': 500}
2023-01-11 12:15:38,681 INFO [export-for-ncnn.py:231] About to create model
2023-01-11 12:15:40,053 INFO [checkpoint.py:112] Loading checkpoint from icefall-asr-librispeech-conv-emformer-transducer-stateless2-2
022-07-05/exp/epoch-30.pt
2023-01-11 12:15:40,708 INFO [export-for-ncnn.py:315] Number of model parameters: 75490012
2023-01-11 12:15:41,681 INFO [export-for-ncnn.py:318] Using torch.jit.trace()
2023-01-11 12:15:41,681 INFO [export-for-ncnn.py:320] Exporting encoder
2023-01-11 12:15:41,682 INFO [export-for-ncnn.py:149] chunk_length: 32, right_context_length: 8

View File

@ -1,18 +0,0 @@
2023-02-17 11:22:42,862 INFO [export-for-ncnn.py:222] device: cpu
2023-02-17 11:22:42,865 INFO [export-for-ncnn.py:231] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'dim_feedforward': 2048, 'decoder_dim': 512, 'joiner_dim': 512, 'is_pnnx': False, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.23.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '62e404dd3f3a811d73e424199b3408e309c06e1a', 'k2-git-date': 'Mon Jan 30 10:26:16 2023', 'lhotse-version': '1.12.0.dev+missing.version.file', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': False, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '6d7a559-dirty', 'icefall-git-date': 'Thu Feb 16 19:47:54 2023', 'icefall-path': '/star-fj/fangjun/open-source/icefall-2', 'k2-path': '/star-fj/fangjun/open-source/k2/k2/python/k2/__init__.py', 'lhotse-path': '/star-fj/fangjun/open-source/lhotse/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-3-1220120619-7695ff496b-s9n4w', 'IP address': '10.177.6.147'}, 'epoch': 99, 'iter': 0, 'avg': 1, 'exp_dir': PosixPath('icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp'), 'bpe_model': './icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/data/lang_bpe_500/bpe.model', 'context_size': 2, 'use_averaged_model': False, 'num_encoder_layers': 12, 'encoder_dim': 512, 'rnn_hidden_size': 1024, 'aux_layer_period': 0, 'blank_id': 0, 'vocab_size': 500}
2023-02-17 11:22:42,865 INFO [export-for-ncnn.py:235] About to create model
2023-02-17 11:22:43,239 INFO [train.py:472] Disable giga
2023-02-17 11:22:43,249 INFO [checkpoint.py:112] Loading checkpoint from icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/epoch-99.pt
2023-02-17 11:22:44,595 INFO [export-for-ncnn.py:324] encoder parameters: 83137520
2023-02-17 11:22:44,596 INFO [export-for-ncnn.py:325] decoder parameters: 257024
2023-02-17 11:22:44,596 INFO [export-for-ncnn.py:326] joiner parameters: 781812
2023-02-17 11:22:44,596 INFO [export-for-ncnn.py:327] total parameters: 84176356
2023-02-17 11:22:44,596 INFO [export-for-ncnn.py:329] Using torch.jit.trace()
2023-02-17 11:22:44,596 INFO [export-for-ncnn.py:331] Exporting encoder
2023-02-17 11:22:48,182 INFO [export-for-ncnn.py:158] Saved to icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/encoder_jit_trace-pnnx.pt
2023-02-17 11:22:48,183 INFO [export-for-ncnn.py:335] Exporting decoder
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/lstm_transducer_stateless2/decoder.py:101: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
need_pad = bool(need_pad)
2023-02-17 11:22:48,259 INFO [export-for-ncnn.py:180] Saved to icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/decoder_jit_trace-pnnx.pt
2023-02-17 11:22:48,259 INFO [export-for-ncnn.py:339] Exporting joiner
2023-02-17 11:22:48,304 INFO [export-for-ncnn.py:207] Saved to icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/joiner_jit_trace-pnnx.pt

View File

@ -1,21 +0,0 @@
2022-10-13 19:09:02,233 INFO [pretrained.py:265] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'model_warm_step': 3000, 'env_info': {'k2-version': '1.21', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4810e00d8738f1a21278b0156a42ff396a2d40ac', 'k2-git-date': 'Fri Oct 7 19:35:03 2022', 'lhotse-version': '1.3.0.dev+missing.version.file', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': False, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'onnx-doc-1013', 'icefall-git-sha1': 'c39cba5-dirty', 'icefall-git-date': 'Thu Oct 13 15:17:20 2022', 'icefall-path': '/k2-dev/fangjun/open-source/icefall-master', 'k2-path': '/k2-dev/fangjun/open-source/k2-master/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-fj/fangjun/open-source-2/lhotse-jsonl/lhotse/__init__.py', 'hostname': 'de-74279-k2-test-4-0324160024-65bfd8b584-jjlbn', 'IP address': '10.177.74.203'}, 'checkpoint': './icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/exp/pretrained-iter-1224000-avg-14.pt', 'bpe_model': './icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/data/lang_bpe_500/bpe.model', 'method': 'greedy_search', 'sound_files': ['./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1089-134686-0001.wav', './icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0001.wav', './icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0002.wav'], 'sample_rate': 16000, 'beam_size': 4, 'beam': 4, 'max_contexts': 4, 'max_states': 8, 'context_size': 2, 'max_sym_per_frame': 1, 'simulate_streaming': False, 'decode_chunk_size': 16, 'left_context': 64, 'dynamic_chunk_training': False, 'causal_convolution': False, 'short_chunk_size': 25, 'num_left_chunks': 4, 'blank_id': 0, 'unk_id': 2, 'vocab_size': 500}
2022-10-13 19:09:02,233 INFO [pretrained.py:271] device: cpu
2022-10-13 19:09:02,233 INFO [pretrained.py:273] Creating model
2022-10-13 19:09:02,612 INFO [train.py:458] Disable giga
2022-10-13 19:09:02,623 INFO [pretrained.py:277] Number of model parameters: 78648040
2022-10-13 19:09:02,951 INFO [pretrained.py:285] Constructing Fbank computer
2022-10-13 19:09:02,952 INFO [pretrained.py:295] Reading sound files: ['./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1089-134686-0001.wav', './icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0001.wav', './icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0002.wav']
2022-10-13 19:09:02,957 INFO [pretrained.py:301] Decoding started
2022-10-13 19:09:06,700 INFO [pretrained.py:329] Using greedy_search
2022-10-13 19:09:06,912 INFO [pretrained.py:388]
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1089-134686-0001.wav:
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0001.wav:
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0002.wav:
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
2022-10-13 19:09:06,912 INFO [pretrained.py:390] Decoding Done

View File

@ -1,74 +0,0 @@
2023-02-27 20:23:07,473 INFO [export-for-ncnn.py:246] device: cpu
2023-02-27 20:23:07,477 INFO [export-for-ncnn.py:255] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.23.4', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '62e404dd3f3a811d73e424199b3408e309c06e1a', 'k2-git-date': 'Mon Jan 30 10:26:16 2023', 'lhotse-version': '1.12.0.dev+missing.version.file', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '6d7a559-clean', 'icefall-git-date': 'Thu Feb 16 19:47:54 2023', 'icefall-path': '/star-fj/fangjun/open-source/icefall-2', 'k2-path': '/star-fj/fangjun/open-source/k2/k2/python/k2/__init__.py', 'lhotse-path': '/star-fj/fangjun/open-source/lhotse/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-3-1220120619-7695ff496b-s9n4w', 'IP address': '10.177.6.147'}, 'epoch': 99, 'iter': 0, 'avg': 1, 'exp_dir': PosixPath('icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp'), 'bpe_model': './icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/bpe.model', 'context_size': 2, 'use_averaged_model': False, 'num_encoder_layers': '2,4,3,2,4', 'feedforward_dims': '1024,1024,2048,2048,1024', 'nhead': '8,8,8,8,8', 'encoder_dims': '384,384,384,384,384', 'attention_dims': '192,192,192,192,192', 'encoder_unmasked_dims': '256,256,256,256,256', 'zipformer_downsampling_factors': '1,2,4,8,2', 'cnn_module_kernels': '31,31,31,31,31', 'decoder_dim': 512, 'joiner_dim': 512, 'short_chunk_size': 50, 'num_left_chunks': 4, 'decode_chunk_len': 32, 'blank_id': 0, 'vocab_size': 500}
2023-02-27 20:23:07,477 INFO [export-for-ncnn.py:257] About to create model
2023-02-27 20:23:08,023 INFO [zipformer2.py:419] At encoder stack 4, which has downsampling_factor=2, we will combine the outputs of layers 1 and 3, with downsampling_factors=2 and 8.
2023-02-27 20:23:08,037 INFO [checkpoint.py:112] Loading checkpoint from icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/epoch-99.pt
2023-02-27 20:23:08,655 INFO [export-for-ncnn.py:346] encoder parameters: 68944004
2023-02-27 20:23:08,655 INFO [export-for-ncnn.py:347] decoder parameters: 260096
2023-02-27 20:23:08,655 INFO [export-for-ncnn.py:348] joiner parameters: 716276
2023-02-27 20:23:08,656 INFO [export-for-ncnn.py:349] total parameters: 69920376
2023-02-27 20:23:08,656 INFO [export-for-ncnn.py:351] Using torch.jit.trace()
2023-02-27 20:23:08,656 INFO [export-for-ncnn.py:353] Exporting encoder
2023-02-27 20:23:08,656 INFO [export-for-ncnn.py:174] decode_chunk_len: 32
2023-02-27 20:23:08,656 INFO [export-for-ncnn.py:175] T: 39
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1344: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cached_len.size(0) == self.num_layers, (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1348: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cached_avg.size(0) == self.num_layers, (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1352: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cached_key.size(0) == self.num_layers, (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1356: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cached_val.size(0) == self.num_layers, (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1360: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cached_val2.size(0) == self.num_layers, (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1364: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cached_conv1.size(0) == self.num_layers, (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1368: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cached_conv2.size(0) == self.num_layers, (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1373: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert self.left_context_len == cached_key.shape[1], (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1884: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert self.x_size == x.size(0), (self.x_size, x.size(0))
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:2442: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cached_key.shape[0] == self.left_context_len, (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:2449: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cached_key.shape[0] == cached_val.shape[0], (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:2469: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cached_key.shape[0] == left_context_len, (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:2473: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cached_val.shape[0] == left_context_len, (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:2483: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert kv_len == k.shape[0], (kv_len, k.shape)
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:2570: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert list(attn_output.size()) == [bsz * num_heads, seq_len, head_dim // 2]
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:2926: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cache.shape == (x.size(0), x.size(1), self.lorder), (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:2652: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert x.shape[0] == self.x_size, (x.shape[0], self.x_size)
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:2653: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert x.shape[2] == self.embed_dim, (x.shape[2], self.embed_dim)
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:2666: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert cached_val.shape[0] == self.left_context_len, (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1543: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert src.shape[0] == self.in_x_size, (src.shape[0], self.in_x_size)
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1637: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert src.shape[0] == self.in_x_size, (
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1643: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert src.shape[2] == self.in_channels, (src.shape[2], self.in_channels)
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1571: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
if src.shape[0] != self.in_x_size:
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1763: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert src1.shape[:-1] == src2.shape[:-1], (src1.shape, src2.shape)
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1779: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert src1.shape[-1] == self.dim1, (src1.shape[-1], self.dim1)
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/zipformer2.py:1780: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert src2.shape[-1] == self.dim2, (src2.shape[-1], self.dim2)
/star-fj/fangjun/py38/lib/python3.8/site-packages/torch/jit/_trace.py:958: TracerWarning: Encountering a list at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module's inputs. Consider using a constant container instead (e.g. for `list`, use a `tuple` instead. for `dict`, use a `NamedTuple` instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior.
module._c._create_method_from_trace(
2023-02-27 20:23:19,640 INFO [export-for-ncnn.py:182] Saved to icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/encoder_jit_trace-pnnx.pt
2023-02-27 20:23:19,646 INFO [export-for-ncnn.py:357] Exporting decoder
/star-fj/fangjun/open-source/icefall-2/egs/librispeech/ASR/pruned_transducer_stateless7_streaming/decoder.py:102: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
assert embedding_out.size(-1) == self.context_size
2023-02-27 20:23:19,686 INFO [export-for-ncnn.py:204] Saved to icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/decoder_jit_trace-pnnx.pt
2023-02-27 20:23:19,686 INFO [export-for-ncnn.py:361] Exporting joiner
2023-02-27 20:23:19,735 INFO [export-for-ncnn.py:231] Saved to icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/joiner_jit_trace-pnnx.pt

View File

@ -1,104 +0,0 @@
Don't Use GPU. has_gpu: 0, config.use_vulkan_compute: 1
num encoder conv layers: 88
num joiner conv layers: 3
num files: 3
Processing ../test_wavs/1089-134686-0001.wav
Processing ../test_wavs/1221-135766-0001.wav
Processing ../test_wavs/1221-135766-0002.wav
Processing ../test_wavs/1089-134686-0001.wav
Processing ../test_wavs/1221-135766-0001.wav
Processing ../test_wavs/1221-135766-0002.wav
----------encoder----------
conv_87 : max = 15.942385 threshold = 15.938493 scale = 7.968131
conv_88 : max = 35.442448 threshold = 15.549335 scale = 8.167552
conv_89 : max = 23.228289 threshold = 8.001738 scale = 15.871552
linear_90 : max = 3.976146 threshold = 1.101789 scale = 115.267128
linear_91 : max = 6.962030 threshold = 5.162033 scale = 24.602713
linear_92 : max = 12.323041 threshold = 3.853959 scale = 32.953129
linear_94 : max = 6.905416 threshold = 4.648006 scale = 27.323545
linear_93 : max = 6.905416 threshold = 5.474093 scale = 23.200188
linear_95 : max = 1.888012 threshold = 1.403563 scale = 90.483986
linear_96 : max = 6.856741 threshold = 5.398679 scale = 23.524273
linear_97 : max = 9.635942 threshold = 2.613655 scale = 48.590950
linear_98 : max = 6.460340 threshold = 5.670146 scale = 22.398010
linear_99 : max = 9.532276 threshold = 2.585537 scale = 49.119396
linear_101 : max = 6.585871 threshold = 5.719224 scale = 22.205809
linear_100 : max = 6.585871 threshold = 5.751382 scale = 22.081648
linear_102 : max = 1.593344 threshold = 1.450581 scale = 87.551147
linear_103 : max = 6.592681 threshold = 5.705824 scale = 22.257959
linear_104 : max = 8.752957 threshold = 1.980955 scale = 64.110489
linear_105 : max = 6.696240 threshold = 5.877193 scale = 21.608953
linear_106 : max = 9.059659 threshold = 2.643138 scale = 48.048950
linear_108 : max = 6.975461 threshold = 4.589567 scale = 27.671457
linear_107 : max = 6.975461 threshold = 6.190381 scale = 20.515701
linear_109 : max = 3.710759 threshold = 2.305635 scale = 55.082436
linear_110 : max = 7.531228 threshold = 5.731162 scale = 22.159557
linear_111 : max = 10.528083 threshold = 2.259322 scale = 56.211544
linear_112 : max = 8.148807 threshold = 5.500842 scale = 23.087374
linear_113 : max = 8.592566 threshold = 1.948851 scale = 65.166611
linear_115 : max = 8.437109 threshold = 5.608947 scale = 22.642395
linear_114 : max = 8.437109 threshold = 6.193942 scale = 20.503904
linear_116 : max = 3.966980 threshold = 3.200896 scale = 39.676392
linear_117 : max = 9.451303 threshold = 6.061664 scale = 20.951344
linear_118 : max = 12.077262 threshold = 3.965800 scale = 32.023804
linear_119 : max = 9.671615 threshold = 4.847613 scale = 26.198460
linear_120 : max = 8.625638 threshold = 3.131427 scale = 40.556595
linear_122 : max = 10.274080 threshold = 4.888716 scale = 25.978189
linear_121 : max = 10.274080 threshold = 5.420480 scale = 23.429659
linear_123 : max = 4.826197 threshold = 3.599617 scale = 35.281532
linear_124 : max = 11.396383 threshold = 7.325849 scale = 17.335875
linear_125 : max = 9.337198 threshold = 3.941410 scale = 32.221970
linear_126 : max = 9.699965 threshold = 4.842878 scale = 26.224073
linear_127 : max = 8.775370 threshold = 3.884215 scale = 32.696438
linear_129 : max = 9.872276 threshold = 4.837319 scale = 26.254213
linear_128 : max = 9.872276 threshold = 7.180057 scale = 17.687883
linear_130 : max = 4.150427 threshold = 3.454298 scale = 36.765789
linear_131 : max = 11.112692 threshold = 7.924847 scale = 16.025545
linear_132 : max = 11.852893 threshold = 3.116593 scale = 40.749626
linear_133 : max = 11.517084 threshold = 5.024665 scale = 25.275314
linear_134 : max = 10.683807 threshold = 3.878618 scale = 32.743618
linear_136 : max = 12.421055 threshold = 6.322729 scale = 20.086264
linear_135 : max = 12.421055 threshold = 5.309880 scale = 23.917679
linear_137 : max = 4.827781 threshold = 3.744595 scale = 33.915554
linear_138 : max = 14.422395 threshold = 7.742882 scale = 16.402161
linear_139 : max = 8.527538 threshold = 3.866123 scale = 32.849449
linear_140 : max = 12.128619 threshold = 4.657793 scale = 27.266134
linear_141 : max = 9.839593 threshold = 3.845993 scale = 33.021378
linear_143 : max = 12.442304 threshold = 7.099039 scale = 17.889746
linear_142 : max = 12.442304 threshold = 5.325038 scale = 23.849592
linear_144 : max = 5.929444 threshold = 5.618206 scale = 22.605080
linear_145 : max = 13.382126 threshold = 9.321095 scale = 13.625010
linear_146 : max = 9.894987 threshold = 3.867645 scale = 32.836517
linear_147 : max = 10.915313 threshold = 4.906028 scale = 25.886522
linear_148 : max = 9.614287 threshold = 3.908151 scale = 32.496181
linear_150 : max = 11.724932 threshold = 4.485588 scale = 28.312899
linear_149 : max = 11.724932 threshold = 5.161146 scale = 24.606939
linear_151 : max = 7.164453 threshold = 5.847355 scale = 21.719223
linear_152 : max = 13.086471 threshold = 5.984121 scale = 21.222834
linear_153 : max = 11.099524 threshold = 3.991601 scale = 31.816805
linear_154 : max = 10.054585 threshold = 4.489706 scale = 28.286930
linear_155 : max = 12.389185 threshold = 3.100321 scale = 40.963501
linear_157 : max = 9.982999 threshold = 5.154796 scale = 24.637253
linear_156 : max = 9.982999 threshold = 8.537706 scale = 14.875190
linear_158 : max = 8.420287 threshold = 6.502287 scale = 19.531588
linear_159 : max = 25.014746 threshold = 9.423280 scale = 13.477261
linear_160 : max = 45.633553 threshold = 5.715335 scale = 22.220921
linear_161 : max = 20.371849 threshold = 5.117830 scale = 24.815203
linear_162 : max = 12.492933 threshold = 3.126283 scale = 40.623318
linear_164 : max = 20.697504 threshold = 4.825712 scale = 26.317358
linear_163 : max = 20.697504 threshold = 5.078367 scale = 25.008038
linear_165 : max = 9.023975 threshold = 6.836278 scale = 18.577358
linear_166 : max = 34.860619 threshold = 7.259792 scale = 17.493614
linear_167 : max = 30.380934 threshold = 5.496160 scale = 23.107042
linear_168 : max = 20.691216 threshold = 4.733317 scale = 26.831076
linear_169 : max = 9.723948 threshold = 3.952728 scale = 32.129707
linear_171 : max = 21.034811 threshold = 5.366547 scale = 23.665123
linear_170 : max = 21.034811 threshold = 5.356277 scale = 23.710501
linear_172 : max = 10.556884 threshold = 5.729481 scale = 22.166058
linear_173 : max = 20.033039 threshold = 10.207264 scale = 12.442120
linear_174 : max = 11.597379 threshold = 2.658676 scale = 47.768131
----------joiner----------
linear_2 : max = 19.293503 threshold = 14.305265 scale = 8.877850
linear_1 : max = 10.812222 threshold = 8.766452 scale = 14.487047
linear_3 : max = 0.999999 threshold = 0.999755 scale = 127.031174
ncnn int8 calibration table create success, best wish for your int8 inference has a low accuracy loss...\(^0^)/...233...

View File

@ -1,44 +0,0 @@
Don't Use GPU. has_gpu: 0, config.use_vulkan_compute: 1
num encoder conv layers: 28
num joiner conv layers: 3
num files: 3
Processing ../test_wavs/1089-134686-0001.wav
Processing ../test_wavs/1221-135766-0001.wav
Processing ../test_wavs/1221-135766-0002.wav
Processing ../test_wavs/1089-134686-0001.wav
Processing ../test_wavs/1221-135766-0001.wav
Processing ../test_wavs/1221-135766-0002.wav
----------encoder----------
conv_15 : max = 15.942385 threshold = 15.930708 scale = 7.972025
conv_16 : max = 44.978855 threshold = 17.031788 scale = 7.456645
conv_17 : max = 17.868437 threshold = 7.830528 scale = 16.218575
linear_18 : max = 3.107259 threshold = 1.194808 scale = 106.293236
linear_19 : max = 6.193777 threshold = 4.634748 scale = 27.401705
linear_20 : max = 9.259933 threshold = 2.606617 scale = 48.722160
linear_21 : max = 5.186600 threshold = 4.790260 scale = 26.512129
linear_22 : max = 9.759041 threshold = 2.265832 scale = 56.050053
linear_23 : max = 3.931209 threshold = 3.099090 scale = 40.979767
linear_24 : max = 10.324160 threshold = 2.215561 scale = 57.321835
linear_25 : max = 3.800708 threshold = 3.599352 scale = 35.284134
linear_26 : max = 10.492444 threshold = 3.153369 scale = 40.274391
linear_27 : max = 3.660161 threshold = 2.720994 scale = 46.674126
linear_28 : max = 9.415265 threshold = 3.174434 scale = 40.007133
linear_29 : max = 4.038418 threshold = 3.118534 scale = 40.724262
linear_30 : max = 10.072084 threshold = 3.936867 scale = 32.259155
linear_31 : max = 4.342712 threshold = 3.599489 scale = 35.282787
linear_32 : max = 11.340535 threshold = 3.120308 scale = 40.701103
linear_33 : max = 3.846987 threshold = 3.630030 scale = 34.985939
linear_34 : max = 10.686298 threshold = 2.204571 scale = 57.607586
linear_35 : max = 4.904821 threshold = 4.575518 scale = 27.756420
linear_36 : max = 11.806659 threshold = 2.585589 scale = 49.118401
linear_37 : max = 6.402340 threshold = 5.047157 scale = 25.162680
linear_38 : max = 11.174589 threshold = 1.923361 scale = 66.030258
linear_39 : max = 16.178576 threshold = 7.556058 scale = 16.807705
linear_40 : max = 12.901954 threshold = 5.301267 scale = 23.956539
linear_41 : max = 14.839805 threshold = 7.597429 scale = 16.716181
linear_42 : max = 10.178945 threshold = 2.651595 scale = 47.895699
----------joiner----------
linear_2 : max = 24.829245 threshold = 16.627592 scale = 7.637907
linear_1 : max = 10.746186 threshold = 5.255032 scale = 24.167313
linear_3 : max = 1.000000 threshold = 0.999756 scale = 127.031013
ncnn int8 calibration table create success, best wish for your int8 inference has a low accuracy loss...\(^0^)/...233...

View File

@ -1,7 +0,0 @@
2023-01-11 14:02:12,216 INFO [streaming-ncnn-decode.py:320] {'tokens': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/data/lang_bpe_500/tokens.txt', 'encoder_param_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.param', 'encoder_bin_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.bin', 'decoder_param_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.param', 'decoder_bin_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.bin', 'joiner_param_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.param', 'joiner_bin_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.bin', 'sound_filename': './icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/test_wavs/1089-134686-0001.wav'}
T 51 32
2023-01-11 14:02:13,141 INFO [streaming-ncnn-decode.py:328] Constructing Fbank computer
2023-01-11 14:02:13,151 INFO [streaming-ncnn-decode.py:331] Reading sound files: ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/test_wavs/1089-134686-0001.wav
2023-01-11 14:02:13,176 INFO [streaming-ncnn-decode.py:336] torch.Size([106000])
2023-01-11 14:02:17,581 INFO [streaming-ncnn-decode.py:380] ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/test_wavs/1089-134686-0001.wav
2023-01-11 14:02:17,581 INFO [streaming-ncnn-decode.py:381] AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS

View File

@ -1,6 +0,0 @@
2023-02-17 11:37:30,861 INFO [streaming-ncnn-decode.py:255] {'tokens': './icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/data/lang_bpe_500/tokens.txt', 'encoder_param_filename': './icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/encoder_jit_trace-pnnx.ncnn.param', 'encoder_bin_filename': './icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/encoder_jit_trace-pnnx.ncnn.bin', 'decoder_param_filename': './icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/decoder_jit_trace-pnnx.ncnn.param', 'decoder_bin_filename': './icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/decoder_jit_trace-pnnx.ncnn.bin', 'joiner_param_filename': './icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/joiner_jit_trace-pnnx.ncnn.param', 'joiner_bin_filename': './icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/joiner_jit_trace-pnnx.ncnn.bin', 'sound_filename': './icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/test_wavs/1089-134686-0001.wav'}
2023-02-17 11:37:31,425 INFO [streaming-ncnn-decode.py:263] Constructing Fbank computer
2023-02-17 11:37:31,427 INFO [streaming-ncnn-decode.py:266] Reading sound files: ./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/test_wavs/1089-134686-0001.wav
2023-02-17 11:37:31,431 INFO [streaming-ncnn-decode.py:271] torch.Size([106000])
2023-02-17 11:37:34,115 INFO [streaming-ncnn-decode.py:342] ./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/test_wavs/1089-134686-0001.wav
2023-02-17 11:37:34,115 INFO [streaming-ncnn-decode.py:343] AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS

View File

@ -1,7 +0,0 @@
2023-02-27 20:43:40,283 INFO [streaming-ncnn-decode.py:349] {'tokens': './icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/tokens.txt', 'encoder_param_filename': './icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/encoder_jit_trace-pnnx.ncnn.param', 'encoder_bin_filename': './icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/encoder_jit_trace-pnnx.ncnn.bin', 'decoder_param_filename': './icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/decoder_jit_trace-pnnx.ncnn.param', 'decoder_bin_filename': './icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/decoder_jit_trace-pnnx.ncnn.bin', 'joiner_param_filename': './icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/joiner_jit_trace-pnnx.ncnn.param', 'joiner_bin_filename': './icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/joiner_jit_trace-pnnx.ncnn.bin', 'sound_filename': './icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/test_wavs/1089-134686-0001.wav'}
2023-02-27 20:43:41,260 INFO [streaming-ncnn-decode.py:357] Constructing Fbank computer
2023-02-27 20:43:41,264 INFO [streaming-ncnn-decode.py:360] Reading sound files: ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/test_wavs/1089-134686-0001.wav
2023-02-27 20:43:41,269 INFO [streaming-ncnn-decode.py:365] torch.Size([106000])
2023-02-27 20:43:41,280 INFO [streaming-ncnn-decode.py:372] number of states: 35
2023-02-27 20:43:45,026 INFO [streaming-ncnn-decode.py:410] ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/test_wavs/1089-134686-0001.wav
2023-02-27 20:43:45,026 INFO [streaming-ncnn-decode.py:411] AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS

View File

@ -1,135 +0,0 @@
Export model.state_dict()
=========================
When to use it
--------------
During model training, we save checkpoints periodically to disk.
A checkpoint contains the following information:
- ``model.state_dict()``
- ``optimizer.state_dict()``
- and some other information related to training
When we need to resume the training process from some point, we need a checkpoint.
However, if we want to publish the model for inference, then only
``model.state_dict()`` is needed. In this case, we need to strip all other information
except ``model.state_dict()`` to reduce the file size of the published model.
How to export
-------------
Every recipe contains a file ``export.py`` that you can use to
export ``model.state_dict()`` by taking some checkpoints as inputs.
.. hint::
Each ``export.py`` contains well-documented usage information.
In the following, we use
`<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless3/export.py>`_
as an example.
.. note::
The steps for other recipes are almost the same.
.. code-block:: bash
cd egs/librispeech/ASR
./pruned_transducer_stateless3/export.py \
--exp-dir ./pruned_transducer_stateless3/exp \
--tokens data/lang_bpe_500/tokens.txt \
--epoch 20 \
--avg 10
will generate a file ``pruned_transducer_stateless3/exp/pretrained.pt``, which
is a dict containing ``{"model": model.state_dict()}`` saved by ``torch.save()``.
How to use the exported model
-----------------------------
For each recipe, we provide pretrained models hosted on huggingface.
You can find links to pretrained models in ``RESULTS.md`` of each dataset.
In the following, we demonstrate how to use the pretrained model from
`<https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13>`_.
.. code-block:: bash
cd egs/librispeech/ASR
git lfs install
git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13
After cloning the repo with ``git lfs``, you will find several files in the folder
``icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/exp``
that have a prefix ``pretrained-``. Those files contain ``model.state_dict()``
exported by the above ``export.py``.
In each recipe, there is also a file ``pretrained.py``, which can use
``pretrained-xxx.pt`` to decode waves. The following is an example:
.. code-block:: bash
cd egs/librispeech/ASR
./pruned_transducer_stateless3/pretrained.py \
--checkpoint ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/exp/pretrained-iter-1224000-avg-14.pt \
--tokens ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/data/lang_bpe_500/tokens.txt \
--method greedy_search \
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1089-134686-0001.wav \
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0001.wav \
./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/test_wavs/1221-135766-0002.wav
The above commands show how to use the exported model with ``pretrained.py`` to
decode multiple sound files. Its output is given as follows for reference:
.. literalinclude:: ./code/export-model-state-dict-pretrained-out.txt
Use the exported model to run decode.py
---------------------------------------
When we publish the model, we always note down its WERs on some test
dataset in ``RESULTS.md``. This section describes how to use the
pretrained model to reproduce the WER.
.. code-block:: bash
cd egs/librispeech/ASR
git lfs install
git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13
cd icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/exp
ln -s pretrained-iter-1224000-avg-14.pt epoch-9999.pt
cd ../..
We create a symlink with name ``epoch-9999.pt`` to ``pretrained-iter-1224000-avg-14.pt``,
so that we can pass ``--epoch 9999 --avg 1`` to ``decode.py`` in the following
command:
.. code-block:: bash
./pruned_transducer_stateless3/decode.py \
--epoch 9999 \
--avg 1 \
--exp-dir ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/exp \
--lang-dir ./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/data/lang_bpe_500 \
--max-duration 600 \
--decoding-method greedy_search
You will find the decoding results in
``./icefall-asr-librispeech-pruned-transducer-stateless3-2022-05-13/exp/greedy_search``.
.. caution::
For some recipes, you also need to pass ``--use-averaged-model False``
to ``decode.py``. The reason is that the exported pretrained model is already
the averaged one.
.. hint::
Before running ``decode.py``, we assume that you have already run
``prepare.sh`` to prepare the test dataset.

View File

@ -1,752 +0,0 @@
.. _export_conv_emformer_transducer_models_to_ncnn:
Export ConvEmformer transducer models to ncnn
=============================================
We use the pre-trained model from the following repository as an example:
- `<https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05>`_
We will show you step by step how to export it to `ncnn`_ and run it with `sherpa-ncnn`_.
.. hint::
We use ``Ubuntu 18.04``, ``torch 1.13``, and ``Python 3.8`` for testing.
.. caution::
``torch > 2.0`` may not work. If you get errors while building pnnx, please switch
to ``torch < 2.0``.
1. Download the pre-trained model
---------------------------------
.. hint::
You can also refer to `<https://k2-fsa.github.io/sherpa/cpp/pretrained_models/online_transducer.html#icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05>`_ to download the pre-trained model.
You have to install `git-lfs`_ before you continue.
.. code-block:: bash
cd egs/librispeech/ASR
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05
git lfs pull --include "exp/pretrained-epoch-30-avg-10-averaged.pt"
git lfs pull --include "data/lang_bpe_500/bpe.model"
cd ..
.. note::
We downloaded ``exp/pretrained-xxx.pt``, not ``exp/cpu-jit_xxx.pt``.
In the above code, we downloaded the pre-trained model into the directory
``egs/librispeech/ASR/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05``.
.. _export_for_ncnn_install_ncnn_and_pnnx:
2. Install ncnn and pnnx
------------------------
.. code-block:: bash
# We put ncnn into $HOME/open-source/ncnn
# You can change it to anywhere you like
cd $HOME
mkdir -p open-source
cd open-source
git clone https://github.com/csukuangfj/ncnn
cd ncnn
git submodule update --recursive --init
# Note: We don't use "python setup.py install" or "pip install ." here
mkdir -p build-wheel
cd build-wheel
cmake \
-DCMAKE_BUILD_TYPE=Release \
-DNCNN_PYTHON=ON \
-DNCNN_BUILD_BENCHMARK=OFF \
-DNCNN_BUILD_EXAMPLES=OFF \
-DNCNN_BUILD_TOOLS=ON \
..
make -j4
cd ..
# Note: $PWD here is $HOME/open-source/ncnn
export PYTHONPATH=$PWD/python:$PYTHONPATH
export PATH=$PWD/tools/pnnx/build/src:$PATH
export PATH=$PWD/build-wheel/tools/quantize:$PATH
# Now build pnnx
cd tools/pnnx
mkdir build
cd build
cmake ..
make -j4
./src/pnnx
Congratulations! You have successfully installed the following components:
- ``pnnx``, which is an executable located in
``$HOME/open-source/ncnn/tools/pnnx/build/src``. We will use
it to convert models exported by ``torch.jit.trace()``.
- ``ncnn2int8``, which is an executable located in
``$HOME/open-source/ncnn/build-wheel/tools/quantize``. We will use
it to quantize our models to ``int8``.
- ``ncnn.cpython-38-x86_64-linux-gnu.so``, which is a Python module located
in ``$HOME/open-source/ncnn/python/ncnn``.
.. note::
I am using ``Python 3.8``, so it
is ``ncnn.cpython-38-x86_64-linux-gnu.so``. If you use a different
version, say, ``Python 3.9``, the name would be
``ncnn.cpython-39-x86_64-linux-gnu.so``.
Also, if you are not using Linux, the file name would also be different.
But that does not matter. As long as you can compile it, it should work.
We have set up ``PYTHONPATH`` so that you can use ``import ncnn`` in your
Python code. We have also set up ``PATH`` so that you can use
``pnnx`` and ``ncnn2int8`` later in your terminal.
.. caution::
Please don't use `<https://github.com/tencent/ncnn>`_.
We have made some modifications to the official `ncnn`_.
We will synchronize `<https://github.com/csukuangfj/ncnn>`_ periodically
with the official one.
3. Export the model via torch.jit.trace()
-----------------------------------------
First, let us rename our pre-trained model:
.. code-block::
cd egs/librispeech/ASR
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp
ln -s pretrained-epoch-30-avg-10-averaged.pt epoch-30.pt
cd ../..
Next, we use the following code to export our model:
.. code-block:: bash
dir=./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/
./conv_emformer_transducer_stateless2/export-for-ncnn.py \
--exp-dir $dir/exp \
--tokens $dir/data/lang_bpe_500/tokens.txt \
--epoch 30 \
--avg 1 \
--use-averaged-model 0 \
--num-encoder-layers 12 \
--chunk-length 32 \
--cnn-module-kernel 31 \
--left-context-length 32 \
--right-context-length 8 \
--memory-size 32 \
--encoder-dim 512
.. caution::
If your model has different configuration parameters, please change them accordingly.
.. hint::
We have renamed our model to ``epoch-30.pt`` so that we can use ``--epoch 30``.
There is only one pre-trained model, so we use ``--avg 1 --use-averaged-model 0``.
If you have trained a model by yourself and if you have all checkpoints
available, please first use ``decode.py`` to tune ``--epoch --avg``
and select the best combination with with ``--use-averaged-model 1``.
.. note::
You will see the following log output:
.. literalinclude:: ./code/export-conv-emformer-transducer-for-ncnn-output.txt
The log shows the model has ``75490012`` parameters, i.e., ``~75 M``.
.. code-block::
ls -lh icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/pretrained-epoch-30-avg-10-averaged.pt
-rw-r--r-- 1 kuangfangjun root 289M Jan 11 12:05 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/pretrained-epoch-30-avg-10-averaged.pt
You can see that the file size of the pre-trained model is ``289 MB``, which
is roughly equal to ``75490012*4/1024/1024 = 287.97 MB``.
After running ``conv_emformer_transducer_stateless2/export-for-ncnn.py``,
we will get the following files:
.. code-block:: bash
ls -lh icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/*pnnx*
-rw-r--r-- 1 kuangfangjun root 1010K Jan 11 12:15 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.pt
-rw-r--r-- 1 kuangfangjun root 283M Jan 11 12:15 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.pt
-rw-r--r-- 1 kuangfangjun root 3.0M Jan 11 12:15 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.pt
.. _conv-emformer-step-4-export-torchscript-model-via-pnnx:
4. Export torchscript model via pnnx
------------------------------------
.. hint::
Make sure you have set up the ``PATH`` environment variable. Otherwise,
it will throw an error saying that ``pnnx`` could not be found.
Now, it's time to export our models to `ncnn`_ via ``pnnx``.
.. code-block::
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/
pnnx ./encoder_jit_trace-pnnx.pt
pnnx ./decoder_jit_trace-pnnx.pt
pnnx ./joiner_jit_trace-pnnx.pt
It will generate the following files:
.. code-block:: bash
ls -lh icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/*ncnn*{bin,param}
-rw-r--r-- 1 kuangfangjun root 503K Jan 11 12:38 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 437 Jan 11 12:38 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 142M Jan 11 12:36 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 79K Jan 11 12:36 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 1.5M Jan 11 12:38 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 488 Jan 11 12:38 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.param
There are two types of files:
- ``param``: It is a text file containing the model architectures. You can
use a text editor to view its content.
- ``bin``: It is a binary file containing the model parameters.
We compare the file sizes of the models below before and after converting via ``pnnx``:
.. see https://tableconvert.com/restructuredtext-generator
+----------------------------------+------------+
| File name | File size |
+==================================+============+
| encoder_jit_trace-pnnx.pt | 283 MB |
+----------------------------------+------------+
| decoder_jit_trace-pnnx.pt | 1010 KB |
+----------------------------------+------------+
| joiner_jit_trace-pnnx.pt | 3.0 MB |
+----------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.bin | 142 MB |
+----------------------------------+------------+
| decoder_jit_trace-pnnx.ncnn.bin | 503 KB |
+----------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.bin | 1.5 MB |
+----------------------------------+------------+
You can see that the file sizes of the models after conversion are about one half
of the models before conversion:
- encoder: 283 MB vs 142 MB
- decoder: 1010 KB vs 503 KB
- joiner: 3.0 MB vs 1.5 MB
The reason is that by default ``pnnx`` converts ``float32`` parameters
to ``float16``. A ``float32`` parameter occupies 4 bytes, while it is 2 bytes
for ``float16``. Thus, it is ``twice smaller`` after conversion.
.. hint::
If you use ``pnnx ./encoder_jit_trace-pnnx.pt fp16=0``, then ``pnnx``
won't convert ``float32`` to ``float16``.
5. Test the exported models in icefall
--------------------------------------
.. note::
We assume you have set up the environment variable ``PYTHONPATH`` when
building `ncnn`_.
Now we have successfully converted our pre-trained model to `ncnn`_ format.
The generated 6 files are what we need. You can use the following code to
test the converted models:
.. code-block:: bash
./conv_emformer_transducer_stateless2/streaming-ncnn-decode.py \
--tokens ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/data/lang_bpe_500/tokens.txt \
--encoder-param-filename ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.param \
--encoder-bin-filename ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.bin \
--decoder-param-filename ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.param \
--decoder-bin-filename ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.bin \
--joiner-param-filename ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.param \
--joiner-bin-filename ./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.bin \
./icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/test_wavs/1089-134686-0001.wav
.. hint::
`ncnn`_ supports only ``batch size == 1``, so ``streaming-ncnn-decode.py`` accepts
only 1 wave file as input.
The output is given below:
.. literalinclude:: ./code/test-streaming-ncnn-decode-conv-emformer-transducer-libri.txt
Congratulations! You have successfully exported a model from PyTorch to `ncnn`_!
.. _conv-emformer-modify-the-exported-encoder-for-sherpa-ncnn:
6. Modify the exported encoder for sherpa-ncnn
----------------------------------------------
In order to use the exported models in `sherpa-ncnn`_, we have to modify
``encoder_jit_trace-pnnx.ncnn.param``.
Let us have a look at the first few lines of ``encoder_jit_trace-pnnx.ncnn.param``:
.. code-block::
7767517
1060 1342
Input in0 0 1 in0
**Explanation** of the above three lines:
1. ``7767517``, it is a magic number and should not be changed.
2. ``1060 1342``, the first number ``1060`` specifies the number of layers
in this file, while ``1342`` specifies the number of intermediate outputs
of this file
3. ``Input in0 0 1 in0``, ``Input`` is the layer type of this layer; ``in0``
is the layer name of this layer; ``0`` means this layer has no input;
``1`` means this layer has one output; ``in0`` is the output name of
this layer.
We need to add 1 extra line and also increment the number of layers.
The result looks like below:
.. code-block:: bash
7767517
1061 1342
SherpaMetaData sherpa_meta_data1 0 0 0=1 1=12 2=32 3=31 4=8 5=32 6=8 7=512
Input in0 0 1 in0
**Explanation**
1. ``7767517``, it is still the same
2. ``1061 1342``, we have added an extra layer, so we need to update ``1060`` to ``1061``.
We don't need to change ``1342`` since the newly added layer has no inputs or outputs.
3. ``SherpaMetaData sherpa_meta_data1 0 0 0=1 1=12 2=32 3=31 4=8 5=32 6=8 7=512``
This line is newly added. Its explanation is given below:
- ``SherpaMetaData`` is the type of this layer. Must be ``SherpaMetaData``.
- ``sherpa_meta_data1`` is the name of this layer. Must be ``sherpa_meta_data1``.
- ``0 0`` means this layer has no inputs or output. Must be ``0 0``
- ``0=1``, 0 is the key and 1 is the value. MUST be ``0=1``
- ``1=12``, 1 is the key and 12 is the value of the
parameter ``--num-encoder-layers`` that you provided when running
``conv_emformer_transducer_stateless2/export-for-ncnn.py``.
- ``2=32``, 2 is the key and 32 is the value of the
parameter ``--memory-size`` that you provided when running
``conv_emformer_transducer_stateless2/export-for-ncnn.py``.
- ``3=31``, 3 is the key and 31 is the value of the
parameter ``--cnn-module-kernel`` that you provided when running
``conv_emformer_transducer_stateless2/export-for-ncnn.py``.
- ``4=8``, 4 is the key and 8 is the value of the
parameter ``--left-context-length`` that you provided when running
``conv_emformer_transducer_stateless2/export-for-ncnn.py``.
- ``5=32``, 5 is the key and 32 is the value of the
parameter ``--chunk-length`` that you provided when running
``conv_emformer_transducer_stateless2/export-for-ncnn.py``.
- ``6=8``, 6 is the key and 8 is the value of the
parameter ``--right-context-length`` that you provided when running
``conv_emformer_transducer_stateless2/export-for-ncnn.py``.
- ``7=512``, 7 is the key and 512 is the value of the
parameter ``--encoder-dim`` that you provided when running
``conv_emformer_transducer_stateless2/export-for-ncnn.py``.
For ease of reference, we list the key-value pairs that you need to add
in the following table. If your model has a different setting, please
change the values for ``SherpaMetaData`` accordingly. Otherwise, you
will be ``SAD``.
+------+-----------------------------+
| key | value |
+======+=============================+
| 0 | 1 (fixed) |
+------+-----------------------------+
| 1 | ``--num-encoder-layers`` |
+------+-----------------------------+
| 2 | ``--memory-size`` |
+------+-----------------------------+
| 3 | ``--cnn-module-kernel`` |
+------+-----------------------------+
| 4 | ``--left-context-length`` |
+------+-----------------------------+
| 5 | ``--chunk-length`` |
+------+-----------------------------+
| 6 | ``--right-context-length`` |
+------+-----------------------------+
| 7 | ``--encoder-dim`` |
+------+-----------------------------+
4. ``Input in0 0 1 in0``. No need to change it.
.. caution::
When you add a new layer ``SherpaMetaData``, please remember to update the
number of layers. In our case, update ``1060`` to ``1061``. Otherwise,
you will be SAD later.
.. hint::
After adding the new layer ``SherpaMetaData``, you cannot use this model
with ``streaming-ncnn-decode.py`` anymore since ``SherpaMetaData`` is
supported only in `sherpa-ncnn`_.
.. hint::
`ncnn`_ is very flexible. You can add new layers to it just by text-editing
the ``param`` file! You don't need to change the ``bin`` file.
Now you can use this model in `sherpa-ncnn`_.
Please refer to the following documentation:
- Linux/macOS/Windows/arm/aarch64: `<https://k2-fsa.github.io/sherpa/ncnn/install/index.html>`_
- ``Android``: `<https://k2-fsa.github.io/sherpa/ncnn/android/index.html>`_
- ``iOS``: `<https://k2-fsa.github.io/sherpa/ncnn/ios/index.html>`_
- Python: `<https://k2-fsa.github.io/sherpa/ncnn/python/index.html>`_
We have a list of pre-trained models that have been exported for `sherpa-ncnn`_:
- `<https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html>`_
You can find more usages there.
7. (Optional) int8 quantization with sherpa-ncnn
------------------------------------------------
This step is optional.
In this step, we describe how to quantize our model with ``int8``.
Change :ref:`conv-emformer-step-4-export-torchscript-model-via-pnnx` to
disable ``fp16`` when using ``pnnx``:
.. code-block::
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/
pnnx ./encoder_jit_trace-pnnx.pt fp16=0
pnnx ./decoder_jit_trace-pnnx.pt
pnnx ./joiner_jit_trace-pnnx.pt fp16=0
.. note::
We add ``fp16=0`` when exporting the encoder and joiner. `ncnn`_ does not
support quantizing the decoder model yet. We will update this documentation
once `ncnn`_ supports it. (Maybe in this year, 2023).
It will generate the following files
.. code-block:: bash
ls -lh icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/*_jit_trace-pnnx.ncnn.{param,bin}
-rw-r--r-- 1 kuangfangjun root 503K Jan 11 15:56 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 437 Jan 11 15:56 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/decoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 283M Jan 11 15:56 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 79K Jan 11 15:56 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/encoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 3.0M Jan 11 15:56 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 488 Jan 11 15:56 icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/joiner_jit_trace-pnnx.ncnn.param
Let us compare again the file sizes:
+----------------------------------------+------------+
| File name | File size |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.pt | 283 MB |
+----------------------------------------+------------+
| decoder_jit_trace-pnnx.pt | 1010 KB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.pt | 3.0 MB |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.bin (fp16) | 142 MB |
+----------------------------------------+------------+
| decoder_jit_trace-pnnx.ncnn.bin (fp16) | 503 KB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.bin (fp16) | 1.5 MB |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.bin (fp32) | 283 MB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.bin (fp32) | 3.0 MB |
+----------------------------------------+------------+
You can see that the file sizes are doubled when we disable ``fp16``.
.. note::
You can again use ``streaming-ncnn-decode.py`` to test the exported models.
Next, follow :ref:`conv-emformer-modify-the-exported-encoder-for-sherpa-ncnn`
to modify ``encoder_jit_trace-pnnx.ncnn.param``.
Change
.. code-block:: bash
7767517
1060 1342
Input in0 0 1 in0
to
.. code-block:: bash
7767517
1061 1342
SherpaMetaData sherpa_meta_data1 0 0 0=1 1=12 2=32 3=31 4=8 5=32 6=8 7=512
Input in0 0 1 in0
.. caution::
Please follow :ref:`conv-emformer-modify-the-exported-encoder-for-sherpa-ncnn`
to change the values for ``SherpaMetaData`` if your model uses a different setting.
Next, let us compile `sherpa-ncnn`_ since we will quantize our models within
`sherpa-ncnn`_.
.. code-block:: bash
# We will download sherpa-ncnn to $HOME/open-source/
# You can change it to anywhere you like.
cd $HOME
mkdir -p open-source
cd open-source
git clone https://github.com/k2-fsa/sherpa-ncnn
cd sherpa-ncnn
mkdir build
cd build
cmake ..
make -j 4
./bin/generate-int8-scale-table
export PATH=$HOME/open-source/sherpa-ncnn/build/bin:$PATH
The output of the above commands are:
.. code-block:: bash
(py38) kuangfangjun:build$ generate-int8-scale-table
Please provide 10 arg. Currently given: 1
Usage:
generate-int8-scale-table encoder.param encoder.bin decoder.param decoder.bin joiner.param joiner.bin encoder-scale-table.txt joiner-scale-table.txt wave_filenames.txt
Each line in wave_filenames.txt is a path to some 16k Hz mono wave file.
We need to create a file ``wave_filenames.txt``, in which we need to put
some calibration wave files. For testing purpose, we put the ``test_wavs``
from the pre-trained model repository `<https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05>`_
.. code-block:: bash
cd egs/librispeech/ASR
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/
cat <<EOF > wave_filenames.txt
../test_wavs/1089-134686-0001.wav
../test_wavs/1221-135766-0001.wav
../test_wavs/1221-135766-0002.wav
EOF
Now we can calculate the scales needed for quantization with the calibration data:
.. code-block:: bash
cd egs/librispeech/ASR
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/
generate-int8-scale-table \
./encoder_jit_trace-pnnx.ncnn.param \
./encoder_jit_trace-pnnx.ncnn.bin \
./decoder_jit_trace-pnnx.ncnn.param \
./decoder_jit_trace-pnnx.ncnn.bin \
./joiner_jit_trace-pnnx.ncnn.param \
./joiner_jit_trace-pnnx.ncnn.bin \
./encoder-scale-table.txt \
./joiner-scale-table.txt \
./wave_filenames.txt
The output logs are in the following:
.. literalinclude:: ./code/generate-int-8-scale-table-for-conv-emformer.txt
It generates the following two files:
.. code-block:: bash
$ ls -lh encoder-scale-table.txt joiner-scale-table.txt
-rw-r--r-- 1 kuangfangjun root 955K Jan 11 17:28 encoder-scale-table.txt
-rw-r--r-- 1 kuangfangjun root 18K Jan 11 17:28 joiner-scale-table.txt
.. caution::
Definitely, you need more calibration data to compute the scale table.
Finally, let us use the scale table to quantize our models into ``int8``.
.. code-block:: bash
ncnn2int8
usage: ncnn2int8 [inparam] [inbin] [outparam] [outbin] [calibration table]
First, we quantize the encoder model:
.. code-block:: bash
cd egs/librispeech/ASR
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/
ncnn2int8 \
./encoder_jit_trace-pnnx.ncnn.param \
./encoder_jit_trace-pnnx.ncnn.bin \
./encoder_jit_trace-pnnx.ncnn.int8.param \
./encoder_jit_trace-pnnx.ncnn.int8.bin \
./encoder-scale-table.txt
Next, we quantize the joiner model:
.. code-block:: bash
ncnn2int8 \
./joiner_jit_trace-pnnx.ncnn.param \
./joiner_jit_trace-pnnx.ncnn.bin \
./joiner_jit_trace-pnnx.ncnn.int8.param \
./joiner_jit_trace-pnnx.ncnn.int8.bin \
./joiner-scale-table.txt
The above two commands generate the following 4 files:
.. code-block:: bash
-rw-r--r-- 1 kuangfangjun root 99M Jan 11 17:34 encoder_jit_trace-pnnx.ncnn.int8.bin
-rw-r--r-- 1 kuangfangjun root 78K Jan 11 17:34 encoder_jit_trace-pnnx.ncnn.int8.param
-rw-r--r-- 1 kuangfangjun root 774K Jan 11 17:35 joiner_jit_trace-pnnx.ncnn.int8.bin
-rw-r--r-- 1 kuangfangjun root 496 Jan 11 17:35 joiner_jit_trace-pnnx.ncnn.int8.param
Congratulations! You have successfully quantized your model from ``float32`` to ``int8``.
.. caution::
``ncnn.int8.param`` and ``ncnn.int8.bin`` must be used in pairs.
You can replace ``ncnn.param`` and ``ncnn.bin`` with ``ncnn.int8.param``
and ``ncnn.int8.bin`` in `sherpa-ncnn`_ if you like.
For instance, to use only the ``int8`` encoder in ``sherpa-ncnn``, you can
replace the following invocation:
.. code-block:: bash
cd egs/librispeech/ASR
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/
sherpa-ncnn \
../data/lang_bpe_500/tokens.txt \
./encoder_jit_trace-pnnx.ncnn.param \
./encoder_jit_trace-pnnx.ncnn.bin \
./decoder_jit_trace-pnnx.ncnn.param \
./decoder_jit_trace-pnnx.ncnn.bin \
./joiner_jit_trace-pnnx.ncnn.param \
./joiner_jit_trace-pnnx.ncnn.bin \
../test_wavs/1089-134686-0001.wav
with
.. code-block::
cd egs/librispeech/ASR
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/
sherpa-ncnn \
../data/lang_bpe_500/tokens.txt \
./encoder_jit_trace-pnnx.ncnn.int8.param \
./encoder_jit_trace-pnnx.ncnn.int8.bin \
./decoder_jit_trace-pnnx.ncnn.param \
./decoder_jit_trace-pnnx.ncnn.bin \
./joiner_jit_trace-pnnx.ncnn.param \
./joiner_jit_trace-pnnx.ncnn.bin \
../test_wavs/1089-134686-0001.wav
The following table compares again the file sizes:
+----------------------------------------+------------+
| File name | File size |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.pt | 283 MB |
+----------------------------------------+------------+
| decoder_jit_trace-pnnx.pt | 1010 KB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.pt | 3.0 MB |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.bin (fp16) | 142 MB |
+----------------------------------------+------------+
| decoder_jit_trace-pnnx.ncnn.bin (fp16) | 503 KB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.bin (fp16) | 1.5 MB |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.bin (fp32) | 283 MB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.bin (fp32) | 3.0 MB |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.int8.bin | 99 MB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.int8.bin | 774 KB |
+----------------------------------------+------------+
You can see that the file sizes of the model after ``int8`` quantization
are much smaller.
.. hint::
Currently, only linear layers and convolutional layers are quantized
with ``int8``, so you don't see an exact ``4x`` reduction in file sizes.
.. note::
You need to test the recognition accuracy after ``int8`` quantization.
You can find the speed comparison at `<https://github.com/k2-fsa/sherpa-ncnn/issues/44>`_.
That's it! Have fun with `sherpa-ncnn`_!

View File

@ -1,644 +0,0 @@
.. _export_lstm_transducer_models_to_ncnn:
Export LSTM transducer models to ncnn
-------------------------------------
We use the pre-trained model from the following repository as an example:
`<https://huggingface.co/csukuangfj/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03>`_
We will show you step by step how to export it to `ncnn`_ and run it with `sherpa-ncnn`_.
.. hint::
We use ``Ubuntu 18.04``, ``torch 1.13``, and ``Python 3.8`` for testing.
.. caution::
``torch > 2.0`` may not work. If you get errors while building pnnx, please switch
to ``torch < 2.0``.
1. Download the pre-trained model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. hint::
You have to install `git-lfs`_ before you continue.
.. code-block:: bash
cd egs/librispeech/ASR
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03
cd icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03
git lfs pull --include "exp/pretrained-iter-468000-avg-16.pt"
git lfs pull --include "data/lang_bpe_500/bpe.model"
cd ..
.. note::
We downloaded ``exp/pretrained-xxx.pt``, not ``exp/cpu-jit_xxx.pt``.
In the above code, we downloaded the pre-trained model into the directory
``egs/librispeech/ASR/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03``.
2. Install ncnn and pnnx
^^^^^^^^^^^^^^^^^^^^^^^^
Please refer to :ref:`export_for_ncnn_install_ncnn_and_pnnx` .
3. Export the model via torch.jit.trace()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
First, let us rename our pre-trained model:
.. code-block::
cd egs/librispeech/ASR
cd icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp
ln -s pretrained-iter-468000-avg-16.pt epoch-99.pt
cd ../..
Next, we use the following code to export our model:
.. code-block:: bash
dir=./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03
./lstm_transducer_stateless2/export-for-ncnn.py \
--exp-dir $dir/exp \
--tokens $dir/data/lang_bpe_500/tokens.txt \
--epoch 99 \
--avg 1 \
--use-averaged-model 0 \
--num-encoder-layers 12 \
--encoder-dim 512 \
--rnn-hidden-size 1024
.. hint::
We have renamed our model to ``epoch-99.pt`` so that we can use ``--epoch 99``.
There is only one pre-trained model, so we use ``--avg 1 --use-averaged-model 0``.
If you have trained a model by yourself and if you have all checkpoints
available, please first use ``decode.py`` to tune ``--epoch --avg``
and select the best combination with with ``--use-averaged-model 1``.
.. note::
You will see the following log output:
.. literalinclude:: ./code/export-lstm-transducer-for-ncnn-output.txt
The log shows the model has ``84176356`` parameters, i.e., ``~84 M``.
.. code-block::
ls -lh icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/pretrained-iter-468000-avg-16.pt
-rw-r--r-- 1 kuangfangjun root 324M Feb 17 10:34 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/pretrained-iter-468000-avg-16.pt
You can see that the file size of the pre-trained model is ``324 MB``, which
is roughly equal to ``84176356*4/1024/1024 = 321.107 MB``.
After running ``lstm_transducer_stateless2/export-for-ncnn.py``,
we will get the following files:
.. code-block:: bash
ls -lh icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/*pnnx.pt
-rw-r--r-- 1 kuangfangjun root 1010K Feb 17 11:22 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/decoder_jit_trace-pnnx.pt
-rw-r--r-- 1 kuangfangjun root 318M Feb 17 11:22 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/encoder_jit_trace-pnnx.pt
-rw-r--r-- 1 kuangfangjun root 3.0M Feb 17 11:22 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/joiner_jit_trace-pnnx.pt
.. _lstm-transducer-step-4-export-torchscript-model-via-pnnx:
4. Export torchscript model via pnnx
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. hint::
Make sure you have set up the ``PATH`` environment variable
in :ref:`export_for_ncnn_install_ncnn_and_pnnx`. Otherwise,
it will throw an error saying that ``pnnx`` could not be found.
Now, it's time to export our models to `ncnn`_ via ``pnnx``.
.. code-block::
cd icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/
pnnx ./encoder_jit_trace-pnnx.pt
pnnx ./decoder_jit_trace-pnnx.pt
pnnx ./joiner_jit_trace-pnnx.pt
It will generate the following files:
.. code-block:: bash
ls -lh icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/*ncnn*{bin,param}
-rw-r--r-- 1 kuangfangjun root 503K Feb 17 11:32 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/decoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 437 Feb 17 11:32 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/decoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 159M Feb 17 11:32 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/encoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 21K Feb 17 11:32 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/encoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 1.5M Feb 17 11:33 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/joiner_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 488 Feb 17 11:33 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/joiner_jit_trace-pnnx.ncnn.param
There are two types of files:
- ``param``: It is a text file containing the model architectures. You can
use a text editor to view its content.
- ``bin``: It is a binary file containing the model parameters.
We compare the file sizes of the models below before and after converting via ``pnnx``:
.. see https://tableconvert.com/restructuredtext-generator
+----------------------------------+------------+
| File name | File size |
+==================================+============+
| encoder_jit_trace-pnnx.pt | 318 MB |
+----------------------------------+------------+
| decoder_jit_trace-pnnx.pt | 1010 KB |
+----------------------------------+------------+
| joiner_jit_trace-pnnx.pt | 3.0 MB |
+----------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.bin | 159 MB |
+----------------------------------+------------+
| decoder_jit_trace-pnnx.ncnn.bin | 503 KB |
+----------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.bin | 1.5 MB |
+----------------------------------+------------+
You can see that the file sizes of the models after conversion are about one half
of the models before conversion:
- encoder: 318 MB vs 159 MB
- decoder: 1010 KB vs 503 KB
- joiner: 3.0 MB vs 1.5 MB
The reason is that by default ``pnnx`` converts ``float32`` parameters
to ``float16``. A ``float32`` parameter occupies 4 bytes, while it is 2 bytes
for ``float16``. Thus, it is ``twice smaller`` after conversion.
.. hint::
If you use ``pnnx ./encoder_jit_trace-pnnx.pt fp16=0``, then ``pnnx``
won't convert ``float32`` to ``float16``.
5. Test the exported models in icefall
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
We assume you have set up the environment variable ``PYTHONPATH`` when
building `ncnn`_.
Now we have successfully converted our pre-trained model to `ncnn`_ format.
The generated 6 files are what we need. You can use the following code to
test the converted models:
.. code-block:: bash
python3 ./lstm_transducer_stateless2/streaming-ncnn-decode.py \
--tokens ./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/data/lang_bpe_500/tokens.txt \
--encoder-param-filename ./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/encoder_jit_trace-pnnx.ncnn.param \
--encoder-bin-filename ./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/encoder_jit_trace-pnnx.ncnn.bin \
--decoder-param-filename ./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/decoder_jit_trace-pnnx.ncnn.param \
--decoder-bin-filename ./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/decoder_jit_trace-pnnx.ncnn.bin \
--joiner-param-filename ./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/joiner_jit_trace-pnnx.ncnn.param \
--joiner-bin-filename ./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/joiner_jit_trace-pnnx.ncnn.bin \
./icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/test_wavs/1089-134686-0001.wav
.. hint::
`ncnn`_ supports only ``batch size == 1``, so ``streaming-ncnn-decode.py`` accepts
only 1 wave file as input.
The output is given below:
.. literalinclude:: ./code/test-streaming-ncnn-decode-lstm-transducer-libri.txt
Congratulations! You have successfully exported a model from PyTorch to `ncnn`_!
.. _lstm-modify-the-exported-encoder-for-sherpa-ncnn:
6. Modify the exported encoder for sherpa-ncnn
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order to use the exported models in `sherpa-ncnn`_, we have to modify
``encoder_jit_trace-pnnx.ncnn.param``.
Let us have a look at the first few lines of ``encoder_jit_trace-pnnx.ncnn.param``:
.. code-block::
7767517
267 379
Input in0 0 1 in0
**Explanation** of the above three lines:
1. ``7767517``, it is a magic number and should not be changed.
2. ``267 379``, the first number ``267`` specifies the number of layers
in this file, while ``379`` specifies the number of intermediate outputs
of this file
3. ``Input in0 0 1 in0``, ``Input`` is the layer type of this layer; ``in0``
is the layer name of this layer; ``0`` means this layer has no input;
``1`` means this layer has one output; ``in0`` is the output name of
this layer.
We need to add 1 extra line and also increment the number of layers.
The result looks like below:
.. code-block:: bash
7767517
268 379
SherpaMetaData sherpa_meta_data1 0 0 0=3 1=12 2=512 3=1024
Input in0 0 1 in0
**Explanation**
1. ``7767517``, it is still the same
2. ``268 379``, we have added an extra layer, so we need to update ``267`` to ``268``.
We don't need to change ``379`` since the newly added layer has no inputs or outputs.
3. ``SherpaMetaData sherpa_meta_data1 0 0 0=3 1=12 2=512 3=1024``
This line is newly added. Its explanation is given below:
- ``SherpaMetaData`` is the type of this layer. Must be ``SherpaMetaData``.
- ``sherpa_meta_data1`` is the name of this layer. Must be ``sherpa_meta_data1``.
- ``0 0`` means this layer has no inputs or output. Must be ``0 0``
- ``0=3``, 0 is the key and 3 is the value. MUST be ``0=3``
- ``1=12``, 1 is the key and 12 is the value of the
parameter ``--num-encoder-layers`` that you provided when running
``./lstm_transducer_stateless2/export-for-ncnn.py``.
- ``2=512``, 2 is the key and 512 is the value of the
parameter ``--encoder-dim`` that you provided when running
``./lstm_transducer_stateless2/export-for-ncnn.py``.
- ``3=1024``, 3 is the key and 1024 is the value of the
parameter ``--rnn-hidden-size`` that you provided when running
``./lstm_transducer_stateless2/export-for-ncnn.py``.
For ease of reference, we list the key-value pairs that you need to add
in the following table. If your model has a different setting, please
change the values for ``SherpaMetaData`` accordingly. Otherwise, you
will be ``SAD``.
+------+-----------------------------+
| key | value |
+======+=============================+
| 0 | 3 (fixed) |
+------+-----------------------------+
| 1 | ``--num-encoder-layers`` |
+------+-----------------------------+
| 2 | ``--encoder-dim`` |
+------+-----------------------------+
| 3 | ``--rnn-hidden-size`` |
+------+-----------------------------+
4. ``Input in0 0 1 in0``. No need to change it.
.. caution::
When you add a new layer ``SherpaMetaData``, please remember to update the
number of layers. In our case, update ``267`` to ``268``. Otherwise,
you will be SAD later.
.. hint::
After adding the new layer ``SherpaMetaData``, you cannot use this model
with ``streaming-ncnn-decode.py`` anymore since ``SherpaMetaData`` is
supported only in `sherpa-ncnn`_.
.. hint::
`ncnn`_ is very flexible. You can add new layers to it just by text-editing
the ``param`` file! You don't need to change the ``bin`` file.
Now you can use this model in `sherpa-ncnn`_.
Please refer to the following documentation:
- Linux/macOS/Windows/arm/aarch64: `<https://k2-fsa.github.io/sherpa/ncnn/install/index.html>`_
- ``Android``: `<https://k2-fsa.github.io/sherpa/ncnn/android/index.html>`_
- ``iOS``: `<https://k2-fsa.github.io/sherpa/ncnn/ios/index.html>`_
- Python: `<https://k2-fsa.github.io/sherpa/ncnn/python/index.html>`_
We have a list of pre-trained models that have been exported for `sherpa-ncnn`_:
- `<https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html>`_
You can find more usages there.
7. (Optional) int8 quantization with sherpa-ncnn
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This step is optional.
In this step, we describe how to quantize our model with ``int8``.
Change :ref:`lstm-transducer-step-4-export-torchscript-model-via-pnnx` to
disable ``fp16`` when using ``pnnx``:
.. code-block::
cd icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/
pnnx ./encoder_jit_trace-pnnx.pt fp16=0
pnnx ./decoder_jit_trace-pnnx.pt
pnnx ./joiner_jit_trace-pnnx.pt fp16=0
.. note::
We add ``fp16=0`` when exporting the encoder and joiner. `ncnn`_ does not
support quantizing the decoder model yet. We will update this documentation
once `ncnn`_ supports it. (Maybe in this year, 2023).
.. code-block:: bash
ls -lh icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/*_jit_trace-pnnx.ncnn.{param,bin}
-rw-r--r-- 1 kuangfangjun root 503K Feb 17 11:32 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/decoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 437 Feb 17 11:32 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/decoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 317M Feb 17 11:54 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/encoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 21K Feb 17 11:54 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/encoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 3.0M Feb 17 11:54 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/joiner_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 488 Feb 17 11:54 icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/joiner_jit_trace-pnnx.ncnn.param
Let us compare again the file sizes:
+----------------------------------------+------------+
| File name | File size |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.pt | 318 MB |
+----------------------------------------+------------+
| decoder_jit_trace-pnnx.pt | 1010 KB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.pt | 3.0 MB |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.bin (fp16) | 159 MB |
+----------------------------------------+------------+
| decoder_jit_trace-pnnx.ncnn.bin (fp16) | 503 KB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.bin (fp16) | 1.5 MB |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.bin (fp32) | 317 MB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.bin (fp32) | 3.0 MB |
+----------------------------------------+------------+
You can see that the file sizes are doubled when we disable ``fp16``.
.. note::
You can again use ``streaming-ncnn-decode.py`` to test the exported models.
Next, follow :ref:`lstm-modify-the-exported-encoder-for-sherpa-ncnn`
to modify ``encoder_jit_trace-pnnx.ncnn.param``.
Change
.. code-block:: bash
7767517
267 379
Input in0 0 1 in0
to
.. code-block:: bash
7767517
268 379
SherpaMetaData sherpa_meta_data1 0 0 0=3 1=12 2=512 3=1024
Input in0 0 1 in0
.. caution::
Please follow :ref:`lstm-modify-the-exported-encoder-for-sherpa-ncnn`
to change the values for ``SherpaMetaData`` if your model uses a different setting.
Next, let us compile `sherpa-ncnn`_ since we will quantize our models within
`sherpa-ncnn`_.
.. code-block:: bash
# We will download sherpa-ncnn to $HOME/open-source/
# You can change it to anywhere you like.
cd $HOME
mkdir -p open-source
cd open-source
git clone https://github.com/k2-fsa/sherpa-ncnn
cd sherpa-ncnn
mkdir build
cd build
cmake ..
make -j 4
./bin/generate-int8-scale-table
export PATH=$HOME/open-source/sherpa-ncnn/build/bin:$PATH
The output of the above commands are:
.. code-block:: bash
(py38) kuangfangjun:build$ generate-int8-scale-table
Please provide 10 arg. Currently given: 1
Usage:
generate-int8-scale-table encoder.param encoder.bin decoder.param decoder.bin joiner.param joiner.bin encoder-scale-table.txt joiner-scale-table.txt wave_filenames.txt
Each line in wave_filenames.txt is a path to some 16k Hz mono wave file.
We need to create a file ``wave_filenames.txt``, in which we need to put
some calibration wave files. For testing purpose, we put the ``test_wavs``
from the pre-trained model repository
`<https://huggingface.co/csukuangfj/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03>`_
.. code-block:: bash
cd egs/librispeech/ASR
cd icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/
cat <<EOF > wave_filenames.txt
../test_wavs/1089-134686-0001.wav
../test_wavs/1221-135766-0001.wav
../test_wavs/1221-135766-0002.wav
EOF
Now we can calculate the scales needed for quantization with the calibration data:
.. code-block:: bash
cd egs/librispeech/ASR
cd icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/
generate-int8-scale-table \
./encoder_jit_trace-pnnx.ncnn.param \
./encoder_jit_trace-pnnx.ncnn.bin \
./decoder_jit_trace-pnnx.ncnn.param \
./decoder_jit_trace-pnnx.ncnn.bin \
./joiner_jit_trace-pnnx.ncnn.param \
./joiner_jit_trace-pnnx.ncnn.bin \
./encoder-scale-table.txt \
./joiner-scale-table.txt \
./wave_filenames.txt
The output logs are in the following:
.. literalinclude:: ./code/generate-int-8-scale-table-for-lstm.txt
It generates the following two files:
.. code-block:: bash
ls -lh encoder-scale-table.txt joiner-scale-table.txt
-rw-r--r-- 1 kuangfangjun root 345K Feb 17 12:13 encoder-scale-table.txt
-rw-r--r-- 1 kuangfangjun root 17K Feb 17 12:13 joiner-scale-table.txt
.. caution::
Definitely, you need more calibration data to compute the scale table.
Finally, let us use the scale table to quantize our models into ``int8``.
.. code-block:: bash
ncnn2int8
usage: ncnn2int8 [inparam] [inbin] [outparam] [outbin] [calibration table]
First, we quantize the encoder model:
.. code-block:: bash
cd egs/librispeech/ASR
cd icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/
ncnn2int8 \
./encoder_jit_trace-pnnx.ncnn.param \
./encoder_jit_trace-pnnx.ncnn.bin \
./encoder_jit_trace-pnnx.ncnn.int8.param \
./encoder_jit_trace-pnnx.ncnn.int8.bin \
./encoder-scale-table.txt
Next, we quantize the joiner model:
.. code-block:: bash
ncnn2int8 \
./joiner_jit_trace-pnnx.ncnn.param \
./joiner_jit_trace-pnnx.ncnn.bin \
./joiner_jit_trace-pnnx.ncnn.int8.param \
./joiner_jit_trace-pnnx.ncnn.int8.bin \
./joiner-scale-table.txt
The above two commands generate the following 4 files:
.. code-block::
-rw-r--r-- 1 kuangfangjun root 218M Feb 17 12:19 encoder_jit_trace-pnnx.ncnn.int8.bin
-rw-r--r-- 1 kuangfangjun root 21K Feb 17 12:19 encoder_jit_trace-pnnx.ncnn.int8.param
-rw-r--r-- 1 kuangfangjun root 774K Feb 17 12:19 joiner_jit_trace-pnnx.ncnn.int8.bin
-rw-r--r-- 1 kuangfangjun root 496 Feb 17 12:19 joiner_jit_trace-pnnx.ncnn.int8.param
Congratulations! You have successfully quantized your model from ``float32`` to ``int8``.
.. caution::
``ncnn.int8.param`` and ``ncnn.int8.bin`` must be used in pairs.
You can replace ``ncnn.param`` and ``ncnn.bin`` with ``ncnn.int8.param``
and ``ncnn.int8.bin`` in `sherpa-ncnn`_ if you like.
For instance, to use only the ``int8`` encoder in ``sherpa-ncnn``, you can
replace the following invocation:
.. code-block::
cd egs/librispeech/ASR
cd icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/
sherpa-ncnn \
../data/lang_bpe_500/tokens.txt \
./encoder_jit_trace-pnnx.ncnn.param \
./encoder_jit_trace-pnnx.ncnn.bin \
./decoder_jit_trace-pnnx.ncnn.param \
./decoder_jit_trace-pnnx.ncnn.bin \
./joiner_jit_trace-pnnx.ncnn.param \
./joiner_jit_trace-pnnx.ncnn.bin \
../test_wavs/1089-134686-0001.wav
with
.. code-block:: bash
cd egs/librispeech/ASR
cd icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05/exp/
sherpa-ncnn \
../data/lang_bpe_500/tokens.txt \
./encoder_jit_trace-pnnx.ncnn.int8.param \
./encoder_jit_trace-pnnx.ncnn.int8.bin \
./decoder_jit_trace-pnnx.ncnn.param \
./decoder_jit_trace-pnnx.ncnn.bin \
./joiner_jit_trace-pnnx.ncnn.param \
./joiner_jit_trace-pnnx.ncnn.bin \
../test_wavs/1089-134686-0001.wav
The following table compares again the file sizes:
+----------------------------------------+------------+
| File name | File size |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.pt | 318 MB |
+----------------------------------------+------------+
| decoder_jit_trace-pnnx.pt | 1010 KB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.pt | 3.0 MB |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.bin (fp16) | 159 MB |
+----------------------------------------+------------+
| decoder_jit_trace-pnnx.ncnn.bin (fp16) | 503 KB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.bin (fp16) | 1.5 MB |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.bin (fp32) | 317 MB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.bin (fp32) | 3.0 MB |
+----------------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.int8.bin | 218 MB |
+----------------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.int8.bin | 774 KB |
+----------------------------------------+------------+
You can see that the file size of the joiner model after ``int8`` quantization
is much smaller. However, the size of the encoder model is even larger than
the ``fp16`` counterpart. The reason is that `ncnn`_ currently does not support
quantizing ``LSTM`` layers into ``8-bit``. Please see
`<https://github.com/Tencent/ncnn/issues/4532>`_
.. hint::
Currently, only linear layers and convolutional layers are quantized
with ``int8``, so you don't see an exact ``4x`` reduction in file sizes.
.. note::
You need to test the recognition accuracy after ``int8`` quantization.
That's it! Have fun with `sherpa-ncnn`_!

View File

@ -1,387 +0,0 @@
.. _export_streaming_zipformer_transducer_models_to_ncnn:
Export streaming Zipformer transducer models to ncnn
----------------------------------------------------
We use the pre-trained model from the following repository as an example:
`<https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29>`_
We will show you step by step how to export it to `ncnn`_ and run it with `sherpa-ncnn`_.
.. hint::
We use ``Ubuntu 18.04``, ``torch 1.13``, and ``Python 3.8`` for testing.
.. caution::
``torch > 2.0`` may not work. If you get errors while building pnnx, please switch
to ``torch < 2.0``.
1. Download the pre-trained model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. hint::
You have to install `git-lfs`_ before you continue.
.. code-block:: bash
cd egs/librispeech/ASR
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
cd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
git lfs pull --include "exp/pretrained.pt"
git lfs pull --include "data/lang_bpe_500/bpe.model"
cd ..
.. note::
We downloaded ``exp/pretrained-xxx.pt``, not ``exp/cpu-jit_xxx.pt``.
In the above code, we downloaded the pre-trained model into the directory
``egs/librispeech/ASR/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29``.
2. Install ncnn and pnnx
^^^^^^^^^^^^^^^^^^^^^^^^
Please refer to :ref:`export_for_ncnn_install_ncnn_and_pnnx` .
3. Export the model via torch.jit.trace()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
First, let us rename our pre-trained model:
.. code-block::
cd egs/librispeech/ASR
cd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp
ln -s pretrained.pt epoch-99.pt
cd ../..
Next, we use the following code to export our model:
.. code-block:: bash
dir=./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
./pruned_transducer_stateless7_streaming/export-for-ncnn.py \
--tokens $dir/data/lang_bpe_500/tokens.txt \
--exp-dir $dir/exp \
--use-averaged-model 0 \
--epoch 99 \
--avg 1 \
--decode-chunk-len 32 \
--num-left-chunks 4 \
--num-encoder-layers "2,4,3,2,4" \
--feedforward-dims "1024,1024,2048,2048,1024" \
--nhead "8,8,8,8,8" \
--encoder-dims "384,384,384,384,384" \
--attention-dims "192,192,192,192,192" \
--encoder-unmasked-dims "256,256,256,256,256" \
--zipformer-downsampling-factors "1,2,4,8,2" \
--cnn-module-kernels "31,31,31,31,31" \
--decoder-dim 512 \
--joiner-dim 512
.. caution::
If your model has different configuration parameters, please change them accordingly.
.. hint::
We have renamed our model to ``epoch-99.pt`` so that we can use ``--epoch 99``.
There is only one pre-trained model, so we use ``--avg 1 --use-averaged-model 0``.
If you have trained a model by yourself and if you have all checkpoints
available, please first use ``decode.py`` to tune ``--epoch --avg``
and select the best combination with with ``--use-averaged-model 1``.
.. note::
You will see the following log output:
.. literalinclude:: ./code/export-zipformer-transducer-for-ncnn-output.txt
The log shows the model has ``69920376`` parameters, i.e., ``~69.9 M``.
.. code-block:: bash
ls -lh icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/pretrained.pt
-rw-r--r-- 1 kuangfangjun root 269M Jan 12 12:53 icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/pretrained.pt
You can see that the file size of the pre-trained model is ``269 MB``, which
is roughly equal to ``69920376*4/1024/1024 = 266.725 MB``.
After running ``pruned_transducer_stateless7_streaming/export-for-ncnn.py``,
we will get the following files:
.. code-block:: bash
ls -lh icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/*pnnx.pt
-rw-r--r-- 1 kuangfangjun root 1022K Feb 27 20:23 icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/decoder_jit_trace-pnnx.pt
-rw-r--r-- 1 kuangfangjun root 266M Feb 27 20:23 icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/encoder_jit_trace-pnnx.pt
-rw-r--r-- 1 kuangfangjun root 2.8M Feb 27 20:23 icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/joiner_jit_trace-pnnx.pt
.. _zipformer-transducer-step-4-export-torchscript-model-via-pnnx:
4. Export torchscript model via pnnx
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. hint::
Make sure you have set up the ``PATH`` environment variable
in :ref:`export_for_ncnn_install_ncnn_and_pnnx`. Otherwise,
it will throw an error saying that ``pnnx`` could not be found.
Now, it's time to export our models to `ncnn`_ via ``pnnx``.
.. code-block::
cd icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/
pnnx ./encoder_jit_trace-pnnx.pt
pnnx ./decoder_jit_trace-pnnx.pt
pnnx ./joiner_jit_trace-pnnx.pt
It will generate the following files:
.. code-block:: bash
ls -lh icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/*ncnn*{bin,param}
-rw-r--r-- 1 kuangfangjun root 509K Feb 27 20:31 icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/decoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 437 Feb 27 20:31 icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/decoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 133M Feb 27 20:30 icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/encoder_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 152K Feb 27 20:30 icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/encoder_jit_trace-pnnx.ncnn.param
-rw-r--r-- 1 kuangfangjun root 1.4M Feb 27 20:31 icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/joiner_jit_trace-pnnx.ncnn.bin
-rw-r--r-- 1 kuangfangjun root 488 Feb 27 20:31 icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/joiner_jit_trace-pnnx.ncnn.param
There are two types of files:
- ``param``: It is a text file containing the model architectures. You can
use a text editor to view its content.
- ``bin``: It is a binary file containing the model parameters.
We compare the file sizes of the models below before and after converting via ``pnnx``:
.. see https://tableconvert.com/restructuredtext-generator
+----------------------------------+------------+
| File name | File size |
+==================================+============+
| encoder_jit_trace-pnnx.pt | 266 MB |
+----------------------------------+------------+
| decoder_jit_trace-pnnx.pt | 1022 KB |
+----------------------------------+------------+
| joiner_jit_trace-pnnx.pt | 2.8 MB |
+----------------------------------+------------+
| encoder_jit_trace-pnnx.ncnn.bin | 133 MB |
+----------------------------------+------------+
| decoder_jit_trace-pnnx.ncnn.bin | 509 KB |
+----------------------------------+------------+
| joiner_jit_trace-pnnx.ncnn.bin | 1.4 MB |
+----------------------------------+------------+
You can see that the file sizes of the models after conversion are about one half
of the models before conversion:
- encoder: 266 MB vs 133 MB
- decoder: 1022 KB vs 509 KB
- joiner: 2.8 MB vs 1.4 MB
The reason is that by default ``pnnx`` converts ``float32`` parameters
to ``float16``. A ``float32`` parameter occupies 4 bytes, while it is 2 bytes
for ``float16``. Thus, it is ``twice smaller`` after conversion.
.. hint::
If you use ``pnnx ./encoder_jit_trace-pnnx.pt fp16=0``, then ``pnnx``
won't convert ``float32`` to ``float16``.
5. Test the exported models in icefall
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. note::
We assume you have set up the environment variable ``PYTHONPATH`` when
building `ncnn`_.
Now we have successfully converted our pre-trained model to `ncnn`_ format.
The generated 6 files are what we need. You can use the following code to
test the converted models:
.. code-block:: bash
python3 ./pruned_transducer_stateless7_streaming/streaming-ncnn-decode.py \
--tokens ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/data/lang_bpe_500/tokens.txt \
--encoder-param-filename ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/encoder_jit_trace-pnnx.ncnn.param \
--encoder-bin-filename ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/encoder_jit_trace-pnnx.ncnn.bin \
--decoder-param-filename ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/decoder_jit_trace-pnnx.ncnn.param \
--decoder-bin-filename ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/decoder_jit_trace-pnnx.ncnn.bin \
--joiner-param-filename ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/joiner_jit_trace-pnnx.ncnn.param \
--joiner-bin-filename ./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/exp/joiner_jit_trace-pnnx.ncnn.bin \
./icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29/test_wavs/1089-134686-0001.wav
.. hint::
`ncnn`_ supports only ``batch size == 1``, so ``streaming-ncnn-decode.py`` accepts
only 1 wave file as input.
The output is given below:
.. literalinclude:: ./code/test-streaming-ncnn-decode-zipformer-transducer-libri.txt
Congratulations! You have successfully exported a model from PyTorch to `ncnn`_!
.. _zipformer-modify-the-exported-encoder-for-sherpa-ncnn:
6. Modify the exported encoder for sherpa-ncnn
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order to use the exported models in `sherpa-ncnn`_, we have to modify
``encoder_jit_trace-pnnx.ncnn.param``.
Let us have a look at the first few lines of ``encoder_jit_trace-pnnx.ncnn.param``:
.. code-block::
7767517
2028 2547
Input in0 0 1 in0
**Explanation** of the above three lines:
1. ``7767517``, it is a magic number and should not be changed.
2. ``2028 2547``, the first number ``2028`` specifies the number of layers
in this file, while ``2547`` specifies the number of intermediate outputs
of this file
3. ``Input in0 0 1 in0``, ``Input`` is the layer type of this layer; ``in0``
is the layer name of this layer; ``0`` means this layer has no input;
``1`` means this layer has one output; ``in0`` is the output name of
this layer.
We need to add 1 extra line and also increment the number of layers.
The result looks like below:
.. code-block:: bash
7767517
2029 2547
SherpaMetaData sherpa_meta_data1 0 0 0=2 1=32 2=4 3=7 15=1 -23316=5,2,4,3,2,4 -23317=5,384,384,384,384,384 -23318=5,192,192,192,192,192 -23319=5,1,2,4,8,2 -23320=5,31,31,31,31,31
Input in0 0 1 in0
**Explanation**
1. ``7767517``, it is still the same
2. ``2029 2547``, we have added an extra layer, so we need to update ``2028`` to ``2029``.
We don't need to change ``2547`` since the newly added layer has no inputs or outputs.
3. ``SherpaMetaData sherpa_meta_data1 0 0 0=2 1=32 2=4 3=7 -23316=5,2,4,3,2,4 -23317=5,384,384,384,384,384 -23318=5,192,192,192,192,192 -23319=5,1,2,4,8,2 -23320=5,31,31,31,31,31``
This line is newly added. Its explanation is given below:
- ``SherpaMetaData`` is the type of this layer. Must be ``SherpaMetaData``.
- ``sherpa_meta_data1`` is the name of this layer. Must be ``sherpa_meta_data1``.
- ``0 0`` means this layer has no inputs or output. Must be ``0 0``
- ``0=2``, 0 is the key and 2 is the value. MUST be ``0=2``
- ``1=32``, 1 is the key and 32 is the value of the
parameter ``--decode-chunk-len`` that you provided when running
``./pruned_transducer_stateless7_streaming/export-for-ncnn.py``.
- ``2=4``, 2 is the key and 4 is the value of the
parameter ``--num-left-chunks`` that you provided when running
``./pruned_transducer_stateless7_streaming/export-for-ncnn.py``.
- ``3=7``, 3 is the key and 7 is the value of for the amount of padding
used in the Conv2DSubsampling layer. It should be 7 for zipformer
if you don't change zipformer.py.
- ``15=1``, attribute 15, this is the model version. Starting from
`sherpa-ncnn`_ v2.0, we require that the model version has to
be >= 1.
- ``-23316=5,2,4,3,2,4``, attribute 16, this is an array attribute.
It is attribute 16 since -23300 - (-23316) = 16.
The first element of the array is the length of the array, which is 5 in our case.
``2,4,3,2,4`` is the value of ``--num-encoder-layers``that you provided
when running ``./pruned_transducer_stateless7_streaming/export-for-ncnn.py``.
- ``-23317=5,384,384,384,384,384``, attribute 17.
The first element of the array is the length of the array, which is 5 in our case.
``384,384,384,384,384`` is the value of ``--encoder-dims``that you provided
when running ``./pruned_transducer_stateless7_streaming/export-for-ncnn.py``.
- ``-23318=5,192,192,192,192,192``, attribute 18.
The first element of the array is the length of the array, which is 5 in our case.
``192,192,192,192,192`` is the value of ``--attention-dims`` that you provided
when running ``./pruned_transducer_stateless7_streaming/export-for-ncnn.py``.
- ``-23319=5,1,2,4,8,2``, attribute 19.
The first element of the array is the length of the array, which is 5 in our case.
``1,2,4,8,2`` is the value of ``--zipformer-downsampling-factors`` that you provided
when running ``./pruned_transducer_stateless7_streaming/export-for-ncnn.py``.
- ``-23320=5,31,31,31,31,31``, attribute 20.
The first element of the array is the length of the array, which is 5 in our case.
``31,31,31,31,31`` is the value of ``--cnn-module-kernels`` that you provided
when running ``./pruned_transducer_stateless7_streaming/export-for-ncnn.py``.
For ease of reference, we list the key-value pairs that you need to add
in the following table. If your model has a different setting, please
change the values for ``SherpaMetaData`` accordingly. Otherwise, you
will be ``SAD``.
+----------+--------------------------------------------+
| key | value |
+==========+============================================+
| 0 | 2 (fixed) |
+----------+--------------------------------------------+
| 1 | ``-decode-chunk-len`` |
+----------+--------------------------------------------+
| 2 | ``--num-left-chunks`` |
+----------+--------------------------------------------+
| 3 | 7 (if you don't change code) |
+----------+--------------------------------------------+
| 15 | 1 (The model version) |
+----------+--------------------------------------------+
|-23316 | ``--num-encoder-layer`` |
+----------+--------------------------------------------+
|-23317 | ``--encoder-dims`` |
+----------+--------------------------------------------+
|-23318 | ``--attention-dims`` |
+----------+--------------------------------------------+
|-23319 | ``--zipformer-downsampling-factors`` |
+----------+--------------------------------------------+
|-23320 | ``--cnn-module-kernels`` |
+----------+--------------------------------------------+
4. ``Input in0 0 1 in0``. No need to change it.
.. caution::
When you add a new layer ``SherpaMetaData``, please remember to update the
number of layers. In our case, update ``2028`` to ``2029``. Otherwise,
you will be SAD later.
.. hint::
After adding the new layer ``SherpaMetaData``, you cannot use this model
with ``streaming-ncnn-decode.py`` anymore since ``SherpaMetaData`` is
supported only in `sherpa-ncnn`_.
.. hint::
`ncnn`_ is very flexible. You can add new layers to it just by text-editing
the ``param`` file! You don't need to change the ``bin`` file.
Now you can use this model in `sherpa-ncnn`_.
Please refer to the following documentation:
- Linux/macOS/Windows/arm/aarch64: `<https://k2-fsa.github.io/sherpa/ncnn/install/index.html>`_
- ``Android``: `<https://k2-fsa.github.io/sherpa/ncnn/android/index.html>`_
- ``iOS``: `<https://k2-fsa.github.io/sherpa/ncnn/ios/index.html>`_
- Python: `<https://k2-fsa.github.io/sherpa/ncnn/python/index.html>`_
We have a list of pre-trained models that have been exported for `sherpa-ncnn`_:
- `<https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/index.html>`_
You can find more usages there.

View File

@ -1,39 +0,0 @@
.. _icefall_export_to_ncnn:
Export to ncnn
==============
We support exporting the following models
to `ncnn <https://github.com/tencent/ncnn>`_:
- `Zipformer transducer models <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless7_streaming>`_
- `LSTM transducer models <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/lstm_transducer_stateless2>`_
- `ConvEmformer transducer models <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/conv_emformer_transducer_stateless2>`_
We also provide `sherpa-ncnn`_
for performing speech recognition using `ncnn`_ with exported models.
It has been tested on the following platforms:
- Linux
- macOS
- Windows
- ``Android``
- ``iOS``
- ``Raspberry Pi``
- `爱芯派 <https://wiki.sipeed.com/hardware/zh/>`_ (`MAIX-III AXera-Pi <https://wiki.sipeed.com/hardware/en/maixIII/ax-pi/axpi.html>`_).
- `RV1126 <https://www.rock-chips.com/a/en/products/RV11_Series/2020/0427/1076.html>`_
`sherpa-ncnn`_ is self-contained and can be statically linked to produce
a binary containing everything needed. Please refer
to its documentation for details:
- `<https://k2-fsa.github.io/sherpa/ncnn/index.html>`_
.. toctree::
export-ncnn-zipformer
export-ncnn-conv-emformer
export-ncnn-lstm

View File

@ -1,104 +0,0 @@
Export to ONNX
==============
In this section, we describe how to export models to `ONNX`_.
.. hint::
Before you continue, please run:
.. code-block:: bash
pip install onnx
In each recipe, there is a file called ``export-onnx.py``, which is used
to export trained models to `ONNX`_.
There is also a file named ``onnx_pretrained.py``, which you can use
the exported `ONNX`_ model in Python with `onnxruntime`_ to decode sound files.
sherpa-onnx
-----------
We have a separate repository `sherpa-onnx`_ for deploying your exported models
on various platforms such as:
- iOS
- Android
- Raspberry Pi
- Linux/macOS/Windows
Please see the documentation of `sherpa-onnx`_ for details:
`<https://k2-fsa.github.io/sherpa/onnx/index.html>`_
Example
-------
In the following, we demonstrate how to export a streaming Zipformer pre-trained
model from
`<https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless7-2022-11-11>`_
to `ONNX`_.
Download the pre-trained model
------------------------------
.. hint::
We assume you have installed `git-lfs`_.
.. code-block:: bash
cd egs/librispeech/ASR
repo_url=https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless7-streaming-2022-12-29
GIT_LFS_SKIP_SMUDGE=1 git clone $repo_url
repo=$(basename $repo_url)
pushd $repo
git lfs pull --include "data/lang_bpe_500/bpe.model"
git lfs pull --include "exp/pretrained.pt"
cd exp
ln -s pretrained.pt epoch-99.pt
popd
Export the model to ONNX
------------------------
.. code-block:: bash
./pruned_transducer_stateless7_streaming/export-onnx.py \
--tokens $repo/data/lang_bpe_500/tokens.txt \
--use-averaged-model 0 \
--epoch 99 \
--avg 1 \
--decode-chunk-len 32 \
--exp-dir $repo/exp/
.. warning::
``export-onnx.py`` from different recipes has different options.
In the above example, ``--decode-chunk-len`` is specific for the
streaming Zipformer. Other models won't have such an option.
It will generate the following 3 files in ``$repo/exp``
- ``encoder-epoch-99-avg-1.onnx``
- ``decoder-epoch-99-avg-1.onnx``
- ``joiner-epoch-99-avg-1.onnx``
Decode sound files with exported ONNX models
--------------------------------------------
.. code-block:: bash
./pruned_transducer_stateless7_streaming/onnx_pretrained.py \
--encoder-model-filename $repo/exp/encoder-epoch-99-avg-1.onnx \
--decoder-model-filename $repo/exp/decoder-epoch-99-avg-1.onnx \
--joiner-model-filename $repo/exp/joiner-epoch-99-avg-1.onnx \
--tokens $repo/data/lang_bpe_500/tokens.txt \
$repo/test_wavs/1089-134686-0001.wav

View File

@ -1,58 +0,0 @@
.. _export-model-with-torch-jit-script:
Export model with torch.jit.script()
====================================
In this section, we describe how to export a model via
``torch.jit.script()``.
When to use it
--------------
If we want to use our trained model with torchscript,
we can use ``torch.jit.script()``.
.. hint::
See :ref:`export-model-with-torch-jit-trace`
if you want to use ``torch.jit.trace()``.
How to export
-------------
We use
`<https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless3>`_
as an example in the following.
.. code-block:: bash
cd egs/librispeech/ASR
epoch=14
avg=1
./pruned_transducer_stateless3/export.py \
--exp-dir ./pruned_transducer_stateless3/exp \
--tokens data/lang_bpe_500/tokens.txt \
--epoch $epoch \
--avg $avg \
--jit 1
It will generate a file ``cpu_jit.pt`` in ``pruned_transducer_stateless3/exp``.
.. caution::
Don't be confused by ``cpu`` in ``cpu_jit.pt``. We move all parameters
to CPU before saving it into a ``pt`` file; that's why we use ``cpu``
in the filename.
How to use the exported model
-----------------------------
Please refer to the following pages for usage:
- `<https://k2-fsa.github.io/sherpa/python/streaming_asr/emformer/index.html>`_
- `<https://k2-fsa.github.io/sherpa/python/streaming_asr/conv_emformer/index.html>`_
- `<https://k2-fsa.github.io/sherpa/python/streaming_asr/conformer/index.html>`_
- `<https://k2-fsa.github.io/sherpa/python/offline_asr/conformer/index.html>`_
- `<https://k2-fsa.github.io/sherpa/cpp/offline_asr/gigaspeech.html>`_
- `<https://k2-fsa.github.io/sherpa/cpp/offline_asr/wenetspeech.html>`_

View File

@ -1,69 +0,0 @@
.. _export-model-with-torch-jit-trace:
Export model with torch.jit.trace()
===================================
In this section, we describe how to export a model via
``torch.jit.trace()``.
When to use it
--------------
If we want to use our trained model with torchscript,
we can use ``torch.jit.trace()``.
.. hint::
See :ref:`export-model-with-torch-jit-script`
if you want to use ``torch.jit.script()``.
How to export
-------------
We use
`<https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/lstm_transducer_stateless2>`_
as an example in the following.
.. code-block:: bash
iter=468000
avg=16
cd egs/librispeech/ASR
./lstm_transducer_stateless2/export.py \
--exp-dir ./lstm_transducer_stateless2/exp \
--tokens data/lang_bpe_500/tokens.txt \
--iter $iter \
--avg $avg \
--jit-trace 1
It will generate three files inside ``lstm_transducer_stateless2/exp``:
- ``encoder_jit_trace.pt``
- ``decoder_jit_trace.pt``
- ``joiner_jit_trace.pt``
You can use
`<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/lstm_transducer_stateless2/jit_pretrained.py>`_
to decode sound files with the following commands:
.. code-block:: bash
cd egs/librispeech/ASR
./lstm_transducer_stateless2/jit_pretrained.py \
--bpe-model ./data/lang_bpe_500/bpe.model \
--encoder-model-filename ./lstm_transducer_stateless2/exp/encoder_jit_trace.pt \
--decoder-model-filename ./lstm_transducer_stateless2/exp/decoder_jit_trace.pt \
--joiner-model-filename ./lstm_transducer_stateless2/exp/joiner_jit_trace.pt \
/path/to/foo.wav \
/path/to/bar.wav \
/path/to/baz.wav
How to use the exported models
------------------------------
Please refer to
`<https://k2-fsa.github.io/sherpa/python/streaming_asr/lstm/index.html>`_
for its usage in `sherpa <https://k2-fsa.github.io/sherpa/python/streaming_asr/lstm/index.html>`_.
You can also find pretrained models there.

View File

@ -1,14 +0,0 @@
Model export
============
In this section, we describe various ways to export models.
.. toctree::
export-model-state-dict
export-with-torch-jit-trace
export-with-torch-jit-script
export-onnx
export-ncnn

View File

@ -1,225 +0,0 @@
Finetune from a pre-trained Zipformer model with adapters
=========================================================
This tutorial shows you how to fine-tune a pre-trained **Zipformer**
transducer model on a new dataset with adapters.
Adapters are compact and efficient module that can be integrated into a pre-trained model
to improve the model's performance on a new domain. Adapters are injected
between different modules in the well-trained neural network. During training, only the parameters
in the adapters will be updated. It achieves competitive performance
while requiring much less GPU memory than full fine-tuning. For more details about adapters,
please refer to the original `paper <https://arxiv.org/pdf/1902.00751.pdf#/>`_ for more details.
.. HINT::
We assume you have read the page :ref:`install icefall` and have setup
the environment for ``icefall``.
.. HINT::
We recommend you to use a GPU or several GPUs to run this recipe
For illustration purpose, we fine-tune the Zipformer transducer model
pre-trained on `LibriSpeech`_ on the small subset of `GigaSpeech`_. You could use your
own data for fine-tuning if you create a manifest for your new dataset.
Data preparation
----------------
Please follow the instructions in the `GigaSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/gigaspeech/ASR>`_
to prepare the fine-tune data used in this tutorial. We only require the small subset in GigaSpeech for this tutorial.
Model preparation
-----------------
We are using the Zipformer model trained on full LibriSpeech (960 hours) as the intialization. The
checkpoint of the model can be downloaded via the following command:
.. code-block:: bash
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
$ cd icefall-asr-librispeech-zipformer-2023-05-15/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt
$ cd ../data/lang_bpe_500
$ git lfs pull --include bpe.model
$ cd ../../..
Before fine-tuning, let's test the model's WER on the new domain. The following command performs
decoding on the GigaSpeech test sets:
.. code-block:: bash
./zipformer/decode_gigaspeech.py \
--epoch 99 \
--avg 1 \
--exp-dir icefall-asr-librispeech-zipformer-2023-05-15/exp \
--use-averaged-model 0 \
--max-duration 1000 \
--decoding-method greedy_search
You should see the following numbers:
.. code-block::
For dev, WER of different settings are:
greedy_search 20.06 best for dev
For test, WER of different settings are:
greedy_search 19.27 best for test
Fine-tune with adapter
----------------------
We insert 4 adapters with residual connection in each ``Zipformer2EncoderLayer``.
The original model parameters remain untouched during training and only the parameters of
the adapters are updated. The following command starts a fine-tuning experiment with adapters:
.. code-block:: bash
$ do_finetune=1
$ use_adapters=1
$ adapter_dim=8
$ ./zipformer_adapter/train.py \
--world-size 2 \
--num-epochs 20 \
--start-epoch 1 \
--exp-dir zipformer_adapter/exp_giga_finetune_adapters${use_adapters}_adapter_dim${adapter_dim} \
--use-fp16 1 \
--base-lr 0.045 \
--use-adapters $use_adapters --adapter-dim $adapter_dim \
--bpe-model data/lang_bpe_500/bpe.model \
--do-finetune $do_finetune \
--master-port 13022 \
--finetune-ckpt icefall-asr-librispeech-zipformer-2023-05-15/exp/pretrained.pt \
--max-duration 1000
The following arguments are related to fine-tuning:
- ``--do-finetune``
If True, do fine-tuning by initializing the model from a pre-trained checkpoint.
**Note that if you want to resume your fine-tuning experiment from certain epochs, you
need to set this to False.**
- ``use-adapters``
If adapters are used during fine-tuning.
- ``--adapter-dim``
The bottleneck dimension of the adapter module. Typically a small number.
You should notice that in the training log, the total number of trainale parameters is shown:
.. code-block::
2024-02-22 21:22:03,808 INFO [train.py:1277] A total of 761344 trainable parameters (1.148% of the whole model)
The trainable parameters only makes up 1.15% of the entire model parameters, so the training will be much faster
and requires less memory than full fine-tuning.
Decoding
--------
After training, let's test the WERs. To test the WERs on the GigaSpeech set,
you can execute the following command:
.. code-block:: bash
$ epoch=20
$ avg=10
$ use_adapters=1
$ adapter_dim=8
% ./zipformer/decode.py \
--epoch $epoch \
--avg $avg \
--use-averaged-model 1 \
--exp-dir zipformer_adapter/exp_giga_finetune_adapters${use_adapters}_adapter_dim${adapter_dim} \
--max-duration 600 \
--use-adapters $use_adapters \
--adapter-dim $adapter_dim \
--decoding-method greedy_search
You should see the following numbers:
.. code-block::
For dev, WER of different settings are:
greedy_search 15.44 best for dev
For test, WER of different settings are:
greedy_search 15.42 best for test
The WER on test set is improved from 19.27 to 15.42, demonstrating the effectiveness of adapters.
The same model can be used to perform decoding on LibriSpeech test sets. You can deactivate the adapters
to keep the same performance of the original model:
.. code-block:: bash
$ epoch=20
$ avg=1
$ use_adapters=0
$ adapter_dim=8
% ./zipformer/decode.py \
--epoch $epoch \
--avg $avg \
--use-averaged-model 1 \
--exp-dir zipformer_adapter/exp_giga_finetune_adapters${use_adapters}_adapter_dim${adapter_dim} \
--max-duration 600 \
--use-adapters $use_adapters \
--adapter-dim $adapter_dim \
--decoding-method greedy_search
.. code-block::
For dev, WER of different settings are:
greedy_search 2.23 best for test-clean
For test, WER of different settings are:
greedy_search 4.96 best for test-other
The numbers are the same as reported in `icefall <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#normal-scaled-model-number-of-model-parameters-65549011-ie-6555-m>`_. So adapter-based
fine-tuning is also very flexible as the same model can be used for decoding on the original and target domain.
Export the model
----------------
After training, the model can be exported to ``onnx`` format easily using the following command:
.. code-block:: bash
$ use_adapters=1
$ adapter_dim=16
$ ./zipformer_adapter/export-onnx.py \
--tokens icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500/tokens.txt \
--use-averaged-model 1 \
--epoch 20 \
--avg 10 \
--exp-dir zipformer_adapter/exp_giga_finetune_adapters${use_adapters}_adapter_dim${adapter_dim} \
--use-adapters $use_adapters \
--adapter-dim $adapter_dim \
--num-encoder-layers "2,2,3,4,3,2" \
--downsampling-factor "1,2,4,8,4,2" \
--feedforward-dim "512,768,1024,1536,1024,768" \
--num-heads "4,4,4,8,4,4" \
--encoder-dim "192,256,384,512,384,256" \
--query-head-dim 32 \
--value-head-dim 12 \
--pos-head-dim 4 \
--pos-dim 48 \
--encoder-unmasked-dim "192,192,256,256,256,192" \
--cnn-module-kernel "31,31,15,15,15,31" \
--decoder-dim 512 \
--joiner-dim 512 \
--causal False \
--chunk-size "16,32,64,-1" \
--left-context-frames "64,128,256,-1"

View File

@ -1,140 +0,0 @@
Finetune from a supervised pre-trained Zipformer model
======================================================
This tutorial shows you how to fine-tune a supervised pre-trained **Zipformer**
transducer model on a new dataset.
.. HINT::
We assume you have read the page :ref:`install icefall` and have setup
the environment for ``icefall``.
.. HINT::
We recommend you to use a GPU or several GPUs to run this recipe
For illustration purpose, we fine-tune the Zipformer transducer model
pre-trained on `LibriSpeech`_ on the small subset of `GigaSpeech`_. You could use your
own data for fine-tuning if you create a manifest for your new dataset.
Data preparation
----------------
Please follow the instructions in the `GigaSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/gigaspeech/ASR>`_
to prepare the fine-tune data used in this tutorial. We only require the small subset in GigaSpeech for this tutorial.
Model preparation
-----------------
We are using the Zipformer model trained on full LibriSpeech (960 hours) as the intialization. The
checkpoint of the model can be downloaded via the following command:
.. code-block:: bash
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
$ cd icefall-asr-librispeech-zipformer-2023-05-15/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt
$ cd ../data/lang_bpe_500
$ git lfs pull --include bpe.model
$ cd ../../..
Before fine-tuning, let's test the model's WER on the new domain. The following command performs
decoding on the GigaSpeech test sets:
.. code-block:: bash
./zipformer/decode_gigaspeech.py \
--epoch 99 \
--avg 1 \
--exp-dir icefall-asr-librispeech-zipformer-2023-05-15/exp \
--use-averaged-model 0 \
--max-duration 1000 \
--decoding-method greedy_search
You should see the following numbers:
.. code-block::
For dev, WER of different settings are:
greedy_search 20.06 best for dev
For test, WER of different settings are:
greedy_search 19.27 best for test
Fine-tune
---------
Since LibriSpeech and GigaSpeech are both English dataset, we can initialize the whole
Zipformer model with the checkpoint downloaded in the previous step (otherwise we should consider
initializing the stateless decoder and joiner from scratch due to the mismatch of the output
vocabulary). The following command starts a fine-tuning experiment:
.. code-block:: bash
$ use_mux=0
$ do_finetune=1
$ ./zipformer/finetune.py \
--world-size 2 \
--num-epochs 20 \
--start-epoch 1 \
--exp-dir zipformer/exp_giga_finetune${do_finetune}_mux${use_mux} \
--use-fp16 1 \
--base-lr 0.0045 \
--bpe-model data/lang_bpe_500/bpe.model \
--do-finetune $do_finetune \
--use-mux $use_mux \
--master-port 13024 \
--finetune-ckpt icefall-asr-librispeech-zipformer-2023-05-15/exp/pretrained.pt \
--max-duration 1000
The following arguments are related to fine-tuning:
- ``--base-lr``
The learning rate used for fine-tuning. We suggest to set a **small** learning rate for fine-tuning,
otherwise the model may forget the initialization very quickly. A reasonable value should be around
1/10 of the original lr, i.e 0.0045.
- ``--do-finetune``
If True, do fine-tuning by initializing the model from a pre-trained checkpoint.
**Note that if you want to resume your fine-tuning experiment from certain epochs, you
need to set this to False.**
- ``--finetune-ckpt``
The path to the pre-trained checkpoint (used for initialization).
- ``--use-mux``
If True, mix the fine-tune data with the original training data by using `CutSet.mux <https://lhotse.readthedocs.io/en/latest/api.html#lhotse.supervision.SupervisionSet.mux>`_
This helps maintain the model's performance on the original domain if the original training
is available. **If you don't have the original training data, please set it to False.**
After fine-tuning, let's test the WERs. You can do this via the following command:
.. code-block:: bash
$ use_mux=0
$ do_finetune=1
$ ./zipformer/decode_gigaspeech.py \
--epoch 20 \
--avg 10 \
--exp-dir zipformer/exp_giga_finetune${do_finetune}_mux${use_mux} \
--use-averaged-model 1 \
--max-duration 1000 \
--decoding-method greedy_search
You should see numbers similar to the ones below:
.. code-block:: text
For dev, WER of different settings are:
greedy_search 13.47 best for dev
For test, WER of different settings are:
greedy_search 13.66 best for test
Compared to the original checkpoint, the fine-tuned model achieves much lower WERs
on the GigaSpeech test sets.

View File

@ -1,16 +0,0 @@
Fine-tune a pre-trained model
=============================
After pre-training on public available datasets, the ASR model is already capable of
performing general speech recognition with relatively high accuracy. However, the accuracy
could be still low on certain domains that are quite different from the original training
set. In this case, we can fine-tune the model with a small amount of additional labelled
data to improve the performance on new domains.
.. toctree::
:maxdepth: 2
:caption: Table of Contents
from_supervised/finetune_zipformer
adapter/finetune_adapter

View File

@ -1,747 +0,0 @@
Conformer CTC
=============
This tutorial shows you how to run a conformer ctc model
with the `Aishell <https://www.openslr.org/33>`_ dataset.
.. HINT::
We assume you have read the page :ref:`install icefall` and have setup
the environment for ``icefall``.
.. HINT::
We recommend you to use a GPU or several GPUs to run this recipe.
In this tutorial, you will learn:
- (1) How to prepare data for training and decoding
- (2) How to start the training, either with a single GPU or multiple GPUs
- (3) How to do decoding after training, with ctc-decoding, 1best and attention decoder rescoring
- (4) How to use a pre-trained model, provided by us
Data preparation
----------------
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./prepare.sh
The script ``./prepare.sh`` handles the data preparation for you, **automagically**.
All you need to do is to run it.
The data preparation contains several stages, you can use the following two
options:
- ``--stage``
- ``--stop-stage``
to control which stage(s) should be run. By default, all stages are executed.
For example,
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./prepare.sh --stage 0 --stop-stage 0
means to run only stage 0.
To run stage 2 to stage 5, use:
.. code-block:: bash
$ ./prepare.sh --stage 2 --stop-stage 5
.. HINT::
If you have pre-downloaded the `Aishell <https://www.openslr.org/33>`_
dataset and the `musan <http://www.openslr.org/17/>`_ dataset, say,
they are saved in ``/tmp/aishell`` and ``/tmp/musan``, you can modify
the ``dl_dir`` variable in ``./prepare.sh`` to point to ``/tmp`` so that
``./prepare.sh`` won't re-download them.
.. HINT::
A 3-gram language model will be downloaded from huggingface, we assume you have
installed and initialized ``git-lfs``. If not, you could install ``git-lfs`` by
.. code-block:: bash
$ sudo apt-get install git-lfs
$ git-lfs install
If you don't have the ``sudo`` permission, you could download the
`git-lfs binary <https://github.com/git-lfs/git-lfs/releases>`_ here, then add it to you ``PATH``.
.. NOTE::
All generated files by ``./prepare.sh``, e.g., features, lexicon, etc,
are saved in ``./data`` directory.
Training
--------
Configurable options
~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./conformer_ctc/train.py --help
shows you the training options that can be passed from the commandline.
The following options are used quite often:
- ``--exp-dir``
The experiment folder to save logs and model checkpoints,
default ``./conformer_ctc/exp``.
- ``--num-epochs``
It is the number of epochs to train. For instance,
``./conformer_ctc/train.py --num-epochs 30`` trains for 30 epochs
and generates ``epoch-0.pt``, ``epoch-1.pt``, ..., ``epoch-29.pt``
in the folder set by ``--exp-dir``.
- ``--start-epoch``
It's used to resume training.
``./conformer_ctc/train.py --start-epoch 10`` loads the
checkpoint ``./conformer_ctc/exp/epoch-9.pt`` and starts
training from epoch 10, based on the state from epoch 9.
- ``--world-size``
It is used for multi-GPU single-machine DDP training.
- (a) If it is 1, then no DDP training is used.
- (b) If it is 2, then GPU 0 and GPU 1 are used for DDP training.
The following shows some use cases with it.
**Use case 1**: You have 4 GPUs, but you only want to use GPU 0 and
GPU 2 for training. You can do the following:
.. code-block:: bash
$ cd egs/aishell/ASR
$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./conformer_ctc/train.py --world-size 2
**Use case 2**: You have 4 GPUs and you want to use all of them
for training. You can do the following:
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./conformer_ctc/train.py --world-size 4
**Use case 3**: You have 4 GPUs but you only want to use GPU 3
for training. You can do the following:
.. code-block:: bash
$ cd egs/aishell/ASR
$ export CUDA_VISIBLE_DEVICES="3"
$ ./conformer_ctc/train.py --world-size 1
.. CAUTION::
Only multi-GPU single-machine DDP training is implemented at present.
Multi-GPU multi-machine DDP training will be added later.
- ``--max-duration``
It specifies the number of seconds over all utterances in a
batch, before **padding**.
If you encounter CUDA OOM, please reduce it. For instance, if
your are using V100 NVIDIA GPU, we recommend you to set it to ``200``.
.. HINT::
Due to padding, the number of seconds of all utterances in a
batch will usually be larger than ``--max-duration``.
A larger value for ``--max-duration`` may cause OOM during training,
while a smaller value may increase the training time. You have to
tune it.
Pre-configured options
~~~~~~~~~~~~~~~~~~~~~~
There are some training options, e.g., weight decay,
number of warmup steps, etc,
that are not passed from the commandline.
They are pre-configured by the function ``get_params()`` in
`conformer_ctc/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/aishell/ASR/conformer_ctc/train.py>`_
You don't need to change these pre-configured parameters. If you really need to change
them, please modify ``./conformer_ctc/train.py`` directly.
.. CAUTION::
The training set is perturbed by speed with two factors: 0.9 and 1.1.
Each epoch actually processes ``3x150 == 450`` hours of data.
Training logs
~~~~~~~~~~~~~
Training logs and checkpoints are saved in the folder set by ``--exp-dir``
(default ``conformer_ctc/exp``). You will find the following files in that directory:
- ``epoch-0.pt``, ``epoch-1.pt``, ...
These are checkpoint files, containing model ``state_dict`` and optimizer ``state_dict``.
To resume training from some checkpoint, say ``epoch-10.pt``, you can use:
.. code-block:: bash
$ ./conformer_ctc/train.py --start-epoch 11
- ``tensorboard/``
This folder contains TensorBoard logs. Training loss, validation loss, learning
rate, etc, are recorded in these logs. You can visualize them by:
.. code-block:: bash
$ cd conformer_ctc/exp/tensorboard
$ tensorboard dev upload --logdir . --name "Aishell conformer ctc training with icefall" --description "Training with new LabelSmoothing loss, see https://github.com/k2-fsa/icefall/pull/109"
It will print something like below:
.. code-block::
TensorFlow installation not found - running with reduced feature set.
Upload started and will continue reading any new data as it's added to the logdir.
To stop uploading, press Ctrl-C.
New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/engw8KSkTZqS24zBV5dgCg/
[2021-11-22T11:09:27] Started scanning logdir.
[2021-11-22T11:10:14] Total uploaded: 116068 scalars, 0 tensors, 0 binary objects
Listening for new data in logdir...
Note there is a URL in the above output, click it and you will see
the following screenshot:
.. figure:: images/aishell-conformer-ctc-tensorboard-log.jpg
:width: 600
:alt: TensorBoard screenshot
:align: center
:target: https://tensorboard.dev/experiment/WE1DocDqRRCOSAgmGyClhg/
TensorBoard screenshot.
- ``log/log-train-xxxx``
It is the detailed training log in text format, same as the one
you saw printed to the console during training.
Usage examples
~~~~~~~~~~~~~~
The following shows typical use cases:
**Case 1**
^^^^^^^^^^
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./conformer_ctc/train.py --max-duration 200
It uses ``--max-duration`` of 200 to avoid OOM.
**Case 2**
^^^^^^^^^^
.. code-block:: bash
$ cd egs/aishell/ASR
$ export CUDA_VISIBLE_DEVICES="0,3"
$ ./conformer_ctc/train.py --world-size 2
It uses GPU 0 and GPU 3 for DDP training.
**Case 3**
^^^^^^^^^^
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./conformer_ctc/train.py --num-epochs 10 --start-epoch 3
It loads checkpoint ``./conformer_ctc/exp/epoch-2.pt`` and starts
training from epoch 3. Also, it trains for 10 epochs.
Decoding
--------
The decoding part uses checkpoints saved by the training part, so you have
to run the training part first.
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./conformer_ctc/decode.py --help
shows the options for decoding.
The commonly used options are:
- ``--method``
This specifies the decoding method.
The following command uses attention decoder for rescoring:
.. code-block::
$ cd egs/aishell/ASR
$ ./conformer_ctc/decode.py --method attention-decoder --max-duration 30 --nbest-scale 0.5
- ``--nbest-scale``
It is used to scale down lattice scores so that there are more unique
paths for rescoring.
- ``--max-duration``
It has the same meaning as the one during training. A larger
value may cause OOM.
Pre-trained Model
-----------------
We have uploaded a pre-trained model to
`<https://huggingface.co/pkufool/icefall_asr_aishell_conformer_ctc>`_.
We describe how to use the pre-trained model to transcribe a sound file or
multiple sound files in the following.
Install kaldifeat
~~~~~~~~~~~~~~~~~
`kaldifeat <https://github.com/csukuangfj/kaldifeat>`_ is used to
extract features for a single sound file or multiple sound files
at the same time.
Please refer to `<https://github.com/csukuangfj/kaldifeat>`_ for installation.
Download the pre-trained model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The following commands describe how to download the pre-trained model:
.. code-block::
$ cd egs/aishell/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/pkufool/icefall_asr_aishell_conformer_ctc
.. CAUTION::
You have to use ``git lfs`` to download the pre-trained model.
.. CAUTION::
In order to use this pre-trained model, your k2 version has to be v1.7 or later.
After downloading, you will have the following files:
.. code-block:: bash
$ cd egs/aishell/ASR
$ tree tmp
.. code-block:: bash
tmp/
`-- icefall_asr_aishell_conformer_ctc
|-- README.md
|-- data
| `-- lang_char
| |-- HLG.pt
| |-- tokens.txt
| `-- words.txt
|-- exp
| `-- pretrained.pt
`-- test_waves
|-- BAC009S0764W0121.wav
|-- BAC009S0764W0122.wav
|-- BAC009S0764W0123.wav
`-- trans.txt
5 directories, 9 files
**File descriptions**:
- ``data/lang_char/HLG.pt``
It is the decoding graph.
- ``data/lang_char/tokens.txt``
It contains tokens and their IDs.
Provided only for convenience so that you can look up the SOS/EOS ID easily.
- ``data/lang_char/words.txt``
It contains words and their IDs.
- ``exp/pretrained.pt``
It contains pre-trained model parameters, obtained by averaging
checkpoints from ``epoch-25.pt`` to ``epoch-84.pt``.
Note: We have removed optimizer ``state_dict`` to reduce file size.
- ``test_waves/*.wav``
It contains some test sound files from Aishell ``test`` dataset.
- ``test_waves/trans.txt``
It contains the reference transcripts for the sound files in `test_waves/`.
The information of the test sound files is listed below:
.. code-block:: bash
$ soxi tmp/icefall_asr_aishell_conformer_ctc/test_waves/*.wav
Input File : 'tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:04.20 = 67263 samples ~ 315.295 CDDA sectors
File Size : 135k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
Input File : 'tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:04.12 = 65840 samples ~ 308.625 CDDA sectors
File Size : 132k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
Input File : 'tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:04.00 = 64000 samples ~ 300 CDDA sectors
File Size : 128k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
Total Duration of 3 files: 00:00:12.32
Usage
~~~~~
.. code-block::
$ cd egs/aishell/ASR
$ ./conformer_ctc/pretrained.py --help
displays the help information.
It supports three decoding methods:
- CTC decoding
- HLG decoding
- HLG + attention decoder rescoring
CTC decoding
^^^^^^^^^^^^
CTC decoding only uses the ctc topology for decoding without a lexicon and language model
The command to run CTC decoding is:
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./conformer_ctc/pretrained.py \
--checkpoint ./tmp/icefall_asr_aishell_conformer_ctc/exp/pretrained.pt \
--tokens-file ./tmp/icefall_asr_aishell_conformer_ctc/data/lang_char/tokens.txt \
--method ctc-decoding \
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav \
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav \
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav
The output is given below:
.. code-block::
2021-11-18 07:53:41,707 INFO [pretrained.py:229] {'sample_rate': 16000, 'subsampling_factor': 4, 'feature_dim': 80, 'nhead': 4, 'attention_dim': 512, 'num_decoder_layers': 6, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.9', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'f2fd997f752ed11bbef4c306652c433e83f9cf12', 'k2-git-date': 'Sun Sep 19 09:41:46 2021', 'lhotse-version': '0.11.0.dev+git.33cfe45.clean', 'torch-cuda-available': True, 'torch-cuda-version': '10.1', 'python-version': '3.8', 'icefall-git-branch': 'aishell', 'icefall-git-sha1': 'd57a873-dirty', 'icefall-git-date': 'Wed Nov 17 19:53:25 2021', 'icefall-path': '/ceph-hw/kangwei/code/icefall_aishell3', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/code/lhotse/lhotse/__init__.py'}, 'checkpoint': './tmp/icefall_asr_aishell_conformer_ctc/exp/pretrained.pt', 'tokens_file': './tmp/icefall_asr_aishell_conformer_ctc/data/lang_char/tokens.txt', 'words_file': None, 'HLG': None, 'method': 'ctc-decoding', 'num_paths': 100, 'ngram_lm_scale': 0.3, 'attention_decoder_scale': 0.9, 'nbest_scale': 0.5, 'sos_id': 1, 'eos_id': 1, 'num_classes': 4336, 'sound_files': ['./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav', './tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav', './tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav']}
2021-11-18 07:53:41,708 INFO [pretrained.py:240] device: cuda:0
2021-11-18 07:53:41,708 INFO [pretrained.py:242] Creating model
2021-11-18 07:53:51,131 INFO [pretrained.py:259] Constructing Fbank computer
2021-11-18 07:53:51,134 INFO [pretrained.py:269] Reading sound files: ['./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav', './tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav', './tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav']
2021-11-18 07:53:51,138 INFO [pretrained.py:275] Decoding started
2021-11-18 07:53:51,241 INFO [pretrained.py:293] Use CTC decoding
2021-11-18 07:53:51,704 INFO [pretrained.py:369]
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav:
甚 至 出 现 交 易 几 乎 停 止 的 情 况
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav:
一 二 线 城 市 虽 然 也 处 于 调 整 中
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav:
但 因 为 聚 集 了 过 多 公 共 资 源
2021-11-18 07:53:51,704 INFO [pretrained.py:371] Decoding Done
HLG decoding
^^^^^^^^^^^^
HLG decoding uses the best path of the decoding lattice as the decoding result.
The command to run HLG decoding is:
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./conformer_ctc/pretrained.py \
--checkpoint ./tmp/icefall_asr_aishell_conformer_ctc/exp/pretrained.pt \
--words-file ./tmp/icefall_asr_aishell_conformer_ctc/data/lang_char/words.txt \
--HLG ./tmp/icefall_asr_aishell_conformer_ctc/data/lang_char/HLG.pt \
--method 1best \
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav \
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav \
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav
The output is given below:
.. code-block::
2021-11-18 07:37:38,683 INFO [pretrained.py:229] {'sample_rate': 16000, 'subsampling_factor': 4, 'feature_dim': 80, 'nhead': 4, 'attention_dim': 512, 'num_decoder_layers': 6, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.9', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'f2fd997f752ed11bbef4c306652c433e83f9cf12', 'k2-git-date': 'Sun Sep 19 09:41:46 2021', 'lhotse-version': '0.11.0.dev+git.33cfe45.clean', 'torch-cuda-available': True, 'torch-cuda-version': '10.1', 'python-version': '3.8', 'icefall-git-branch': 'aishell', 'icefall-git-sha1': 'd57a873-clean', 'icefall-git-date': 'Wed Nov 17 19:53:25 2021', 'icefall-path': '/ceph-hw/kangwei/code/icefall_aishell3', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/code/lhotse/lhotse/__init__.py'}, 'checkpoint': './tmp/icefall_asr_aishell_conformer_ctc/exp/pretrained.pt', 'tokens_file': None, 'words_file': './tmp/icefall_asr_aishell_conformer_ctc/data/lang_char/words.txt', 'HLG': './tmp/icefall_asr_aishell_conformer_ctc/data/lang_char/HLG.pt', 'method': '1best', 'num_paths': 100, 'ngram_lm_scale': 0.3, 'attention_decoder_scale': 0.9, 'nbest_scale': 0.5, 'sos_id': 1, 'eos_id': 1, 'num_classes': 4336, 'sound_files': ['./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav', './tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav', './tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav']}
2021-11-18 07:37:38,684 INFO [pretrained.py:240] device: cuda:0
2021-11-18 07:37:38,684 INFO [pretrained.py:242] Creating model
2021-11-18 07:37:47,651 INFO [pretrained.py:259] Constructing Fbank computer
2021-11-18 07:37:47,654 INFO [pretrained.py:269] Reading sound files: ['./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav', './tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav', './tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav']
2021-11-18 07:37:47,659 INFO [pretrained.py:275] Decoding started
2021-11-18 07:37:47,752 INFO [pretrained.py:321] Loading HLG from ./tmp/icefall_asr_aishell_conformer_ctc/data/lang_char/HLG.pt
2021-11-18 07:37:51,887 INFO [pretrained.py:340] Use HLG decoding
2021-11-18 07:37:52,102 INFO [pretrained.py:370]
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav:
甚至 出现 交易 几乎 停止 的 情况
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav:
一二 线 城市 虽然 也 处于 调整 中
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav:
但 因为 聚集 了 过多 公共 资源
2021-11-18 07:37:52,102 INFO [pretrained.py:372] Decoding Done
HLG decoding + attention decoder rescoring
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
It extracts n paths from the lattice, recores the extracted paths with
an attention decoder. The path with the highest score is the decoding result.
The command to run HLG decoding + attention decoder rescoring is:
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./conformer_ctc/pretrained.py \
--checkpoint ./tmp/icefall_asr_aishell_conformer_ctc/exp/pretrained.pt \
--words-file ./tmp/icefall_asr_aishell_conformer_ctc/data/lang_char/words.txt \
--HLG ./tmp/icefall_asr_aishell_conformer_ctc/data/lang_char/HLG.pt \
--method attention-decoder \
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav \
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav \
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav
The output is below:
.. code-block::
2021-11-18 07:42:05,965 INFO [pretrained.py:229] {'sample_rate': 16000, 'subsampling_factor': 4, 'feature_dim': 80, 'nhead': 4, 'attention_dim': 512, 'num_decoder_layers': 6, 'vgg_frontend': False, 'use_feat_batchnorm': True, 'search_beam': 20, 'output_beam': 8, 'min_active_states': 30, 'max_active_states': 10000, 'use_double_scores': True, 'env_info': {'k2-version': '1.9', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'f2fd997f752ed11bbef4c306652c433e83f9cf12', 'k2-git-date': 'Sun Sep 19 09:41:46 2021', 'lhotse-version': '0.11.0.dev+git.33cfe45.clean', 'torch-cuda-available': True, 'torch-cuda-version': '10.1', 'python-version': '3.8', 'icefall-git-branch': 'aishell', 'icefall-git-sha1': 'd57a873-dirty', 'icefall-git-date': 'Wed Nov 17 19:53:25 2021', 'icefall-path': '/ceph-hw/kangwei/code/icefall_aishell3', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/code/lhotse/lhotse/__init__.py'}, 'checkpoint': './tmp/icefall_asr_aishell_conformer_ctc/exp/pretrained.pt', 'tokens_file': None, 'words_file': './tmp/icefall_asr_aishell_conformer_ctc/data/lang_char/words.txt', 'HLG': './tmp/icefall_asr_aishell_conformer_ctc/data/lang_char/HLG.pt', 'method': 'attention-decoder', 'num_paths': 100, 'ngram_lm_scale': 0.3, 'attention_decoder_scale': 0.9, 'nbest_scale': 0.5, 'sos_id': 1, 'eos_id': 1, 'num_classes': 4336, 'sound_files': ['./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav', './tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav', './tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav']}
2021-11-18 07:42:05,966 INFO [pretrained.py:240] device: cuda:0
2021-11-18 07:42:05,966 INFO [pretrained.py:242] Creating model
2021-11-18 07:42:16,821 INFO [pretrained.py:259] Constructing Fbank computer
2021-11-18 07:42:16,822 INFO [pretrained.py:269] Reading sound files: ['./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav', './tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav', './tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav']
2021-11-18 07:42:16,826 INFO [pretrained.py:275] Decoding started
2021-11-18 07:42:16,916 INFO [pretrained.py:321] Loading HLG from ./tmp/icefall_asr_aishell_conformer_ctc/data/lang_char/HLG.pt
2021-11-18 07:42:21,115 INFO [pretrained.py:345] Use HLG + attention decoder rescoring
2021-11-18 07:42:21,888 INFO [pretrained.py:370]
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav:
甚至 出现 交易 几乎 停止 的 情况
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav:
一二 线 城市 虽然 也 处于 调整 中
./tmp/icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav:
但 因为 聚集 了 过多 公共 资源
2021-11-18 07:42:21,889 INFO [pretrained.py:372] Decoding Done
Colab notebook
--------------
We do provide a colab notebook for this recipe showing how to use a pre-trained model.
|aishell asr conformer ctc colab notebook|
.. |aishell asr conformer ctc colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/1WnG17io5HEZ0Gn_cnh_VzK5QYOoiiklC
.. HINT::
Due to limited memory provided by Colab, you have to upgrade to Colab Pro to
run ``HLG decoding + attention decoder rescoring``.
Otherwise, you can only run ``HLG decoding`` with Colab.
**Congratulations!** You have finished the aishell ASR recipe with
conformer CTC models in ``icefall``.
If you want to deploy your trained model in C++, please read the following section.
Deployment with C++
-------------------
This section describes how to deploy the pre-trained model in C++, without
Python dependencies.
.. HINT::
At present, it does NOT support streaming decoding.
First, let us compile k2 from source:
.. code-block:: bash
$ cd $HOME
$ git clone https://github.com/k2-fsa/k2
$ cd k2
$ git checkout v2.0-pre
.. CAUTION::
You have to switch to the branch ``v2.0-pre``!
.. code-block:: bash
$ mkdir build-release
$ cd build-release
$ cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j hlg_decode
# You will find four binaries in `./bin`, i.e. ./bin/hlg_decode,
Now you are ready to go!
Assume you have run:
.. code-block:: bash
$ cd k2/build-release
$ ln -s /path/to/icefall-asr-aishell-conformer-ctc ./
To view the usage of ``./bin/hlg_decode``, run:
.. code-block::
$ ./bin/hlg_decode
It will show you the following message:
.. code-block:: bash
Please provide --nn_model
This file implements decoding with an HLG decoding graph.
Usage:
./bin/hlg_decode \
--use_gpu true \
--nn_model <path to torch scripted pt file> \
--hlg <path to HLG.pt> \
--word_table <path to words.txt> \
<path to foo.wav> \
<path to bar.wav> \
<more waves if any>
To see all possible options, use
./bin/hlg_decode --help
Caution:
- Only sound files (*.wav) with single channel are supported.
- It assumes the model is conformer_ctc/transformer.py from icefall.
If you use a different model, you have to change the code
related to `model.forward` in this file.
HLG decoding
~~~~~~~~~~~~
.. code-block:: bash
./bin/hlg_decode \
--use_gpu true \
--nn_model icefall_asr_aishell_conformer_ctc/exp/cpu_jit.pt \
--hlg icefall_asr_aishell_conformer_ctc/data/lang_char/HLG.pt \
--word_table icefall_asr_aishell_conformer_ctc/data/lang_char/words.txt \
icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav \
icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav \
icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav
The output is:
.. code-block::
2021-11-18 14:48:20.89 [I] k2/torch/bin/hlg_decode.cu:115:int main(int, char**) Device: cpu
2021-11-18 14:48:20.89 [I] k2/torch/bin/hlg_decode.cu:124:int main(int, char**) Load wave files
2021-11-18 14:48:20.97 [I] k2/torch/bin/hlg_decode.cu:131:int main(int, char**) Build Fbank computer
2021-11-18 14:48:20.98 [I] k2/torch/bin/hlg_decode.cu:142:int main(int, char**) Compute features
2021-11-18 14:48:20.115 [I] k2/torch/bin/hlg_decode.cu:150:int main(int, char**) Load neural network model
2021-11-18 14:48:20.693 [I] k2/torch/bin/hlg_decode.cu:165:int main(int, char**) Compute nnet_output
2021-11-18 14:48:23.182 [I] k2/torch/bin/hlg_decode.cu:180:int main(int, char**) Load icefall_asr_aishell_conformer_ctc/data/lang_char/HLG.pt
2021-11-18 14:48:33.489 [I] k2/torch/bin/hlg_decode.cu:185:int main(int, char**) Decoding
2021-11-18 14:48:45.217 [I] k2/torch/bin/hlg_decode.cu:216:int main(int, char**)
Decoding result:
icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0121.wav
甚至 出现 交易 几乎 停止 的 情况
icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0122.wav
一二 线 城市 虽然 也 处于 调整 中
icefall_asr_aishell_conformer_ctc/test_waves/BAC009S0764W0123.wav
但 因为 聚集 了 过多 公共 资源
There is a Colab notebook showing you how to run a torch scripted model in C++.
Please see |aishell asr conformer ctc torch script colab notebook|
.. |aishell asr conformer ctc torch script colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/1Vh7RER7saTW01DtNbvr7CY7ovNZgmfWz?usp=sharing

Binary file not shown.

Before

Width:  |  Height:  |  Size: 334 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 426 KiB

View File

@ -1,21 +0,0 @@
aishell
=======
Aishell is an open-source Chinese Mandarin speech corpus published by Beijing
Shell Shell Technology Co.,Ltd.
400 people from different accent areas in China are invited to participate in
the recording, which is conducted in a quiet indoor environment using high
fidelity microphone and downsampled to 16kHz. The manual transcription accuracy
is above 95%, through professional speech annotation and strict quality
inspection. The data is free for academic use. We hope to provide moderate
amount of data for new researchers in the field of speech recognition.
It can be downloaded from `<https://www.openslr.org/33/>`_
.. toctree::
:maxdepth: 1
tdnn_lstm_ctc
conformer_ctc
stateless_transducer

View File

@ -1,714 +0,0 @@
Stateless Transducer
====================
This tutorial shows you how to do transducer training in ``icefall``.
.. HINT::
Instead of using RNN-T or RNN transducer, we only use transducer
here. As you will see, there are no RNNs in the model.
.. HINT::
We assume you have read the page :ref:`install icefall` and have setup
the environment for ``icefall``.
.. HINT::
We recommend you to use a GPU or several GPUs to run this recipe.
In this tutorial, you will learn:
- (1) What does the transducer model look like
- (2) How to prepare data for training and decoding
- (3) How to start the training, either with a single GPU or with multiple GPUs
- (4) How to do decoding after training, with greedy search, beam search and, **modified beam search**
- (5) How to use a pre-trained model provided by us to transcribe sound files
The Model
---------
The transducer model consists of 3 parts:
- **Encoder**: It is a conformer encoder with the following parameters
- Number of heads: 8
- Attention dim: 512
- Number of layers: 12
- Feedforward dim: 2048
- **Decoder**: We use a stateless model consisting of:
- An embedding layer with embedding dim 512
- A Conv1d layer with a default kernel size 2 (i.e. it sees 2
symbols of left-context by default)
- **Joiner**: It consists of a ``nn.tanh()`` and a ``nn.Linear()``.
.. Caution::
The decoder is stateless and very simple. It is borrowed from
`<https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9054419>`_
(Rnn-Transducer with Stateless Prediction Network)
We make one modification to it: Place a Conv1d layer right after
the embedding layer.
When using Chinese characters as modelling unit, whose vocabulary size
is 4336 in this specific dataset,
the number of parameters of the model is ``87939824``, i.e., about ``88 M``.
The Loss
--------
We are using `<https://github.com/csukuangfj/optimized_transducer>`_
to compute the transducer loss, which removes extra paddings
in loss computation to save memory.
.. Hint::
``optimized_transducer`` implements the technqiues proposed
in `Improving RNN Transducer Modeling for End-to-End Speech Recognition <https://arxiv.org/abs/1909.12415>`_ to save memory.
Furthermore, it supports ``modified transducer``, limiting the maximum
number of symbols that can be emitted per frame to 1, which simplifies
the decoding process significantly. Also, the experiment results
show that it does not degrade the performance.
See `<https://github.com/csukuangfj/optimized_transducer#modified-transducer>`_
for what exactly modified transducer is.
`<https://github.com/csukuangfj/transducer-loss-benchmarking>`_ shows that
in the unpruned case ``optimized_transducer`` has the advantage about minimizing
memory usage.
.. todo::
Add tutorial about ``pruned_transducer_stateless`` that uses k2
pruned transducer loss.
.. hint::
You can use::
pip install optimized_transducer
to install ``optimized_transducer``. Refer to
`<https://github.com/csukuangfj/optimized_transducer>`_ for other
alternatives.
Data Preparation
----------------
To prepare the data for training, please use the following commands:
.. code-block:: bash
cd egs/aishell/ASR
./prepare.sh --stop-stage 4
./prepare.sh --stage 6 --stop-stage 6
.. note::
You can use ``./prepare.sh``, though it will generate FSTs that
are not used in transducer training.
When you finish running the script, you will get the following two folders:
- ``data/fbank``: It saves the pre-computed features
- ``data/lang_char``: It contains tokens that will be used in the training
Training
--------
.. code-block:: bash
cd egs/aishell/ASR
./transducer_stateless_modified/train.py --help
shows you the training options that can be passed from the commandline.
The following options are used quite often:
- ``--exp-dir``
The experiment folder to save logs and model checkpoints,
defaults to ``./transducer_stateless_modified/exp``.
- ``--num-epochs``
It is the number of epochs to train. For instance,
``./transducer_stateless_modified/train.py --num-epochs 30`` trains for 30
epochs and generates ``epoch-0.pt``, ``epoch-1.pt``, ..., ``epoch-29.pt``
in the folder set by ``--exp-dir``.
- ``--start-epoch``
It's used to resume training.
``./transducer_stateless_modified/train.py --start-epoch 10`` loads the
checkpoint from ``exp_dir/epoch-9.pt`` and starts
training from epoch 10, based on the state from epoch 9.
- ``--world-size``
It is used for single-machine multi-GPU DDP training.
- (a) If it is 1, then no DDP training is used.
- (b) If it is 2, then GPU 0 and GPU 1 are used for DDP training.
The following shows some use cases with it.
**Use case 1**: You have 4 GPUs, but you only want to use GPU 0 and
GPU 2 for training. You can do the following:
.. code-block:: bash
$ cd egs/aishell/ASR
$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./transducer_stateless_modified/train.py --world-size 2
**Use case 2**: You have 4 GPUs and you want to use all of them
for training. You can do the following:
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/train.py --world-size 4
**Use case 3**: You have 4 GPUs but you only want to use GPU 3
for training. You can do the following:
.. code-block:: bash
$ cd egs/aishell/ASR
$ export CUDA_VISIBLE_DEVICES="3"
$ ./transducer_stateless_modified/train.py --world-size 1
.. CAUTION::
Only single-machine multi-GPU DDP training is implemented at present.
There is an on-going PR `<https://github.com/k2-fsa/icefall/pull/63>`_
that adds support for multi-machine multi-GPU DDP training.
- ``--max-duration``
It specifies the number of seconds over all utterances in a
batch **before padding**.
If you encounter CUDA OOM, please reduce it. For instance, if
your are using V100 NVIDIA GPU with 32 GB RAM, we recommend you
to set it to ``300`` when the vocabulary size is 500.
.. HINT::
Due to padding, the number of seconds of all utterances in a
batch will usually be larger than ``--max-duration``.
A larger value for ``--max-duration`` may cause OOM during training,
while a smaller value may increase the training time. You have to
tune it.
- ``--lr-factor``
It controls the learning rate. If you use a single GPU for training, you
may want to use a small value for it. If you use multiple GPUs for training,
you may increase it.
- ``--context-size``
It specifies the kernel size in the decoder. The default value 2 means it
functions as a tri-gram LM.
- ``--modified-transducer-prob``
It specifies the probability to use modified transducer loss.
If it is 0, then no modified transducer is used; if it is 1,
then it uses modified transducer loss for all batches. If it is
``p``, it applies modified transducer with probability ``p``.
There are some training options, e.g.,
number of warmup steps,
that are not passed from the commandline.
They are pre-configured by the function ``get_params()`` in
`transducer_stateless_modified/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/aishell/ASR/transducer_stateless_modified/train.py#L162>`_
If you need to change them, please modify ``./transducer_stateless_modified/train.py`` directly.
.. CAUTION::
The training set is perturbed by speed with two factors: 0.9 and 1.1.
Each epoch actually processes ``3x150 == 450`` hours of data.
Training logs
~~~~~~~~~~~~~
Training logs and checkpoints are saved in the folder set by ``--exp-dir``
(defaults to ``transducer_stateless_modified/exp``). You will find the following files in that directory:
- ``epoch-0.pt``, ``epoch-1.pt``, ...
These are checkpoint files, containing model ``state_dict`` and optimizer ``state_dict``.
To resume training from some checkpoint, say ``epoch-10.pt``, you can use:
.. code-block:: bash
$ ./transducer_stateless_modified/train.py --start-epoch 11
- ``tensorboard/``
This folder contains TensorBoard logs. Training loss, validation loss, learning
rate, etc, are recorded in these logs. You can visualize them by:
.. code-block:: bash
$ cd transducer_stateless_modified/exp/tensorboard
$ tensorboard dev upload --logdir . --name "Aishell transducer training with icefall" --description "Training modified transducer, see https://github.com/k2-fsa/icefall/pull/219"
It will print something like below:
.. code-block::
TensorFlow installation not found - running with reduced feature set.
Upload started and will continue reading any new data as it's added to the logdir.
To stop uploading, press Ctrl-C.
New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/laGZ6HrcQxOigbFD5E0Y3Q/
[2022-03-03T14:29:45] Started scanning logdir.
[2022-03-03T14:29:48] Total uploaded: 8477 scalars, 0 tensors, 0 binary objects
Listening for new data in logdir...
Note there is a `URL <https://tensorboard.dev/experiment/laGZ6HrcQxOigbFD5E0Y3Q/>`_ in the
above output, click it and you will see the following screenshot:
.. figure:: images/aishell-transducer_stateless_modified-tensorboard-log.png
:width: 600
:alt: TensorBoard screenshot
:align: center
:target: https://tensorboard.dev/experiment/laGZ6HrcQxOigbFD5E0Y3Q
TensorBoard screenshot.
- ``log/log-train-xxxx``
It is the detailed training log in text format, same as the one
you saw printed to the console during training.
Usage examples
~~~~~~~~~~~~~~
The following shows typical use cases:
**Case 1**
^^^^^^^^^^
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/train.py --max-duration 250
It uses ``--max-duration`` of 250 to avoid OOM.
**Case 2**
^^^^^^^^^^
.. code-block:: bash
$ cd egs/aishell/ASR
$ export CUDA_VISIBLE_DEVICES="0,3"
$ ./transducer_stateless_modified/train.py --world-size 2
It uses GPU 0 and GPU 3 for DDP training.
**Case 3**
^^^^^^^^^^
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/train.py --num-epochs 10 --start-epoch 3
It loads checkpoint ``./transducer_stateless_modified/exp/epoch-2.pt`` and starts
training from epoch 3. Also, it trains for 10 epochs.
Decoding
--------
The decoding part uses checkpoints saved by the training part, so you have
to run the training part first.
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/decode.py --help
shows the options for decoding.
The commonly used options are:
- ``--method``
This specifies the decoding method. Currently, it supports:
- **greedy_search**. You can provide the commandline option ``--max-sym-per-frame``
to limit the maximum number of symbols that can be emitted per frame.
- **beam_search**. You can provide the commandline option ``--beam-size``.
- **modified_beam_search**. You can also provide the commandline option ``--beam-size``.
To use this method, we assume that you have trained your model with modified transducer,
i.e., used the option ``--modified-transducer-prob`` in the training.
The following command uses greedy search for decoding
.. code-block::
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/decode.py \
--epoch 64 \
--avg 33 \
--exp-dir ./transducer_stateless_modified/exp \
--max-duration 100 \
--decoding-method greedy_search \
--max-sym-per-frame 1
The following command uses beam search for decoding
.. code-block::
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/decode.py \
--epoch 64 \
--avg 33 \
--exp-dir ./transducer_stateless_modified/exp \
--max-duration 100 \
--decoding-method beam_search \
--beam-size 4
The following command uses ``modified`` beam search for decoding
.. code-block::
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/decode.py \
--epoch 64 \
--avg 33 \
--exp-dir ./transducer_stateless_modified/exp \
--max-duration 100 \
--decoding-method modified_beam_search \
--beam-size 4
- ``--max-duration``
It has the same meaning as the one used in training. A larger
value may cause OOM.
- ``--epoch``
It specifies the checkpoint from which epoch that should be used for decoding.
- ``--avg``
It specifies the number of models to average. For instance, if it is 3 and if
``--epoch=10``, then it averages the checkpoints ``epoch-8.pt``, ``epoch-9.pt``,
and ``epoch-10.pt`` and the averaged checkpoint is used for decoding.
After decoding, you can find the decoding logs and results in `exp_dir/log/<decoding_method>`, e.g.,
``exp_dir/log/greedy_search``.
Pre-trained Model
-----------------
We have uploaded a pre-trained model to
`<https://huggingface.co/csukuangfj/icefall-aishell-transducer-stateless-modified-2022-03-01>`_
We describe how to use the pre-trained model to transcribe a sound file or
multiple sound files in the following.
Install kaldifeat
~~~~~~~~~~~~~~~~~
`kaldifeat <https://github.com/csukuangfj/kaldifeat>`_ is used to
extract features for a single sound file or multiple sound files
at the same time.
Please refer to `<https://github.com/csukuangfj/kaldifeat>`_ for installation.
Download the pre-trained model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The following commands describe how to download the pre-trained model:
.. code-block::
$ cd egs/aishell/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/csukuangfj/icefall-aishell-transducer-stateless-modified-2022-03-01
.. CAUTION::
You have to use ``git lfs`` to download the pre-trained model.
After downloading, you will have the following files:
.. code-block:: bash
$ cd egs/aishell/ASR
$ tree tmp/icefall-aishell-transducer-stateless-modified-2022-03-01
.. code-block:: bash
tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/
|-- README.md
|-- data
| `-- lang_char
| |-- L.pt
| |-- lexicon.txt
| |-- tokens.txt
| `-- words.txt
|-- exp
| `-- pretrained.pt
|-- log
| |-- errs-test-beam_4-epoch-64-avg-33-beam-4.txt
| |-- errs-test-greedy_search-epoch-64-avg-33-context-2-max-sym-per-frame-1.txt
| |-- log-decode-epoch-64-avg-33-beam-4-2022-03-02-12-05-03
| |-- log-decode-epoch-64-avg-33-context-2-max-sym-per-frame-1-2022-02-28-18-13-07
| |-- recogs-test-beam_4-epoch-64-avg-33-beam-4.txt
| `-- recogs-test-greedy_search-epoch-64-avg-33-context-2-max-sym-per-frame-1.txt
`-- test_wavs
|-- BAC009S0764W0121.wav
|-- BAC009S0764W0122.wav
|-- BAC009S0764W0123.wav
`-- transcript.txt
5 directories, 16 files
**File descriptions**:
- ``data/lang_char``
It contains language related files. You can find the vocabulary size in ``tokens.txt``.
- ``exp/pretrained.pt``
It contains pre-trained model parameters, obtained by averaging
checkpoints from ``epoch-32.pt`` to ``epoch-64.pt``.
Note: We have removed optimizer ``state_dict`` to reduce file size.
- ``log``
It contains decoding logs and decoded results.
- ``test_wavs``
It contains some test sound files from Aishell ``test`` dataset.
The information of the test sound files is listed below:
.. code-block:: bash
$ soxi tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/*.wav
Input File : 'tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:04.20 = 67263 samples ~ 315.295 CDDA sectors
File Size : 135k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
Input File : 'tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:04.12 = 65840 samples ~ 308.625 CDDA sectors
File Size : 132k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
Input File : 'tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:04.00 = 64000 samples ~ 300 CDDA sectors
File Size : 128k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
Total Duration of 3 files: 00:00:12.32
Usage
~~~~~
.. code-block::
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/pretrained.py --help
displays the help information.
It supports three decoding methods:
- greedy search
- beam search
- modified beam search
.. note::
In modified beam search, it limits the maximum number of symbols that can be
emitted per frame to 1. To use this method, you have to ensure that your model
has been trained with the option ``--modified-transducer-prob``. Otherwise,
it may give you poor results.
Greedy search
^^^^^^^^^^^^^
The command to run greedy search is given below:
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/pretrained.py \
--checkpoint ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/exp/pretrained.pt \
--lang-dir ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char \
--method greedy_search \
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav \
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav \
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav
The output is as follows:
.. code-block::
2022-03-03 15:35:26,531 INFO [pretrained.py:239] device: cuda:0
2022-03-03 15:35:26,994 INFO [lexicon.py:176] Loading pre-compiled tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char/Linv.pt
2022-03-03 15:35:27,027 INFO [pretrained.py:246] {'feature_dim': 80, 'encoder_out_dim': 512, 'subsampling_factor': 4, 'attention_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'vgg_frontend': False, 'env_info': {'k2-version': '1.13', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'f4fefe4882bc0ae59af951da3f47335d5495ef71', 'k2-git-date': 'Thu Feb 10 15:16:02 2022', 'lhotse-version': '1.0.0.dev+missing.version.file', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '50d2281-clean', 'icefall-git-date': 'Wed Mar 2 16:02:38 2022', 'icefall-path': '/ceph-fj/fangjun/open-source-2/icefall-aishell', 'k2-path': '/ceph-fj/fangjun/open-source-2/k2-multi-datasets/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-fj/fangjun/open-source-2/lhotse-aishell/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-2-0815224919-75d558775b-mmnv8', 'IP address': '10.177.72.138'}, 'sample_rate': 16000, 'checkpoint': './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/exp/pretrained.pt', 'lang_dir': PosixPath('tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char'), 'method': 'greedy_search', 'sound_files': ['./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav'], 'beam_size': 4, 'context_size': 2, 'max_sym_per_frame': 3, 'blank_id': 0, 'vocab_size': 4336}
2022-03-03 15:35:27,027 INFO [pretrained.py:248] About to create model
2022-03-03 15:35:36,878 INFO [pretrained.py:257] Constructing Fbank computer
2022-03-03 15:35:36,880 INFO [pretrained.py:267] Reading sound files: ['./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav']
2022-03-03 15:35:36,891 INFO [pretrained.py:273] Decoding started
/ceph-fj/fangjun/open-source-2/icefall-aishell/egs/aishell/ASR/transducer_stateless_modified/conformer.py:113: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
lengths = ((x_lens - 1) // 2 - 1) // 2
2022-03-03 15:35:37,163 INFO [pretrained.py:320]
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav:
甚 至 出 现 交 易 几 乎 停 滞 的 情 况
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav:
一 二 线 城 市 虽 然 也 处 于 调 整 中
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav:
但 因 为 聚 集 了 过 多 公 共 资 源
2022-03-03 15:35:37,163 INFO [pretrained.py:322] Decoding Done
Beam search
^^^^^^^^^^^
The command to run beam search is given below:
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/pretrained.py \
--checkpoint ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/exp/pretrained.pt \
--lang-dir ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char \
--method beam_search \
--beam-size 4 \
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav \
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav \
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav
The output is as follows:
.. code-block::
2022-03-03 15:39:09,285 INFO [pretrained.py:239] device: cuda:0
2022-03-03 15:39:09,708 INFO [lexicon.py:176] Loading pre-compiled tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char/Linv.pt
2022-03-03 15:39:09,759 INFO [pretrained.py:246] {'feature_dim': 80, 'encoder_out_dim': 512, 'subsampling_factor': 4, 'attention_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'vgg_frontend': False, 'env_info': {'k2-version': '1.13', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'f4fefe4882bc0ae59af951da3f47335d5495ef71', 'k2-git-date': 'Thu Feb 10 15:16:02 2022', 'lhotse-version': '1.0.0.dev+missing.version.file', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '50d2281-clean', 'icefall-git-date': 'Wed Mar 2 16:02:38 2022', 'icefall-path': '/ceph-fj/fangjun/open-source-2/icefall-aishell', 'k2-path': '/ceph-fj/fangjun/open-source-2/k2-multi-datasets/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-fj/fangjun/open-source-2/lhotse-aishell/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-2-0815224919-75d558775b-mmnv8', 'IP address': '10.177.72.138'}, 'sample_rate': 16000, 'checkpoint': './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/exp/pretrained.pt', 'lang_dir': PosixPath('tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char'), 'method': 'beam_search', 'sound_files': ['./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav'], 'beam_size': 4, 'context_size': 2, 'max_sym_per_frame': 3, 'blank_id': 0, 'vocab_size': 4336}
2022-03-03 15:39:09,760 INFO [pretrained.py:248] About to create model
2022-03-03 15:39:18,919 INFO [pretrained.py:257] Constructing Fbank computer
2022-03-03 15:39:18,922 INFO [pretrained.py:267] Reading sound files: ['./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav']
2022-03-03 15:39:18,929 INFO [pretrained.py:273] Decoding started
/ceph-fj/fangjun/open-source-2/icefall-aishell/egs/aishell/ASR/transducer_stateless_modified/conformer.py:113: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
lengths = ((x_lens - 1) // 2 - 1) // 2
2022-03-03 15:39:21,046 INFO [pretrained.py:320]
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav:
甚 至 出 现 交 易 几 乎 停 滞 的 情 况
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav:
一 二 线 城 市 虽 然 也 处 于 调 整 中
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav:
但 因 为 聚 集 了 过 多 公 共 资 源
2022-03-03 15:39:21,047 INFO [pretrained.py:322] Decoding Done
Modified Beam search
^^^^^^^^^^^^^^^^^^^^
The command to run modified beam search is given below:
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./transducer_stateless_modified/pretrained.py \
--checkpoint ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/exp/pretrained.pt \
--lang-dir ./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char \
--method modified_beam_search \
--beam-size 4 \
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav \
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav \
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav
The output is as follows:
.. code-block::
2022-03-03 15:41:23,319 INFO [pretrained.py:239] device: cuda:0
2022-03-03 15:41:23,798 INFO [lexicon.py:176] Loading pre-compiled tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char/Linv.pt
2022-03-03 15:41:23,831 INFO [pretrained.py:246] {'feature_dim': 80, 'encoder_out_dim': 512, 'subsampling_factor': 4, 'attention_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'vgg_frontend': False, 'env_info': {'k2-version': '1.13', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'f4fefe4882bc0ae59af951da3f47335d5495ef71', 'k2-git-date': 'Thu Feb 10 15:16:02 2022', 'lhotse-version': '1.0.0.dev+missing.version.file', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'master', 'icefall-git-sha1': '50d2281-clean', 'icefall-git-date': 'Wed Mar 2 16:02:38 2022', 'icefall-path': '/ceph-fj/fangjun/open-source-2/icefall-aishell', 'k2-path': '/ceph-fj/fangjun/open-source-2/k2-multi-datasets/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-fj/fangjun/open-source-2/lhotse-aishell/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-2-0815224919-75d558775b-mmnv8', 'IP address': '10.177.72.138'}, 'sample_rate': 16000, 'checkpoint': './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/exp/pretrained.pt', 'lang_dir': PosixPath('tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/data/lang_char'), 'method': 'modified_beam_search', 'sound_files': ['./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav'], 'beam_size': 4, 'context_size': 2, 'max_sym_per_frame': 3, 'blank_id': 0, 'vocab_size': 4336}
2022-03-03 15:41:23,831 INFO [pretrained.py:248] About to create model
2022-03-03 15:41:32,214 INFO [pretrained.py:257] Constructing Fbank computer
2022-03-03 15:41:32,215 INFO [pretrained.py:267] Reading sound files: ['./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav', './tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav']
2022-03-03 15:41:32,220 INFO [pretrained.py:273] Decoding started
/ceph-fj/fangjun/open-source-2/icefall-aishell/egs/aishell/ASR/transducer_stateless_modified/conformer.py:113: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
lengths = ((x_lens - 1) // 2 - 1) // 2
/ceph-fj/fangjun/open-source-2/icefall-aishell/egs/aishell/ASR/transducer_stateless_modified/beam_search.py:402: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
topk_hyp_indexes = topk_indexes // logits.size(-1)
2022-03-03 15:41:32,583 INFO [pretrained.py:320]
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0121.wav:
甚 至 出 现 交 易 几 乎 停 滞 的 情 况
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0122.wav:
一 二 线 城 市 虽 然 也 处 于 调 整 中
./tmp/icefall-aishell-transducer-stateless-modified-2022-03-01/test_wavs/BAC009S0764W0123.wav:
但 因 为 聚 集 了 过 多 公 共 资 源
2022-03-03 15:41:32,583 INFO [pretrained.py:322] Decoding Done
Colab notebook
--------------
We provide a colab notebook for this recipe showing how to use a pre-trained model to
transcribe sound files.
|aishell asr stateless modified transducer colab notebook|
.. |aishell asr stateless modified transducer colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/12jpTxJB44vzwtcmJl2DTdznW0OawPb9H?usp=sharing

View File

@ -1,504 +0,0 @@
TDNN-LSTM CTC
=============
This tutorial shows you how to run a tdnn-lstm ctc model
with the `Aishell <https://www.openslr.org/33>`_ dataset.
.. HINT::
We assume you have read the page :ref:`install icefall` and have setup
the environment for ``icefall``.
.. HINT::
We recommend you to use a GPU or several GPUs to run this recipe.
In this tutorial, you will learn:
- (1) How to prepare data for training and decoding
- (2) How to start the training, either with a single GPU or multiple GPUs
- (3) How to do decoding after training.
- (4) How to use a pre-trained model, provided by us
Data preparation
----------------
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./prepare.sh
The script ``./prepare.sh`` handles the data preparation for you, **automagically**.
All you need to do is to run it.
The data preparation contains several stages, you can use the following two
options:
- ``--stage``
- ``--stop-stage``
to control which stage(s) should be run. By default, all stages are executed.
For example,
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./prepare.sh --stage 0 --stop-stage 0
means to run only stage 0.
To run stage 2 to stage 5, use:
.. code-block:: bash
$ ./prepare.sh --stage 2 --stop-stage 5
.. HINT::
If you have pre-downloaded the `Aishell <https://www.openslr.org/33>`_
dataset and the `musan <http://www.openslr.org/17/>`_ dataset, say,
they are saved in ``/tmp/aishell`` and ``/tmp/musan``, you can modify
the ``dl_dir`` variable in ``./prepare.sh`` to point to ``/tmp`` so that
``./prepare.sh`` won't re-download them.
.. HINT::
A 3-gram language model will be downloaded from huggingface, we assume you have
installed and initialized ``git-lfs``. If not, you could install ``git-lfs`` by
.. code-block:: bash
$ sudo apt-get install git-lfs
$ git-lfs install
If you don't have the ``sudo`` permission, you could download the
`git-lfs binary <https://github.com/git-lfs/git-lfs/releases>`_ here, then add it to you ``PATH``.
.. NOTE::
All generated files by ``./prepare.sh``, e.g., features, lexicon, etc,
are saved in ``./data`` directory.
Training
--------
Configurable options
~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./tdnn_lstm_ctc/train.py --help
shows you the training options that can be passed from the commandline.
The following options are used quite often:
- ``--num-epochs``
It is the number of epochs to train. For instance,
``./tdnn_lstm_ctc/train.py --num-epochs 30`` trains for 30 epochs
and generates ``epoch-0.pt``, ``epoch-1.pt``, ..., ``epoch-29.pt``
in the folder ``./tdnn_lstm_ctc/exp``.
- ``--start-epoch``
It's used to resume training.
``./tdnn_lstm_ctc/train.py --start-epoch 10`` loads the
checkpoint ``./tdnn_lstm_ctc/exp/epoch-9.pt`` and starts
training from epoch 10, based on the state from epoch 9.
- ``--world-size``
It is used for multi-GPU single-machine DDP training.
- (a) If it is 1, then no DDP training is used.
- (b) If it is 2, then GPU 0 and GPU 1 are used for DDP training.
The following shows some use cases with it.
**Use case 1**: You have 4 GPUs, but you only want to use GPU 0 and
GPU 2 for training. You can do the following:
.. code-block:: bash
$ cd egs/aishell/ASR
$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./tdnn_lstm_ctc/train.py --world-size 2
**Use case 2**: You have 4 GPUs and you want to use all of them
for training. You can do the following:
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./tdnn_lstm_ctc/train.py --world-size 4
**Use case 3**: You have 4 GPUs but you only want to use GPU 3
for training. You can do the following:
.. code-block:: bash
$ cd egs/aishell/ASR
$ export CUDA_VISIBLE_DEVICES="3"
$ ./tdnn_lstm_ctc/train.py --world-size 1
.. CAUTION::
Only multi-GPU single-machine DDP training is implemented at present.
Multi-GPU multi-machine DDP training will be added later.
- ``--max-duration``
It specifies the number of seconds over all utterances in a
batch, before **padding**.
If you encounter CUDA OOM, please reduce it. For instance, if
your are using V100 NVIDIA GPU, we recommend you to set it to ``2000``.
.. HINT::
Due to padding, the number of seconds of all utterances in a
batch will usually be larger than ``--max-duration``.
A larger value for ``--max-duration`` may cause OOM during training,
while a smaller value may increase the training time. You have to
tune it.
Pre-configured options
~~~~~~~~~~~~~~~~~~~~~~
There are some training options, e.g., weight decay,
number of warmup steps, results dir, etc,
that are not passed from the commandline.
They are pre-configured by the function ``get_params()`` in
`tdnn_lstm_ctc/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/aishell/ASR/tdnn_lstm_ctc/train.py>`_
You don't need to change these pre-configured parameters. If you really need to change
them, please modify ``./tdnn_lstm_ctc/train.py`` directly.
.. CAUTION::
The training set is perturbed by speed with two factors: 0.9 and 1.1.
Each epoch actually processes ``3x150 == 450`` hours of data.
Training logs
~~~~~~~~~~~~~
Training logs and checkpoints are saved in ``tdnn_lstm_ctc/exp``.
You will find the following files in that directory:
- ``epoch-0.pt``, ``epoch-1.pt``, ...
These are checkpoint files, containing model ``state_dict`` and optimizer ``state_dict``.
To resume training from some checkpoint, say ``epoch-10.pt``, you can use:
.. code-block:: bash
$ ./tdnn_lstm_ctc/train.py --start-epoch 11
- ``tensorboard/``
This folder contains TensorBoard logs. Training loss, validation loss, learning
rate, etc, are recorded in these logs. You can visualize them by:
.. code-block:: bash
$ cd tdnn_lstm_ctc/exp/tensorboard
$ tensorboard dev upload --logdir . --description "TDNN-LSTM CTC training for Aishell with icefall"
It will print something like below:
.. code-block::
TensorFlow installation not found - running with reduced feature set.
Upload started and will continue reading any new data as it's added to the logdir.
To stop uploading, press Ctrl-C.
New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/LJI9MWUORLOw3jkdhxwk8A/
[2021-09-13T11:59:23] Started scanning logdir.
[2021-09-13T11:59:24] Total uploaded: 4454 scalars, 0 tensors, 0 binary objects
Listening for new data in logdir...
Note there is a URL in the above output, click it and you will see
the following screenshot:
.. figure:: images/aishell-tdnn-lstm-ctc-tensorboard-log.jpg
:width: 600
:alt: TensorBoard screenshot
:align: center
:target: https://tensorboard.dev/experiment/LJI9MWUORLOw3jkdhxwk8A/
TensorBoard screenshot.
- ``log/log-train-xxxx``
It is the detailed training log in text format, same as the one
you saw printed to the console during training.
Usage examples
~~~~~~~~~~~~~~
The following shows typical use cases:
**Case 1**
^^^^^^^^^^
.. code-block:: bash
$ cd egs/aishell/ASR
$ export CUDA_VISIBLE_DEVICES="0,3"
$ ./tdnn_lstm_ctc/train.py --world-size 2
It uses GPU 0 and GPU 3 for DDP training.
**Case 2**
^^^^^^^^^^
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./tdnn_lstm_ctc/train.py --num-epochs 10 --start-epoch 3
It loads checkpoint ``./tdnn_lstm_ctc/exp/epoch-2.pt`` and starts
training from epoch 3. Also, it trains for 10 epochs.
Decoding
--------
The decoding part uses checkpoints saved by the training part, so you have
to run the training part first.
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./tdnn_lstm_ctc/decode.py --help
shows the options for decoding.
The commonly used options are:
- ``--method``
This specifies the decoding method.
The following command uses attention decoder for rescoring:
.. code-block::
$ cd egs/aishell/ASR
$ ./tdnn_lstm_ctc/decode.py --method 1best --max-duration 100
- ``--max-duration``
It has the same meaning as the one during training. A larger
value may cause OOM.
Pre-trained Model
-----------------
We have uploaded a pre-trained model to
`<https://huggingface.co/pkufool/icefall_asr_aishell_tdnn_lstm_ctc>`_.
We describe how to use the pre-trained model to transcribe a sound file or
multiple sound files in the following.
Install kaldifeat
~~~~~~~~~~~~~~~~~
`kaldifeat <https://github.com/csukuangfj/kaldifeat>`_ is used to
extract features for a single sound file or multiple sound files
at the same time.
Please refer to `<https://github.com/csukuangfj/kaldifeat>`_ for installation.
Download the pre-trained model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The following commands describe how to download the pre-trained model:
.. code-block::
$ cd egs/aishell/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/pkufool/icefall_asr_aishell_tdnn_lstm_ctc
.. CAUTION::
You have to use ``git lfs`` to download the pre-trained model.
.. CAUTION::
In order to use this pre-trained model, your k2 version has to be v1.7 or later.
After downloading, you will have the following files:
.. code-block:: bash
$ cd egs/aishell/ASR
$ tree tmp
.. code-block:: bash
tmp/
`-- icefall_asr_aishell_tdnn_lstm_ctc
|-- README.md
|-- data
| `-- lang_phone
| |-- HLG.pt
| |-- tokens.txt
| `-- words.txt
|-- exp
| `-- pretrained.pt
`-- test_waves
|-- BAC009S0764W0121.wav
|-- BAC009S0764W0122.wav
|-- BAC009S0764W0123.wav
`-- trans.txt
5 directories, 9 files
**File descriptions**:
- ``data/lang_phone/HLG.pt``
It is the decoding graph.
- ``data/lang_phone/tokens.txt``
It contains tokens and their IDs.
Provided only for convenience so that you can look up the SOS/EOS ID easily.
- ``data/lang_phone/words.txt``
It contains words and their IDs.
- ``exp/pretrained.pt``
It contains pre-trained model parameters, obtained by averaging
checkpoints from ``epoch-18.pt`` to ``epoch-40.pt``.
Note: We have removed optimizer ``state_dict`` to reduce file size.
- ``test_waves/*.wav``
It contains some test sound files from Aishell ``test`` dataset.
- ``test_waves/trans.txt``
It contains the reference transcripts for the sound files in `test_waves/`.
The information of the test sound files is listed below:
.. code-block:: bash
$ soxi tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/*.wav
Input File : 'tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/BAC009S0764W0121.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:04.20 = 67263 samples ~ 315.295 CDDA sectors
File Size : 135k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
Input File : 'tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/BAC009S0764W0122.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:04.12 = 65840 samples ~ 308.625 CDDA sectors
File Size : 132k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
Input File : 'tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/BAC009S0764W0123.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:04.00 = 64000 samples ~ 300 CDDA sectors
File Size : 128k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
Total Duration of 3 files: 00:00:12.32
Usage
~~~~~
.. code-block::
$ cd egs/aishell/ASR
$ ./tdnn_lstm_ctc/pretrained.py --help
displays the help information.
HLG decoding
^^^^^^^^^^^^
HLG decoding uses the best path of the decoding lattice as the decoding result.
The command to run HLG decoding is:
.. code-block:: bash
$ cd egs/aishell/ASR
$ ./tdnn_lstm_ctc/pretrained.py \
--checkpoint ./tmp/icefall_asr_aishell_tdnn_lstm_ctc/exp/pretrained.pt \
--words-file ./tmp/icefall_asr_aishell_tdnn_lstm_ctc/data/lang_phone/words.txt \
--HLG ./tmp/icefall_asr_aishell_tdnn_lstm_ctc/data/lang_phone/HLG.pt \
--method 1best \
./tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/BAC009S0764W0121.wav \
./tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/BAC009S0764W0122.wav \
./tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/BAC009S0764W0123.wav
The output is given below:
.. code-block::
2021-09-13 15:00:55,858 INFO [pretrained.py:140] device: cuda:0
2021-09-13 15:00:55,858 INFO [pretrained.py:142] Creating model
2021-09-13 15:01:05,389 INFO [pretrained.py:154] Loading HLG from ./tmp/icefall_asr_aishell_tdnn_lstm_ctc/data/lang_phone/HLG.pt
2021-09-13 15:01:06,531 INFO [pretrained.py:161] Constructing Fbank computer
2021-09-13 15:01:06,536 INFO [pretrained.py:171] Reading sound files: ['./tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/BAC009S0764W0121.wav', './tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/BAC009S0764W0122.wav', './tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/BAC009S0764W0123.wav']
2021-09-13 15:01:06,539 INFO [pretrained.py:177] Decoding started
2021-09-13 15:01:06,917 INFO [pretrained.py:207] Use HLG decoding
2021-09-13 15:01:07,129 INFO [pretrained.py:220]
./tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/BAC009S0764W0121.wav:
甚至 出现 交易 几乎 停滞 的 情况
./tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/BAC009S0764W0122.wav:
一二 线 城市 虽然 也 处于 调整 中
./tmp/icefall_asr_aishell_tdnn_lstm_ctc/test_waves/BAC009S0764W0123.wav:
但 因为 聚集 了 过多 公共 资源
2021-09-13 15:01:07,129 INFO [pretrained.py:222] Decoding Done
Colab notebook
--------------
We do provide a colab notebook for this recipe showing how to use a pre-trained model.
|aishell asr conformer ctc colab notebook|
.. |aishell asr conformer ctc colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/1jbyzYq3ytm6j2nlEt-diQm-6QVWyDDEa?usp=sharing
**Congratulations!** You have finished the aishell ASR recipe with
TDNN-LSTM CTC models in ``icefall``.

View File

@ -1,10 +0,0 @@
Non Streaming ASR
=================
.. toctree::
:maxdepth: 2
aishell/index
librispeech/index
timit/index
yesno/index

View File

@ -1,223 +0,0 @@
Distillation with HuBERT
========================
This tutorial shows you how to perform knowledge distillation in `icefall <https://github.com/k2-fsa/icefall>`_
with the `LibriSpeech`_ dataset. The distillation method
used here is called "Multi Vector Quantization Knowledge Distillation" (MVQ-KD).
Please have a look at our paper `Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation <https://arxiv.org/abs/2211.00508>`_
for more details about MVQ-KD.
.. note::
This tutorial is based on recipe
`pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`_.
Currently, we only implement MVQ-KD in this recipe. However, MVQ-KD is theoretically applicable to all recipes
with only minor changes needed. Feel free to try out MVQ-KD in different recipes. If you
encounter any problems, please open an issue here `icefall <https://github.com/k2-fsa/icefall/issues>`__.
.. note::
We assume you have read the page :ref:`install icefall` and have setup
the environment for `icefall`_.
.. HINT::
We recommend you to use a GPU or several GPUs to run this recipe.
Data preparation
----------------
We first prepare necessary training data for `LibriSpeech`_.
This is the same as in :ref:`non_streaming_librispeech_pruned_transducer_stateless`.
.. hint::
The data preparation is the same as other recipes on LibriSpeech dataset,
if you have finished this step, you can skip to :ref:`codebook_index_preparation` directly.
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./prepare.sh
The script ``./prepare.sh`` handles the data preparation for you, **automagically**.
All you need to do is to run it.
The data preparation contains several stages, you can use the following two
options:
- ``--stage``
- ``--stop_stage``
to control which stage(s) should be run. By default, all stages are executed.
For example,
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./prepare.sh --stage 0 --stop_stage 0 # run only stage 0
$ ./prepare.sh --stage 2 --stop_stage 5 # run from stage 2 to stage 5
.. HINT::
If you have pre-downloaded the `LibriSpeech`_
dataset and the `musan`_ dataset, say,
they are saved in ``/tmp/LibriSpeech`` and ``/tmp/musan``, you can modify
the ``dl_dir`` variable in ``./prepare.sh`` to point to ``/tmp`` so that
``./prepare.sh`` won't re-download them.
.. NOTE::
All generated files by ``./prepare.sh``, e.g., features, lexicon, etc,
are saved in ``./data`` directory.
We provide the following YouTube video showing how to run ``./prepare.sh``.
.. note::
To get the latest news of `next-gen Kaldi <https://github.com/k2-fsa>`_, please subscribe
the following YouTube channel by `Nadira Povey <https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_:
`<https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_
.. youtube:: ofEIoJL-mGM
.. _codebook_index_preparation:
Codebook index preparation
--------------------------
Here, we prepare necessary data for MVQ-KD. This requires the generation
of codebook indexes (please read our `paper <https://arxiv.org/abs/2211.00508>`_.
if you are interested in details). In this tutorial, we use the pre-computed
codebook indexes for convenience. The only thing you need to do is to
run `./distillation_with_hubert.sh <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/distillation_with_hubert.sh>`_.
.. note::
There are 5 stages in total, the first and second stage will be automatically skipped
when choosing to downloaded codebook indexes prepared by `icefall`_.
Of course, you can extract and compute the codebook indexes by yourself. This
will require you downloading a HuBERT-XL model and it can take a while for
the extraction of codebook indexes.
As usual, you can control the stages you want to run by specifying the following
two options:
- ``--stage``
- ``--stop_stage``
For example,
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./distillation_with_hubert.sh --stage 0 --stop_stage 0 # run only stage 0
$ ./distillation_with_hubert.sh --stage 2 --stop_stage 4 # run from stage 2 to stage 5
Here are a few options in `./distillation_with_hubert.sh <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/distillation_with_hubert.sh>`_
you need to know before you proceed.
- ``--full_libri`` If True, use full 960h data. Otherwise only ``train-clean-100`` will be used
- ``--use_extracted_codebook`` If True, the first two stages will be skipped and the codebook
indexes uploaded by us will be downloaded.
Since we are using the pre-computed codebook indexes, we set
``use_extracted_codebook=True``. If you want to do full `LibriSpeech`_
experiments, please set ``full_libri=True``.
The following command downloads the pre-computed codebook indexes
and prepares MVQ-augmented training manifests.
.. code-block:: bash
$ ./distillation_with_hubert.sh --stage 2 --stop_stage 2 # run only stage 2
Please see the
following screenshot for the output of an example execution.
.. figure:: ./images/distillation_codebook.png
:width: 800
:alt: Downloading codebook indexes and preparing training manifest.
:align: center
Downloading codebook indexes and preparing training manifest.
.. hint::
The codebook indexes we prepared for you in this tutorial
are extracted from the 36-th layer of a fine-tuned HuBERT-XL model
with 8 codebooks. If you want to try other configurations, please
set ``use_extracted_codebook=False`` and set ``embedding_layer`` and
``num_codebooks`` by yourself.
Now, you should see the following files under the directory ``./data/vq_fbank_layer36_cb8``.
.. figure:: ./images/distillation_directory.png
:width: 800
:alt: MVQ-augmented training manifests
:align: center
MVQ-augmented training manifests.
Whola! You are ready to perform knowledge distillation training now!
Training
--------
To perform training, please run stage 3 by executing the following command.
.. code-block:: bash
$ ./prepare.sh --stage 3 --stop_stage 3 # run MVQ training
Here is the code snippet for training:
.. code-block:: bash
WORLD_SIZE=$(echo ${CUDA_VISIBLE_DEVICES} | awk '{n=split($1, _, ","); print n}')
./pruned_transducer_stateless6/train.py \
--manifest-dir ./data/vq_fbank_layer36_cb8 \
--master-port 12359 \
--full-libri $full_libri \
--spec-aug-time-warp-factor -1 \
--max-duration 300 \
--world-size ${WORLD_SIZE} \
--num-epochs 30 \
--exp-dir $exp_dir \
--enable-distillation True \
--codebook-loss-scale 0.01
There are a few training arguments in the following
training commands that should be paid attention to.
- ``--enable-distillation`` If True, knowledge distillation training is enabled.
- ``--codebook-loss-scale`` The scale of the knowledge distillation loss.
- ``--manifest-dir`` The path to the MVQ-augmented manifest.
Decoding
--------
After training finished, you can test the performance on using
the following command.
.. code-block:: bash
export CUDA_VISIBLE_DEVICES=0
./pruned_transducer_stateless6/train.py \
--decoding-method "modified_beam_search" \
--epoch 30 \
--avg 10 \
--max-duration 200 \
--exp-dir $exp_dir \
--enable-distillation True
You should get similar results as `here <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS-100hours.md#distillation-with-hubert>`__.
That's all! Feel free to experiment with your own setups and report your results.
If you encounter any problems during training, please open up an issue `here <https://github.com/k2-fsa/icefall/issues>`__.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 43 KiB

View File

@ -1,12 +0,0 @@
LibriSpeech
===========
.. toctree::
:maxdepth: 1
tdnn_lstm_ctc
conformer_ctc
pruned_transducer_stateless
zipformer_mmi
zipformer_ctc_blankskip
distillation

View File

@ -1,548 +0,0 @@
.. _non_streaming_librispeech_pruned_transducer_stateless:
Pruned transducer statelessX
============================
This tutorial shows you how to run a conformer transducer model
with the `LibriSpeech <https://www.openslr.org/12>`_ dataset.
.. Note::
The tutorial is suitable for `pruned_transducer_stateless <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless>`__,
`pruned_transducer_stateless2 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless2>`__,
`pruned_transducer_stateless4 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless4>`__,
`pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/pruned_transducer_stateless5>`__,
We will take pruned_transducer_stateless4 as an example in this tutorial.
.. HINT::
We assume you have read the page :ref:`install icefall` and have setup
the environment for ``icefall``.
.. HINT::
We recommend you to use a GPU or several GPUs to run this recipe.
.. hint::
Please scroll down to the bottom of this page to find download links
for pretrained models if you don't want to train a model from scratch.
We use pruned RNN-T to compute the loss.
.. note::
You can find the paper about pruned RNN-T at the following address:
`<https://arxiv.org/abs/2206.13236>`_
The transducer model consists of 3 parts:
- Encoder, a.k.a, the transcription network. We use a Conformer model (the reworked version by Daniel Povey)
- Decoder, a.k.a, the prediction network. We use a stateless model consisting of
``nn.Embedding`` and ``nn.Conv1d``
- Joiner, a.k.a, the joint network.
.. caution::
Contrary to the conventional RNN-T models, we use a stateless decoder.
That is, it has no recurrent connections.
Data preparation
----------------
.. hint::
The data preparation is the same as other recipes on LibriSpeech dataset,
if you have finished this step, you can skip to ``Training`` directly.
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./prepare.sh
The script ``./prepare.sh`` handles the data preparation for you, **automagically**.
All you need to do is to run it.
The data preparation contains several stages, you can use the following two
options:
- ``--stage``
- ``--stop-stage``
to control which stage(s) should be run. By default, all stages are executed.
For example,
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./prepare.sh --stage 0 --stop-stage 0
means to run only stage 0.
To run stage 2 to stage 5, use:
.. code-block:: bash
$ ./prepare.sh --stage 2 --stop-stage 5
.. HINT::
If you have pre-downloaded the `LibriSpeech <https://www.openslr.org/12>`_
dataset and the `musan <http://www.openslr.org/17/>`_ dataset, say,
they are saved in ``/tmp/LibriSpeech`` and ``/tmp/musan``, you can modify
the ``dl_dir`` variable in ``./prepare.sh`` to point to ``/tmp`` so that
``./prepare.sh`` won't re-download them.
.. NOTE::
All generated files by ``./prepare.sh``, e.g., features, lexicon, etc,
are saved in ``./data`` directory.
We provide the following YouTube video showing how to run ``./prepare.sh``.
.. note::
To get the latest news of `next-gen Kaldi <https://github.com/k2-fsa>`_, please subscribe
the following YouTube channel by `Nadira Povey <https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_:
`<https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_
.. youtube:: ofEIoJL-mGM
Training
--------
Configurable options
~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./pruned_transducer_stateless4/train.py --help
shows you the training options that can be passed from the commandline.
The following options are used quite often:
- ``--exp-dir``
The directory to save checkpoints, training logs and tensorboard.
- ``--full-libri``
If it's True, the training part uses all the training data, i.e.,
960 hours. Otherwise, the training part uses only the subset
``train-clean-100``, which has 100 hours of training data.
.. CAUTION::
The training set is perturbed by speed with two factors: 0.9 and 1.1.
If ``--full-libri`` is True, each epoch actually processes
``3x960 == 2880`` hours of data.
- ``--num-epochs``
It is the number of epochs to train. For instance,
``./pruned_transducer_stateless4/train.py --num-epochs 30`` trains for 30 epochs
and generates ``epoch-1.pt``, ``epoch-2.pt``, ..., ``epoch-30.pt``
in the folder ``./pruned_transducer_stateless4/exp``.
- ``--start-epoch``
It's used to resume training.
``./pruned_transducer_stateless4/train.py --start-epoch 10`` loads the
checkpoint ``./pruned_transducer_stateless4/exp/epoch-9.pt`` and starts
training from epoch 10, based on the state from epoch 9.
- ``--world-size``
It is used for multi-GPU single-machine DDP training.
- (a) If it is 1, then no DDP training is used.
- (b) If it is 2, then GPU 0 and GPU 1 are used for DDP training.
The following shows some use cases with it.
**Use case 1**: You have 4 GPUs, but you only want to use GPU 0 and
GPU 2 for training. You can do the following:
.. code-block:: bash
$ cd egs/librispeech/ASR
$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./pruned_transducer_stateless4/train.py --world-size 2
**Use case 2**: You have 4 GPUs and you want to use all of them
for training. You can do the following:
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./pruned_transducer_stateless4/train.py --world-size 4
**Use case 3**: You have 4 GPUs but you only want to use GPU 3
for training. You can do the following:
.. code-block:: bash
$ cd egs/librispeech/ASR
$ export CUDA_VISIBLE_DEVICES="3"
$ ./pruned_transducer_stateless4/train.py --world-size 1
.. caution::
Only multi-GPU single-machine DDP training is implemented at present.
Multi-GPU multi-machine DDP training will be added later.
- ``--max-duration``
It specifies the number of seconds over all utterances in a
batch, before **padding**.
If you encounter CUDA OOM, please reduce it.
.. HINT::
Due to padding, the number of seconds of all utterances in a
batch will usually be larger than ``--max-duration``.
A larger value for ``--max-duration`` may cause OOM during training,
while a smaller value may increase the training time. You have to
tune it.
- ``--use-fp16``
If it is True, the model will train with half precision, from our experiment
results, by using half precision you can train with two times larger ``--max-duration``
so as to get almost 2X speed up.
Pre-configured options
~~~~~~~~~~~~~~~~~~~~~~
There are some training options, e.g., number of encoder layers,
encoder dimension, decoder dimension, number of warmup steps etc,
that are not passed from the commandline.
They are pre-configured by the function ``get_params()`` in
`pruned_transducer_stateless4/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless4/train.py>`_
You don't need to change these pre-configured parameters. If you really need to change
them, please modify ``./pruned_transducer_stateless4/train.py`` directly.
.. NOTE::
The options for `pruned_transducer_stateless5 <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless5/train.py>`__ are a little different from
other recipes. It allows you to configure ``--num-encoder-layers``, ``--dim-feedforward``, ``--nhead``, ``--encoder-dim``, ``--decoder-dim``, ``--joiner-dim`` from commandline, so that you can train models with different size with pruned_transducer_stateless5.
Training logs
~~~~~~~~~~~~~
Training logs and checkpoints are saved in ``--exp-dir`` (e.g. ``pruned_transducer_stateless4/exp``.
You will find the following files in that directory:
- ``epoch-1.pt``, ``epoch-2.pt``, ...
These are checkpoint files saved at the end of each epoch, containing model
``state_dict`` and optimizer ``state_dict``.
To resume training from some checkpoint, say ``epoch-10.pt``, you can use:
.. code-block:: bash
$ ./pruned_transducer_stateless4/train.py --start-epoch 11
- ``checkpoint-436000.pt``, ``checkpoint-438000.pt``, ...
These are checkpoint files saved every ``--save-every-n`` batches,
containing model ``state_dict`` and optimizer ``state_dict``.
To resume training from some checkpoint, say ``checkpoint-436000``, you can use:
.. code-block:: bash
$ ./pruned_transducer_stateless4/train.py --start-batch 436000
- ``tensorboard/``
This folder contains tensorBoard logs. Training loss, validation loss, learning
rate, etc, are recorded in these logs. You can visualize them by:
.. code-block:: bash
$ cd pruned_transducer_stateless4/exp/tensorboard
$ tensorboard dev upload --logdir . --description "pruned transducer training for LibriSpeech with icefall"
It will print something like below:
.. code-block::
TensorFlow installation not found - running with reduced feature set.
Upload started and will continue reading any new data as it's added to the logdir.
To stop uploading, press Ctrl-C.
New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/QOGSPBgsR8KzcRMmie9JGw/
[2022-11-20T15:50:50] Started scanning logdir.
Uploading 4468 scalars...
[2022-11-20T15:53:02] Total uploaded: 210171 scalars, 0 tensors, 0 binary objects
Listening for new data in logdir...
Note there is a URL in the above output. Click it and you will see
the following screenshot:
.. figure:: images/librispeech-pruned-transducer-tensorboard-log.jpg
:width: 600
:alt: TensorBoard screenshot
:align: center
:target: https://tensorboard.dev/experiment/QOGSPBgsR8KzcRMmie9JGw/
TensorBoard screenshot.
.. hint::
If you don't have access to google, you can use the following command
to view the tensorboard log locally:
.. code-block:: bash
cd pruned_transducer_stateless4/exp/tensorboard
tensorboard --logdir . --port 6008
It will print the following message:
.. code-block::
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.8.0 at http://localhost:6008/ (Press CTRL+C to quit)
Now start your browser and go to `<http://localhost:6008>`_ to view the tensorboard
logs.
- ``log/log-train-xxxx``
It is the detailed training log in text format, same as the one
you saw printed to the console during training.
Usage example
~~~~~~~~~~~~~
You can use the following command to start the training using 6 GPUs:
.. code-block:: bash
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5"
./pruned_transducer_stateless4/train.py \
--world-size 6 \
--num-epochs 30 \
--start-epoch 1 \
--exp-dir pruned_transducer_stateless4/exp \
--full-libri 1 \
--max-duration 300
Decoding
--------
The decoding part uses checkpoints saved by the training part, so you have
to run the training part first.
.. hint::
There are two kinds of checkpoints:
- (1) ``epoch-1.pt``, ``epoch-2.pt``, ..., which are saved at the end
of each epoch. You can pass ``--epoch`` to
``pruned_transducer_stateless4/decode.py`` to use them.
- (2) ``checkpoints-436000.pt``, ``epoch-438000.pt``, ..., which are saved
every ``--save-every-n`` batches. You can pass ``--iter`` to
``pruned_transducer_stateless4/decode.py`` to use them.
We suggest that you try both types of checkpoints and choose the one
that produces the lowest WERs.
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./pruned_transducer_stateless4/decode.py --help
shows the options for decoding.
The following shows two examples (for two types of checkpoints):
.. code-block:: bash
for m in greedy_search fast_beam_search modified_beam_search; do
for epoch in 25 20; do
for avg in 7 5 3 1; do
./pruned_transducer_stateless4/decode.py \
--epoch $epoch \
--avg $avg \
--exp-dir pruned_transducer_stateless4/exp \
--max-duration 600 \
--decoding-method $m
done
done
done
.. code-block:: bash
for m in greedy_search fast_beam_search modified_beam_search; do
for iter in 474000; do
for avg in 8 10 12 14 16 18; do
./pruned_transducer_stateless4/decode.py \
--iter $iter \
--avg $avg \
--exp-dir pruned_transducer_stateless4/exp \
--max-duration 600 \
--decoding-method $m
done
done
done
.. Note::
Supporting decoding methods are as follows:
- ``greedy_search`` : It takes the symbol with largest posterior probability
of each frame as the decoding result.
- ``beam_search`` : It implements Algorithm 1 in https://arxiv.org/pdf/1211.3711.pdf and
`espnet/nets/beam_search_transducer.py <https://github.com/espnet/espnet/blob/master/espnet/nets/beam_search_transducer.py#L247>`_
is used as a reference. Basically, it keeps topk states for each frame, and expands the kept states with their own contexts to
next frame.
- ``modified_beam_search`` : It implements the same algorithm as ``beam_search`` above, but it
runs in batch mode with ``--max-sym-per-frame=1`` being hardcoded.
- ``fast_beam_search`` : It implements graph composition between the output ``log_probs`` and
given ``FSAs``. It is hard to describe the details in several lines of texts, you can read
our paper in https://arxiv.org/pdf/2211.00484.pdf or our `rnnt decode code in k2 <https://github.com/k2-fsa/k2/blob/master/k2/csrc/rnnt_decode.h>`_. ``fast_beam_search`` can decode with ``FSAs`` on GPU efficiently.
- ``fast_beam_search_LG`` : The same as ``fast_beam_search`` above, ``fast_beam_search`` uses
an trivial graph that has only one state, while ``fast_beam_search_LG`` uses an LG graph
(with N-gram LM).
- ``fast_beam_search_nbest`` : It produces the decoding results as follows:
- (1) Use ``fast_beam_search`` to get a lattice
- (2) Select ``num_paths`` paths from the lattice using ``k2.random_paths()``
- (3) Unique the selected paths
- (4) Intersect the selected paths with the lattice and compute the
shortest path from the intersection result
- (5) The path with the largest score is used as the decoding output.
- ``fast_beam_search_nbest_LG`` : It implements same logic as ``fast_beam_search_nbest``, the
only difference is that it uses ``fast_beam_search_LG`` to generate the lattice.
Export Model
------------
`pruned_transducer_stateless4/export.py <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless4/export.py>`_ supports exporting checkpoints from ``pruned_transducer_stateless4/exp`` in the following ways.
Export ``model.state_dict()``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Checkpoints saved by ``pruned_transducer_stateless4/train.py`` also include
``optimizer.state_dict()``. It is useful for resuming training. But after training,
we are interested only in ``model.state_dict()``. You can use the following
command to extract ``model.state_dict()``.
.. code-block:: bash
# Assume that --epoch 25 --avg 3 produces the smallest WER
# (You can get such information after running ./pruned_transducer_stateless4/decode.py)
epoch=25
avg=3
./pruned_transducer_stateless4/export.py \
--exp-dir ./pruned_transducer_stateless4/exp \
--bpe-model data/lang_bpe_500/bpe.model \
--epoch $epoch \
--avg $avg
It will generate a file ``./pruned_transducer_stateless4/exp/pretrained.pt``.
.. hint::
To use the generated ``pretrained.pt`` for ``pruned_transducer_stateless4/decode.py``,
you can run:
.. code-block:: bash
cd pruned_transducer_stateless4/exp
ln -s pretrained.pt epoch-999.pt
And then pass ``--epoch 999 --avg 1 --use-averaged-model 0`` to
``./pruned_transducer_stateless4/decode.py``.
To use the exported model with ``./pruned_transducer_stateless4/pretrained.py``, you
can run:
.. code-block:: bash
./pruned_transducer_stateless4/pretrained.py \
--checkpoint ./pruned_transducer_stateless4/exp/pretrained.pt \
--bpe-model ./data/lang_bpe_500/bpe.model \
--method greedy_search \
/path/to/foo.wav \
/path/to/bar.wav
Export model using ``torch.jit.script()``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
./pruned_transducer_stateless4/export.py \
--exp-dir ./pruned_transducer_stateless4/exp \
--bpe-model data/lang_bpe_500/bpe.model \
--epoch 25 \
--avg 3 \
--jit 1
It will generate a file ``cpu_jit.pt`` in the given ``exp_dir``. You can later
load it by ``torch.jit.load("cpu_jit.pt")``.
Note ``cpu`` in the name ``cpu_jit.pt`` means the parameters when loaded into Python
are on CPU. You can use ``to("cuda")`` to move them to a CUDA device.
.. NOTE::
You will need this ``cpu_jit.pt`` when deploying with Sherpa framework.
Download pretrained models
--------------------------
If you don't want to train from scratch, you can download the pretrained models
by visiting the following links:
- `pruned_transducer_stateless <https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless-2022-03-12>`__
- `pruned_transducer_stateless2 <https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless2-2022-04-29>`__
- `pruned_transducer_stateless4 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless4-2022-06-03>`__
- `pruned_transducer_stateless5 <https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless5-2022-07-07>`__
See `<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md>`_
for the details of the above pretrained models
Deploy with Sherpa
------------------
Please see `<https://k2-fsa.github.io/sherpa/python/offline_asr/conformer/librispeech.html#>`_
for how to deploy the models in ``sherpa``.

View File

@ -1,404 +0,0 @@
TDNN-LSTM-CTC
=============
This tutorial shows you how to run a TDNN-LSTM-CTC model with the `LibriSpeech <https://www.openslr.org/12>`_ dataset.
.. HINT::
We assume you have read the page :ref:`install icefall` and have setup
the environment for ``icefall``.
Data preparation
----------------
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./prepare.sh
The script ``./prepare.sh`` handles the data preparation for you, **automagically**.
All you need to do is to run it.
The data preparation contains several stages, you can use the following two
options:
- ``--stage``
- ``--stop-stage``
to control which stage(s) should be run. By default, all stages are executed.
For example,
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./prepare.sh --stage 0 --stop-stage 0
means to run only stage 0.
To run stage 2 to stage 5, use:
.. code-block:: bash
$ ./prepare.sh --stage 2 --stop-stage 5
We provide the following YouTube video showing how to run ``./prepare.sh``.
.. note::
To get the latest news of `next-gen Kaldi <https://github.com/k2-fsa>`_, please subscribe
the following YouTube channel by `Nadira Povey <https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_:
`<https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_
.. youtube:: ofEIoJL-mGM
Training
--------
Now describing the training of TDNN-LSTM-CTC model, contained in
the `tdnn_lstm_ctc <https://github.com/k2-fsa/icefall/tree/master/egs/librispeech/ASR/tdnn_lstm_ctc>`_
folder.
The command to run the training part is:
.. code-block:: bash
$ cd egs/librispeech/ASR
$ export CUDA_VISIBLE_DEVICES="0,1,2,3"
$ ./tdnn_lstm_ctc/train.py --world-size 4
By default, it will run ``20`` epochs. Training logs and checkpoints are saved
in ``tdnn_lstm_ctc/exp``.
In ``tdnn_lstm_ctc/exp``, you will find the following files:
- ``epoch-0.pt``, ``epoch-1.pt``, ..., ``epoch-19.pt``
These are checkpoint files, containing model ``state_dict`` and optimizer ``state_dict``.
To resume training from some checkpoint, say ``epoch-10.pt``, you can use:
.. code-block:: bash
$ ./tdnn_lstm_ctc/train.py --start-epoch 11
- ``tensorboard/``
This folder contains TensorBoard logs. Training loss, validation loss, learning
rate, etc, are recorded in these logs. You can visualize them by:
.. code-block:: bash
$ cd tdnn_lstm_ctc/exp/tensorboard
$ tensorboard dev upload --logdir . --description "TDNN LSTM training for librispeech with icefall"
- ``log/log-train-xxxx``
It is the detailed training log in text format, same as the one
you saw printed to the console during training.
To see available training options, you can use:
.. code-block:: bash
$ ./tdnn_lstm_ctc/train.py --help
Other training options, e.g., learning rate, results dir, etc., are
pre-configured in the function ``get_params()``
in `tdnn_lstm_ctc/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/tdnn_lstm_ctc/train.py>`_.
Normally, you don't need to change them. You can change them by modifying the code, if
you want.
Decoding
--------
The decoding part uses checkpoints saved by the training part, so you have
to run the training part first.
The command for decoding is:
.. code-block:: bash
$ export CUDA_VISIBLE_DEVICES="0"
$ ./tdnn_lstm_ctc/decode.py
You will see the WER in the output log.
Decoded results are saved in ``tdnn_lstm_ctc/exp``.
.. code-block:: bash
$ ./tdnn_lstm_ctc/decode.py --help
shows you the available decoding options.
Some commonly used options are:
- ``--epoch``
You can select which checkpoint to be used for decoding.
For instance, ``./tdnn_lstm_ctc/decode.py --epoch 10`` means to use
``./tdnn_lstm_ctc/exp/epoch-10.pt`` for decoding.
- ``--avg``
It's related to model averaging. It specifies number of checkpoints
to be averaged. The averaged model is used for decoding.
For example, the following command:
.. code-block:: bash
$ ./tdnn_lstm_ctc/decode.py --epoch 10 --avg 3
uses the average of ``epoch-8.pt``, ``epoch-9.pt`` and ``epoch-10.pt``
for decoding.
- ``--export``
If it is ``True``, i.e., ``./tdnn_lstm_ctc/decode.py --export 1``, the code
will save the averaged model to ``tdnn_lstm_ctc/exp/pretrained.pt``.
See :ref:`tdnn_lstm_ctc use a pre-trained model` for how to use it.
.. _tdnn_lstm_ctc use a pre-trained model:
Pre-trained Model
-----------------
We have uploaded the pre-trained model to
`<https://huggingface.co/pkufool/icefall_asr_librispeech_tdnn-lstm_ctc>`_.
The following shows you how to use the pre-trained model.
Install kaldifeat
~~~~~~~~~~~~~~~~~
`kaldifeat <https://github.com/csukuangfj/kaldifeat>`_ is used to
extract features for a single sound file or multiple sound files
at the same time.
Please refer to `<https://github.com/csukuangfj/kaldifeat>`_ for installation.
Download the pre-trained model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
$ cd egs/librispeech/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/pkufool/icefall_asr_librispeech_tdnn-lstm_ctc
.. CAUTION::
You have to use ``git lfs`` to download the pre-trained model.
.. CAUTION::
In order to use this pre-trained model, your k2 version has to be v1.7 or later.
After downloading, you will have the following files:
.. code-block:: bash
$ cd egs/librispeech/ASR
$ tree tmp
.. code-block:: bash
tmp/
`-- icefall_asr_librispeech_tdnn-lstm_ctc
|-- README.md
|-- data
| |-- lang_phone
| | |-- HLG.pt
| | |-- tokens.txt
| | `-- words.txt
| `-- lm
| `-- G_4_gram.pt
|-- exp
| `-- pretrained.pt
`-- test_wavs
|-- 1089-134686-0001.flac
|-- 1221-135766-0001.flac
|-- 1221-135766-0002.flac
`-- trans.txt
6 directories, 10 files
**File descriptions**:
- ``data/lang_phone/HLG.pt``
It is the decoding graph.
- ``data/lang_phone/tokens.txt``
It contains tokens and their IDs.
- ``data/lang_phone/words.txt``
It contains words and their IDs.
- ``data/lm/G_4_gram.pt``
It is a 4-gram LM, useful for LM rescoring.
- ``exp/pretrained.pt``
It contains pre-trained model parameters, obtained by averaging
checkpoints from ``epoch-14.pt`` to ``epoch-19.pt``.
Note: We have removed optimizer ``state_dict`` to reduce file size.
- ``test_waves/*.flac``
It contains some test sound files from LibriSpeech ``test-clean`` dataset.
- ``test_waves/trans.txt``
It contains the reference transcripts for the sound files in ``test_waves/``.
The information of the test sound files is listed below:
.. code-block:: bash
$ soxi tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/*.flac
Input File : 'tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:06.62 = 106000 samples ~ 496.875 CDDA sectors
File Size : 116k
Bit Rate : 140k
Sample Encoding: 16-bit FLAC
Input File : 'tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:16.71 = 267440 samples ~ 1253.62 CDDA sectors
File Size : 343k
Bit Rate : 164k
Sample Encoding: 16-bit FLAC
Input File : 'tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:04.83 = 77200 samples ~ 361.875 CDDA sectors
File Size : 105k
Bit Rate : 174k
Sample Encoding: 16-bit FLAC
Total Duration of 3 files: 00:00:28.16
Inference with a pre-trained model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./tdnn_lstm_ctc/pretrained.py --help
shows the usage information of ``./tdnn_lstm_ctc/pretrained.py``.
To decode with ``1best`` method, we can use:
.. code-block:: bash
./tdnn_lstm_ctc/pretrained.py \
--checkpoint ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/exp/pretraind.pt \
--words-file ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lang_phone/words.txt \
--HLG ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lang_phone/HLG.pt \
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac \
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac \
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac
The output is:
.. code-block::
2021-08-24 16:57:13,315 INFO [pretrained.py:168] device: cuda:0
2021-08-24 16:57:13,315 INFO [pretrained.py:170] Creating model
2021-08-24 16:57:18,331 INFO [pretrained.py:182] Loading HLG from ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lang_phone/HLG.pt
2021-08-24 16:57:27,581 INFO [pretrained.py:199] Constructing Fbank computer
2021-08-24 16:57:27,584 INFO [pretrained.py:209] Reading sound files: ['./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac', './tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac', './tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac']
2021-08-24 16:57:27,599 INFO [pretrained.py:215] Decoding started
2021-08-24 16:57:27,791 INFO [pretrained.py:245] Use HLG decoding
2021-08-24 16:57:28,098 INFO [pretrained.py:266]
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac:
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac:
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac:
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
2021-08-24 16:57:28,099 INFO [pretrained.py:268] Decoding Done
To decode with ``whole-lattice-rescoring`` methond, you can use
.. code-block:: bash
./tdnn_lstm_ctc/pretrained.py \
--checkpoint ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/exp/pretraind.pt \
--words-file ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lang_phone/words.txt \
--HLG ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lang_phone/HLG.pt \
--method whole-lattice-rescoring \
--G ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lm/G_4_gram.pt \
--ngram-lm-scale 0.8 \
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac \
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac \
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac
The decoding output is:
.. code-block::
2021-08-24 16:39:24,725 INFO [pretrained.py:168] device: cuda:0
2021-08-24 16:39:24,725 INFO [pretrained.py:170] Creating model
2021-08-24 16:39:29,403 INFO [pretrained.py:182] Loading HLG from ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lang_phone/HLG.pt
2021-08-24 16:39:40,631 INFO [pretrained.py:190] Loading G from ./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/data/lm/G_4_gram.pt
2021-08-24 16:39:53,098 INFO [pretrained.py:199] Constructing Fbank computer
2021-08-24 16:39:53,107 INFO [pretrained.py:209] Reading sound files: ['./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac', './tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac', './tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac']
2021-08-24 16:39:53,121 INFO [pretrained.py:215] Decoding started
2021-08-24 16:39:53,443 INFO [pretrained.py:250] Use HLG decoding + LM rescoring
2021-08-24 16:39:54,010 INFO [pretrained.py:266]
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1089-134686-0001.flac:
AFTER EARLY NIGHTFALL THE YELLOW LAMPS WOULD LIGHT UP HERE AND THERE THE SQUALID QUARTER OF THE BROTHELS
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0001.flac:
GOD AS A DIRECT CONSEQUENCE OF THE SIN WHICH MAN THUS PUNISHED HAD GIVEN HER A LOVELY CHILD WHOSE PLACE WAS ON THAT SAME DISHONORED BOSOM TO CONNECT HER PARENT FOREVER WITH THE RACE AND DESCENT OF MORTALS AND TO BE FINALLY A BLESSED SOUL IN HEAVEN
./tmp/icefall_asr_librispeech_tdnn-lstm_ctc/test_wavs/1221-135766-0002.flac:
YET THESE THOUGHTS AFFECTED HESTER PRYNNE LESS WITH HOPE THAN APPREHENSION
2021-08-24 16:39:54,010 INFO [pretrained.py:268] Decoding Done
Colab notebook
--------------
We provide a colab notebook for decoding with pre-trained model.
|librispeech tdnn_lstm_ctc colab notebook|
.. |librispeech tdnn_lstm_ctc colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/1-iSfQMp2So-We_Uu49N4AAcMInB72u9z?usp=sharing
**Congratulations!** You have finished the TDNN-LSTM-CTC recipe on librispeech in ``icefall``.

View File

@ -1,454 +0,0 @@
Zipformer CTC Blank Skip
========================
.. hint::
Please scroll down to the bottom of this page to find download links
for pretrained models if you don't want to train a model from scratch.
This tutorial shows you how to train a Zipformer model based on the guidance from
a co-trained CTC model using `blank skip method <https://arxiv.org/pdf/2210.16481.pdf>`_
with the `LibriSpeech <https://www.openslr.org/12>`_ dataset.
.. note::
We use both CTC and RNN-T loss to train. During the forward pass, the encoder output
is first used to calculate the CTC posterior probability; then for each output frame,
if its blank posterior is bigger than some threshold, it will be simply discarded
from the encoder output. To prevent information loss, we also put a convolution module
similar to the one used in conformer (referred to as “LConv”) before the frame reduction.
Data preparation
----------------
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./prepare.sh
The script ``./prepare.sh`` handles the data preparation for you, **automagically**.
All you need to do is to run it.
.. note::
We encourage you to read ``./prepare.sh``.
The data preparation contains several stages. You can use the following two
options:
- ``--stage``
- ``--stop-stage``
to control which stage(s) should be run. By default, all stages are executed.
For example,
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./prepare.sh --stage 0 --stop-stage 0
means to run only stage 0.
To run stage 2 to stage 5, use:
.. code-block:: bash
$ ./prepare.sh --stage 2 --stop-stage 5
.. hint::
If you have pre-downloaded the `LibriSpeech <https://www.openslr.org/12>`_
dataset and the `musan <http://www.openslr.org/17/>`_ dataset, say,
they are saved in ``/tmp/LibriSpeech`` and ``/tmp/musan``, you can modify
the ``dl_dir`` variable in ``./prepare.sh`` to point to ``/tmp`` so that
``./prepare.sh`` won't re-download them.
.. note::
All generated files by ``./prepare.sh``, e.g., features, lexicon, etc,
are saved in ``./data`` directory.
We provide the following YouTube video showing how to run ``./prepare.sh``.
.. note::
To get the latest news of `next-gen Kaldi <https://github.com/k2-fsa>`_, please subscribe
the following YouTube channel by `Nadira Povey <https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_:
`<https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_
.. youtube:: ofEIoJL-mGM
Training
--------
For stability, it doesn`t use blank skip method until model warm-up.
Configurable options
~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./pruned_transducer_stateless7_ctc_bs/train.py --help
shows you the training options that can be passed from the commandline.
The following options are used quite often:
- ``--full-libri``
If it's True, the training part uses all the training data, i.e.,
960 hours. Otherwise, the training part uses only the subset
``train-clean-100``, which has 100 hours of training data.
.. CAUTION::
The training set is perturbed by speed with two factors: 0.9 and 1.1.
If ``--full-libri`` is True, each epoch actually processes
``3x960 == 2880`` hours of data.
- ``--num-epochs``
It is the number of epochs to train. For instance,
``./pruned_transducer_stateless7_ctc_bs/train.py --num-epochs 30`` trains for 30 epochs
and generates ``epoch-1.pt``, ``epoch-2.pt``, ..., ``epoch-30.pt``
in the folder ``./pruned_transducer_stateless7_ctc_bs/exp``.
- ``--start-epoch``
It's used to resume training.
``./pruned_transducer_stateless7_ctc_bs/train.py --start-epoch 10`` loads the
checkpoint ``./pruned_transducer_stateless7_ctc_bs/exp/epoch-9.pt`` and starts
training from epoch 10, based on the state from epoch 9.
- ``--world-size``
It is used for multi-GPU single-machine DDP training.
- (a) If it is 1, then no DDP training is used.
- (b) If it is 2, then GPU 0 and GPU 1 are used for DDP training.
The following shows some use cases with it.
**Use case 1**: You have 4 GPUs, but you only want to use GPU 0 and
GPU 2 for training. You can do the following:
.. code-block:: bash
$ cd egs/librispeech/ASR
$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./pruned_transducer_stateless7_ctc_bs/train.py --world-size 2
**Use case 2**: You have 4 GPUs and you want to use all of them
for training. You can do the following:
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./pruned_transducer_stateless7_ctc_bs/train.py --world-size 4
**Use case 3**: You have 4 GPUs but you only want to use GPU 3
for training. You can do the following:
.. code-block:: bash
$ cd egs/librispeech/ASR
$ export CUDA_VISIBLE_DEVICES="3"
$ ./pruned_transducer_stateless7_ctc_bs/train.py --world-size 1
.. caution::
Only multi-GPU single-machine DDP training is implemented at present.
Multi-GPU multi-machine DDP training will be added later.
- ``--max-duration``
It specifies the number of seconds over all utterances in a
batch, before **padding**.
If you encounter CUDA OOM, please reduce it.
.. HINT::
Due to padding, the number of seconds of all utterances in a
batch will usually be larger than ``--max-duration``.
A larger value for ``--max-duration`` may cause OOM during training,
while a smaller value may increase the training time. You have to
tune it.
Pre-configured options
~~~~~~~~~~~~~~~~~~~~~~
There are some training options, e.g., weight decay,
number of warmup steps, results dir, etc,
that are not passed from the commandline.
They are pre-configured by the function ``get_params()`` in
`pruned_transducer_stateless7_ctc_bs/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7_ctc_bs/train.py>`_
You don't need to change these pre-configured parameters. If you really need to change
them, please modify ``./pruned_transducer_stateless7_ctc_bs/train.py`` directly.
Training logs
~~~~~~~~~~~~~
Training logs and checkpoints are saved in ``pruned_transducer_stateless7_ctc_bs/exp``.
You will find the following files in that directory:
- ``epoch-1.pt``, ``epoch-2.pt``, ...
These are checkpoint files saved at the end of each epoch, containing model
``state_dict`` and optimizer ``state_dict``.
To resume training from some checkpoint, say ``epoch-10.pt``, you can use:
.. code-block:: bash
$ ./pruned_transducer_stateless7_ctc_bs/train.py --start-epoch 11
- ``checkpoint-436000.pt``, ``checkpoint-438000.pt``, ...
These are checkpoint files saved every ``--save-every-n`` batches,
containing model ``state_dict`` and optimizer ``state_dict``.
To resume training from some checkpoint, say ``checkpoint-436000``, you can use:
.. code-block:: bash
$ ./pruned_transducer_stateless7_ctc_bs/train.py --start-batch 436000
- ``tensorboard/``
This folder contains tensorBoard logs. Training loss, validation loss, learning
rate, etc, are recorded in these logs. You can visualize them by:
.. code-block:: bash
$ cd pruned_transducer_stateless7_ctc_bs/exp/tensorboard
$ tensorboard dev upload --logdir . --description "Zipformer-CTC co-training using blank skip for LibriSpeech with icefall"
It will print something like below:
.. code-block::
TensorFlow installation not found - running with reduced feature set.
Upload started and will continue reading any new data as it's added to the logdir.
To stop uploading, press Ctrl-C.
New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/xyOZUKpEQm62HBIlUD4uPA/
Note there is a URL in the above output. Click it and you will see
tensorboard.
.. hint::
If you don't have access to google, you can use the following command
to view the tensorboard log locally:
.. code-block:: bash
cd pruned_transducer_stateless7_ctc_bs/exp/tensorboard
tensorboard --logdir . --port 6008
It will print the following message:
.. code-block::
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.8.0 at http://localhost:6008/ (Press CTRL+C to quit)
Now start your browser and go to `<http://localhost:6008>`_ to view the tensorboard
logs.
- ``log/log-train-xxxx``
It is the detailed training log in text format, same as the one
you saw printed to the console during training.
Usage example
~~~~~~~~~~~~~
You can use the following command to start the training using 4 GPUs:
.. code-block:: bash
export CUDA_VISIBLE_DEVICES="0,1,2,3"
./pruned_transducer_stateless7_ctc_bs/train.py \
--world-size 4 \
--num-epochs 30 \
--start-epoch 1 \
--full-libri 1 \
--exp-dir pruned_transducer_stateless7_ctc_bs/exp \
--max-duration 600 \
--use-fp16 1
Decoding
--------
The decoding part uses checkpoints saved by the training part, so you have
to run the training part first.
.. hint::
There are two kinds of checkpoints:
- (1) ``epoch-1.pt``, ``epoch-2.pt``, ..., which are saved at the end
of each epoch. You can pass ``--epoch`` to
``pruned_transducer_stateless7_ctc_bs/ctc_guide_decode_bs.py`` to use them.
- (2) ``checkpoints-436000.pt``, ``epoch-438000.pt``, ..., which are saved
every ``--save-every-n`` batches. You can pass ``--iter`` to
``pruned_transducer_stateless7_ctc_bs/ctc_guide_decode_bs.py`` to use them.
We suggest that you try both types of checkpoints and choose the one
that produces the lowest WERs.
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./pruned_transducer_stateless7_ctc_bs/ctc_guide_decode_bs.py --help
shows the options for decoding.
The following shows the example using ``epoch-*.pt``:
.. code-block:: bash
for m in greedy_search fast_beam_search modified_beam_search; do
./pruned_transducer_stateless7_ctc_bs/ctc_guide_decode_bs.py \
--epoch 30 \
--avg 13 \
--exp-dir pruned_transducer_stateless7_ctc_bs/exp \
--max-duration 600 \
--decoding-method $m
done
To test CTC branch, you can use the following command:
.. code-block:: bash
for m in ctc-decoding 1best; do
./pruned_transducer_stateless7_ctc_bs/ctc_guide_decode_bs.py \
--epoch 30 \
--avg 13 \
--exp-dir pruned_transducer_stateless7_ctc_bs/exp \
--max-duration 600 \
--decoding-method $m
done
Export models
-------------
`pruned_transducer_stateless7_ctc_bs/export.py <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7_ctc_bs/export.py>`_ supports exporting checkpoints from ``pruned_transducer_stateless7_ctc_bs/exp`` in the following ways.
Export ``model.state_dict()``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Checkpoints saved by ``pruned_transducer_stateless7_ctc_bs/train.py`` also include
``optimizer.state_dict()``. It is useful for resuming training. But after training,
we are interested only in ``model.state_dict()``. You can use the following
command to extract ``model.state_dict()``.
.. code-block:: bash
./pruned_transducer_stateless7_ctc_bs/export.py \
--exp-dir ./pruned_transducer_stateless7_ctc_bs/exp \
--bpe-model data/lang_bpe_500/bpe.model \
--epoch 30 \
--avg 13 \
--jit 0
It will generate a file ``./pruned_transducer_stateless7_ctc_bs/exp/pretrained.pt``.
.. hint::
To use the generated ``pretrained.pt`` for ``pruned_transducer_stateless7_ctc_bs/ctc_guide_decode_bs.py``,
you can run:
.. code-block:: bash
cd pruned_transducer_stateless7_ctc_bs/exp
ln -s pretrained epoch-9999.pt
And then pass ``--epoch 9999 --avg 1 --use-averaged-model 0`` to
``./pruned_transducer_stateless7_ctc_bs/ctc_guide_decode_bs.py``.
To use the exported model with ``./pruned_transducer_stateless7_ctc_bs/pretrained.py``, you
can run:
.. code-block:: bash
./pruned_transducer_stateless7_ctc_bs/pretrained.py \
--checkpoint ./pruned_transducer_stateless7_ctc_bs/exp/pretrained.pt \
--bpe-model ./data/lang_bpe_500/bpe.model \
--method greedy_search \
/path/to/foo.wav \
/path/to/bar.wav
To test CTC branch using the exported model with ``./pruned_transducer_stateless7_ctc_bs/pretrained_ctc.py``:
.. code-block:: bash
./pruned_transducer_stateless7_ctc_bs/jit_pretrained_ctc.py \
--checkpoint ./pruned_transducer_stateless7_ctc_bs/exp/pretrained.pt \
--bpe-model data/lang_bpe_500/bpe.model \
--method ctc-decoding \
--sample-rate 16000 \
/path/to/foo.wav \
/path/to/bar.wav
Export model using ``torch.jit.script()``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
./pruned_transducer_stateless7_ctc_bs/export.py \
--exp-dir ./pruned_transducer_stateless7_ctc_bs/exp \
--bpe-model data/lang_bpe_500/bpe.model \
--epoch 30 \
--avg 13 \
--jit 1
It will generate a file ``cpu_jit.pt`` in the given ``exp_dir``. You can later
load it by ``torch.jit.load("cpu_jit.pt")``.
Note ``cpu`` in the name ``cpu_jit.pt`` means the parameters when loaded into Python
are on CPU. You can use ``to("cuda")`` to move them to a CUDA device.
To use the generated files with ``./pruned_transducer_stateless7_ctc_bs/jit_pretrained.py``:
.. code-block:: bash
./pruned_transducer_stateless7_ctc_bs/jit_pretrained.py \
--nn-model-filename ./pruned_transducer_stateless7_ctc_bs/exp/cpu_jit.pt \
/path/to/foo.wav \
/path/to/bar.wav
To test CTC branch using the generated files with ``./pruned_transducer_stateless7_ctc_bs/jit_pretrained_ctc.py``:
.. code-block:: bash
./pruned_transducer_stateless7_ctc_bs/jit_pretrained_ctc.py \
--model-filename ./pruned_transducer_stateless7_ctc_bs/exp/cpu_jit.pt \
--bpe-model data/lang_bpe_500/bpe.model \
--method ctc-decoding \
--sample-rate 16000 \
/path/to/foo.wav \
/path/to/bar.wav
Download pretrained models
--------------------------
If you don't want to train from scratch, you can download the pretrained models
by visiting the following links:
- trained on LibriSpeech 100h: `<https://huggingface.co/yfyeung/icefall-asr-librispeech-pruned_transducer_stateless7_ctc_bs-2022-12-14>`_
- trained on LibriSpeech 960h: `<https://huggingface.co/yfyeung/icefall-asr-librispeech-pruned_transducer_stateless7_ctc_bs-2023-01-29>`_
See `<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md>`_
for the details of the above pretrained models

View File

@ -1,422 +0,0 @@
Zipformer MMI
===============
.. hint::
Please scroll down to the bottom of this page to find download links
for pretrained models if you don't want to train a model from scratch.
This tutorial shows you how to train an Zipformer MMI model
with the `LibriSpeech <https://www.openslr.org/12>`_ dataset.
We use LF-MMI to compute the loss.
.. note::
You can find the document about LF-MMI training at the following address:
`<https://github.com/k2-fsa/next-gen-kaldi-wechat/blob/master/pdf/LF-MMI-training-and-decoding-in-k2-Part-I.pdf>`_
Data preparation
----------------
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./prepare.sh
The script ``./prepare.sh`` handles the data preparation for you, **automagically**.
All you need to do is to run it.
.. note::
We encourage you to read ``./prepare.sh``.
The data preparation contains several stages. You can use the following two
options:
- ``--stage``
- ``--stop-stage``
to control which stage(s) should be run. By default, all stages are executed.
For example,
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./prepare.sh --stage 0 --stop-stage 0
means to run only stage 0.
To run stage 2 to stage 5, use:
.. code-block:: bash
$ ./prepare.sh --stage 2 --stop-stage 5
.. hint::
If you have pre-downloaded the `LibriSpeech <https://www.openslr.org/12>`_
dataset and the `musan <http://www.openslr.org/17/>`_ dataset, say,
they are saved in ``/tmp/LibriSpeech`` and ``/tmp/musan``, you can modify
the ``dl_dir`` variable in ``./prepare.sh`` to point to ``/tmp`` so that
``./prepare.sh`` won't re-download them.
.. note::
All generated files by ``./prepare.sh``, e.g., features, lexicon, etc,
are saved in ``./data`` directory.
We provide the following YouTube video showing how to run ``./prepare.sh``.
.. note::
To get the latest news of `next-gen Kaldi <https://github.com/k2-fsa>`_, please subscribe
the following YouTube channel by `Nadira Povey <https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_:
`<https://www.youtube.com/channel/UC_VaumpkmINz1pNkFXAN9mw>`_
.. youtube:: ofEIoJL-mGM
Training
--------
For stability, it uses CTC loss for model warm-up and then switches to MMI loss.
Configurable options
~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./zipformer_mmi/train.py --help
shows you the training options that can be passed from the commandline.
The following options are used quite often:
- ``--full-libri``
If it's True, the training part uses all the training data, i.e.,
960 hours. Otherwise, the training part uses only the subset
``train-clean-100``, which has 100 hours of training data.
.. CAUTION::
The training set is perturbed by speed with two factors: 0.9 and 1.1.
If ``--full-libri`` is True, each epoch actually processes
``3x960 == 2880`` hours of data.
- ``--num-epochs``
It is the number of epochs to train. For instance,
``./zipformer_mmi/train.py --num-epochs 30`` trains for 30 epochs
and generates ``epoch-1.pt``, ``epoch-2.pt``, ..., ``epoch-30.pt``
in the folder ``./zipformer_mmi/exp``.
- ``--start-epoch``
It's used to resume training.
``./zipformer_mmi/train.py --start-epoch 10`` loads the
checkpoint ``./zipformer_mmi/exp/epoch-9.pt`` and starts
training from epoch 10, based on the state from epoch 9.
- ``--world-size``
It is used for multi-GPU single-machine DDP training.
- (a) If it is 1, then no DDP training is used.
- (b) If it is 2, then GPU 0 and GPU 1 are used for DDP training.
The following shows some use cases with it.
**Use case 1**: You have 4 GPUs, but you only want to use GPU 0 and
GPU 2 for training. You can do the following:
.. code-block:: bash
$ cd egs/librispeech/ASR
$ export CUDA_VISIBLE_DEVICES="0,2"
$ ./zipformer_mmi/train.py --world-size 2
**Use case 2**: You have 4 GPUs and you want to use all of them
for training. You can do the following:
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./zipformer_mmi/train.py --world-size 4
**Use case 3**: You have 4 GPUs but you only want to use GPU 3
for training. You can do the following:
.. code-block:: bash
$ cd egs/librispeech/ASR
$ export CUDA_VISIBLE_DEVICES="3"
$ ./zipformer_mmi/train.py --world-size 1
.. caution::
Only multi-GPU single-machine DDP training is implemented at present.
Multi-GPU multi-machine DDP training will be added later.
- ``--max-duration``
It specifies the number of seconds over all utterances in a
batch, before **padding**.
If you encounter CUDA OOM, please reduce it.
.. HINT::
Due to padding, the number of seconds of all utterances in a
batch will usually be larger than ``--max-duration``.
A larger value for ``--max-duration`` may cause OOM during training,
while a smaller value may increase the training time. You have to
tune it.
Pre-configured options
~~~~~~~~~~~~~~~~~~~~~~
There are some training options, e.g., weight decay,
number of warmup steps, results dir, etc,
that are not passed from the commandline.
They are pre-configured by the function ``get_params()`` in
`zipformer_mmi/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/zipformer_mmi/train.py>`_
You don't need to change these pre-configured parameters. If you really need to change
them, please modify ``./zipformer_mmi/train.py`` directly.
Training logs
~~~~~~~~~~~~~
Training logs and checkpoints are saved in ``zipformer_mmi/exp``.
You will find the following files in that directory:
- ``epoch-1.pt``, ``epoch-2.pt``, ...
These are checkpoint files saved at the end of each epoch, containing model
``state_dict`` and optimizer ``state_dict``.
To resume training from some checkpoint, say ``epoch-10.pt``, you can use:
.. code-block:: bash
$ ./zipformer_mmi/train.py --start-epoch 11
- ``checkpoint-436000.pt``, ``checkpoint-438000.pt``, ...
These are checkpoint files saved every ``--save-every-n`` batches,
containing model ``state_dict`` and optimizer ``state_dict``.
To resume training from some checkpoint, say ``checkpoint-436000``, you can use:
.. code-block:: bash
$ ./zipformer_mmi/train.py --start-batch 436000
- ``tensorboard/``
This folder contains tensorBoard logs. Training loss, validation loss, learning
rate, etc, are recorded in these logs. You can visualize them by:
.. code-block:: bash
$ cd zipformer_mmi/exp/tensorboard
$ tensorboard dev upload --logdir . --description "Zipformer MMI training for LibriSpeech with icefall"
It will print something like below:
.. code-block::
TensorFlow installation not found - running with reduced feature set.
Upload started and will continue reading any new data as it's added to the logdir.
To stop uploading, press Ctrl-C.
New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/xyOZUKpEQm62HBIlUD4uPA/
Note there is a URL in the above output. Click it and you will see
tensorboard.
.. hint::
If you don't have access to google, you can use the following command
to view the tensorboard log locally:
.. code-block:: bash
cd zipformer_mmi/exp/tensorboard
tensorboard --logdir . --port 6008
It will print the following message:
.. code-block::
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.8.0 at http://localhost:6008/ (Press CTRL+C to quit)
Now start your browser and go to `<http://localhost:6008>`_ to view the tensorboard
logs.
- ``log/log-train-xxxx``
It is the detailed training log in text format, same as the one
you saw printed to the console during training.
Usage example
~~~~~~~~~~~~~
You can use the following command to start the training using 4 GPUs:
.. code-block:: bash
export CUDA_VISIBLE_DEVICES="0,1,2,3"
./zipformer_mmi/train.py \
--world-size 4 \
--num-epochs 30 \
--start-epoch 1 \
--full-libri 1 \
--exp-dir zipformer_mmi/exp \
--max-duration 500 \
--use-fp16 1 \
--num-workers 2
Decoding
--------
The decoding part uses checkpoints saved by the training part, so you have
to run the training part first.
.. hint::
There are two kinds of checkpoints:
- (1) ``epoch-1.pt``, ``epoch-2.pt``, ..., which are saved at the end
of each epoch. You can pass ``--epoch`` to
``zipformer_mmi/decode.py`` to use them.
- (2) ``checkpoints-436000.pt``, ``epoch-438000.pt``, ..., which are saved
every ``--save-every-n`` batches. You can pass ``--iter`` to
``zipformer_mmi/decode.py`` to use them.
We suggest that you try both types of checkpoints and choose the one
that produces the lowest WERs.
.. code-block:: bash
$ cd egs/librispeech/ASR
$ ./zipformer_mmi/decode.py --help
shows the options for decoding.
The following shows the example using ``epoch-*.pt``:
.. code-block:: bash
for m in nbest nbest-rescoring-LG nbest-rescoring-3-gram nbest-rescoring-4-gram; do
./zipformer_mmi/decode.py \
--epoch 30 \
--avg 10 \
--exp-dir ./zipformer_mmi/exp/ \
--max-duration 100 \
--lang-dir data/lang_bpe_500 \
--nbest-scale 1.2 \
--hp-scale 1.0 \
--decoding-method $m
done
Export models
-------------
`zipformer_mmi/export.py <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/zipformer_mmi/export.py>`_ supports exporting checkpoints from ``zipformer_mmi/exp`` in the following ways.
Export ``model.state_dict()``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Checkpoints saved by ``zipformer_mmi/train.py`` also include
``optimizer.state_dict()``. It is useful for resuming training. But after training,
we are interested only in ``model.state_dict()``. You can use the following
command to extract ``model.state_dict()``.
.. code-block:: bash
./zipformer_mmi/export.py \
--exp-dir ./zipformer_mmi/exp \
--bpe-model data/lang_bpe_500/bpe.model \
--epoch 30 \
--avg 9 \
--jit 0
It will generate a file ``./zipformer_mmi/exp/pretrained.pt``.
.. hint::
To use the generated ``pretrained.pt`` for ``zipformer_mmi/decode.py``,
you can run:
.. code-block:: bash
cd zipformer_mmi/exp
ln -s pretrained epoch-9999.pt
And then pass ``--epoch 9999 --avg 1 --use-averaged-model 0`` to
``./zipformer_mmi/decode.py``.
To use the exported model with ``./zipformer_mmi/pretrained.py``, you
can run:
.. code-block:: bash
./zipformer_mmi/pretrained.py \
--checkpoint ./zipformer_mmi/exp/pretrained.pt \
--bpe-model ./data/lang_bpe_500/bpe.model \
--method 1best \
/path/to/foo.wav \
/path/to/bar.wav
Export model using ``torch.jit.script()``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
./zipformer_mmi/export.py \
--exp-dir ./zipformer_mmi/exp \
--bpe-model data/lang_bpe_500/bpe.model \
--epoch 30 \
--avg 9 \
--jit 1
It will generate a file ``cpu_jit.pt`` in the given ``exp_dir``. You can later
load it by ``torch.jit.load("cpu_jit.pt")``.
Note ``cpu`` in the name ``cpu_jit.pt`` means the parameters when loaded into Python
are on CPU. You can use ``to("cuda")`` to move them to a CUDA device.
To use the generated files with ``./zipformer_mmi/jit_pretrained.py``:
.. code-block:: bash
./zipformer_mmi/jit_pretrained.py \
--nn-model-filename ./zipformer_mmi/exp/cpu_jit.pt \
--bpe-model ./data/lang_bpe_500/bpe.model \
--method 1best \
/path/to/foo.wav \
/path/to/bar.wav
Download pretrained models
--------------------------
If you don't want to train from scratch, you can download the pretrained models
by visiting the following links:
- `<https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-mmi-2022-12-08>`_
See `<https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md>`_
for the details of the above pretrained models

View File

@ -1,8 +0,0 @@
TIMIT
=====
.. toctree::
:maxdepth: 1
tdnn_ligru_ctc
tdnn_lstm_ctc

View File

@ -1,406 +0,0 @@
TDNN-LiGRU-CTC
==============
This tutorial shows you how to run a TDNN-LiGRU-CTC model with the `TIMIT <https://data.deepai.org/timit.zip>`_ dataset.
.. HINT::
We assume you have read the page :ref:`install icefall` and have setup
the environment for ``icefall``.
Data preparation
----------------
.. code-block:: bash
$ cd egs/timit/ASR
$ ./prepare.sh
The script ``./prepare.sh`` handles the data preparation for you, **automagically**.
All you need to do is to run it.
The data preparation contains several stages, you can use the following two
options:
- ``--stage``
- ``--stop-stage``
to control which stage(s) should be run. By default, all stages are executed.
For example,
.. code-block:: bash
$ cd egs/timit/ASR
$ ./prepare.sh --stage 0 --stop-stage 0
means to run only stage 0.
To run stage 2 to stage 5, use:
.. code-block:: bash
$ ./prepare.sh --stage 2 --stop-stage 5
Training
--------
Now describing the training of TDNN-LiGRU-CTC model, contained in
the `tdnn_ligru_ctc <https://github.com/k2-fsa/icefall/tree/master/egs/timit/ASR/tdnn_ligru_ctc>`_
folder.
.. HINT::
TIMIT is a very small dataset. So one GPU is enough.
The command to run the training part is:
.. code-block:: bash
$ cd egs/timit/ASR
$ export CUDA_VISIBLE_DEVICES="0"
$ ./tdnn_ligru_ctc/train.py
By default, it will run ``25`` epochs. Training logs and checkpoints are saved
in ``tdnn_ligru_ctc/exp``.
In ``tdnn_ligru_ctc/exp``, you will find the following files:
- ``epoch-0.pt``, ``epoch-1.pt``, ..., ``epoch-29.pt``
These are checkpoint files, containing model ``state_dict`` and optimizer ``state_dict``.
To resume training from some checkpoint, say ``epoch-10.pt``, you can use:
.. code-block:: bash
$ ./tdnn_ligru_ctc/train.py --start-epoch 11
- ``tensorboard/``
This folder contains TensorBoard logs. Training loss, validation loss, learning
rate, etc, are recorded in these logs. You can visualize them by:
.. code-block:: bash
$ cd tdnn_ligru_ctc/exp/tensorboard
$ tensorboard dev upload --logdir . --description "TDNN ligru training for timit with icefall"
- ``log/log-train-xxxx``
It is the detailed training log in text format, same as the one
you saw printed to the console during training.
To see available training options, you can use:
.. code-block:: bash
$ ./tdnn_ligru_ctc/train.py --help
Other training options, e.g., learning rate, results dir, etc., are
pre-configured in the function ``get_params()``
in `tdnn_ligru_ctc/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/timit/ASR/tdnn_ligru_ctc/train.py>`_.
Normally, you don't need to change them. You can change them by modifying the code, if
you want.
Decoding
--------
The decoding part uses checkpoints saved by the training part, so you have
to run the training part first.
The command for decoding is:
.. code-block:: bash
$ export CUDA_VISIBLE_DEVICES="0"
$ ./tdnn_ligru_ctc/decode.py
You will see the WER in the output log.
Decoded results are saved in ``tdnn_ligru_ctc/exp``.
.. code-block:: bash
$ ./tdnn_ligru_ctc/decode.py --help
shows you the available decoding options.
Some commonly used options are:
- ``--epoch``
You can select which checkpoint to be used for decoding.
For instance, ``./tdnn_ligru_ctc/decode.py --epoch 10`` means to use
``./tdnn_ligru_ctc/exp/epoch-10.pt`` for decoding.
- ``--avg``
It's related to model averaging. It specifies number of checkpoints
to be averaged. The averaged model is used for decoding.
For example, the following command:
.. code-block:: bash
$ ./tdnn_ligru_ctc/decode.py --epoch 25 --avg 17
uses the average of ``epoch-9.pt``, ``epoch-10.pt``, ``epoch-11.pt``,
``epoch-12.pt``, ``epoch-13.pt``, ``epoch-14.pt``, ``epoch-15.pt``,
``epoch-16.pt``, ``epoch-17.pt``, ``epoch-18.pt``, ``epoch-19.pt``,
``epoch-20.pt``, ``epoch-21.pt``, ``epoch-22.pt``, ``epoch-23.pt``,
``epoch-24.pt`` and ``epoch-25.pt``
for decoding.
- ``--export``
If it is ``True``, i.e., ``./tdnn_ligru_ctc/decode.py --export 1``, the code
will save the averaged model to ``tdnn_ligru_ctc/exp/pretrained.pt``.
See :ref:`tdnn_ligru_ctc use a pre-trained model` for how to use it.
.. _tdnn_ligru_ctc use a pre-trained model:
Pre-trained Model
-----------------
We have uploaded the pre-trained model to
`<https://huggingface.co/luomingshuang/icefall_asr_timit_tdnn_ligru_ctc>`_.
The following shows you how to use the pre-trained model.
Install kaldifeat
~~~~~~~~~~~~~~~~~
`kaldifeat <https://github.com/csukuangfj/kaldifeat>`_ is used to
extract features for a single sound file or multiple sound files
at the same time.
Please refer to `<https://github.com/csukuangfj/kaldifeat>`_ for installation.
Download the pre-trained model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
$ cd egs/timit/ASR
$ mkdir tmp-ligru
$ cd tmp-ligru
$ git lfs install
$ git clone https://huggingface.co/luomingshuang/icefall_asr_timit_tdnn_ligru_ctc
.. CAUTION::
You have to use ``git lfs`` to download the pre-trained model.
.. CAUTION::
In order to use this pre-trained model, your k2 version has to be v1.7 or later.
After downloading, you will have the following files:
.. code-block:: bash
$ cd egs/timit/ASR
$ tree tmp-ligru
.. code-block:: bash
tmp-ligru/
`-- icefall_asr_timit_tdnn_ligru_ctc
|-- README.md
|-- data
| |-- lang_phone
| | |-- HLG.pt
| | |-- tokens.txt
| | `-- words.txt
| `-- lm
| `-- G_4_gram.pt
|-- exp
| `-- pretrained_average_9_25.pt
`-- test_wavs
|-- FDHC0_SI1559.WAV
|-- FELC0_SI756.WAV
|-- FMGD0_SI1564.WAV
`-- trans.txt
6 directories, 10 files
**File descriptions**:
- ``data/lang_phone/HLG.pt``
It is the decoding graph.
- ``data/lang_phone/tokens.txt``
It contains tokens and their IDs.
- ``data/lang_phone/words.txt``
It contains words and their IDs.
- ``data/lm/G_4_gram.pt``
It is a 4-gram LM, useful for LM rescoring.
- ``exp/pretrained.pt``
It contains pre-trained model parameters, obtained by averaging
checkpoints from ``epoch-9.pt`` to ``epoch-25.pt``.
Note: We have removed optimizer ``state_dict`` to reduce file size.
- ``test_waves/*.WAV``
It contains some test sound files from timit ``TEST`` dataset.
- ``test_waves/trans.txt``
It contains the reference transcripts for the sound files in ``test_waves/``.
The information of the test sound files is listed below:
.. code-block:: bash
$ ffprobe -show_format tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV
Input #0, nistsphere, from 'tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV':
Metadata:
database_id : TIMIT
database_version: 1.0
utterance_id : dhc0_si1559
sample_min : -4176
sample_max : 5984
Duration: 00:00:03.40, bitrate: 258 kb/s
Stream #0:0: Audio: pcm_s16le, 16000 Hz, 1 channels, s16, 256 kb/s
$ ffprobe -show_format tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV
Input #0, nistsphere, from 'tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV':
Metadata:
database_id : TIMIT
database_version: 1.0
utterance_id : elc0_si756
sample_min : -1546
sample_max : 1989
Duration: 00:00:04.19, bitrate: 257 kb/s
Stream #0:0: Audio: pcm_s16le, 16000 Hz, 1 channels, s16, 256 kb/s
$ ffprobe -show_format tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV
Input #0, nistsphere, from 'tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV':
Metadata:
database_id : TIMIT
database_version: 1.0
utterance_id : mgd0_si1564
sample_min : -7626
sample_max : 10573
Duration: 00:00:04.44, bitrate: 257 kb/s
Stream #0:0: Audio: pcm_s16le, 16000 Hz, 1 channels, s16, 256 kb/s
Inference with a pre-trained model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
$ cd egs/timit/ASR
$ ./tdnn_ligru_ctc/pretrained.py --help
shows the usage information of ``./tdnn_ligru_ctc/pretrained.py``.
To decode with ``1best`` method, we can use:
.. code-block:: bash
./tdnn_ligru_ctc/pretrained.py
--method 1best
--checkpoint ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/exp/pretrained_average_9_25.pt
--words-file ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lang_phone/words.txt
--HLG ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lang_phone/HLG.pt
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV
The output is:
.. code-block::
2021-11-08 20:41:33,660 INFO [pretrained.py:169] device: cuda:0
2021-11-08 20:41:33,660 INFO [pretrained.py:171] Creating model
2021-11-08 20:41:38,680 INFO [pretrained.py:183] Loading HLG from ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lang_phone/HLG.pt
2021-11-08 20:41:38,695 INFO [pretrained.py:200] Constructing Fbank computer
2021-11-08 20:41:38,697 INFO [pretrained.py:210] Reading sound files: ['./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV', './tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV', './tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV']
2021-11-08 20:41:38,704 INFO [pretrained.py:216] Decoding started
2021-11-08 20:41:39,819 INFO [pretrained.py:246] Use HLG decoding
2021-11-08 20:41:39,829 INFO [pretrained.py:267]
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV:
sil dh ih sh uw ah l iy v iy z ih sil p r aa sil k s ih m ey dx ih sil d w uh dx ih w ih s f iy l ih ng w ih th ih n ih m s eh l f sil jh
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV:
sil m ih sil t ih r iy s sil s er r ih m ih sil m aa l ih sil k l ey sil r eh sil d w ay sil d aa r sil b ah f sil jh
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV:
sil hh ah z sil b ih sil g r iy w ah z sil d aw n ih sil b ay s sil n ey sil w eh l f eh n s ih z eh n dh eh r w er sil g r ey z ih ng sil k ae dx l sil
2021-11-08 20:41:39,829 INFO [pretrained.py:269] Decoding Done
To decode with ``whole-lattice-rescoring`` methond, you can use
.. code-block:: bash
./tdnn_ligru_ctc/pretrained.py \
--method whole-lattice-rescoring \
--checkpoint ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/exp/pretrained_average_9_25.pt \
--words-file ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lang_phone/words.txt \
--HLG ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lang_phone/HLG.pt \
--G ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lm/G_4_gram.pt \
--ngram-lm-scale 0.1 \
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV
The decoding output is:
.. code-block::
2021-11-08 20:37:50,693 INFO [pretrained.py:169] device: cuda:0
2021-11-08 20:37:50,693 INFO [pretrained.py:171] Creating model
2021-11-08 20:37:54,693 INFO [pretrained.py:183] Loading HLG from ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lang_phone/HLG.pt
2021-11-08 20:37:54,705 INFO [pretrained.py:191] Loading G from ./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/data/lm/G_4_gram.pt
2021-11-08 20:37:54,714 INFO [pretrained.py:200] Constructing Fbank computer
2021-11-08 20:37:54,715 INFO [pretrained.py:210] Reading sound files: ['./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV', './tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV', './tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV']
2021-11-08 20:37:54,720 INFO [pretrained.py:216] Decoding started
2021-11-08 20:37:55,808 INFO [pretrained.py:251] Use HLG decoding + LM rescoring
2021-11-08 20:37:56,348 INFO [pretrained.py:267]
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FDHC0_SI1559.WAV:
sil dh ih sh uw ah l iy v iy z ah sil p r aa sil k s ih m ey dx ih sil d w uh dx iy w ih s f iy l iy ng w ih th ih n ih m s eh l f sil jh
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FELC0_SI756.WAV:
sil m ih sil t ih r iy l s sil s er r eh m ih sil m aa l ih ng sil k l ey sil r eh sil d w ay sil d aa r sil b ah f sil jh ch
./tmp-ligru/icefall_asr_timit_tdnn_ligru_ctc/test_waves/FMGD0_SI1564.WAV:
sil hh ah z sil b ih n sil g r iy w ah z sil b aw n ih sil b ay s sil n ey sil w er l f eh n s ih z eh n dh eh r w er sil g r ey z ih ng sil k ae dx l sil
2021-11-08 20:37:56,348 INFO [pretrained.py:269] Decoding Done
Colab notebook
--------------
We provide a colab notebook for decoding with pre-trained model.
|timit tdnn_ligru_ctc colab notebook|
.. |timit tdnn_ligru_ctc colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/11IT-k4HQIgQngXz1uvWsEYktjqQt7Tmb
**Congratulations!** You have finished the TDNN-LiGRU-CTC recipe on timit in ``icefall``.

View File

@ -1,404 +0,0 @@
TDNN-LSTM-CTC
=============
This tutorial shows you how to run a TDNN-LSTM-CTC model with the `TIMIT <https://data.deepai.org/timit.zip>`_ dataset.
.. HINT::
We assume you have read the page :ref:`install icefall` and have setup
the environment for ``icefall``.
Data preparation
----------------
.. code-block:: bash
$ cd egs/timit/ASR
$ ./prepare.sh
The script ``./prepare.sh`` handles the data preparation for you, **automagically**.
All you need to do is to run it.
The data preparation contains several stages, you can use the following two
options:
- ``--stage``
- ``--stop-stage``
to control which stage(s) should be run. By default, all stages are executed.
For example,
.. code-block:: bash
$ cd egs/timit/ASR
$ ./prepare.sh --stage 0 --stop-stage 0
means to run only stage 0.
To run stage 2 to stage 5, use:
.. code-block:: bash
$ ./prepare.sh --stage 2 --stop-stage 5
Training
--------
Now describing the training of TDNN-LSTM-CTC model, contained in
the `tdnn_lstm_ctc <https://github.com/k2-fsa/icefall/tree/master/egs/timit/ASR/tdnn_lstm_ctc>`_
folder.
.. HINT::
TIMIT is a very small dataset. So one GPU for training is enough.
The command to run the training part is:
.. code-block:: bash
$ cd egs/timit/ASR
$ export CUDA_VISIBLE_DEVICES="0"
$ ./tdnn_lstm_ctc/train.py
By default, it will run ``25`` epochs. Training logs and checkpoints are saved
in ``tdnn_lstm_ctc/exp``.
In ``tdnn_lstm_ctc/exp``, you will find the following files:
- ``epoch-0.pt``, ``epoch-1.pt``, ..., ``epoch-29.pt``
These are checkpoint files, containing model ``state_dict`` and optimizer ``state_dict``.
To resume training from some checkpoint, say ``epoch-10.pt``, you can use:
.. code-block:: bash
$ ./tdnn_lstm_ctc/train.py --start-epoch 11
- ``tensorboard/``
This folder contains TensorBoard logs. Training loss, validation loss, learning
rate, etc, are recorded in these logs. You can visualize them by:
.. code-block:: bash
$ cd tdnn_lstm_ctc/exp/tensorboard
$ tensorboard dev upload --logdir . --description "TDNN LSTM training for timit with icefall"
- ``log/log-train-xxxx``
It is the detailed training log in text format, same as the one
you saw printed to the console during training.
To see available training options, you can use:
.. code-block:: bash
$ ./tdnn_lstm_ctc/train.py --help
Other training options, e.g., learning rate, results dir, etc., are
pre-configured in the function ``get_params()``
in `tdnn_lstm_ctc/train.py <https://github.com/k2-fsa/icefall/blob/master/egs/timit/ASR/tdnn_lstm_ctc/train.py>`_.
Normally, you don't need to change them. You can change them by modifying the code, if
you want.
Decoding
--------
The decoding part uses checkpoints saved by the training part, so you have
to run the training part first.
The command for decoding is:
.. code-block:: bash
$ export CUDA_VISIBLE_DEVICES="0"
$ ./tdnn_lstm_ctc/decode.py
You will see the WER in the output log.
Decoded results are saved in ``tdnn_lstm_ctc/exp``.
.. code-block:: bash
$ ./tdnn_lstm_ctc/decode.py --help
shows you the available decoding options.
Some commonly used options are:
- ``--epoch``
You can select which checkpoint to be used for decoding.
For instance, ``./tdnn_lstm_ctc/decode.py --epoch 10`` means to use
``./tdnn_lstm_ctc/exp/epoch-10.pt`` for decoding.
- ``--avg``
It's related to model averaging. It specifies number of checkpoints
to be averaged. The averaged model is used for decoding.
For example, the following command:
.. code-block:: bash
$ ./tdnn_lstm_ctc/decode.py --epoch 25 --avg 10
uses the average of ``epoch-16.pt``, ``epoch-17.pt``, ``epoch-18.pt``,
``epoch-19.pt``, ``epoch-20.pt``, ``epoch-21.pt``, ``epoch-22.pt``,
``epoch-23.pt``, ``epoch-24.pt`` and ``epoch-25.pt``
for decoding.
- ``--export``
If it is ``True``, i.e., ``./tdnn_lstm_ctc/decode.py --export 1``, the code
will save the averaged model to ``tdnn_lstm_ctc/exp/pretrained.pt``.
See :ref:`tdnn_lstm_ctc use a pre-trained model` for how to use it.
.. _tdnn_lstm_ctc use a pre-trained model:
Pre-trained Model
-----------------
We have uploaded the pre-trained model to
`<https://huggingface.co/luomingshuang/icefall_asr_timit_tdnn_lstm_ctc>`_.
The following shows you how to use the pre-trained model.
Install kaldifeat
~~~~~~~~~~~~~~~~~
`kaldifeat <https://github.com/csukuangfj/kaldifeat>`_ is used to
extract features for a single sound file or multiple sound files
at the same time.
Please refer to `<https://github.com/csukuangfj/kaldifeat>`_ for installation.
Download the pre-trained model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
$ cd egs/timit/ASR
$ mkdir tmp-lstm
$ cd tmp-lstm
$ git lfs install
$ git clone https://huggingface.co/luomingshuang/icefall_asr_timit_tdnn_lstm_ctc
.. CAUTION::
You have to use ``git lfs`` to download the pre-trained model.
.. CAUTION::
In order to use this pre-trained model, your k2 version has to be v1.7 or later.
After downloading, you will have the following files:
.. code-block:: bash
$ cd egs/timit/ASR
$ tree tmp-lstm
.. code-block:: bash
tmp-lstm/
`-- icefall_asr_timit_tdnn_lstm_ctc
|-- README.md
|-- data
| |-- lang_phone
| | |-- HLG.pt
| | |-- tokens.txt
| | `-- words.txt
| `-- lm
| `-- G_4_gram.pt
|-- exp
| `-- pretrained_average_16_25.pt
`-- test_wavs
|-- FDHC0_SI1559.WAV
|-- FELC0_SI756.WAV
|-- FMGD0_SI1564.WAV
`-- trans.txt
6 directories, 10 files
**File descriptions**:
- ``data/lang_phone/HLG.pt``
It is the decoding graph.
- ``data/lang_phone/tokens.txt``
It contains tokens and their IDs.
- ``data/lang_phone/words.txt``
It contains words and their IDs.
- ``data/lm/G_4_gram.pt``
It is a 4-gram LM, useful for LM rescoring.
- ``exp/pretrained.pt``
It contains pre-trained model parameters, obtained by averaging
checkpoints from ``epoch-16.pt`` to ``epoch-25.pt``.
Note: We have removed optimizer ``state_dict`` to reduce file size.
- ``test_waves/*.WAV``
It contains some test sound files from timit ``TEST`` dataset.
- ``test_waves/trans.txt``
It contains the reference transcripts for the sound files in ``test_waves/``.
The information of the test sound files is listed below:
.. code-block:: bash
$ ffprobe -show_format tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FDHC0_SI1559.WAV
Input #0, nistsphere, from 'tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FDHC0_SI1559.WAV':
Metadata:
database_id : TIMIT
database_version: 1.0
utterance_id : dhc0_si1559
sample_min : -4176
sample_max : 5984
Duration: 00:00:03.40, bitrate: 258 kb/s
Stream #0:0: Audio: pcm_s16le, 16000 Hz, 1 channels, s16, 256 kb/s
$ ffprobe -show_format tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FELC0_SI756.WAV
Input #0, nistsphere, from 'tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FELC0_SI756.WAV':
Metadata:
database_id : TIMIT
database_version: 1.0
utterance_id : elc0_si756
sample_min : -1546
sample_max : 1989
Duration: 00:00:04.19, bitrate: 257 kb/s
Stream #0:0: Audio: pcm_s16le, 16000 Hz, 1 channels, s16, 256 kb/s
$ ffprobe -show_format tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FMGD0_SI1564.WAV
Input #0, nistsphere, from 'tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FMGD0_SI1564.WAV':
Metadata:
database_id : TIMIT
database_version: 1.0
utterance_id : mgd0_si1564
sample_min : -7626
sample_max : 10573
Duration: 00:00:04.44, bitrate: 257 kb/s
Stream #0:0: Audio: pcm_s16le, 16000 Hz, 1 channels, s16, 256 kb/s
Inference with a pre-trained model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
$ cd egs/timit/ASR
$ ./tdnn_lstm_ctc/pretrained.py --help
shows the usage information of ``./tdnn_lstm_ctc/pretrained.py``.
To decode with ``1best`` method, we can use:
.. code-block:: bash
./tdnn_lstm_ctc/pretrained.py
--method 1best
--checkpoint ./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/exp/pretrained_average_16_25.pt
--words-file ./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/data/lang_phone/words.txt
--HLG ./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/data/lang_phone/HLG.pt
./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FDHC0_SI1559.WAV
./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FELC0_SI756.WAV
./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FMGD0_SI1564.WAV
The output is:
.. code-block::
2021-11-08 21:02:49,583 INFO [pretrained.py:169] device: cuda:0
2021-11-08 21:02:49,584 INFO [pretrained.py:171] Creating model
2021-11-08 21:02:53,816 INFO [pretrained.py:183] Loading HLG from ./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/data/lang_phone/HLG.pt
2021-11-08 21:02:53,827 INFO [pretrained.py:200] Constructing Fbank computer
2021-11-08 21:02:53,827 INFO [pretrained.py:210] Reading sound files: ['./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FDHC0_SI1559.WAV', './tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FELC0_SI756.WAV', './tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FMGD0_SI1564.WAV']
2021-11-08 21:02:53,831 INFO [pretrained.py:216] Decoding started
2021-11-08 21:02:54,380 INFO [pretrained.py:246] Use HLG decoding
2021-11-08 21:02:54,387 INFO [pretrained.py:267]
./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FDHC0_SI1559.WAV:
sil dh ih sh uw ah l iy v iy z ih sil p r aa sil k s ih m ey dx ih sil d w uh dx iy w ih s f iy l iy w ih th ih n ih m s eh l f sil jh
./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FELC0_SI756.WAV:
sil dh ih sil t ih r ih s sil s er r ih m ih sil m aa l ih ng sil k l ey sil r eh sil d w ay sil d aa r sil b ah f sil <UNK> jh
./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FMGD0_SI1564.WAV:
sil hh ae z sil b ih n iy w ah z sil b ae n ih sil b ay s sil n ey sil k eh l f eh n s ih z eh n dh eh r w er sil g r ey z ih ng sil k ae dx l sil
2021-11-08 21:02:54,387 INFO [pretrained.py:269] Decoding Done
To decode with ``whole-lattice-rescoring`` methond, you can use
.. code-block:: bash
./tdnn_lstm_ctc/pretrained.py \
--method whole-lattice-rescoring \
--checkpoint ./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/exp/pretrained_average_16_25.pt \
--words-file ./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/data/lang_phone/words.txt \
--HLG ./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/data/lang_phone/HLG.pt \
--G ./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/data/lm/G_4_gram.pt \
--ngram-lm-scale 0.08 \
./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FDHC0_SI1559.WAV
./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FELC0_SI756.WAV
./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FMGD0_SI1564.WAV
The decoding output is:
.. code-block::
2021-11-08 20:05:22,739 INFO [pretrained.py:169] device: cuda:0
2021-11-08 20:05:22,739 INFO [pretrained.py:171] Creating model
2021-11-08 20:05:26,959 INFO [pretrained.py:183] Loading HLG from ./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/data/lang_phone/HLG.pt
2021-11-08 20:05:26,971 INFO [pretrained.py:191] Loading G from ./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/data/lm/G_4_gram.pt
2021-11-08 20:05:26,977 INFO [pretrained.py:200] Constructing Fbank computer
2021-11-08 20:05:26,978 INFO [pretrained.py:210] Reading sound files: ['./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FDHC0_SI1559.WAV', './tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FELC0_SI756.WAV', './tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FMGD0_SI1564.WAV']
2021-11-08 20:05:26,981 INFO [pretrained.py:216] Decoding started
2021-11-08 20:05:27,519 INFO [pretrained.py:251] Use HLG decoding + LM rescoring
2021-11-08 20:05:27,878 INFO [pretrained.py:267]
./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FDHC0_SI1559.WAV:
sil dh ih sh uw l iy v iy z ih sil p r aa sil k s ah m ey dx ih sil w uh dx iy w ih s f iy l ih ng w ih th ih n ih m s eh l f sil jh
./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FELC0_SI756.WAV:
sil dh ih sil t ih r iy ih s sil s er r eh m ih sil n ah l ih ng sil k l ey sil r eh sil d w ay sil d aa r sil b ow f sil jh
./tmp-lstm/icefall_asr_timit_tdnn_lstm_ctc/test_waves/FMGD0_SI1564.WAV:
sil hh ah z sil b ih n iy w ah z sil b ae n ih sil b ay s sil n ey sil k ih l f eh n s ih z eh n dh eh r w er sil g r ey z ih n sil k ae dx l sil
2021-11-08 20:05:27,878 INFO [pretrained.py:269] Decoding Done
Colab notebook
--------------
We provide a colab notebook for decoding with pre-trained model.
|timit tdnn_lstm_ctc colab notebook|
.. |timit tdnn_lstm_ctc colab notebook| image:: https://colab.research.google.com/assets/colab-badge.svg
:target: https://colab.research.google.com/drive/1Hs9DA4V96uapw_30uNp32OMJgkuR5VVd
**Congratulations!** You have finished the TDNN-LSTM-CTC recipe on timit in ``icefall``.

Some files were not shown because too many files have changed in this diff Show More