icefall/egs/librispeech/ASR/transducer_stateless/README.md

## Introduction

The decoder, i.e., the prediction network, is from
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9054419
(Rnn-Transducer with Stateless Prediction Network)

You can use the following command to start the training:

```bash
cd egs/librispeech/ASR

export CUDA_VISIBLE_DEVICES="0,1,2,3"

./transducer_stateless/train.py \
  --world-size 4 \
  --num-epochs 30 \
  --start-epoch 0 \
  --exp-dir transducer_stateless/exp \
  --full-libri 1 \
  --max-duration 250 \
  --lr-factor 2.5
```

## How to get framewise token alignment

Assume that you already have a trained model. If not, you can either
train one by yourself or download a pre-trained model from hugging face:
<https://huggingface.co/csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01>

**Caution**: If you are going to use your own trained model, remember
to set `--modified-transducer-prob` to a nonzero value since the
force alignment code assumes that `--max-sym-per-frame` is 1.


The following shows how to get framewise token alignment using the above
pre-trained model.

```bash
git clone https://github.com/k2-fsa/icefall
cd icefall/egs/librispeech/ASR
mkdir tmp
sudo apt-get install git-lfs
git lfs install
git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-transducer-stateless-multi-datasets-bpe-500-2022-03-01 ./tmp/

ln -s $PWD/tmp/exp/pretrained.pt $PWD/tmp/epoch-999.pt

./transducer_stateless/compute_ali.py \
        --exp-dir ./tmp/exp \
        --bpe-model ./tmp/data/lang_bpe_500/bpe.model \
        --epoch 999 \
        --avg 1 \
        --max-duration 100 \
        --dataset dev-clean \
        --out-dir data/ali
```

After running the above commands, you will find the following two files
in the folder `./data/ali`:

```
-rw-r--r-- 1 xxx xxx 412K Mar  7 15:45 cuts_dev-clean.json.gz
-rw-r--r-- 1 xxx xxx 2.9M Mar  7 15:45 token_ali_dev-clean.h5
```

You can find usage examples in `./test_compute_ali.py` about
extracting framewise token alignment information from the above
two files.

## How to get word starting time from framewise token alignment

Assume you have run the above commands to get framewise token alignment
using a pre-trained model from `tmp/exp/epoch-999.pt`. You can use the following
commands to obtain word starting time.

```bash
./transducer_stateless/test_compute_ali.py \
        --bpe-model ./tmp/data/lang_bpe_500/bpe.model \
        --ali-dir data/ali \
        --dataset dev-clean
```

**Caution**: Since the frame shift is 10ms and the subsampling factor
of the model is 4, the time resolution is 0.04 second.

**Note**: The script `test_compute_ali.py` is for illustration only
and it processes only one batch and then exits.

You will get the following output:

```
5694-64029-0022-1998-0
[('THE', '0.20'), ('LEADEN', '0.36'), ('HAIL', '0.72'), ('STORM', '1.00'), ('SWEPT', '1.48'), ('THEM', '1.88'), ('OFF', '2.00'), ('THE', '2.24'), ('FIELD', '2.36'), ('THEY', '3.20'), ('FELL', '3.36'), ('BACK', '3.64'), ('AND', '3.92'), ('RE', '4.04'), ('FORMED', '4.20')]

3081-166546-0040-308-0
[('IN', '0.32'), ('OLDEN', '0.60'), ('DAYS', '1.00'), ('THEY', '1.40'), ('WOULD', '1.56'), ('HAVE', '1.76'), ('SAID', '1.92'), ('STRUCK', '2.60'), ('BY', '3.16'), ('A', '3.36'), ('BOLT', '3.44'), ('FROM', '3.84'), ('HEAVEN', '4.04')]

2035-147960-0016-1283-0
[('A', '0.44'), ('SNAKE', '0.52'), ('OF', '0.84'), ('HIS', '0.96'), ('SIZE', '1.12'), ('IN', '1.60'), ('FIGHTING', '1.72'), ('TRIM', '2.12'), ('WOULD', '2.56'), ('BE', '2.76'), ('MORE', '2.88'), ('THAN', '3.08'), ('ANY', '3.28'), ('BOY', '3.56'), ('COULD', '3.88'), ('HANDLE', '4.04')]

2428-83699-0020-1734-0
[('WHEN', '0.28'), ('THE', '0.48'), ('TRAP', '0.60'), ('DID', '0.88'), ('APPEAR', '1.08'), ('IT', '1.80'), ('LOOKED', '1.96'), ('TO',
'2.24'), ('ME', '2.36'), ('UNCOMMONLY', '2.52'), ('LIKE', '3.16'), ('AN', '3.40'), ('OPEN', '3.56'), ('SPRING', '3.92'), ('CART', '4.28')]

8297-275154-0026-2108-0
[('LET', '0.44'), ('ME', '0.72'), ('REST', '0.92'), ('A', '1.32'), ('LITTLE', '1.40'), ('HE', '1.80'), ('PLEADED', '2.00'), ('IF', '3.04'), ("I'M", '3.28'), ('NOT', '3.52'), ('IN', '3.76'), ('THE', '3.88'), ('WAY', '4.00')]

652-129742-0007-1002-0
[('SURROUND', '0.28'), ('WITH', '0.80'), ('A', '0.92'), ('GARNISH', '1.00'), ('OF', '1.44'), ('COOKED', '1.56'), ('AND', '1.88'), ('DICED', '4.16'), ('CARROTS', '4.28'), ('TURNIPS', '4.44'), ('GREEN', '4.60'), ('PEAS', '4.72')]
```


For the row:
```
5694-64029-0022-1998-0
[('THE', '0.20'), ('LEADEN', '0.36'), ('HAIL', '0.72'), ('STORM', '1.00'), ('SWEPT', '1.48'),
('THEM', '1.88'), ('OFF', '2.00'), ('THE', '2.24'), ('FIELD', '2.36'), ('THEY', '3.20'), ('FELL', '3.36'),
('BACK', '3.64'), ('AND', '3.92'), ('RE', '4.04'), ('FORMED', '4.20')]
```

- `5694-64029-0022-1998-0` is the cut ID.
- `('THE', '0.20')` means the word `THE` starts at 0.20 second.
- `('LEADEN', '0.36')` means the word `LEADEN` starts at 0.36 second.


You can compare the above word starting time with the one
from <https://github.com/CorentinJ/librispeech-alignments>

```
5694-64029-0022 ",THE,LEADEN,HAIL,STORM,SWEPT,THEM,OFF,THE,FIELD,,THEY,FELL,BACK,AND,RE,FORMED," "0.230,0.360,0.670,1.010,1.440,1.860,1.990,2.230,2.350,2.870,3.230,3.390,3.660,3.960,4.060,4.160,4.850,4.9"
```

We reformat it below for readability:

```
5694-64029-0022 ",THE,LEADEN,HAIL,STORM,SWEPT,THEM,OFF,THE,FIELD,,THEY,FELL,BACK,AND,RE,FORMED,"
"0.230,0.360,0.670,1.010,1.440,1.860,1.990,2.230,2.350,2.870,3.230,3.390,3.660,3.960,4.060,4.160,4.850,4.9"
  the  leaden hail storm swept them  off   the   field  sil   they  fell  back  and   re   formed  sil
```