mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-08 09:32:20 +00:00
SPGISpeech recipe (#334)
* initial commit for SPGISpeech recipe * add decoding * add spgispeech transducer * remove conformer ctc; minor fixes in RNN-T * add results * add tensorboard * add pretrained model to HF * remove unused scripts and soft link common scripts * remove duplicate files * pre commit hooks * remove change in librispeech * pre commit hook * add CER numbers
This commit is contained in:
parent
6f7860a0a6
commit
5aafbb970e
32
egs/spgispeech/ASR/README.md
Normal file
32
egs/spgispeech/ASR/README.md
Normal file
@ -0,0 +1,32 @@
|
|||||||
|
# SPGISpeech
|
||||||
|
|
||||||
|
SPGISpeech consists of 5,000 hours of recorded company earnings calls and their respective
|
||||||
|
transcriptions. The original calls were split into slices ranging from 5 to 15 seconds in
|
||||||
|
length to allow easy training for speech recognition systems. Calls represent a broad
|
||||||
|
cross-section of international business English; SPGISpeech contains approximately 50,000
|
||||||
|
speakers, one of the largest numbers of any speech corpus, and offers a variety of L1 and
|
||||||
|
L2 English accents. The format of each WAV file is single channel, 16kHz, 16 bit audio.
|
||||||
|
|
||||||
|
Transcription text represents the output of several stages of manual post-processing.
|
||||||
|
As such, the text contains polished English orthography following a detailed style guide,
|
||||||
|
including proper casing, punctuation, and denormalized non-standard words such as numbers
|
||||||
|
and acronyms, making SPGISpeech suited for training fully formatted end-to-end models.
|
||||||
|
|
||||||
|
Official reference:
|
||||||
|
|
||||||
|
O’Neill, P.K., Lavrukhin, V., Majumdar, S., Noroozi, V., Zhang, Y., Kuchaiev, O., Balam,
|
||||||
|
J., Dovzhenko, Y., Freyberg, K., Shulman, M.D., Ginsburg, B., Watanabe, S., & Kucsko, G.
|
||||||
|
(2021). SPGISpeech: 5, 000 hours of transcribed financial audio for fully formatted
|
||||||
|
end-to-end speech recognition. ArXiv, abs/2104.02014.
|
||||||
|
|
||||||
|
ArXiv link: https://arxiv.org/abs/2104.02014
|
||||||
|
|
||||||
|
## Performance Record
|
||||||
|
|
||||||
|
| Decoding method | val WER | val CER |
|
||||||
|
|---------------------------|------------|---------|
|
||||||
|
| greedy search | 2.40 | 0.99 |
|
||||||
|
| modified beam search | 2.24 | 0.91 |
|
||||||
|
| fast beam search | 2.35 | 0.97 |
|
||||||
|
|
||||||
|
See [RESULTS](/egs/spgispeech/ASR/RESULTS.md) for details.
|
73
egs/spgispeech/ASR/RESULTS.md
Normal file
73
egs/spgispeech/ASR/RESULTS.md
Normal file
@ -0,0 +1,73 @@
|
|||||||
|
## Results
|
||||||
|
|
||||||
|
### SPGISpeech BPE training results (Pruned Transducer)
|
||||||
|
|
||||||
|
#### 2022-05-11
|
||||||
|
|
||||||
|
#### Conformer encoder + embedding decoder
|
||||||
|
|
||||||
|
Conformer encoder + non-current decoder. The decoder
|
||||||
|
contains only an embedding layer, a Conv1d (with kernel size 2) and a linear
|
||||||
|
layer (to transform tensor dim).
|
||||||
|
|
||||||
|
The WERs are
|
||||||
|
|
||||||
|
| | dev | val | comment |
|
||||||
|
|---------------------------|------------|------------|------------------------------------------|
|
||||||
|
| greedy search | 2.46 | 2.40 | --avg-last-n 10 --max-duration 500 |
|
||||||
|
| modified beam search | 2.28 | 2.24 | --avg-last-n 10 --max-duration 500 --beam-size 4 |
|
||||||
|
| fast beam search | 2.38 | 2.35 | --avg-last-n 10 --max-duration 500 --beam-size 4 --max-contexts 4 --max-states 8 |
|
||||||
|
|
||||||
|
**NOTE:** SPGISpeech transcripts can be prepared in `ortho` or `norm` ways, which refer to whether the
|
||||||
|
transcripts are orthographic or normalized. These WERs correspond to the normalized transcription
|
||||||
|
scenario.
|
||||||
|
|
||||||
|
The training command for reproducing is given below:
|
||||||
|
|
||||||
|
```
|
||||||
|
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
|
||||||
|
|
||||||
|
./pruned_transducer_stateless2/train.py \
|
||||||
|
--world-size 8 \
|
||||||
|
--num-epochs 20 \
|
||||||
|
--start-epoch 0 \
|
||||||
|
--exp-dir pruned_transducer_stateless2/exp \
|
||||||
|
--max-duration 200 \
|
||||||
|
--prune-range 5 \
|
||||||
|
--lr-factor 5 \
|
||||||
|
--lm-scale 0.25 \
|
||||||
|
--use-fp16 True
|
||||||
|
```
|
||||||
|
|
||||||
|
The decoding command is:
|
||||||
|
```
|
||||||
|
# greedy search
|
||||||
|
./pruned_transducer_stateless2/decode.py \
|
||||||
|
--iter 696000 --avg 10 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless2/exp \
|
||||||
|
--max-duration 100 \
|
||||||
|
--decoding-method greedy_search
|
||||||
|
|
||||||
|
# modified beam search
|
||||||
|
./pruned_transducer_stateless2/decode.py \
|
||||||
|
--iter 696000 --avg 10 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless2/exp \
|
||||||
|
--max-duration 100 \
|
||||||
|
--decoding-method modified_beam_search \
|
||||||
|
--beam-size 4
|
||||||
|
|
||||||
|
# fast beam search
|
||||||
|
./pruned_transducer_stateless2/decode.py \
|
||||||
|
--iter 696000 --avg 10 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless2/exp \
|
||||||
|
--max-duration 1500 \
|
||||||
|
--decoding-method fast_beam_search \
|
||||||
|
--beam 4 \
|
||||||
|
--max-contexts 4 \
|
||||||
|
--max-states 8
|
||||||
|
```
|
||||||
|
|
||||||
|
Pretrained model is available at <https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2>
|
||||||
|
|
||||||
|
The tensorboard training log can be found at
|
||||||
|
<https://tensorboard.dev/experiment/ExSoBmrPRx6liMTGLu0Tgw/#scalars>
|
0
egs/spgispeech/ASR/local/__init__.py
Normal file
0
egs/spgispeech/ASR/local/__init__.py
Normal file
1
egs/spgispeech/ASR/local/compile_hlg.py
Symbolic link
1
egs/spgispeech/ASR/local/compile_hlg.py
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../../../librispeech/ASR/local/compile_hlg.py
|
104
egs/spgispeech/ASR/local/compute_fbank_musan.py
Executable file
104
egs/spgispeech/ASR/local/compute_fbank_musan.py
Executable file
@ -0,0 +1,104 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# Copyright 2021 Xiaomi Corp. (authors: Fangjun Kuang)
|
||||||
|
#
|
||||||
|
# See ../../../../LICENSE for clarification regarding multiple authors
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
This file computes fbank features of the musan dataset.
|
||||||
|
It looks for manifests in the directory data/manifests.
|
||||||
|
|
||||||
|
The generated fbank features are saved in data/fbank.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from lhotse import LilcomChunkyWriter, CutSet, combine
|
||||||
|
from lhotse.features.kaldifeat import (
|
||||||
|
KaldifeatFbank,
|
||||||
|
KaldifeatFbankConfig,
|
||||||
|
KaldifeatMelOptions,
|
||||||
|
KaldifeatFrameOptions,
|
||||||
|
)
|
||||||
|
from lhotse.recipes.utils import read_manifests_if_cached
|
||||||
|
|
||||||
|
from icefall.utils import get_executor
|
||||||
|
|
||||||
|
# Torch's multithreaded behavior needs to be disabled or
|
||||||
|
# it wastes a lot of CPU and slow things down.
|
||||||
|
# Do this outside of main() in case it needs to take effect
|
||||||
|
# even when we are not invoking the main (e.g. when spawning subprocesses).
|
||||||
|
torch.set_num_threads(1)
|
||||||
|
torch.set_num_interop_threads(1)
|
||||||
|
|
||||||
|
|
||||||
|
def compute_fbank_musan():
|
||||||
|
src_dir = Path("data/manifests")
|
||||||
|
output_dir = Path("data/fbank")
|
||||||
|
|
||||||
|
sampling_rate = 16000
|
||||||
|
num_mel_bins = 80
|
||||||
|
|
||||||
|
extractor = KaldifeatFbank(
|
||||||
|
KaldifeatFbankConfig(
|
||||||
|
frame_opts=KaldifeatFrameOptions(sampling_rate=sampling_rate),
|
||||||
|
mel_opts=KaldifeatMelOptions(num_bins=num_mel_bins),
|
||||||
|
device="cuda",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
dataset_parts = (
|
||||||
|
"music",
|
||||||
|
"speech",
|
||||||
|
"noise",
|
||||||
|
)
|
||||||
|
manifests = read_manifests_if_cached(
|
||||||
|
dataset_parts=dataset_parts, output_dir=src_dir
|
||||||
|
)
|
||||||
|
assert manifests is not None
|
||||||
|
|
||||||
|
musan_cuts_path = src_dir / "cuts_musan.jsonl.gz"
|
||||||
|
|
||||||
|
if musan_cuts_path.is_file():
|
||||||
|
logging.info(f"{musan_cuts_path} already exists - skipping")
|
||||||
|
return
|
||||||
|
|
||||||
|
logging.info("Extracting features for Musan")
|
||||||
|
|
||||||
|
# create chunks of Musan with duration 5 - 10 seconds
|
||||||
|
musan_cuts = (
|
||||||
|
CutSet.from_manifests(
|
||||||
|
recordings=combine(part["recordings"] for part in manifests.values())
|
||||||
|
)
|
||||||
|
.cut_into_windows(10.0)
|
||||||
|
.filter(lambda c: c.duration > 5)
|
||||||
|
.compute_and_store_features_batch(
|
||||||
|
extractor=extractor,
|
||||||
|
storage_path=output_dir / f"feats_musan",
|
||||||
|
manifest_path=src_dir / f"cuts_musan.jsonl.gz",
|
||||||
|
batch_duration=500,
|
||||||
|
num_workers=4,
|
||||||
|
storage_type=LilcomChunkyWriter,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
|
||||||
|
|
||||||
|
logging.basicConfig(format=formatter, level=logging.INFO)
|
||||||
|
compute_fbank_musan()
|
145
egs/spgispeech/ASR/local/compute_fbank_spgispeech.py
Executable file
145
egs/spgispeech/ASR/local/compute_fbank_spgispeech.py
Executable file
@ -0,0 +1,145 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# Copyright 2022 Johns Hopkins University (authors: Desh Raj)
|
||||||
|
#
|
||||||
|
# See ../../../../LICENSE for clarification regarding multiple authors
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
This file computes fbank features of the SPGISpeech dataset.
|
||||||
|
It looks for manifests in the directory data/manifests.
|
||||||
|
|
||||||
|
The generated fbank features are saved in data/fbank.
|
||||||
|
"""
|
||||||
|
import argparse
|
||||||
|
import logging
|
||||||
|
from pathlib import Path
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from lhotse import load_manifest_lazy, LilcomChunkyWriter
|
||||||
|
from lhotse.features.kaldifeat import (
|
||||||
|
KaldifeatFbank,
|
||||||
|
KaldifeatFbankConfig,
|
||||||
|
KaldifeatMelOptions,
|
||||||
|
KaldifeatFrameOptions,
|
||||||
|
)
|
||||||
|
from lhotse.manipulation import combine
|
||||||
|
|
||||||
|
# Torch's multithreaded behavior needs to be disabled or
|
||||||
|
# it wastes a lot of CPU and slow things down.
|
||||||
|
# Do this outside of main() in case it needs to take effect
|
||||||
|
# even when we are not invoking the main (e.g. when spawning subprocesses).
|
||||||
|
torch.set_num_threads(1)
|
||||||
|
torch.set_num_interop_threads(1)
|
||||||
|
|
||||||
|
|
||||||
|
def get_args():
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument(
|
||||||
|
"--num-splits",
|
||||||
|
type=int,
|
||||||
|
default=20,
|
||||||
|
help="Number of splits for the train set.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--start",
|
||||||
|
type=int,
|
||||||
|
default=0,
|
||||||
|
help="Start index of the train set split.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--stop",
|
||||||
|
type=int,
|
||||||
|
default=-1,
|
||||||
|
help="Stop index of the train set split.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--test",
|
||||||
|
action="store_true",
|
||||||
|
help="If set, only compute features for the dev and val set.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--train",
|
||||||
|
action="store_true",
|
||||||
|
help="If set, only compute features for the train set.",
|
||||||
|
)
|
||||||
|
|
||||||
|
return parser.parse_args()
|
||||||
|
|
||||||
|
|
||||||
|
def compute_fbank_spgispeech(args):
|
||||||
|
assert args.train or args.test, "Either train or test must be set."
|
||||||
|
|
||||||
|
src_dir = Path("data/manifests")
|
||||||
|
output_dir = Path("data/fbank")
|
||||||
|
|
||||||
|
sampling_rate = 16000
|
||||||
|
num_mel_bins = 80
|
||||||
|
|
||||||
|
extractor = KaldifeatFbank(
|
||||||
|
KaldifeatFbankConfig(
|
||||||
|
frame_opts=KaldifeatFrameOptions(sampling_rate=sampling_rate),
|
||||||
|
mel_opts=KaldifeatMelOptions(num_bins=num_mel_bins),
|
||||||
|
device="cuda",
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
if args.train:
|
||||||
|
logging.info(f"Processing train")
|
||||||
|
cut_set = load_manifest_lazy(src_dir / f"cuts_train_raw.jsonl.gz")
|
||||||
|
chunk_size = len(cut_set) // args.num_splits
|
||||||
|
cut_sets = cut_set.split_lazy(
|
||||||
|
output_dir=src_dir / f"cuts_train_raw_split{args.num_splits}",
|
||||||
|
chunk_size=chunk_size,
|
||||||
|
)
|
||||||
|
start = args.start
|
||||||
|
stop = min(args.stop, args.num_splits) if args.stop > 0 else args.num_splits
|
||||||
|
num_digits = len(str(args.num_splits))
|
||||||
|
for i in range(start, stop):
|
||||||
|
idx = f"{i + 1}".zfill(num_digits)
|
||||||
|
logging.info(f"Processing train split {i}")
|
||||||
|
cs = cut_sets[i].compute_and_store_features_batch(
|
||||||
|
extractor=extractor,
|
||||||
|
storage_path=output_dir / f"feats_train_{idx}",
|
||||||
|
manifest_path=src_dir / f"cuts_train_{idx}.jsonl.gz",
|
||||||
|
batch_duration=500,
|
||||||
|
num_workers=4,
|
||||||
|
storage_type=LilcomChunkyWriter,
|
||||||
|
)
|
||||||
|
|
||||||
|
if args.test:
|
||||||
|
for partition in ["dev", "val"]:
|
||||||
|
if (output_dir / f"cuts_{partition}.jsonl.gz").is_file():
|
||||||
|
logging.info(f"{partition} already exists - skipping.")
|
||||||
|
continue
|
||||||
|
logging.info(f"Processing {partition}")
|
||||||
|
cut_set = load_manifest_lazy(src_dir / f"cuts_{partition}_raw.jsonl.gz")
|
||||||
|
cut_set = cut_set.compute_and_store_features_batch(
|
||||||
|
extractor=extractor,
|
||||||
|
storage_path=output_dir / f"feats_{partition}",
|
||||||
|
manifest_path=src_dir / f"cuts_{partition}.jsonl.gz",
|
||||||
|
batch_duration=500,
|
||||||
|
num_workers=4,
|
||||||
|
storage_type=LilcomChunkyWriter,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
|
||||||
|
|
||||||
|
logging.basicConfig(format=formatter, level=logging.INFO)
|
||||||
|
|
||||||
|
args = get_args()
|
||||||
|
compute_fbank_spgispeech(args)
|
1
egs/spgispeech/ASR/local/prepare_lang.py
Symbolic link
1
egs/spgispeech/ASR/local/prepare_lang.py
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../../../librispeech/ASR/local/prepare_lang.py
|
1
egs/spgispeech/ASR/local/prepare_lang_bpe.py
Symbolic link
1
egs/spgispeech/ASR/local/prepare_lang_bpe.py
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../../../librispeech/ASR/local/prepare_lang_bpe.py
|
79
egs/spgispeech/ASR/local/prepare_splits.py
Executable file
79
egs/spgispeech/ASR/local/prepare_splits.py
Executable file
@ -0,0 +1,79 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
# Copyright 2022 Johns Hopkins University (authors: Desh Raj)
|
||||||
|
#
|
||||||
|
# See ../../../../LICENSE for clarification regarding multiple authors
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
This file splits the training set into train and dev sets.
|
||||||
|
"""
|
||||||
|
import logging
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from lhotse import CutSet
|
||||||
|
|
||||||
|
from lhotse.recipes.utils import read_manifests_if_cached
|
||||||
|
|
||||||
|
# Torch's multithreaded behavior needs to be disabled or
|
||||||
|
# it wastes a lot of CPU and slow things down.
|
||||||
|
# Do this outside of main() in case it needs to take effect
|
||||||
|
# even when we are not invoking the main (e.g. when spawning subprocesses).
|
||||||
|
torch.set_num_threads(1)
|
||||||
|
torch.set_num_interop_threads(1)
|
||||||
|
|
||||||
|
|
||||||
|
def split_spgispeech_train():
|
||||||
|
src_dir = Path("data/manifests")
|
||||||
|
|
||||||
|
manifests = read_manifests_if_cached(
|
||||||
|
dataset_parts=["train", "val"],
|
||||||
|
output_dir=src_dir,
|
||||||
|
prefix="spgispeech",
|
||||||
|
suffix="jsonl.gz",
|
||||||
|
lazy=True,
|
||||||
|
)
|
||||||
|
assert manifests is not None
|
||||||
|
|
||||||
|
train_dev_cuts = CutSet.from_manifests(
|
||||||
|
recordings=manifests["train"]["recordings"],
|
||||||
|
supervisions=manifests["train"]["supervisions"],
|
||||||
|
)
|
||||||
|
dev_cuts = train_dev_cuts.subset(first=4000)
|
||||||
|
train_cuts = train_dev_cuts.filter(lambda c: c not in dev_cuts)
|
||||||
|
|
||||||
|
# Add speed perturbation
|
||||||
|
train_cuts = (
|
||||||
|
train_cuts + train_cuts.perturb_speed(0.9) + train_cuts.perturb_speed(1.1)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Write the manifests to disk.
|
||||||
|
train_cuts.to_file(src_dir / "cuts_train_raw.jsonl.gz")
|
||||||
|
dev_cuts.to_file(src_dir / "cuts_dev_raw.jsonl.gz")
|
||||||
|
|
||||||
|
# Also write the val set to disk.
|
||||||
|
val_cuts = CutSet.from_manifests(
|
||||||
|
recordings=manifests["val"]["recordings"],
|
||||||
|
supervisions=manifests["val"]["supervisions"],
|
||||||
|
)
|
||||||
|
val_cuts.to_file(src_dir / "cuts_val_raw.jsonl.gz")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
|
||||||
|
|
||||||
|
logging.basicConfig(format=formatter, level=logging.INFO)
|
||||||
|
|
||||||
|
split_spgispeech_train()
|
1
egs/spgispeech/ASR/local/train_bpe_model.py
Symbolic link
1
egs/spgispeech/ASR/local/train_bpe_model.py
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../../../librispeech/ASR/local/train_bpe_model.py
|
196
egs/spgispeech/ASR/prepare.sh
Executable file
196
egs/spgispeech/ASR/prepare.sh
Executable file
@ -0,0 +1,196 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
|
||||||
|
set -eou pipefail
|
||||||
|
|
||||||
|
nj=20
|
||||||
|
stage=-1
|
||||||
|
stop_stage=100
|
||||||
|
|
||||||
|
# We assume dl_dir (download dir) contains the following
|
||||||
|
# directories and files. If not, they will be downloaded
|
||||||
|
# by this script automatically.
|
||||||
|
#
|
||||||
|
# - $dl_dir/spgispeech
|
||||||
|
# You can find train.csv, val.csv, train, and val in this directory, which belong
|
||||||
|
# to the SPGISpeech dataset.
|
||||||
|
#
|
||||||
|
# - $dl_dir/musan
|
||||||
|
# This directory contains the following directories downloaded from
|
||||||
|
# http://www.openslr.org/17/
|
||||||
|
#
|
||||||
|
# - music
|
||||||
|
# - noise
|
||||||
|
# - speech
|
||||||
|
dl_dir=$PWD/download
|
||||||
|
|
||||||
|
. shared/parse_options.sh || exit 1
|
||||||
|
|
||||||
|
# vocab size for sentence piece models.
|
||||||
|
# It will generate data/lang_bpe_xxx,
|
||||||
|
# data/lang_bpe_yyy if the array contains xxx, yyy
|
||||||
|
vocab_sizes=(
|
||||||
|
500
|
||||||
|
)
|
||||||
|
|
||||||
|
# All files generated by this script are saved in "data".
|
||||||
|
# You can safely remove "data" and rerun this script to regenerate it.
|
||||||
|
mkdir -p data
|
||||||
|
|
||||||
|
log() {
|
||||||
|
# This function is from espnet
|
||||||
|
local fname=${BASH_SOURCE[1]##*/}
|
||||||
|
echo -e "$(date '+%Y-%m-%d %H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
|
||||||
|
}
|
||||||
|
|
||||||
|
log "dl_dir: $dl_dir"
|
||||||
|
|
||||||
|
if [ $stage -le 0 ] && [ $stop_stage -ge 0 ]; then
|
||||||
|
log "Stage 0: Download data"
|
||||||
|
|
||||||
|
# If you have pre-downloaded it to /path/to/spgispeech,
|
||||||
|
# you can create a symlink
|
||||||
|
#
|
||||||
|
# ln -sfv /path/to/spgispeech $dl_dir/spgispeech
|
||||||
|
#
|
||||||
|
if [ ! -d $dl_dir/spgispeech/train.csv ]; then
|
||||||
|
lhotse download spgispeech $dl_dir
|
||||||
|
fi
|
||||||
|
|
||||||
|
# If you have pre-downloaded it to /path/to/musan,
|
||||||
|
# you can create a symlink
|
||||||
|
#
|
||||||
|
# ln -sfv /path/to/musan $dl_dir/
|
||||||
|
#
|
||||||
|
if [ ! -d $dl_dir/musan ]; then
|
||||||
|
lhotse download musan $dl_dir
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
|
||||||
|
log "Stage 1: Prepare SPGISpeech manifest (may take ~1h)"
|
||||||
|
# We assume that you have downloaded the SPGISpeech corpus
|
||||||
|
# to $dl_dir/spgispeech. We perform text normalization for the transcripts.
|
||||||
|
mkdir -p data/manifests
|
||||||
|
lhotse prepare spgispeech -j $nj --normalize-text $dl_dir/spgispeech data/manifests
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
|
||||||
|
log "Stage 2: Prepare musan manifest"
|
||||||
|
# We assume that you have downloaded the musan corpus
|
||||||
|
# to data/musan
|
||||||
|
mkdir -p data/manifests
|
||||||
|
lhotse prepare musan $dl_dir/musan data/manifests
|
||||||
|
lhotse combine data/manifests/recordings_{music,speech,noise}.json data/manifests/recordings_musan.jsonl.gz
|
||||||
|
lhotse cut simple -r data/manifests/recordings_musan.jsonl.gz data/manifests/cuts_musan_raw.jsonl.gz
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
|
||||||
|
log "Stage 3: Split train into train and dev and create cut sets."
|
||||||
|
python local/prepare_splits.py
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
|
||||||
|
log "Stage 4: Compute fbank features for spgispeech dev and val"
|
||||||
|
mkdir -p data/fbank
|
||||||
|
python local/compute_fbank_spgispeech.py --test
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ $stage -le 5 ] && [ $stop_stage -ge 5 ]; then
|
||||||
|
log "Stage 5: Compute fbank features for train"
|
||||||
|
mkdir -p data/fbank
|
||||||
|
python local/compute_fbank_spgispeech.py --train --num-splits 20
|
||||||
|
|
||||||
|
log "Combine features from train splits (may take ~1h)"
|
||||||
|
if [ ! -f data/manifests/cuts_train.jsonl.gz ]; then
|
||||||
|
pieces=$(find data/manifests -name "cuts_train_[0-9]*.jsonl.gz")
|
||||||
|
lhotse combine $pieces data/manifests/cuts_train.jsonl.gz
|
||||||
|
fi
|
||||||
|
gunzip -c data/manifests/train_cuts.jsonl.gz | shuf | gzip -c > data/manifests/train_cuts_shuf.jsonl.gz
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ $stage -le 6 ] && [ $stop_stage -ge 6 ]; then
|
||||||
|
log "Stage 6: Compute fbank features for musan"
|
||||||
|
mkdir -p data/fbank
|
||||||
|
python local/compute_fbank_musan.py
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ $stage -le 7 ] && [ $stop_stage -ge 7 ]; then
|
||||||
|
log "Stage 7: Dump transcripts for LM training"
|
||||||
|
mkdir -p data/lm
|
||||||
|
gunzip -c data/manifests/cuts_train_raw.jsonl.gz \
|
||||||
|
| jq '.supervisions[0].text' \
|
||||||
|
| sed 's:"::g' \
|
||||||
|
> data/lm/transcript_words.txt
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ $stage -le 8 ] && [ $stop_stage -ge 8 ]; then
|
||||||
|
log "Stage 8: Prepare BPE based lang"
|
||||||
|
|
||||||
|
for vocab_size in ${vocab_sizes[@]}; do
|
||||||
|
lang_dir=data/lang_bpe_${vocab_size}
|
||||||
|
mkdir -p $lang_dir
|
||||||
|
|
||||||
|
# Add special words to words.txt
|
||||||
|
echo "<eps> 0" > $lang_dir/words.txt
|
||||||
|
echo "!SIL 1" >> $lang_dir/words.txt
|
||||||
|
echo "[UNK] 2" >> $lang_dir/words.txt
|
||||||
|
|
||||||
|
# Add regular words to words.txt
|
||||||
|
gunzip -c data/manifests/cuts_train_raw.jsonl.gz \
|
||||||
|
| jq '.supervisions[0].text' \
|
||||||
|
| sed 's:"::g' \
|
||||||
|
| sed 's: :\n:g' \
|
||||||
|
| sort \
|
||||||
|
| uniq \
|
||||||
|
| sed '/^$/d' \
|
||||||
|
| awk '{print $0,NR+2}' \
|
||||||
|
>> $lang_dir/words.txt
|
||||||
|
|
||||||
|
# Add remaining special word symbols expected by LM scripts.
|
||||||
|
num_words=$(cat $lang_dir/words.txt | wc -l)
|
||||||
|
echo "<s> ${num_words}" >> $lang_dir/words.txt
|
||||||
|
num_words=$(cat $lang_dir/words.txt | wc -l)
|
||||||
|
echo "</s> ${num_words}" >> $lang_dir/words.txt
|
||||||
|
num_words=$(cat $lang_dir/words.txt | wc -l)
|
||||||
|
echo "#0 ${num_words}" >> $lang_dir/words.txt
|
||||||
|
|
||||||
|
./local/train_bpe_model.py \
|
||||||
|
--lang-dir $lang_dir \
|
||||||
|
--vocab-size $vocab_size \
|
||||||
|
--transcript data/lm/transcript_words.txt
|
||||||
|
|
||||||
|
if [ ! -f $lang_dir/L_disambig.pt ]; then
|
||||||
|
./local/prepare_lang_bpe.py --lang-dir $lang_dir
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ $stage -le 9 ] && [ $stop_stage -ge 9 ]; then
|
||||||
|
log "Stage 9: Train LM"
|
||||||
|
lm_dir=data/lm
|
||||||
|
|
||||||
|
if [ ! -f $lm_dir/G.arpa ]; then
|
||||||
|
./shared/make_kn_lm.py \
|
||||||
|
-ngram-order 3 \
|
||||||
|
-text $lm_dir/transcript_words.txt \
|
||||||
|
-lm $lm_dir/G.arpa
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ! -f $lm_dir/G_3_gram.fst.txt ]; then
|
||||||
|
python3 -m kaldilm \
|
||||||
|
--read-symbol-table="data/lang_phone/words.txt" \
|
||||||
|
--disambig-symbol='#0' \
|
||||||
|
--max-order=3 \
|
||||||
|
$lm_dir/G.arpa > $lm_dir/G_3_gram.fst.txt
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ $stage -le 10 ] && [ $stop_stage -ge 10 ]; then
|
||||||
|
log "Stage 10: Compile HLG"
|
||||||
|
./local/compile_hlg.py --lang-dir data/lang_phone
|
||||||
|
|
||||||
|
for vocab_size in ${vocab_sizes[@]}; do
|
||||||
|
lang_dir=data/lang_bpe_${vocab_size}
|
||||||
|
./local/compile_hlg.py --lang-dir $lang_dir
|
||||||
|
done
|
||||||
|
fi
|
@ -0,0 +1,366 @@
|
|||||||
|
# Copyright 2021 Piotr Żelasko
|
||||||
|
#
|
||||||
|
# See ../../../../LICENSE for clarification regarding multiple authors
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import logging
|
||||||
|
from functools import lru_cache
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any, Dict, Optional
|
||||||
|
|
||||||
|
import torch
|
||||||
|
from lhotse import CutSet, Fbank, FbankConfig, load_manifest, load_manifest_lazy
|
||||||
|
from lhotse.dataset import (
|
||||||
|
CutConcatenate,
|
||||||
|
CutMix,
|
||||||
|
DynamicBucketingSampler,
|
||||||
|
K2SpeechRecognitionDataset,
|
||||||
|
PrecomputedFeatures,
|
||||||
|
SpecAugment,
|
||||||
|
)
|
||||||
|
from lhotse.dataset.input_strategies import OnTheFlyFeatures
|
||||||
|
from lhotse.utils import fix_random_seed
|
||||||
|
from torch.utils.data import DataLoader
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
|
from icefall.utils import str2bool
|
||||||
|
|
||||||
|
|
||||||
|
class _SeedWorkers:
|
||||||
|
def __init__(self, seed: int):
|
||||||
|
self.seed = seed
|
||||||
|
|
||||||
|
def __call__(self, worker_id: int):
|
||||||
|
fix_random_seed(self.seed + worker_id)
|
||||||
|
|
||||||
|
|
||||||
|
class SPGISpeechAsrDataModule:
|
||||||
|
"""
|
||||||
|
DataModule for k2 ASR experiments.
|
||||||
|
It assumes there is always one train and valid dataloader,
|
||||||
|
but there can be multiple test dataloaders (e.g. LibriSpeech test-clean
|
||||||
|
and test-other).
|
||||||
|
It contains all the common data pipeline modules used in ASR
|
||||||
|
experiments, e.g.:
|
||||||
|
- dynamic batch size,
|
||||||
|
- bucketing samplers,
|
||||||
|
- cut concatenation,
|
||||||
|
- augmentation,
|
||||||
|
- on-the-fly feature extraction
|
||||||
|
This class should be derived for specific corpora used in ASR tasks.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, args: argparse.Namespace):
|
||||||
|
self.args = args
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def add_arguments(cls, parser: argparse.ArgumentParser):
|
||||||
|
group = parser.add_argument_group(
|
||||||
|
title="ASR data related options",
|
||||||
|
description="These options are used for the preparation of "
|
||||||
|
"PyTorch DataLoaders from Lhotse CutSet's -- they control the "
|
||||||
|
"effective batch sizes, sampling strategies, applied data "
|
||||||
|
"augmentations, etc.",
|
||||||
|
)
|
||||||
|
group.add_argument(
|
||||||
|
"--manifest-dir",
|
||||||
|
type=Path,
|
||||||
|
default=Path("data/manifests"),
|
||||||
|
help="Path to directory with train/valid/test cuts.",
|
||||||
|
)
|
||||||
|
group.add_argument(
|
||||||
|
"--enable-musan",
|
||||||
|
type=str2bool,
|
||||||
|
default=True,
|
||||||
|
help="When enabled, select noise from MUSAN and mix it "
|
||||||
|
"with training dataset. ",
|
||||||
|
)
|
||||||
|
group.add_argument(
|
||||||
|
"--concatenate-cuts",
|
||||||
|
type=str2bool,
|
||||||
|
default=False,
|
||||||
|
help="When enabled, utterances (cuts) will be concatenated "
|
||||||
|
"to minimize the amount of padding.",
|
||||||
|
)
|
||||||
|
group.add_argument(
|
||||||
|
"--duration-factor",
|
||||||
|
type=float,
|
||||||
|
default=1.0,
|
||||||
|
help="Determines the maximum duration of a concatenated cut "
|
||||||
|
"relative to the duration of the longest cut in a batch.",
|
||||||
|
)
|
||||||
|
group.add_argument(
|
||||||
|
"--gap",
|
||||||
|
type=float,
|
||||||
|
default=1.0,
|
||||||
|
help="The amount of padding (in seconds) inserted between "
|
||||||
|
"concatenated cuts. This padding is filled with noise when "
|
||||||
|
"noise augmentation is used.",
|
||||||
|
)
|
||||||
|
group.add_argument(
|
||||||
|
"--max-duration",
|
||||||
|
type=int,
|
||||||
|
default=100.0,
|
||||||
|
help="Maximum pooled recordings duration (seconds) in a "
|
||||||
|
"single batch. You can reduce it if it causes CUDA OOM.",
|
||||||
|
)
|
||||||
|
group.add_argument(
|
||||||
|
"--num-buckets",
|
||||||
|
type=int,
|
||||||
|
default=30,
|
||||||
|
help="The number of buckets for the BucketingSampler"
|
||||||
|
"(you might want to increase it for larger datasets).",
|
||||||
|
)
|
||||||
|
group.add_argument(
|
||||||
|
"--on-the-fly-feats",
|
||||||
|
type=str2bool,
|
||||||
|
default=False,
|
||||||
|
help="When enabled, use on-the-fly cut mixing and feature "
|
||||||
|
"extraction. Will drop existing precomputed feature manifests "
|
||||||
|
"if available.",
|
||||||
|
)
|
||||||
|
group.add_argument(
|
||||||
|
"--shuffle",
|
||||||
|
type=str2bool,
|
||||||
|
default=True,
|
||||||
|
help="When enabled (=default), the examples will be "
|
||||||
|
"shuffled for each epoch.",
|
||||||
|
)
|
||||||
|
|
||||||
|
group.add_argument(
|
||||||
|
"--num-workers",
|
||||||
|
type=int,
|
||||||
|
default=8,
|
||||||
|
help="The number of training dataloader workers that "
|
||||||
|
"collect the batches.",
|
||||||
|
)
|
||||||
|
group.add_argument(
|
||||||
|
"--enable-spec-aug",
|
||||||
|
type=str2bool,
|
||||||
|
default=True,
|
||||||
|
help="When enabled, use SpecAugment for training dataset.",
|
||||||
|
)
|
||||||
|
group.add_argument(
|
||||||
|
"--spec-aug-time-warp-factor",
|
||||||
|
type=int,
|
||||||
|
default=80,
|
||||||
|
help="Used only when --enable-spec-aug is True. "
|
||||||
|
"It specifies the factor for time warping in SpecAugment. "
|
||||||
|
"Larger values mean more warping. "
|
||||||
|
"A value less than 1 means to disable time warp.",
|
||||||
|
)
|
||||||
|
|
||||||
|
def train_dataloaders(
|
||||||
|
self,
|
||||||
|
cuts_train: CutSet,
|
||||||
|
sampler_state_dict: Optional[Dict[str, Any]] = None,
|
||||||
|
) -> DataLoader:
|
||||||
|
"""
|
||||||
|
Args:
|
||||||
|
cuts_train:
|
||||||
|
CutSet for training.
|
||||||
|
sampler_state_dict:
|
||||||
|
The state dict for the training sampler.
|
||||||
|
"""
|
||||||
|
logging.info("About to get Musan cuts")
|
||||||
|
cuts_musan = load_manifest(
|
||||||
|
self.args.manifest_dir / "cuts_musan.jsonl.gz"
|
||||||
|
)
|
||||||
|
|
||||||
|
transforms = []
|
||||||
|
if self.args.enable_musan:
|
||||||
|
logging.info("Enable MUSAN")
|
||||||
|
transforms.append(
|
||||||
|
CutMix(
|
||||||
|
cuts=cuts_musan, prob=0.5, snr=(10, 20), preserve_id=True
|
||||||
|
)
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
logging.info("Disable MUSAN")
|
||||||
|
|
||||||
|
if self.args.concatenate_cuts:
|
||||||
|
logging.info(
|
||||||
|
f"Using cut concatenation with duration factor "
|
||||||
|
f"{self.args.duration_factor} and gap {self.args.gap}."
|
||||||
|
)
|
||||||
|
# Cut concatenation should be the first transform in the list,
|
||||||
|
# so that if we e.g. mix noise in, it will fill the gaps between
|
||||||
|
# different utterances.
|
||||||
|
transforms = [
|
||||||
|
CutConcatenate(
|
||||||
|
duration_factor=self.args.duration_factor, gap=self.args.gap
|
||||||
|
)
|
||||||
|
] + transforms
|
||||||
|
|
||||||
|
input_transforms = []
|
||||||
|
if self.args.enable_spec_aug:
|
||||||
|
logging.info("Enable SpecAugment")
|
||||||
|
logging.info(
|
||||||
|
f"Time warp factor: {self.args.spec_aug_time_warp_factor}"
|
||||||
|
)
|
||||||
|
input_transforms.append(
|
||||||
|
SpecAugment(
|
||||||
|
time_warp_factor=self.args.spec_aug_time_warp_factor,
|
||||||
|
num_frame_masks=2,
|
||||||
|
features_mask_size=27,
|
||||||
|
num_feature_masks=2,
|
||||||
|
frames_mask_size=100,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
logging.info("Disable SpecAugment")
|
||||||
|
|
||||||
|
logging.info("About to create train dataset")
|
||||||
|
if self.args.on_the_fly_feats:
|
||||||
|
train = K2SpeechRecognitionDataset(
|
||||||
|
cut_transforms=transforms,
|
||||||
|
input_strategy=OnTheFlyFeatures(
|
||||||
|
Fbank(FbankConfig(num_mel_bins=80))
|
||||||
|
),
|
||||||
|
input_transforms=input_transforms,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
train = K2SpeechRecognitionDataset(
|
||||||
|
cut_transforms=transforms,
|
||||||
|
input_transforms=input_transforms,
|
||||||
|
)
|
||||||
|
|
||||||
|
logging.info("Using DynamicBucketingSampler.")
|
||||||
|
train_sampler = DynamicBucketingSampler(
|
||||||
|
cuts_train,
|
||||||
|
max_duration=self.args.max_duration,
|
||||||
|
shuffle=False,
|
||||||
|
num_buckets=self.args.num_buckets,
|
||||||
|
drop_last=True,
|
||||||
|
)
|
||||||
|
logging.info("About to create train dataloader")
|
||||||
|
|
||||||
|
if sampler_state_dict is not None:
|
||||||
|
logging.info("Loading sampler state dict")
|
||||||
|
train_sampler.load_state_dict(sampler_state_dict)
|
||||||
|
|
||||||
|
# 'seed' is derived from the current random state, which will have
|
||||||
|
# previously been set in the main process.
|
||||||
|
seed = torch.randint(0, 100000, ()).item()
|
||||||
|
worker_init_fn = _SeedWorkers(seed)
|
||||||
|
|
||||||
|
train_dl = DataLoader(
|
||||||
|
train,
|
||||||
|
sampler=train_sampler,
|
||||||
|
batch_size=None,
|
||||||
|
num_workers=self.args.num_workers,
|
||||||
|
persistent_workers=False,
|
||||||
|
worker_init_fn=worker_init_fn,
|
||||||
|
)
|
||||||
|
|
||||||
|
return train_dl
|
||||||
|
|
||||||
|
def valid_dataloaders(self, cuts_valid: CutSet) -> DataLoader:
|
||||||
|
|
||||||
|
transforms = []
|
||||||
|
if self.args.concatenate_cuts:
|
||||||
|
transforms = [
|
||||||
|
CutConcatenate(
|
||||||
|
duration_factor=self.args.duration_factor, gap=self.args.gap
|
||||||
|
)
|
||||||
|
] + transforms
|
||||||
|
|
||||||
|
logging.info("About to create dev dataset")
|
||||||
|
if self.args.on_the_fly_feats:
|
||||||
|
validate = K2SpeechRecognitionDataset(
|
||||||
|
cut_transforms=transforms,
|
||||||
|
input_strategy=OnTheFlyFeatures(
|
||||||
|
Fbank(FbankConfig(num_mel_bins=80))
|
||||||
|
),
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
validate = K2SpeechRecognitionDataset(
|
||||||
|
cut_transforms=transforms,
|
||||||
|
)
|
||||||
|
valid_sampler = DynamicBucketingSampler(
|
||||||
|
cuts_valid,
|
||||||
|
max_duration=self.args.max_duration,
|
||||||
|
shuffle=False,
|
||||||
|
)
|
||||||
|
logging.info("About to create dev dataloader")
|
||||||
|
valid_dl = DataLoader(
|
||||||
|
validate,
|
||||||
|
sampler=valid_sampler,
|
||||||
|
batch_size=None,
|
||||||
|
num_workers=2,
|
||||||
|
persistent_workers=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
return valid_dl
|
||||||
|
|
||||||
|
def test_dataloaders(self, cuts: CutSet) -> DataLoader:
|
||||||
|
logging.debug("About to create test dataset")
|
||||||
|
test = K2SpeechRecognitionDataset(
|
||||||
|
input_strategy=OnTheFlyFeatures(Fbank(FbankConfig(num_mel_bins=80)))
|
||||||
|
if self.args.on_the_fly_feats
|
||||||
|
else PrecomputedFeatures(),
|
||||||
|
)
|
||||||
|
sampler = DynamicBucketingSampler(
|
||||||
|
cuts, max_duration=self.args.max_duration, shuffle=False
|
||||||
|
)
|
||||||
|
logging.debug("About to create test dataloader")
|
||||||
|
test_dl = DataLoader(
|
||||||
|
test,
|
||||||
|
batch_size=None,
|
||||||
|
sampler=sampler,
|
||||||
|
num_workers=self.args.num_workers,
|
||||||
|
)
|
||||||
|
return test_dl
|
||||||
|
|
||||||
|
@lru_cache()
|
||||||
|
def train_cuts(self) -> CutSet:
|
||||||
|
logging.info("About to get SPGISpeech train cuts")
|
||||||
|
return load_manifest_lazy(
|
||||||
|
self.args.manifest_dir / "cuts_train_shuf.jsonl.gz"
|
||||||
|
)
|
||||||
|
|
||||||
|
@lru_cache()
|
||||||
|
def dev_cuts(self) -> CutSet:
|
||||||
|
logging.info("About to get SPGISpeech dev cuts")
|
||||||
|
return load_manifest_lazy(self.args.manifest_dir / "cuts_dev.jsonl.gz")
|
||||||
|
|
||||||
|
@lru_cache()
|
||||||
|
def val_cuts(self) -> CutSet:
|
||||||
|
logging.info("About to get SPGISpeech val cuts")
|
||||||
|
return load_manifest_lazy(self.args.manifest_dir / "cuts_val.jsonl.gz")
|
||||||
|
|
||||||
|
|
||||||
|
def test():
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
SPGISpeechAsrDataModule.add_arguments(parser)
|
||||||
|
args = parser.parse_args()
|
||||||
|
adm = SPGISpeechAsrDataModule(args)
|
||||||
|
|
||||||
|
cuts = adm.train_cuts()
|
||||||
|
dl = adm.train_dataloaders(cuts)
|
||||||
|
for i, batch in tqdm(enumerate(dl)):
|
||||||
|
if i == 100:
|
||||||
|
break
|
||||||
|
|
||||||
|
cuts = adm.dev_cuts()
|
||||||
|
dl = adm.valid_dataloaders(cuts)
|
||||||
|
for i, batch in tqdm(enumerate(dl)):
|
||||||
|
if i == 100:
|
||||||
|
break
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
test()
|
1
egs/spgispeech/ASR/pruned_transducer_stateless2/beam_search.py
Symbolic link
1
egs/spgispeech/ASR/pruned_transducer_stateless2/beam_search.py
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../../../librispeech/ASR/pruned_transducer_stateless2/beam_search.py
|
1
egs/spgispeech/ASR/pruned_transducer_stateless2/conformer.py
Symbolic link
1
egs/spgispeech/ASR/pruned_transducer_stateless2/conformer.py
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../../../librispeech/ASR/pruned_transducer_stateless2/conformer.py
|
594
egs/spgispeech/ASR/pruned_transducer_stateless2/decode.py
Executable file
594
egs/spgispeech/ASR/pruned_transducer_stateless2/decode.py
Executable file
@ -0,0 +1,594 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
#
|
||||||
|
# Copyright 2021 Xiaomi Corporation (Author: Fangjun Kuang)
|
||||||
|
#
|
||||||
|
# See ../../../../LICENSE for clarification regarding multiple authors
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
"""
|
||||||
|
Usage:
|
||||||
|
(1) greedy search
|
||||||
|
./pruned_transducer_stateless2/decode.py \
|
||||||
|
--iter 696000 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless2/exp \
|
||||||
|
--max-duration 100 \
|
||||||
|
--decoding-method greedy_search
|
||||||
|
|
||||||
|
(2) beam search
|
||||||
|
./pruned_transducer_stateless2/decode.py \
|
||||||
|
--iter 696000 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless2/exp \
|
||||||
|
--max-duration 100 \
|
||||||
|
--decoding-method beam_search \
|
||||||
|
--beam-size 4
|
||||||
|
|
||||||
|
(3) modified beam search
|
||||||
|
./pruned_transducer_stateless2/decode.py \
|
||||||
|
--iter 696000 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless2/exp \
|
||||||
|
--max-duration 100 \
|
||||||
|
--decoding-method modified_beam_search \
|
||||||
|
--beam-size 4
|
||||||
|
|
||||||
|
(4) fast beam search
|
||||||
|
./pruned_transducer_stateless2/decode.py \
|
||||||
|
--iter 696000 \
|
||||||
|
--avg 10 \
|
||||||
|
--exp-dir ./pruned_transducer_stateless2/exp \
|
||||||
|
--max-duration 1500 \
|
||||||
|
--decoding-method fast_beam_search \
|
||||||
|
--beam 4 \
|
||||||
|
--max-contexts 4 \
|
||||||
|
--max-states 8
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import logging
|
||||||
|
from collections import defaultdict
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict, List, Optional, Tuple
|
||||||
|
|
||||||
|
import k2
|
||||||
|
import sentencepiece as spm
|
||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
from asr_datamodule import SPGISpeechAsrDataModule
|
||||||
|
from beam_search import (
|
||||||
|
beam_search,
|
||||||
|
fast_beam_search_one_best,
|
||||||
|
greedy_search,
|
||||||
|
greedy_search_batch,
|
||||||
|
modified_beam_search,
|
||||||
|
)
|
||||||
|
from train import get_params, get_transducer_model
|
||||||
|
|
||||||
|
from icefall.checkpoint import (
|
||||||
|
average_checkpoints,
|
||||||
|
find_checkpoints,
|
||||||
|
load_checkpoint,
|
||||||
|
)
|
||||||
|
from icefall.utils import (
|
||||||
|
AttributeDict,
|
||||||
|
setup_logger,
|
||||||
|
store_transcripts,
|
||||||
|
write_error_stats,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def get_parser():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
formatter_class=argparse.ArgumentDefaultsHelpFormatter
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--epoch",
|
||||||
|
type=int,
|
||||||
|
default=20,
|
||||||
|
help="""It specifies the checkpoint to use for decoding.
|
||||||
|
Note: Epoch counts from 0.
|
||||||
|
You can specify --avg to use more checkpoints for model averaging.""",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--iter",
|
||||||
|
type=int,
|
||||||
|
default=0,
|
||||||
|
help="""If positive, --epoch is ignored and it
|
||||||
|
will use the checkpoint exp_dir/checkpoint-iter.pt.
|
||||||
|
You can specify --avg to use more checkpoints for model averaging.
|
||||||
|
""",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--avg",
|
||||||
|
type=int,
|
||||||
|
default=10,
|
||||||
|
help="Number of checkpoints to average. Automatically select "
|
||||||
|
"consecutive checkpoints before the checkpoint specified by "
|
||||||
|
"'--epoch' and '--iter'",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--exp-dir",
|
||||||
|
type=str,
|
||||||
|
default="pruned_transducer_stateless2/exp",
|
||||||
|
help="The experiment dir",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--bpe-model",
|
||||||
|
type=str,
|
||||||
|
default="data/lang_bpe_500/bpe.model",
|
||||||
|
help="Path to the BPE model",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--decoding-method",
|
||||||
|
type=str,
|
||||||
|
default="greedy_search",
|
||||||
|
help="""Possible values are:
|
||||||
|
- greedy_search
|
||||||
|
- beam_search
|
||||||
|
- modified_beam_search
|
||||||
|
- fast_beam_search
|
||||||
|
""",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--beam-size",
|
||||||
|
type=int,
|
||||||
|
default=4,
|
||||||
|
help="""An interger indicating how many candidates we will keep for each
|
||||||
|
frame. Used only when --decoding-method is beam_search or
|
||||||
|
modified_beam_search.""",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--beam",
|
||||||
|
type=float,
|
||||||
|
default=4,
|
||||||
|
help="""A floating point value to calculate the cutoff score during beam
|
||||||
|
search (i.e., `cutoff = max-score - beam`), which is the same as the
|
||||||
|
`beam` in Kaldi.
|
||||||
|
Used only when --decoding-method is fast_beam_search""",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-contexts",
|
||||||
|
type=int,
|
||||||
|
default=4,
|
||||||
|
help="""Used only when --decoding-method is
|
||||||
|
fast_beam_search""",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-states",
|
||||||
|
type=int,
|
||||||
|
default=8,
|
||||||
|
help="""Used only when --decoding-method is
|
||||||
|
fast_beam_search""",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--context-size",
|
||||||
|
type=int,
|
||||||
|
default=2,
|
||||||
|
help="The context size in the decoder. 1 means bigram; "
|
||||||
|
"2 means tri-gram",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-sym-per-frame",
|
||||||
|
type=int,
|
||||||
|
default=1,
|
||||||
|
help="""Maximum number of symbols per frame.
|
||||||
|
Used only when --decoding_method is greedy_search""",
|
||||||
|
)
|
||||||
|
|
||||||
|
return parser
|
||||||
|
|
||||||
|
|
||||||
|
def decode_one_batch(
|
||||||
|
params: AttributeDict,
|
||||||
|
model: nn.Module,
|
||||||
|
sp: spm.SentencePieceProcessor,
|
||||||
|
batch: dict,
|
||||||
|
decoding_graph: Optional[k2.Fsa] = None,
|
||||||
|
) -> Dict[str, List[List[str]]]:
|
||||||
|
"""Decode one batch and return the result in a dict. The dict has the
|
||||||
|
following format:
|
||||||
|
|
||||||
|
- key: It indicates the setting used for decoding. For example,
|
||||||
|
if greedy_search is used, it would be "greedy_search"
|
||||||
|
If beam search with a beam size of 7 is used, it would be
|
||||||
|
"beam_7"
|
||||||
|
- value: It contains the decoding result. `len(value)` equals to
|
||||||
|
batch size. `value[i]` is the decoding result for the i-th
|
||||||
|
utterance in the given batch.
|
||||||
|
Args:
|
||||||
|
params:
|
||||||
|
It's the return value of :func:`get_params`.
|
||||||
|
model:
|
||||||
|
The neural model.
|
||||||
|
sp:
|
||||||
|
The BPE model.
|
||||||
|
batch:
|
||||||
|
It is the return value from iterating
|
||||||
|
`lhotse.dataset.K2SpeechRecognitionDataset`. See its documentation
|
||||||
|
for the format of the `batch`.
|
||||||
|
decoding_graph:
|
||||||
|
The decoding graph. Can be either a `k2.trivial_graph` or HLG, Used
|
||||||
|
only when --decoding_method is fast_beam_search.
|
||||||
|
Returns:
|
||||||
|
Return the decoding result. See above description for the format of
|
||||||
|
the returned dict.
|
||||||
|
"""
|
||||||
|
device = model.device
|
||||||
|
feature = batch["inputs"]
|
||||||
|
assert feature.ndim == 3
|
||||||
|
|
||||||
|
feature = feature.to(device)
|
||||||
|
# at entry, feature is (N, T, C)
|
||||||
|
|
||||||
|
supervisions = batch["supervisions"]
|
||||||
|
feature_lens = supervisions["num_frames"].to(device)
|
||||||
|
|
||||||
|
encoder_out, encoder_out_lens = model.encoder(
|
||||||
|
x=feature, x_lens=feature_lens
|
||||||
|
)
|
||||||
|
hyps = []
|
||||||
|
|
||||||
|
if params.decoding_method == "fast_beam_search":
|
||||||
|
hyp_tokens = fast_beam_search_one_best(
|
||||||
|
model=model,
|
||||||
|
decoding_graph=decoding_graph,
|
||||||
|
encoder_out=encoder_out,
|
||||||
|
encoder_out_lens=encoder_out_lens,
|
||||||
|
beam=params.beam,
|
||||||
|
max_contexts=params.max_contexts,
|
||||||
|
max_states=params.max_states,
|
||||||
|
)
|
||||||
|
for hyp in sp.decode(hyp_tokens):
|
||||||
|
hyps.append(hyp.split())
|
||||||
|
elif (
|
||||||
|
params.decoding_method == "greedy_search"
|
||||||
|
and params.max_sym_per_frame == 1
|
||||||
|
):
|
||||||
|
hyp_tokens = greedy_search_batch(
|
||||||
|
model=model,
|
||||||
|
encoder_out=encoder_out,
|
||||||
|
encoder_out_lens=encoder_out_lens,
|
||||||
|
)
|
||||||
|
for hyp in sp.decode(hyp_tokens):
|
||||||
|
hyps.append(hyp.split())
|
||||||
|
elif params.decoding_method == "modified_beam_search":
|
||||||
|
hyp_tokens = modified_beam_search(
|
||||||
|
model=model,
|
||||||
|
encoder_out=encoder_out,
|
||||||
|
encoder_out_lens=encoder_out_lens,
|
||||||
|
beam=params.beam_size,
|
||||||
|
)
|
||||||
|
for hyp in sp.decode(hyp_tokens):
|
||||||
|
hyps.append(hyp.split())
|
||||||
|
else:
|
||||||
|
batch_size = encoder_out.size(0)
|
||||||
|
|
||||||
|
for i in range(batch_size):
|
||||||
|
# fmt: off
|
||||||
|
encoder_out_i = encoder_out[i:i+1, :encoder_out_lens[i]]
|
||||||
|
# fmt: on
|
||||||
|
if params.decoding_method == "greedy_search":
|
||||||
|
hyp = greedy_search(
|
||||||
|
model=model,
|
||||||
|
encoder_out=encoder_out_i,
|
||||||
|
max_sym_per_frame=params.max_sym_per_frame,
|
||||||
|
)
|
||||||
|
elif params.decoding_method == "beam_search":
|
||||||
|
hyp = beam_search(
|
||||||
|
model=model,
|
||||||
|
encoder_out=encoder_out_i,
|
||||||
|
beam=params.beam_size,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
raise ValueError(
|
||||||
|
f"Unsupported decoding method: {params.decoding_method}"
|
||||||
|
)
|
||||||
|
hyps.append(sp.decode(hyp).split())
|
||||||
|
|
||||||
|
if params.decoding_method == "greedy_search":
|
||||||
|
return {"greedy_search": hyps}
|
||||||
|
elif params.decoding_method == "fast_beam_search":
|
||||||
|
return {
|
||||||
|
(
|
||||||
|
f"beam_{params.beam}_"
|
||||||
|
f"max_contexts_{params.max_contexts}_"
|
||||||
|
f"max_states_{params.max_states}"
|
||||||
|
): hyps
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
return {f"beam_size_{params.beam_size}": hyps}
|
||||||
|
|
||||||
|
|
||||||
|
def decode_dataset(
|
||||||
|
dl: torch.utils.data.DataLoader,
|
||||||
|
params: AttributeDict,
|
||||||
|
model: nn.Module,
|
||||||
|
sp: spm.SentencePieceProcessor,
|
||||||
|
decoding_graph: Optional[k2.Fsa] = None,
|
||||||
|
) -> Dict[str, List[Tuple[List[str], List[str]]]]:
|
||||||
|
"""Decode dataset.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
dl:
|
||||||
|
PyTorch's dataloader containing the dataset to decode.
|
||||||
|
params:
|
||||||
|
It is returned by :func:`get_params`.
|
||||||
|
model:
|
||||||
|
The neural model.
|
||||||
|
sp:
|
||||||
|
The BPE model.
|
||||||
|
decoding_graph:
|
||||||
|
The decoding graph. Can be either a `k2.trivial_graph` or HLG, Used
|
||||||
|
only when --decoding_method is fast_beam_search.
|
||||||
|
Returns:
|
||||||
|
Return a dict, whose key may be "greedy_search" if greedy search
|
||||||
|
is used, or it may be "beam_7" if beam size of 7 is used.
|
||||||
|
Its value is a list of tuples. Each tuple contains two elements:
|
||||||
|
The first is the reference transcript, and the second is the
|
||||||
|
predicted result.
|
||||||
|
"""
|
||||||
|
num_cuts = 0
|
||||||
|
|
||||||
|
try:
|
||||||
|
num_batches = len(dl)
|
||||||
|
except TypeError:
|
||||||
|
num_batches = "?"
|
||||||
|
|
||||||
|
if params.decoding_method == "greedy_search":
|
||||||
|
log_interval = 100
|
||||||
|
else:
|
||||||
|
log_interval = 2
|
||||||
|
|
||||||
|
results = defaultdict(list)
|
||||||
|
for batch_idx, batch in enumerate(dl):
|
||||||
|
texts = batch["supervisions"]["text"]
|
||||||
|
|
||||||
|
hyps_dict = decode_one_batch(
|
||||||
|
params=params,
|
||||||
|
model=model,
|
||||||
|
sp=sp,
|
||||||
|
decoding_graph=decoding_graph,
|
||||||
|
batch=batch,
|
||||||
|
)
|
||||||
|
|
||||||
|
for name, hyps in hyps_dict.items():
|
||||||
|
this_batch = []
|
||||||
|
assert len(hyps) == len(texts)
|
||||||
|
for hyp_words, ref_text in zip(hyps, texts):
|
||||||
|
ref_words = ref_text.split()
|
||||||
|
this_batch.append((ref_words, hyp_words))
|
||||||
|
|
||||||
|
results[name].extend(this_batch)
|
||||||
|
|
||||||
|
num_cuts += len(texts)
|
||||||
|
|
||||||
|
if batch_idx % log_interval == 0:
|
||||||
|
batch_str = f"{batch_idx}/{num_batches}"
|
||||||
|
|
||||||
|
logging.info(
|
||||||
|
f"batch {batch_str}, cuts processed until now is {num_cuts}"
|
||||||
|
)
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def save_results(
|
||||||
|
params: AttributeDict,
|
||||||
|
test_set_name: str,
|
||||||
|
results_dict: Dict[str, List[Tuple[List[int], List[int]]]],
|
||||||
|
):
|
||||||
|
test_set_wers = dict()
|
||||||
|
test_set_cers = dict()
|
||||||
|
for key, results in results_dict.items():
|
||||||
|
recog_path = (
|
||||||
|
params.res_dir / f"recogs-{test_set_name}-{key}-{params.suffix}.txt"
|
||||||
|
)
|
||||||
|
store_transcripts(filename=recog_path, texts=results)
|
||||||
|
logging.info(f"The transcripts are stored in {recog_path}")
|
||||||
|
|
||||||
|
# The following prints out WERs, per-word error statistics and aligned
|
||||||
|
# ref/hyp pairs.
|
||||||
|
wers_filename = (
|
||||||
|
params.res_dir / f"wers-{test_set_name}-{key}-{params.suffix}.txt"
|
||||||
|
)
|
||||||
|
with open(wers_filename, "w") as f:
|
||||||
|
wer = write_error_stats(
|
||||||
|
f, f"{test_set_name}-{key}", results, enable_log=True
|
||||||
|
)
|
||||||
|
test_set_wers[key] = wer
|
||||||
|
|
||||||
|
# we also compute CER for spgispeech dataset.
|
||||||
|
results_char = []
|
||||||
|
for res in results:
|
||||||
|
results_char.append((list("".join(res[0])), list("".join(res[1]))))
|
||||||
|
cers_filename = (
|
||||||
|
params.res_dir / f"cers-{test_set_name}-{key}-{params.suffix}.txt"
|
||||||
|
)
|
||||||
|
with open(cers_filename, "w") as f:
|
||||||
|
cer = write_error_stats(
|
||||||
|
f, f"{test_set_name}-{key}", results_char, enable_log=True
|
||||||
|
)
|
||||||
|
test_set_cers[key] = cer
|
||||||
|
|
||||||
|
logging.info("Wrote detailed error stats to {}".format(wers_filename))
|
||||||
|
|
||||||
|
test_set_wers = {
|
||||||
|
k: v for k, v in sorted(test_set_wers.items(), key=lambda x: x[1])
|
||||||
|
}
|
||||||
|
test_set_cers = {
|
||||||
|
k: v for k, v in sorted(test_set_cers.items(), key=lambda x: x[1])
|
||||||
|
}
|
||||||
|
errs_info = (
|
||||||
|
params.res_dir
|
||||||
|
/ f"wer-summary-{test_set_name}-{key}-{params.suffix}.txt"
|
||||||
|
)
|
||||||
|
with open(errs_info, "w") as f:
|
||||||
|
print("settings\tWER\tCER", file=f)
|
||||||
|
for key in test_set_wers:
|
||||||
|
print(
|
||||||
|
"{}\t{}\t{}".format(
|
||||||
|
key, test_set_wers[key], test_set_cers[key]
|
||||||
|
),
|
||||||
|
file=f,
|
||||||
|
)
|
||||||
|
|
||||||
|
s = "\nFor {}, WER/CER of different settings are:\n".format(test_set_name)
|
||||||
|
note = "\tbest for {}".format(test_set_name)
|
||||||
|
for key in test_set_wers:
|
||||||
|
s += "{}\t{}\t{}{}\n".format(
|
||||||
|
key, test_set_wers[key], test_set_cers[key], note
|
||||||
|
)
|
||||||
|
note = ""
|
||||||
|
logging.info(s)
|
||||||
|
|
||||||
|
|
||||||
|
@torch.no_grad()
|
||||||
|
def main():
|
||||||
|
parser = get_parser()
|
||||||
|
SPGISpeechAsrDataModule.add_arguments(parser)
|
||||||
|
args = parser.parse_args()
|
||||||
|
args.exp_dir = Path(args.exp_dir)
|
||||||
|
|
||||||
|
params = get_params()
|
||||||
|
params.update(vars(args))
|
||||||
|
|
||||||
|
assert params.decoding_method in (
|
||||||
|
"greedy_search",
|
||||||
|
"beam_search",
|
||||||
|
"fast_beam_search",
|
||||||
|
"modified_beam_search",
|
||||||
|
)
|
||||||
|
params.res_dir = params.exp_dir / params.decoding_method
|
||||||
|
|
||||||
|
if params.iter > 0:
|
||||||
|
params.suffix = f"iter-{params.iter}-avg-{params.avg}"
|
||||||
|
else:
|
||||||
|
params.suffix = f"epoch-{params.epoch}-avg-{params.avg}"
|
||||||
|
|
||||||
|
if "fast_beam_search" in params.decoding_method:
|
||||||
|
params.suffix += f"-beam-{params.beam}"
|
||||||
|
params.suffix += f"-max-contexts-{params.max_contexts}"
|
||||||
|
params.suffix += f"-max-states-{params.max_states}"
|
||||||
|
elif "beam_search" in params.decoding_method:
|
||||||
|
params.suffix += (
|
||||||
|
f"-{params.decoding_method}-beam-size-{params.beam_size}"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
params.suffix += f"-context-{params.context_size}"
|
||||||
|
params.suffix += f"-max-sym-per-frame-{params.max_sym_per_frame}"
|
||||||
|
|
||||||
|
setup_logger(f"{params.res_dir}/log-decode-{params.suffix}")
|
||||||
|
logging.info("Decoding started")
|
||||||
|
|
||||||
|
device = torch.device("cpu")
|
||||||
|
if torch.cuda.is_available():
|
||||||
|
device = torch.device("cuda", 0)
|
||||||
|
|
||||||
|
logging.info(f"Device: {device}")
|
||||||
|
|
||||||
|
sp = spm.SentencePieceProcessor()
|
||||||
|
sp.load(params.bpe_model)
|
||||||
|
|
||||||
|
# <blk> is defined in local/train_bpe_model.py
|
||||||
|
params.blank_id = sp.piece_to_id("<blk>")
|
||||||
|
params.vocab_size = sp.get_piece_size()
|
||||||
|
|
||||||
|
logging.info(params)
|
||||||
|
|
||||||
|
logging.info("About to create model")
|
||||||
|
model = get_transducer_model(params)
|
||||||
|
|
||||||
|
if params.iter > 0:
|
||||||
|
filenames = find_checkpoints(params.exp_dir, iteration=-params.iter)[
|
||||||
|
: params.avg
|
||||||
|
]
|
||||||
|
if len(filenames) == 0:
|
||||||
|
raise ValueError(
|
||||||
|
f"No checkpoints found for"
|
||||||
|
f" --iter {params.iter}, --avg {params.avg}"
|
||||||
|
)
|
||||||
|
elif len(filenames) < params.avg:
|
||||||
|
raise ValueError(
|
||||||
|
f"Not enough checkpoints ({len(filenames)}) found for"
|
||||||
|
f" --iter {params.iter}, --avg {params.avg}"
|
||||||
|
)
|
||||||
|
logging.info(f"averaging {filenames}")
|
||||||
|
model.to(device)
|
||||||
|
model.load_state_dict(average_checkpoints(filenames, device=device))
|
||||||
|
elif params.avg == 1:
|
||||||
|
load_checkpoint(f"{params.exp_dir}/epoch-{params.epoch}.pt", model)
|
||||||
|
else:
|
||||||
|
start = params.epoch - params.avg + 1
|
||||||
|
filenames = []
|
||||||
|
for i in range(start, params.epoch + 1):
|
||||||
|
if start >= 0:
|
||||||
|
filenames.append(f"{params.exp_dir}/epoch-{i}.pt")
|
||||||
|
logging.info(f"averaging {filenames}")
|
||||||
|
model.to(device)
|
||||||
|
model.load_state_dict(average_checkpoints(filenames, device=device))
|
||||||
|
|
||||||
|
model.to(device)
|
||||||
|
model.eval()
|
||||||
|
model.device = device
|
||||||
|
|
||||||
|
if params.decoding_method == "fast_beam_search":
|
||||||
|
decoding_graph = k2.trivial_graph(params.vocab_size - 1, device=device)
|
||||||
|
else:
|
||||||
|
decoding_graph = None
|
||||||
|
|
||||||
|
num_param = sum([p.numel() for p in model.parameters()])
|
||||||
|
logging.info(f"Number of model parameters: {num_param}")
|
||||||
|
|
||||||
|
spgispeech = SPGISpeechAsrDataModule(args)
|
||||||
|
|
||||||
|
dev_cuts = spgispeech.dev_cuts()
|
||||||
|
val_cuts = spgispeech.val_cuts()
|
||||||
|
|
||||||
|
dev_dl = spgispeech.test_dataloaders(dev_cuts)
|
||||||
|
val_dl = spgispeech.test_dataloaders(val_cuts)
|
||||||
|
|
||||||
|
test_sets = ["dev", "val"]
|
||||||
|
test_dl = [dev_dl, val_dl]
|
||||||
|
|
||||||
|
for test_set, test_dl in zip(test_sets, test_dl):
|
||||||
|
results_dict = decode_dataset(
|
||||||
|
dl=test_dl,
|
||||||
|
params=params,
|
||||||
|
model=model,
|
||||||
|
sp=sp,
|
||||||
|
decoding_graph=decoding_graph,
|
||||||
|
)
|
||||||
|
|
||||||
|
save_results(
|
||||||
|
params=params,
|
||||||
|
test_set_name=test_set,
|
||||||
|
results_dict=results_dict,
|
||||||
|
)
|
||||||
|
|
||||||
|
logging.info("Done!")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
1
egs/spgispeech/ASR/pruned_transducer_stateless2/decoder.py
Symbolic link
1
egs/spgispeech/ASR/pruned_transducer_stateless2/decoder.py
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../../../librispeech/ASR/pruned_transducer_stateless2/decoder.py
|
@ -0,0 +1 @@
|
|||||||
|
../../../librispeech/ASR/pruned_transducer_stateless2/encoder_interface.py
|
201
egs/spgispeech/ASR/pruned_transducer_stateless2/export.py
Executable file
201
egs/spgispeech/ASR/pruned_transducer_stateless2/export.py
Executable file
@ -0,0 +1,201 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
#
|
||||||
|
# Copyright 2021 Xiaomi Corporation (Author: Fangjun Kuang)
|
||||||
|
#
|
||||||
|
# See ../../../../LICENSE for clarification regarding multiple authors
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
|
||||||
|
# This script converts several saved checkpoints
|
||||||
|
# to a single one using model averaging.
|
||||||
|
"""
|
||||||
|
Usage:
|
||||||
|
./pruned_transducer_stateless2/export.py \
|
||||||
|
--exp-dir ./pruned_transducer_stateless2/exp \
|
||||||
|
--bpe-model data/lang_bpe_500/bpe.model \
|
||||||
|
--avg-last-n 10
|
||||||
|
|
||||||
|
It will generate a file exp_dir/pretrained.pt
|
||||||
|
|
||||||
|
To use the generated file with `pruned_transducer_stateless2/decode.py`,
|
||||||
|
you can do:
|
||||||
|
|
||||||
|
cd /path/to/exp_dir
|
||||||
|
ln -s pretrained.pt epoch-9999.pt
|
||||||
|
|
||||||
|
cd /path/to/egs/spgispeech/ASR
|
||||||
|
./pruned_transducer_stateless2/decode.py \
|
||||||
|
--exp-dir ./pruned_transducer_stateless2/exp \
|
||||||
|
--epoch 9999 \
|
||||||
|
--avg 1 \
|
||||||
|
--max-duration 100 \
|
||||||
|
--bpe-model data/lang_bpe_500/bpe.model
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import logging
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import sentencepiece as spm
|
||||||
|
import torch
|
||||||
|
from train import get_params, get_transducer_model
|
||||||
|
|
||||||
|
from icefall.checkpoint import (
|
||||||
|
average_checkpoints,
|
||||||
|
find_checkpoints,
|
||||||
|
load_checkpoint,
|
||||||
|
)
|
||||||
|
from icefall.utils import str2bool
|
||||||
|
|
||||||
|
|
||||||
|
def get_parser():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
formatter_class=argparse.ArgumentDefaultsHelpFormatter
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--epoch",
|
||||||
|
type=int,
|
||||||
|
default=28,
|
||||||
|
help="It specifies the checkpoint to use for decoding."
|
||||||
|
"Note: Epoch counts from 0.",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--avg",
|
||||||
|
type=int,
|
||||||
|
default=15,
|
||||||
|
help="Number of checkpoints to average. Automatically select "
|
||||||
|
"consecutive checkpoints before the checkpoint specified by "
|
||||||
|
"'--epoch'. ",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--avg-last-n",
|
||||||
|
type=int,
|
||||||
|
default=0,
|
||||||
|
help="""If positive, --epoch and --avg are ignored and it
|
||||||
|
will use the last n checkpoints exp_dir/checkpoint-xxx.pt
|
||||||
|
where xxx is the number of processed batches while
|
||||||
|
saving that checkpoint.
|
||||||
|
""",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--exp-dir",
|
||||||
|
type=str,
|
||||||
|
default="pruned_transducer_stateless2/exp",
|
||||||
|
help="""It specifies the directory where all training related
|
||||||
|
files, e.g., checkpoints, log, etc, are saved
|
||||||
|
""",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--bpe-model",
|
||||||
|
type=str,
|
||||||
|
default="data/lang_bpe_500/bpe.model",
|
||||||
|
help="Path to the BPE model",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--jit",
|
||||||
|
type=str2bool,
|
||||||
|
default=False,
|
||||||
|
help="""True to save a model after applying torch.jit.script.
|
||||||
|
""",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--context-size",
|
||||||
|
type=int,
|
||||||
|
default=2,
|
||||||
|
help="The context size in the decoder. 1 means bigram; "
|
||||||
|
"2 means tri-gram",
|
||||||
|
)
|
||||||
|
|
||||||
|
return parser
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
args = get_parser().parse_args()
|
||||||
|
args.exp_dir = Path(args.exp_dir)
|
||||||
|
|
||||||
|
assert args.jit is False, "Support torchscript will be added later"
|
||||||
|
|
||||||
|
params = get_params()
|
||||||
|
params.update(vars(args))
|
||||||
|
|
||||||
|
device = torch.device("cpu")
|
||||||
|
if torch.cuda.is_available():
|
||||||
|
device = torch.device("cuda", 0)
|
||||||
|
|
||||||
|
logging.info(f"device: {device}")
|
||||||
|
|
||||||
|
sp = spm.SentencePieceProcessor()
|
||||||
|
sp.load(params.bpe_model)
|
||||||
|
|
||||||
|
# <blk> is defined in local/train_bpe_model.py
|
||||||
|
params.blank_id = sp.piece_to_id("<blk>")
|
||||||
|
params.vocab_size = sp.get_piece_size()
|
||||||
|
|
||||||
|
logging.info(params)
|
||||||
|
|
||||||
|
logging.info("About to create model")
|
||||||
|
model = get_transducer_model(params)
|
||||||
|
|
||||||
|
model.to(device)
|
||||||
|
|
||||||
|
if params.avg_last_n > 0:
|
||||||
|
filenames = find_checkpoints(params.exp_dir)[: params.avg_last_n]
|
||||||
|
logging.info(f"averaging {filenames}")
|
||||||
|
model.to(device)
|
||||||
|
model.load_state_dict(average_checkpoints(filenames, device=device))
|
||||||
|
elif params.avg == 1:
|
||||||
|
load_checkpoint(f"{params.exp_dir}/epoch-{params.epoch}.pt", model)
|
||||||
|
else:
|
||||||
|
start = params.epoch - params.avg + 1
|
||||||
|
filenames = []
|
||||||
|
for i in range(start, params.epoch + 1):
|
||||||
|
if start >= 0:
|
||||||
|
filenames.append(f"{params.exp_dir}/epoch-{i}.pt")
|
||||||
|
logging.info(f"averaging {filenames}")
|
||||||
|
model.to(device)
|
||||||
|
model.load_state_dict(average_checkpoints(filenames, device=device))
|
||||||
|
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
model.to("cpu")
|
||||||
|
model.eval()
|
||||||
|
|
||||||
|
if params.jit:
|
||||||
|
logging.info("Using torch.jit.script")
|
||||||
|
model = torch.jit.script(model)
|
||||||
|
filename = params.exp_dir / "cpu_jit.pt"
|
||||||
|
model.save(str(filename))
|
||||||
|
logging.info(f"Saved to {filename}")
|
||||||
|
else:
|
||||||
|
logging.info("Not using torch.jit.script")
|
||||||
|
# Save it using a format so that it can be loaded
|
||||||
|
# by :func:`load_checkpoint`
|
||||||
|
filename = params.exp_dir / "pretrained.pt"
|
||||||
|
torch.save({"model": model.state_dict()}, str(filename))
|
||||||
|
logging.info(f"Saved to {filename}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
formatter = (
|
||||||
|
"%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
|
||||||
|
)
|
||||||
|
|
||||||
|
logging.basicConfig(format=formatter, level=logging.INFO)
|
||||||
|
main()
|
1
egs/spgispeech/ASR/pruned_transducer_stateless2/joiner.py
Symbolic link
1
egs/spgispeech/ASR/pruned_transducer_stateless2/joiner.py
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../../../librispeech/ASR/pruned_transducer_stateless2/joiner.py
|
1
egs/spgispeech/ASR/pruned_transducer_stateless2/model.py
Symbolic link
1
egs/spgispeech/ASR/pruned_transducer_stateless2/model.py
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../../../librispeech/ASR/pruned_transducer_stateless2/model.py
|
1
egs/spgispeech/ASR/pruned_transducer_stateless2/optim.py
Symbolic link
1
egs/spgispeech/ASR/pruned_transducer_stateless2/optim.py
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../../../librispeech/ASR/pruned_transducer_stateless2/optim.py
|
1
egs/spgispeech/ASR/pruned_transducer_stateless2/scaling.py
Symbolic link
1
egs/spgispeech/ASR/pruned_transducer_stateless2/scaling.py
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../../../librispeech/ASR/pruned_transducer_stateless2/scaling.py
|
1031
egs/spgispeech/ASR/pruned_transducer_stateless2/train.py
Executable file
1031
egs/spgispeech/ASR/pruned_transducer_stateless2/train.py
Executable file
File diff suppressed because it is too large
Load Diff
1
egs/spgispeech/ASR/shared
Symbolic link
1
egs/spgispeech/ASR/shared
Symbolic link
@ -0,0 +1 @@
|
|||||||
|
../../../icefall/shared/
|
Loading…
x
Reference in New Issue
Block a user