Previously, it was reset after every epoch, which may cause it to
always use the first part of the gigaspeech dataset if you choose
a small --giga-prob.
* zipformer/ctc_align.py
- tool for forced-alignment with CTC model
- provides timeline, computes per-token and per-utterance acoustic confidences
- based on torchaudio `forced_align()`
- confidences are computed in several ways
other modifications:
- LibriSpeechAsrDataModel extended with `::load_manifest()` to allow
passing-in cutset from CLI.
- update @custom_fwd @custom_bwd in scaling.py
- streaming_decode.py update errs/recogs/log filenames '-' <-> '_'
* putting back `custom_bwd`, `custom_fwd`
* integrating remarks from PR
* update of argparse help strings
* ctc_align.py, avoid shadowing a variable
* Finalizing the code:
- adding some coderabbit suggestions.
- removing `word_table`, `decoding_graph` from aligner API (unused)
- improved consistency of variable names (confidences)
- updated docstrings
HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation
Paper: https://arxiv.org/abs/2506.02157
Fixes incorrect computation of encoder_dim when encoder_dim is a comma-separated list of integers by ensuring numeric (not lexicographic) max is used.
Fixes#2018
- Replace int(max(params.encoder_dim.split(","))) (lexicographic max on strings) with max(_to_int_tuple(params.encoder_dim)) (numeric max).
- Apply the fix consistently across all affected training scripts.
- Introduce unified AMP helpers (create_grad_scaler, torch_autocast) to handle
deprecations in PyTorch ≥2.3.0
- Replace direct uses of torch.cuda.amp.GradScaler and torch.cuda.amp.autocast
with the new utilities across all training and inference scripts
- Update all torch.load calls to include weights_only=False for compatibility with
newer PyTorch versions