* zipformer/ctc_align.py
- tool for forced-alignment with CTC model
- provides timeline, computes per-token and per-utterance acoustic confidences
- based on torchaudio `forced_align()`
- confidences are computed in several ways
other modifications:
- LibriSpeechAsrDataModel extended with `::load_manifest()` to allow
passing-in cutset from CLI.
- update @custom_fwd @custom_bwd in scaling.py
- streaming_decode.py update errs/recogs/log filenames '-' <-> '_'
* putting back `custom_bwd`, `custom_fwd`
* integrating remarks from PR
* update of argparse help strings
* ctc_align.py, avoid shadowing a variable
* Finalizing the code:
- adding some coderabbit suggestions.
- removing `word_table`, `decoding_graph` from aligner API (unused)
- improved consistency of variable names (confidences)
- updated docstrings
- Introduce unified AMP helpers (create_grad_scaler, torch_autocast) to handle
deprecations in PyTorch ≥2.3.0
- Replace direct uses of torch.cuda.amp.GradScaler and torch.cuda.amp.autocast
with the new utilities across all training and inference scripts
- Update all torch.load calls to include weights_only=False for compatibility with
newer PyTorch versions
1. Attach the inf-check hooks if the grad scale is getting too small.
2. Add try-catch to avoid OOM in the inf-check hooks.
3. Set warmup_start=0.1 to reduce chances of divergence
* support consistency-regularized CTC
* update arguments of cr-ctc
* set default value of cr_loss_masked_scale to 1.0
* minor fix
* refactor codes
* update RESULTS.md
- the idea is to support `--skip-scoring` argument passed to a decoding
script
- created for Transducer decoding (non-streaming, streaming)
- it can be done also for CTC decoding... (not yet)
- also added `--label` for extra label in `streaming_decode.py`
- and also added `set_caching_enabled(True)`, which has no effect on
librispeech, but it leads to faster runtime on DBs with long
recordings (assuming `librispeech/zipformer` scripts are the
example scripts for other setups)
- some AudioTransform classes produce audio signals out of range [-1,+1]
- Resample produced 1.0079
- The range [-10,+10] was chosen to still be able to reliably
distinguish from the [-32k,+32k] signal...
- this is related to : https://github.com/lhotse-speech/lhotse/issues/1254