2025-09-17 18:40:03 -04:00

2.8 KiB
Raw Blame History

HENT-SRT

This repository contains a speech-to-text translation (ST) recipe accompanying our IWSLT 2025 paper:

HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation
Paper: https://arxiv.org/abs/2506.02157

Datasets

The recipe combines three conversational, 3-way parallel ST corpora:

Data access: Fisher and HKUST require an institutional LDC subscription.
Recipe status: Lhotse recipes for Fisher Spanish and HKUST are in progress and will be finalized soon.

Zipformer Multi-joiner ST

This model is similar to https://www.isca-archive.org/interspeech_2023/wang23oa_interspeech.pdf, but our system uses zipformer encoder with a pruned transducer and stateless decoder

Dataset Decoding method test WER test BLEU comment
iwslt_ta modified beam search 41.6 16.3 --epoch 20, --avg 13, beam(20),
hkust modified beam search 23.8 10.4 --epoch 20, --avg 13, beam(20),
fisher_sp modified beam search 18.0 31.0 --epoch 20, --avg 13, beam(20),

Hent-SRT offline

Dataset Decoding method test WER test BLEU comment
iwslt_ta modified beam search 41.4 20.6 --epoch 20, --avg 13, beam(20), BP 1
hkust modified beam search 22.8 14.7 --epoch 20, --avg 13, beam(20), BP 1
fisher_sp modified beam search 17.8 33.7 --epoch 20, --avg 13, beam(20), BP 1

Hent-SRT streaming

Dataset Decoding method test WER test BLEU comment
iwslt_ta greedy search 46.2 17.3 --epoch 20, --avg 13, BP 2, chunk-size 64, left-context-frames 128, max-sym-per-frame 20
hkust greedy search 27.3 11.2 --epoch 20, --avg 13, BP 2, chunk-size 64, left-context-frames 128, max-sym-per-frame 20
fisher_sp greedy search 22.7 30.8 --epoch 20, --avg 13, BP 2, chunk-size 64, left-context-frames 128, max-sym-per-frame 20

See RESULTS for details.