This recipe is mostly based on egs/csj, but tweaked to the point that
can be run with ReazonSpeech corpus.
That being said, there are some big caveats:
* Currently the model quality is not very good. Actually, it is very
bad. I trained a model with 1000h corpus, and it resulted in >80%
CER on JSUT.
* The core issue seems that Zipformer is prone to ignore untterances
as sielent segments. It often produces an empty hypothesis despite
that the audio actually contains human voice.
* This issue is already reported in the upstream and not fully
resolved yet as of Dec 2023.
Signed-off-by: Fujimoto Seiji <fujimoto@ceptord.net>