Karel Vesely 59c943878f
add the voxpopuli recipe (#1374)
* add the `voxpopuli` recipe

- this is the data preparation
- there is no ASR training and no results

* update the PR#1374 (feedback from @csukuangfj)

- fixing .py headers and docstrings
- removing BUT specific parts of `prepare.sh`
- adding assert `num_jobs >= num_workers` to `compute_fbank.py`
- narrowing list of languages
  (let's limit to ASR sets with transcripts for now)
- added links to `README.md`
- extending `text_from_manifest.py`
2023-11-16 14:38:31 +08:00
..
2023-11-16 14:38:31 +08:00
2023-11-16 14:38:31 +08:00
2023-11-16 14:38:31 +08:00
2023-11-16 14:38:31 +08:00

Readme

This recipe contains data preparation for the VoxPopuli dataset (pdf). At the moment, without model training.

audio per language

language Size Hrs. untranscribed Hrs. transcribed
bg 295G 17.6K -
cs 308G 18.7K 62
da 233G 13.6K -
de 379G 23.2K 282
el 305G 17.7K -
en 382G 24.1K 543
es 362G 21.4K 166
et 179G 10.6K 3
fi 236G 14.2K 27
fr 376G 22.8K 211
hr 132G 8.1K 43
hu 297G 17.7K 63
it 361G 21.9K 91
lt 243G 14.4K 2
lv 217G 13.1K -
mt 147G 9.1K -
nl 322G 19.0K 53
pl 348G 21.2K 111
pt 300G 17.5K -
ro 296G 17.9K 89
sk 201G 12.1K 35
sl 190G 11.3K 10
sv 272G 16.3K -
total 6.3T 384K 1791