mirror of https://github.com/k2-fsa/icefall.git synced 2025-08-08 09:32:20 +00:00

add the voxpopuli recipe (#1374 )

* add the `voxpopuli` recipe

- this is the data preparation
- there is no ASR training and no results

* update the PR#1374 (feedback from @csukuangfj)

- fixing .py headers and docstrings
- removing BUT specific parts of `prepare.sh`
- adding assert `num_jobs >= num_workers` to `compute_fbank.py`
- narrowing list of languages
  (let's limit to ASR sets with transcripts for now)
- added links to `README.md`
- extending `text_from_manifest.py`

2023-11-16 14:38:31 +08:00

1.9 KiB

Raw Permalink Blame History

Readme

This recipe contains data preparation for the VoxPopuli dataset (pdf). At the moment, without model training.

audio per language

language	Size	Hrs. untranscribed	Hrs. transcribed
bg	295G	17.6K	-
cs	308G	18.7K	62
da	233G	13.6K	-
de	379G	23.2K	282
el	305G	17.7K	-
en	382G	24.1K	543
es	362G	21.4K	166
et	179G	10.6K	3
fi	236G	14.2K	27
fr	376G	22.8K	211
hr	132G	8.1K	43
hu	297G	17.7K	63
it	361G	21.9K	91
lt	243G	14.4K	2
lv	217G	13.1K	-
mt	147G	9.1K	-
nl	322G	19.0K	53
pl	348G	21.2K	111
pt	300G	17.5K	-
ro	296G	17.9K	89
sk	201G	12.1K	35
sl	190G	11.3K	10
sv	272G	16.3K	-

total	6.3T	384K	1791

1.9 KiB Raw Permalink Blame History

Readme

audio per language

1.9 KiB

Raw Permalink Blame History