Archived

This repository has been archived on 2026-03-23. You can view files and clone it, but cannot push or open issues or pull requests.

History

Karel Vesely 59c943878f

add the voxpopuli recipe (#1374 )

* add the `voxpopuli` recipe

- this is the data preparation
- there is no ASR training and no results

* update the PR#1374 (feedback from @csukuangfj)

- fixing .py headers and docstrings
- removing BUT specific parts of `prepare.sh`
- adding assert `num_jobs >= num_workers` to `compute_fbank.py`
- narrowing list of languages
  (let's limit to ASR sets with transcripts for now)
- added links to `README.md`
- extending `text_from_manifest.py`

2023-11-16 14:38:31 +08:00

local

add the voxpopuli recipe (#1374 )

2023-11-16 14:38:31 +08:00

prepare.sh

add the voxpopuli recipe (#1374 )

2023-11-16 14:38:31 +08:00

README.md

add the voxpopuli recipe (#1374 )

2023-11-16 14:38:31 +08:00

shared

add the voxpopuli recipe (#1374 )

2023-11-16 14:38:31 +08:00

README.md

Readme

This recipe contains data preparation for the VoxPopuli dataset (pdf). At the moment, without model training.

audio per language

language	Size	Hrs. untranscribed	Hrs. transcribed
bg	295G	17.6K	-
cs	308G	18.7K	62
da	233G	13.6K	-
de	379G	23.2K	282
el	305G	17.7K	-
en	382G	24.1K	543
es	362G	21.4K	166
et	179G	10.6K	3
fi	236G	14.2K	27
fr	376G	22.8K	211
hr	132G	8.1K	43
hu	297G	17.7K	63
it	361G	21.9K	91
lt	243G	14.4K	2
lv	217G	13.1K	-
mt	147G	9.1K	-
nl	322G	19.0K	53
pl	348G	21.2K	111
pt	300G	17.5K	-
ro	296G	17.9K	89
sk	201G	12.1K	35
sl	190G	11.3K	10
sv	272G	16.3K	-

total	6.3T	384K	1791