mirror of
https://github.com/k2-fsa/icefall.git
synced 2025-08-09 18:12:19 +00:00
* add the `voxpopuli` recipe - this is the data preparation - there is no ASR training and no results * update the PR#1374 (feedback from @csukuangfj) - fixing .py headers and docstrings - removing BUT specific parts of `prepare.sh` - adding assert `num_jobs >= num_workers` to `compute_fbank.py` - narrowing list of languages (let's limit to ASR sets with transcripts for now) - added links to `README.md` - extending `text_from_manifest.py`
39 lines
1.9 KiB
Markdown
39 lines
1.9 KiB
Markdown
# Readme
|
|
|
|
This recipe contains data preparation for the
|
|
[VoxPopuli](https://github.com/facebookresearch/voxpopuli) dataset
|
|
[(pdf)](https://aclanthology.org/2021.acl-long.80.pdf).
|
|
At the moment, without model training.
|
|
|
|
|
|
## audio per language
|
|
|
|
| language | Size | Hrs. untranscribed | Hrs. transcribed |
|
|
|----------|--------|--------------------|------------------|
|
|
| bg | 295G | 17.6K | - |
|
|
| cs | 308G | 18.7K | 62 |
|
|
| da | 233G | 13.6K | - |
|
|
| de | 379G | 23.2K | 282 |
|
|
| el | 305G | 17.7K | - |
|
|
| en | 382G | 24.1K | 543 |
|
|
| es | 362G | 21.4K | 166 |
|
|
| et | 179G | 10.6K | 3 |
|
|
| fi | 236G | 14.2K | 27 |
|
|
| fr | 376G | 22.8K | 211 |
|
|
| hr | 132G | 8.1K | 43 |
|
|
| hu | 297G | 17.7K | 63 |
|
|
| it | 361G | 21.9K | 91 |
|
|
| lt | 243G | 14.4K | 2 |
|
|
| lv | 217G | 13.1K | - |
|
|
| mt | 147G | 9.1K | - |
|
|
| nl | 322G | 19.0K | 53 |
|
|
| pl | 348G | 21.2K | 111 |
|
|
| pt | 300G | 17.5K | - |
|
|
| ro | 296G | 17.9K | 89 |
|
|
| sk | 201G | 12.1K | 35 |
|
|
| sl | 190G | 11.3K | 10 |
|
|
| sv | 272G | 16.3K | - |
|
|
| | | | |
|
|
| total | 6.3T | 384K | 1791 |
|
|
|