.. _dummies_tutorial_data_preparation: Data Preparation ================ After :ref:`dummies_tutorial_environment_setup`, we can start preparing the data for training and decoding. The first step is to prepare the data for training. We have already provided `prepare.sh `_ that would prepare everything required for training. .. code-block:: cd /tmp/icefall export PYTHONPATH=/tmp/icefall:$PYTHONPATH cd egs/yesno/ASR ./prepare.sh Note that in each recipe from `icefall`_, there exists a file ``prepare.sh``, which you should run before you run anything else. That is all you need for data preparation. For the more curious -------------------- If you are wondering how to prepare your own dataset, please refer to the following URLs for more details: - ``_ It contains recipes for a variety of dataset. If you want to add your own dataset, please read recipes in this folder first. - ``_ The `yesno`_ recipe in `lhotse`_. If you already have a `Kaldi`_ dataset directory, which contains files like ``wav.scp``, ``feats.scp``, then you can refer to ``_. A quick look to the generated files ----------------------------------- ``./prepare.sh`` puts generated files into two directories: - ``download`` - ``data`` download ^^^^^^^^ The ``download`` directory contains downloaded dataset files: .. code-block:: bas tree -L 1 ./download/ ./download/ |-- waves_yesno `-- waves_yesno.tar.gz .. hint:: Please refer to ``_ for how the data is downloaded and extracted. data ^^^^ .. code-block:: bash tree ./data/ ./data/ |-- fbank | |-- yesno_cuts_test.jsonl.gz | |-- yesno_cuts_train.jsonl.gz | |-- yesno_feats_test.lca | `-- yesno_feats_train.lca |-- lang_phone | |-- HLG.pt | |-- L.pt | |-- L_disambig.pt | |-- Linv.pt | |-- lexicon.txt | |-- lexicon_disambig.txt | |-- tokens.txt | `-- words.txt |-- lm | |-- G.arpa | `-- G.fst.txt `-- manifests |-- yesno_recordings_test.jsonl.gz |-- yesno_recordings_train.jsonl.gz |-- yesno_supervisions_test.jsonl.gz `-- yesno_supervisions_train.jsonl.gz 4 directories, 18 files **data/manifests**: This directory contains manifests. They are used to generate files in ``data/fbank``. To give you an idea of what it contains, we examine the first few lines of the manifests related to the ``train`` dataset. .. code-block:: bash cd data/manifests gunzip -c yesno_recordings_train.jsonl.gz | head -n 3 The output is given below: .. code-block:: bash {"id": "0_0_0_0_1_1_1_1", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_0_1_1_1_1.wav"}], "sampling_rate": 8000, "num_samples": 50800, "duration": 6.35, "channel_ids": [0]} {"id": "0_0_0_1_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_1_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48880, "duration": 6.11, "channel_ids": [0]} {"id": "0_0_1_0_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_1_0_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48160, "duration": 6.02, "channel_ids": [0]} Please refer to ``_ for the meaning of each field per line. .. code-block:: bash gunzip -c yesno_supervisions_train.jsonl.gz | head -n 3 The output is given below: .. code-block:: bash {"id": "0_0_0_0_1_1_1_1", "recording_id": "0_0_0_0_1_1_1_1", "start": 0.0, "duration": 6.35, "channel": 0, "text": "NO NO NO NO YES YES YES YES", "language": "Hebrew"} {"id": "0_0_0_1_0_1_1_0", "recording_id": "0_0_0_1_0_1_1_0", "start": 0.0, "duration": 6.11, "channel": 0, "text": "NO NO NO YES NO YES YES NO", "language": "Hebrew"} {"id": "0_0_1_0_0_1_1_0", "recording_id": "0_0_1_0_0_1_1_0", "start": 0.0, "duration": 6.02, "channel": 0, "text": "NO NO YES NO NO YES YES NO", "language": "Hebrew"} Please refer to ``_ for the meaning of each field per line. **data/fbank**: This directory contains everything from ``data/manifests``. Furthermore, it also contains features for training. ``data/fbank/yesno_feats_train.lca`` contains the features for the train dataset. Features are compressed using `lilcom`_. ``data/fbank/yesno_cuts_train.jsonl.gz`` stores the `CutSet `_, which stores `RecordingSet `_, `SupervisionSet `_, and `FeatureSet `_. To give you an idea about what it looks like, we can run the following command: .. code-block:: bash cd data/fbank gunzip -c yesno_cuts_train.jsonl.gz | head -n 3 The output is given below: .. code-block:: bash {"id": "0_0_0_0_1_1_1_1-0", "start": 0, "duration": 6.35, "channel": 0, "supervisions": [{"id": "0_0_0_0_1_1_1_1", "recording_id": "0_0_0_0_1_1_1_1", "start": 0.0, "duration": 6.35, "channel": 0, "text": "NO NO NO NO YES YES YES YES", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 635, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.35, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "0,13000,3570", "channels": 0}, "recording": {"id": "0_0_0_0_1_1_1_1", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_0_1_1_1_1.wav"}], "sampling_rate": 8000, "num_samples": 50800, "duration": 6.35, "channel_ids": [0]}, "type": "MonoCut"} {"id": "0_0_0_1_0_1_1_0-1", "start": 0, "duration": 6.11, "channel": 0, "supervisions": [{"id": "0_0_0_1_0_1_1_0", "recording_id": "0_0_0_1_0_1_1_0", "start": 0.0, "duration": 6.11, "channel": 0, "text": "NO NO NO YES NO YES YES NO", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 611, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.11, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "16570,12964,2929", "channels": 0}, "recording": {"id": "0_0_0_1_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_1_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48880, "duration": 6.11, "channel_ids": [0]}, "type": "MonoCut"} {"id": "0_0_1_0_0_1_1_0-2", "start": 0, "duration": 6.02, "channel": 0, "supervisions": [{"id": "0_0_1_0_0_1_1_0", "recording_id": "0_0_1_0_0_1_1_0", "start": 0.0, "duration": 6.02, "channel": 0, "text": "NO NO YES NO NO YES YES NO", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 602, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.02, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "32463,12936,2696", "channels": 0}, "recording": {"id": "0_0_1_0_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_1_0_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48160, "duration": 6.02, "channel_ids": [0]}, "type": "MonoCut"} Note that ``yesno_cuts_train.jsonl.gz`` only stores the information about how to read the features. The actual features are stored separately in ``data/fbank/yesno_feats_train.lca``. **data/lang**: This directory contains the lexicon. **data/lm**: This directory contains language models.