icefall/docs/source/for-dummies/data-preparation.rst

.. _dummies_tutorial_data_preparation:

Data Preparation
================

After :ref:`dummies_tutorial_environment_setup`, we can start preparing the
data for training and decoding.

The first step is to prepare the data for training. We have already provided
`prepare.sh <https://github.com/k2-fsa/icefall/blob/master/egs/yesno/ASR/prepare.sh>`_
that would prepare everything required for training.

.. code-block::

   cd /tmp/icefall
   export PYTHONPATH=/tmp/icefall:$PYTHONPATH
   cd egs/yesno/ASR

   ./prepare.sh

Note that in each recipe from `icefall`_, there exists a file ``prepare.sh``,
which you should run before you run anything else.

That is all you need for data preparation.

For the more curious
--------------------

If you are wondering how to prepare your own dataset, please refer to the following
URLs for more details:

  - `<https://github.com/lhotse-speech/lhotse/tree/master/lhotse/recipes>`_

    It contains recipes for a variety of dataset. If you want to add your own
    dataset, please read recipes in this folder first.

  - `<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/yesno.py>`_

    The `yesno`_ recipe in `lhotse`_.

If you already have a `Kaldi`_ dataset directory, which contains files like
``wav.scp``, ``feats.scp``, then you can refer to `<https://lhotse.readthedocs.io/en/latest/kaldi.html#example>`_.

A quick look to the generated files
-----------------------------------

``./prepare.sh`` puts generated files into two directories:

  - ``download``
  - ``data``

download
^^^^^^^^

The ``download`` directory contains downloaded dataset files:

.. code-block:: bas

    tree -L 1 ./download/

    ./download/
    |-- waves_yesno
    `-- waves_yesno.tar.gz

.. hint::

   Please refer to `<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/recipes/yesno.py#L41>`_
   for how the data is downloaded and extracted.

data
^^^^

.. code-block:: bash

    tree ./data/

    ./data/
    |-- fbank
    |   |-- yesno_cuts_test.jsonl.gz
    |   |-- yesno_cuts_train.jsonl.gz
    |   |-- yesno_feats_test.lca
    |   `-- yesno_feats_train.lca
    |-- lang_phone
    |   |-- HLG.pt
    |   |-- L.pt
    |   |-- L_disambig.pt
    |   |-- Linv.pt
    |   |-- lexicon.txt
    |   |-- lexicon_disambig.txt
    |   |-- tokens.txt
    |   `-- words.txt
    |-- lm
    |   |-- G.arpa
    |   `-- G.fst.txt
    `-- manifests
        |-- yesno_recordings_test.jsonl.gz
        |-- yesno_recordings_train.jsonl.gz
        |-- yesno_supervisions_test.jsonl.gz
        `-- yesno_supervisions_train.jsonl.gz

    4 directories, 18 files

**data/manifests**:

  This directory contains manifests. They are used to generate files in
  ``data/fbank``.

  To give you an idea of what it contains, we examine the first few lines of
  the manifests related to the ``train`` dataset.

  .. code-block:: bash

      cd data/manifests
      gunzip -c  yesno_recordings_train.jsonl.gz  | head -n 3

  The output is given below:

    .. code-block:: bash

      {"id": "0_0_0_0_1_1_1_1", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_0_1_1_1_1.wav"}], "sampling_rate": 8000, "num_samples": 50800, "duration": 6.35, "channel_ids": [0]}
      {"id": "0_0_0_1_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_1_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48880, "duration": 6.11, "channel_ids": [0]}
      {"id": "0_0_1_0_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_1_0_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48160, "duration": 6.02, "channel_ids": [0]}

  Please refer to `<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L300>`_
  for the meaning of each field per line.

  .. code-block:: bash

      gunzip -c  yesno_supervisions_train.jsonl.gz  | head -n 3

  The output is given below:

  .. code-block:: bash

      {"id": "0_0_0_0_1_1_1_1", "recording_id": "0_0_0_0_1_1_1_1", "start": 0.0, "duration": 6.35, "channel": 0, "text": "NO NO NO NO YES YES YES YES", "language": "Hebrew"}
      {"id": "0_0_0_1_0_1_1_0", "recording_id": "0_0_0_1_0_1_1_0", "start": 0.0, "duration": 6.11, "channel": 0, "text": "NO NO NO YES NO YES YES NO", "language": "Hebrew"}
      {"id": "0_0_1_0_0_1_1_0", "recording_id": "0_0_1_0_0_1_1_0", "start": 0.0, "duration": 6.02, "channel": 0, "text": "NO NO YES NO NO YES YES NO", "language": "Hebrew"}

  Please refer to `<https://github.com/lhotse-speech/lhotse/blob/master/lhotse/supervision.py#L510>`_
  for the meaning of each field per line.

**data/fbank**:

  This directory contains everything from ``data/manifests``. Furthermore, it also contains features
  for training.

  ``data/fbank/yesno_feats_train.lca`` contains the features for the train dataset.
  Features are compressed using `lilcom`_.

  ``data/fbank/yesno_cuts_train.jsonl.gz`` stores the `CutSet <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/cut/set.py#L72>`_,
  which stores `RecordingSet <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/audio.py#L928>`_,
  `SupervisionSet <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/supervision.py#L510>`_,
  and `FeatureSet <https://github.com/lhotse-speech/lhotse/blob/master/lhotse/features/base.py#L593>`_.

  To give you an idea about what it looks like, we can run the following command:

    .. code-block:: bash

        cd data/fbank

        gunzip -c yesno_cuts_train.jsonl.gz | head -n 3

  The output is given below:

    .. code-block:: bash

      {"id": "0_0_0_0_1_1_1_1-0", "start": 0, "duration": 6.35, "channel": 0, "supervisions": [{"id": "0_0_0_0_1_1_1_1", "recording_id": "0_0_0_0_1_1_1_1", "start": 0.0, "duration": 6.35, "channel": 0, "text": "NO NO NO NO YES YES YES YES", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 635, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.35, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "0,13000,3570", "channels": 0}, "recording": {"id": "0_0_0_0_1_1_1_1", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_0_1_1_1_1.wav"}], "sampling_rate": 8000, "num_samples": 50800, "duration": 6.35, "channel_ids": [0]}, "type": "MonoCut"}
      {"id": "0_0_0_1_0_1_1_0-1", "start": 0, "duration": 6.11, "channel": 0, "supervisions": [{"id": "0_0_0_1_0_1_1_0", "recording_id": "0_0_0_1_0_1_1_0", "start": 0.0, "duration": 6.11, "channel": 0, "text": "NO NO NO YES NO YES YES NO", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 611, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.11, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "16570,12964,2929", "channels": 0}, "recording": {"id": "0_0_0_1_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_0_1_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48880, "duration": 6.11, "channel_ids": [0]}, "type": "MonoCut"}
      {"id": "0_0_1_0_0_1_1_0-2", "start": 0, "duration": 6.02, "channel": 0, "supervisions": [{"id": "0_0_1_0_0_1_1_0", "recording_id": "0_0_1_0_0_1_1_0", "start": 0.0, "duration": 6.02, "channel": 0, "text": "NO NO YES NO NO YES YES NO", "language": "Hebrew"}], "features": {"type": "kaldi-fbank", "num_frames": 602, "num_features": 23, "frame_shift": 0.01, "sampling_rate": 8000, "start": 0, "duration": 6.02, "storage_type": "lilcom_chunky", "storage_path": "data/fbank/yesno_feats_train.lca", "storage_key": "32463,12936,2696", "channels": 0}, "recording": {"id": "0_0_1_0_0_1_1_0", "sources": [{"type": "file", "channels": [0], "source": "/tmp/icefall/egs/yesno/ASR/download/waves_yesno/0_0_1_0_0_1_1_0.wav"}], "sampling_rate": 8000, "num_samples": 48160, "duration": 6.02, "channel_ids": [0]}, "type": "MonoCut"}

  Note that ``yesno_cuts_train.jsonl.gz`` only stores the information about how to read the features.
  The actual features are stored separately in ``data/fbank/yesno_feats_train.lca``.

**data/lang**:

  This directory contains the lexicon.

**data/lm**:

  This directory contains language models.