mirror of
https://github.com/csukuangfj/kaldifeat.git
synced 2025-08-09 10:02:20 +00:00
213 lines
8.9 KiB
ReStructuredText
213 lines
8.9 KiB
ReStructuredText
Usage
|
|
=====
|
|
|
|
Let us first see the help message of kaldi's ``compute-fbank-feats``:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ compute-fbank-feats
|
|
|
|
Create Mel-filter bank (FBANK) feature files.
|
|
Usage: compute-fbank-feats [options...] <wav-rspecifier> <feats-wspecifier>
|
|
|
|
Options:
|
|
--allow-downsample : If true, allow the input waveform to have a higher frequency than the specified --sample-frequency (and we'll downsample). (bool, default = false)
|
|
--allow-upsample : If true, allow the input waveform to have a lower frequency than the specified --sample-frequency (and we'll upsample). (bool, default = false)
|
|
--blackman-coeff : Constant coefficient for generalized Blackman window. (float, default = 0.42)
|
|
--channel : Channel to extract (-1 -> expect mono, 0 -> left, 1 -> right) (int, default = -1)
|
|
--debug-mel : Print out debugging information for mel bin computation (bool, default = false)
|
|
--dither : Dithering constant (0.0 means no dither). If you turn this off, you should set the --energy-floor option, e.g. to 1.0 or 0.1 (float, default = 1)
|
|
--energy-floor : Floor on energy (absolute, not relative) in FBANK computation. Only makes a difference if --use-energy=true; only necessary if --dither=0.0. Suggested values: 0.1 or 1.0 (float, default = 0)
|
|
--frame-length : Frame length in milliseconds (float, default = 25)
|
|
--frame-shift : Frame shift in milliseconds (float, default = 10)
|
|
--high-freq : High cutoff frequency for mel bins (if <= 0, offset from Nyquist) (float, default = 0)
|
|
--htk-compat : If true, put energy last. Warning: not sufficient to get HTK compatible features (need to change other parameters). (bool, default = false)
|
|
--low-freq : Low cutoff frequency for mel bins (float, default = 20)
|
|
--max-feature-vectors : Memory optimization. If larger than 0, periodically remove feature vectors so that only this number of the latest feature vectors is retained. (int, default = -1)
|
|
--min-duration : Minimum duration of segments to process (in seconds). (float, default = 0)
|
|
--num-mel-bins : Number of triangular mel-frequency bins (int, default = 23)
|
|
--output-format : Format of the output files [kaldi, htk] (string, default = "kaldi")
|
|
--preemphasis-coefficient : Coefficient for use in signal preemphasis (float, default = 0.97)
|
|
--raw-energy : If true, compute energy before preemphasis and windowing (bool, default = true)
|
|
--remove-dc-offset : Subtract mean from waveform on each frame (bool, default = true)
|
|
--round-to-power-of-two : If true, round window size to power of two by zero-padding input to FFT. (bool, default = true)
|
|
--sample-frequency : Waveform data sample frequency (must match the waveform file, if specified there) (float, default = 16000)
|
|
--snip-edges : If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends. (bool, default = true)
|
|
--subtract-mean : Subtract mean of each feature file [CMS]; not recommended to do it this way. (bool, default = false)
|
|
--use-energy : Add an extra dimension with energy to the FBANK output. (bool, default = false)
|
|
--use-log-fbank : If true, produce log-filterbank, else produce linear. (bool, default = true)
|
|
--use-power : If true, use power, else use magnitude. (bool, default = true)
|
|
--utt2spk : Utterance to speaker-id map (if doing VTLN and you have warps per speaker) (string, default = "")
|
|
--vtln-high : High inflection point in piecewise linear VTLN warping function (if negative, offset from high-mel-freq (float, default = -500)
|
|
--vtln-low : Low inflection point in piecewise linear VTLN warping function (float, default = 100)
|
|
--vtln-map : Map from utterance or speaker-id to vtln warp factor (rspecifier) (string, default = "")
|
|
--vtln-warp : Vtln warp factor (only applicable if vtln-map not specified) (float, default = 1)
|
|
--window-type : Type of window ("hamming"|"hanning"|"povey"|"rectangular"|"sine"|"blackmann") (string, default = "povey")
|
|
--write-utt2dur : Wspecifier to write duration of each utterance in seconds, e.g. 'ark,t:utt2dur'. (string, default = "")
|
|
|
|
Standard options:
|
|
--config : Configuration file to read (this option may be repeated) (string, default = "")
|
|
--help : Print out usage message (bool, default = false)
|
|
--print-args : Print the command line arguments (to stderr) (bool, default = true)
|
|
--verbose : Verbose level (higher->more logging) (int, default = 0)
|
|
|
|
FbankOptions
|
|
------------
|
|
|
|
``kaldifeat`` reuses the same options from kaldi's ``compute-fbank-feats``.
|
|
|
|
The following shows the default values of ``kaldifeat.FbankOptions``:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> import kaldifeat
|
|
>>> fbank_opts = kaldifeat.FbankOptions()
|
|
>>> print(fbank_opts)
|
|
frame_opts:
|
|
samp_freq: 16000
|
|
frame_shift_ms: 10
|
|
frame_length_ms: 25
|
|
dither: 1
|
|
preemph_coeff: 0.97
|
|
remove_dc_offset: 1
|
|
window_type: povey
|
|
round_to_power_of_two: 1
|
|
blackman_coeff: 0.42
|
|
snip_edges: 1
|
|
|
|
|
|
mel_opts:
|
|
num_bins: 23
|
|
low_freq: 20
|
|
high_freq: 0
|
|
vtln_low: 100
|
|
vtln_high: -500
|
|
debug_mel: 0
|
|
htk_mode: 0
|
|
|
|
use_energy: 0
|
|
energy_floor: 0
|
|
raw_energy: 1
|
|
htk_compat: 0
|
|
use_log_fbank: 1
|
|
use_power: 1
|
|
device: cpu
|
|
|
|
It consists of three parts:
|
|
|
|
- ``frame_opts``
|
|
|
|
Options in this part are accessed by ``frame_opts.xxx``. That is, to access
|
|
the sample rate, you use:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> fbank_opts = kaldifeat.FbankOptions()
|
|
>>> print(fbank_opts.frame_opts.samp_freq)
|
|
16000.0
|
|
|
|
- ``mel_opts``
|
|
|
|
Options in this part are accessed by ``mel_opts.xxx``. That is, to access
|
|
the number of mel bins, you use:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> fbank_opts = kaldifeat.FbankOptions()
|
|
>>> print(fbank_opts.mel_opts.num_bins)
|
|
23
|
|
|
|
- fbank related
|
|
|
|
Options in this part are accessed directly. That is, to access the device
|
|
field, you use:
|
|
|
|
.. code-block::
|
|
|
|
>>> print(fbank_opts.device)
|
|
cpu
|
|
>>> fbank_opts.device = 'cuda:0'
|
|
>>> print(fbank_opts.device)
|
|
cuda:0
|
|
>>> import torch
|
|
>>> fbank_opts.device = torch.device('cuda', 0)
|
|
>>> print(fbank_opts.device)
|
|
cuda:0
|
|
|
|
|
|
|
|
To change the sample rate to 8000, you can use:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> fbank_opts = kaldifeat.FbankOptions()
|
|
>>> print(fbank_opts.frame_opts.samp_freq)
|
|
16000.0
|
|
>>> fbank_opts.frame_opts.samp_freq = 8000
|
|
>>> print(fbank_opts.frame_opts.samp_freq)
|
|
8000.0
|
|
|
|
To change ``snip_edges`` to ``False``, you can use:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> fbank_opts.frame_opts.snip_edges = False
|
|
>>> print(fbank_opts.frame_opts.snip_edges)
|
|
False
|
|
|
|
To change number of mel bins to 80, you can use:
|
|
|
|
.. code-block:: python
|
|
|
|
>>> print(fbank_opts.mel_opts.num_bins)
|
|
23
|
|
>>> fbank_opts.mel_opts.num_bins = 80
|
|
>>> print(fbank_opts.mel_opts.num_bins)
|
|
80
|
|
|
|
To change the device to ``cuda``, you can use:
|
|
|
|
|
|
Fbank
|
|
-----
|
|
|
|
The following shows how to use ``kaldifeat.Fbank`` to compute
|
|
the fbank features of sound files.
|
|
|
|
First, let us generate two sound files using ``sox``:
|
|
|
|
.. code-block:: bash
|
|
|
|
# generate a wav of two seconds, containing a sine-wave
|
|
# swept from 300 Hz to 3300 Hz
|
|
sox -n -r 16000 -b 16 test.wav synth 1.2 sine 300-3300
|
|
|
|
# another sound file with 0.5 seconds
|
|
sox -n -r 16000 -b 16 test2.wav synth 0.5 sine 300-3300
|
|
|
|
.. hint::
|
|
|
|
You can find the above two files by visiting the following two links:
|
|
|
|
- `test.wav <https://github.com/csukuangfj/kaldifeat/blob/master/kaldifeat/python/tests/test_data/test.wav>`_
|
|
- `test2.wav <https://github.com/csukuangfj/kaldifeat/blob/master/kaldifeat/python/tests/test_data/test2.wav>`_
|
|
|
|
The `following code <https://github.com/csukuangfj/kaldifeat/blob/master/kaldifeat/python/tests/test_fbank.py>`_
|
|
shows the usage of ``kaldifeat.Fbank``.
|
|
|
|
It shows:
|
|
|
|
- How to read a sound file. Note that audio samples are scaled to the range [-32768, 32768].
|
|
The intention is to produce the same output as kaldi. You don't need to scale it if
|
|
you don't care about the compatibility with kaldi
|
|
|
|
- ``kaldifeat.Fbank`` supports CUDA as well as CPU
|
|
|
|
- ``kaldifeat.Fbank`` supports processing sound file in a batch as well as accepting
|
|
a single sound file
|
|
|
|
|
|
.. literalinclude:: ./code/test_fbank.py
|
|
:caption: Demo of ``kaldifeat.Fbank``
|
|
:language: python
|