PyTorch implementation of Learning Filterbanks from Raw Speech for Phone Recognition (ICASSP 2018).

Time-Domain Filterbanks (TD-filterbanks) are neural network layers intended to operate on a raw audio waveform. At initialization, they approximate standard mel-filterbanks by computing first-order scattering coefficients. They can then be fine-tuned with the architecture. Options of mel-filterbanks can be specified, such as a pre-emphasis layer, a log compression of the coefficients, or their mean-variance normalization.

There are four different modes for TD-filterbanks:

  • Fixed: Initialize the layers to match mel-filterbanks and keep their parameters fixed when training the model
  • Learn-all: Initialize the layers and let the filterbank and the averaging be learned jointly with the model
  • Learn-filterbank: Start from the initialization and only learn the filterbank with the model, keeping the averaging fixed to a squared hanning window
  • Randinit: Initialize the layers randomly and learn them with the network

 TD-filberbanks

Time-Domain Filterbanks are a neural architecture composed of a complex-valued convolution, a modulus operator and a grouped real-valued convolution. This structure is based on the computation of first-order scattering coefficients. They are generated by a call to the class TDFbanks.