Skip to content

Python bindings

Tatiana Likhomanenko edited this page Feb 8, 2020 · 9 revisions

Featurization

Featurization module provides a bunch of classes for standard feature extraction from the audio data: Ceplifter, Dct, Derivatives, Dither, Mfcc, Mfsc, PowerSpectrum, PreEmphasis, TriFilterbank, Windowing.

All of them have the method apply which can be used to transform the input data. For example:

# imports
from wav2letter.feature import FeatureParams, Mfcc
import itertools as it

# read the wave
with open("path/to/file.wav") as f:
    wavinput = [float(x) for x in it.chain.from_iterable(line.split() for line in f)]

# create params struct
params = FeatureParams()
params.sampling_freq = 16000
params.low_freq_filterbank = 0
params.high_freq_filterbank = 8000
params.num_filterbank_chans = 20
params.num_cepstral_coeffs = 13
params.use_energy = False
params.zero_mean_frame = False
params.use_power = False

# define transformation and apply to the wave
mfcc = Mfcc(params)
features = mfcc.apply(wavinput)

ASG Loss

ASG loss is a pytorch module (nn.Module) which supports CPU and CUDA backends. It can be defined as

from wav2letter.criterion import ASGLoss
asg_loss = ASGLoss(ntokens, scale_mode).to(device)

where ntokens is the number of tokens predicted for each frame (number of classes), scale_mode is a scaling factor which can be:

NONE = 0, # no scaling
INPUT_SZ = 1, # scale to the input size
INPUT_SZ_SQRT = 2, # scale to the sqrt of the input size
TARGET_SZ = 3, # scale to the target size
TARGET_SZ_SQRT = 4, # scale to the sqrt of the target size

Beam-search decoder

Currently only lexicon-based beam-search decoder is supported. Also only n-gram (KenLM) language model is supported for python bindings. However, one can define custom language model inside python and use it for decoding, details see below. To have better understanding how this beam-search decoder works please see Beam-search decoder section.

To run decoder one first should define its options:

from wav2letter.decoder import DecoderOptions


options = DecoderOptions(
    beam_size, # number of top hypothesis to preserve at each decoding step
    token_beam_size, # restrict number of tokens by top am scores (if you have a huge token set)
    beam_threshold, # preserve a hypothesis only if its score is not far away from the current best hypothesis score
    lm_weight, # language model weight for LM score
    word_score, # score for words appearance in the transcription
    unk_score, # score for unknown word appearance in the transcription
    sil_score, # score for silence appearance in the transcription
    eos_score, # score for eos appearance in the transcription (for Seq2Seq decoder only)
    log_add, # the way how to combine scores during hypotheses merging (log add operation, max)
    criterion_type # supports only CriterionType.ASG or CriterionType.CTC
    )

Then we should prepare tokens dictionary (tokens for which acoustic models returns probability for each frame), lexicon (mapping between words and its spelling with the tokens set). Details on the tokens and lexicon files format have a look at Data Preparation.

from wav2letter.common import Dictionary, load_words, create_word_dict, tkn_to_idx


tokens_dict = Dictionary("path/tokens.txt")
# for ASG add used repetition symbols, for example
# token_dict.add_entry("1")
# token_dict.add_entry("2")

lexicon = load_words("path/lexicon.txt") # returns LexiconMap
word_dict = create_word_dict(lexicon) # returns Dictionary

To create language model for KenLM use

from wav2letter.decoder import KenLM


lm = KenLM("path/lm.arpa", word_dict) # or "path/lm.bin"

Get the unknown and silence indices from the token dict and word dict to pass them into decoder:

sil_idx = token_dict.get_index("|")
unk_idx = word_dict.get_index("<unk>")

Now define the lexicon Trie to restrict beam-search decoder search:

from wav2letter.decoder import Trie, SmearingMode


trie = Trie(token_dict.index_size(), sil_idx)
start_state = lm.start(False)

for word, spellings in lexicon.items():
    usr_idx = word_dict.get_index(word)
    _, score = lm.score(start_state, usr_idx)
    for spelling in spellings:
        spelling_idxs = tkn_to_idx(spelling, token_dict, 1) # convert spelling string into vector of indices
        trie.insert(spelling_idxs, usr_idx, score)

    trie.smear(SmearingMode.MAX) # propagate word score to each spelling node to have some lm proxy score in each node.

Now we can run lexicon-based decoder:

import numpy
from wav2letter.decoder import LexiconDecoder


blank_idx = token_dict.get_index("#") # for CTC
transitions = numpy.zeros((token_dict.index_size(), token_dict.index_size()) # for ASG fill up with correct values
is_token_lm = False # we use word-level LM
decoder = LexiconDecoder(options, trie, lm, sil_idx, blank_idx, unk_idx, transitions, is_token_lm)
# emissions is numpy.array of acoustic model predictions with shape [T, N], where T is time, N is number of tokens
results = decoder.decode(emissions.ctypes.data, T, N) 
# results[i].tokens contains tokens sequence (with length T)
# results[i].score contains score of the hypothesis
# results is sorted array with the best hypothesis stored with index=0.

Define your own language model for beam-search decoding

One can define custom language model in python and use it for beam-search decoding.

To deal with language model state we use the base class LMState and one can define additional info corresponding to each state via creating dict(LMState, info) inside language model class:

import numpy
from wav2letter.decoder import LM


class MyPyLM(LM):
    mapping_states = dict() # store simple additional int for each state
 
    def __init__(self):
        LM.__init__(self)

    def start(self, start_with_nothing):
        state = LMState()
        self.mapping_states[state] = 0
        return state
 
    def score(self, state : LMState, token_index : int):
        """
        Evaluate language model based on the current lm state and new word
        Parameters:
        -----------
        state: current lm state
        token_index: index of the word 
                     (can be lexicon index then you should store inside LM the 
                      mapping between indices of lexicon and lm, or lm index of a word)

        Returns:
        --------
        (LMState, float): pair of (new state, score for the current word)
        """
        outstate = state.child(token_index)
        if outstate not in self.mapping_states:
            self.mapping_states[outstate] = self.mapping_states[state] + 1
        return (outstate, -numpy.random.random())
 
    def finish(self, state: LMState):
        """
        Evaluate eos for language model based on the current lm state

        Returns:
        --------
        (LMState, float): pair of (new state, score for the current word)
        """
        outstate = state.child(-1)
        if outstate not in self.mapping_states:
            self.mapping_states[outstate] = self.mapping_states[state] + 1
        return (outstate, -1)

LMState is a C++ base class for language model state. Its method compare (compare one state with another) is used inside the beam-search decoder. It also has method LMState child(int index) returning a state which we obtained by following token with this index from current state. Thus all states are organized as a trie. We use child method in python to create this trie in a correct way (will be used inside decoder to compare states) and then we can store additional info about state inside mapping_states.

This language model can be used as (also printing the state and its additional stored info inside lm.mapping_states):

custom_lm = MyLM()

state = custom_lm.start(True)
print(state, custom_lm.mapping_states[state])

for i in range(5):
    state, score = custom_lm.score(state, i)
    print(state, custom_lm.mapping_states[state], score)

state, score = custom_lm.finish(state)
print(state, custom_lm.mapping_states[state], score)

and for decoder:

decoder = LexiconDecoder(options, trie, custom_lm, sil_idx, blank_inx, unk_idx, transitions, False)