Audio-Visual Speech Recognition using SciPy

In audio-visual automatic speech recognition (AVASR) both acoustic and visual modalities of speech are used to identify what a person is saying. In this paper we propose a basic AVASR system implemented using SciPy, an open source Python library for scientific computing. AVASR research draws from the fields of signal processing, computer vision and machine learning, all of which are active fields of development in the SciPy community. As such, AVASR researchers using SciPy are able to benefit from a wide range of tools available in SciPy. The performance of the system is tested using the Clemson University audio-visual experiments (CUAVE) database. We find that visual speech information is in itself not sufficient for automatic speech recognition. However, by integrating visual and acoustic speech information we are able to obtain better performance than what is possible with audio-only ASR.


Introduction
Motivated by the multi-modal manner humans perceive their environment, research in Audio-Visual Automatic Speech Recognition (AVASR) focuses on the integration of acoustic and visual speech information with the purpose of improving accuracy and robustness of speech recognition systems. AVASR is in general expected to perform better than audio-only automatic speech recognition (ASR), especially so in noisy environments as the visual channel is not affected by acoustic noise.
Functional requirements for an AVASR system include acoustic and visual feature extraction, probabilistic model learning and classification.
In this paper we propose a basic AVASR system implemented using SciPy. In the proposed system mel-frequency cepstrum coefficients (MFCCs) and active appearance model (AAM) parameters are used as acoustic and visual features, respectively. Gaussian mixture models are used to learn the distributions of the feature vectors given a particular class such as a word or a phoneme. We present two alternatives for learning the GMMs, expectation maximization (EM) and variational Bayesian (VB) inference.
The performance of the system is tested using the CUAVE database. The performance is evaluated by calculating the misclassification rate of the system on a separate test data data. We find that visual speech information is in itself not sufficient for automatic speech recognition. However, by integrating visual and MFCCs are the standard acoustic features used in most modern speech recognition systems. In [Dav80] MFCCs are shown experimentally to give better recognition accuracy than alternative parametric representations.
MFCCs are calculated as the cosine transform of the logarithm of the short-term energy spectrum of the signal expressed on the mel-frequency scale. The result is a set of coefficients that approximates the way the human auditory system perceives sound.
MFCCs may be used directly as acoustic features in an AVASR system. In this case the dimensionality of the feature vectors equals the number of MFCCs computed. Alternatively, velocity and acceleration information may be included by appending first and second order temporal differences to the feature vectors.
The total number of feature vectors obtained from an audio sample depends on the duration and sample rate of the original sample and the size of the window that is used in calculating the cepstrum (a windowed Fourier transform  Figure 1 shows the original audio sample and mel-frequency cepstrum for the word "zero".

Visual speech
While acoustic speech features can be extracted through a sequence of transformations applied to the input audio signal, extracting visual speech features is in general more complicated. The visual information relevant to speech is mostly contained in the motion of visible articulators such as lips, tongue and jaw. In order to extract this information from a sequence of video frames it is advantageous to track the complete motion of the face and facial features. AAM [Coo98] fitting is an efficient and robust method for tracking the motion of deformable objects in a video sequence. AAMs model variations in shape and texture of the object of interest. To build an AAM it is necessary to provide sample images with the shape of the object annotated. Hence, in contrast to MFCCs, AAMs require prior training before being used for tracking and feature extraction.
The shape of an appearance model is given by a set of (x, y) coordinates represented in the form of a column vector s = (x 1 , y 1 , x 2 , y 2 , . . . , x n , y n ) T . (1) The coordinates are relative to the coordinate frame of the image. Shape variations are restricted to a base shape s 0 plus a linear combination of a set of N shape vectors where p i are called the shape parameters of the AAM. The base shape and shape vectors are normally generated by applying principal component analysis (PCA) to a set of manually annotated training images. The base shape s 0 is the mean of the object annotations in the training set, and the shape vectors are N singular vectors corresponding to the N largest singular values of the data matrix (constructed from the training shapes). Figure  2 shows an example of a base mesh and the first three shape vectors corresponding to the three largest singular values of the data matrix.
The appearance of an AAM is defined with respect to the base shape s 0 . As with shape, appearance variation is restricted to a base appearance plus a linear combination of M appearance vectors To generate an appearance model, the training images are first shape-normalized by warping each image onto the base mesh using a piecewise affine transformation. Recall that two sets of three corresponding points are sufficient for determining an affine transformation. The shape mesh vertices are first triangulated. The collection of corresponding triangles in two shapes meshes then defines a piecewise affine transformation between the two shapes.  The pixel values within each triangle in the training shape s are warped onto the corresponding triangle in the base shape s 0 using the affine transformation defined by the two triangles. The appearance model is generated from the shape-normalized images using PCA. Figure 3 shows the base appearance and the first three appearance images.
Tracking of an appearance in a sequence of images is performed by minimizing the difference between the base model appearance, and the input image warped onto the coordinate frame of the AAM. For a given image I we minimize For the rest of the discussion of AAMs we assume that the variable x takes on the image coordinates contained within the base mesh s 0 as in (4).
In (4) we are looking for the optimal alignment of the input image, warped backwards onto the frame of the base appearance A 0 (x).
For simplicity we will limit the discussion to shape variation and ignore any variation in texture. The derivation for the case including texture variation is available in [Mat03]. Consequently (4) now reduces to Solving (5) for p is a non-linear optimization problem. This is the case even if W(x; p) is linear in p since the pixel values I(x) are in general nonlinear in x.
The quantity that is minimized in (5) is the same as in the classic Lucas-Kanade image alignment algorithm [Luc81]. In the Lukas-Kanade algorithm the problem is first reformulated as This equation differs from (5) in that we are now optimizing with respect to ∆p while assuming p is known. Given an initial estimate of p we update with the value of ∆p that minimizes (6) to give This will necessarily decrease the value of (5) for the new value of p. Replacing p with the updated value for p new , this procedure is iterated until convergence at which point p yields the (local) optimal shape parameters for the input image I.
To solve (6) Taylor expansion is used [Bak01] which gives where ∇I is the gradient of the input image and ∂ W/∂ p is the Jacobian of the warp evaluated at p. The optimal solution to (7) is found by setting the partial derivative with respect to ∆p equal to zero which gives Solving for ∆p we get where H is the Gauss-Newton approximation to the Hessian matrix given by For a motivation for the backwards warp and further details on how to compute the piecewise linear affine warp and the Jacobian see [Mat03].
A proper initialization of the shape parameters p is essential for the first frame. For subsequent frames p may be initialized as the optimal parameters from the previous frame.
The Lucas-Kanade algorithm is a Gauss-Newton gradient descent algorithm. Gauss-Newton gradient descent is available in scipy.optimize.fmin_ncg.
Example usage:  Figure 4 shows an AAM fitted to an input image. When tracking motion in a video sequence an AAM is fitted to each frame using the previous optimal fit as a starting point.
In [Bak01] the AAM fitting method described above is referred to as "forwards-additive".
As can be seen in Figure 2 the first two shape vectors mainly correspond to the movement in the up-down and left-right directions, respectively. As these components do not contain any speech related information we can ignore the corresponding shape parameters p 1 and p 2 when extracting visual speech features. The remaining shape parameters, p 3 , . . . , p N , are used as visual features in the AVASR system.

Models for audio-visual speech recognition
Once acoustic and visual speech features have been extracted from respective modalities, we learn probabilistic models for each of the classes we need to discriminate between (e.g. words or phonemes). The models are learned from manually labeled training data. We require these models to generalize well; i.e. the models must be able to correctly classify novel samples that was not present in the training data.

Gaussian Mixture Models
Gaussian Mixture Models (GMMs) provide a powerful method for modeling data distributions under the assumption that the data is independent and identically distributed (i.i.d.). GMMs are defined as a weighted sum of Gaussian probability distributions where π k is the weight, µ µ µ k the mean, and Σ Σ Σ k the covariance matrix of the kth mixture component.

Maximum likelihood
The log likelihood function of the GMM parameters π π π, µ µ µ and Σ Σ Σ given a set of D-dimensional observations X = {x 1 , . . . , x N } is given by Note that the log likelihood is a function of the GMM parameters π π π, µ µ µ and Σ Σ Σ. In order to fit a GMM to the observed data we maximize this likelihood with respect to the model parameters.

Expectation maximization
The Expectation Maximization (EM) algorithm [Bis07] is an efficient iterative technique for optimizing the log likelihood function. As its name suggests, EM is a two stage algorithm. The first (E or expectation) step calculates the expectations for each data point to belong to each of the mixture components. It is also often expressed as the responsibility that the kth mixture component takes for "explaining" the nth data point, and is given by Note that this is a "soft" assignment where each data point is assigned to a given mixture component with a certain probability. Once the responsibilities are available the model parameters are updated ("M" or "maximization'" step). The quantities are first calculated. Finally the model parameters are updated as Note that in practice more than two shape parameters are used, which usually also requires an increase in the number of mixture components necessary to sufficiently capture the distribution of the data.

Variational Bayes
An important question that we have not yet answered is how to choose the number of mixture components. Too many components lead to redundancy in the number of computations, while too few may not be sufficient to represent the structure of the data.
Additionally, too many components easily lead to overfitting. Overfitting occurs when the complexity of the model is not in proportion to the amount of available training data. In this case the data is not sufficient for accurately estimating the GMM parameters.
The maximum likelihood criteria is unsuitable to estimate the number of mixture components since it increases monotonically with the number of mixture components. Variational Bayesian (VB) inference is an alternative learning method that is less sensitive than ML-EM to over-fitting and singular solutions while at the same time leads to automatic model complexity selection [Bis07].
As it simplifies calculation we work with the precision matrix Λ Λ Λ = Σ Σ Σ −1 instead of the covariance matrix.
VB differs from EM in that the parameters are modeled as random variables. Suitable conjugate distributions are the Dirichlet distribution p(π π π) = C(α α α 0 ) for the mixture component weights, and the Gaussian-Wishart distribution for the means and precisions of the mixture components.
In the VB framework, learning the GMM is performed by finding the posterior distribution over the model parameters given the observed data. This posterior distribution can be found using VB inference as described in [Bis07].
VB is an iterative algorithm with steps analogous to the EM algorithm. Responsibilities are calculated as The quantities ρ nk are given in the log domain by where and Here α = ∑ k α k and ψ is the derivative of the logarithm of the gamma function, also called the digamma function. The digamma function is available in SciPy as scipy.special.psi.
The analogous M-step is performed using a set of equations similar to those found in EM. First the quantities (34) Figure 6 shows a GMM learned using VB on the same data as in Figure 5. The initial number of components is again 16. Compared to Figure 5 we observe that VB results in a much sparser model while still capturing the structure of the data. In fact, the redundant components have all converged to their prior distributions and have been assigned the weight of 0 indicating that these components do not contribute towards "explaining" the data and can be pruned from the model. We also observe that outliers in the data (which is likely to be noise) is to a large extent ignored.
We have recently developed a Python VB class for scikits.learn. The class conforms to a similar interface as the EM class and will soon be available in the development version of scikits.learn.

Experimental results
A basic AVASR system was implemented using SciPy as outlined in the previous sections.
In order to test the system we use the CUAVE database [Pat02]. The CUAVE database consists of 36 speakers, 19 male and 17 female, uttering isolated and continuous digits. Video of the speakers is recorded in frontal, profile and while moving. We only use the portion of the database where the speakers are stationary and facing the camera while uttering isolated digits. We use data from 24 speakers for training and the remaining 12 for testing. Hence, data from the speakers in the test data are not used for training. This allows us to evaluate how well the models generalize to speakers other than than those used for training. A sample frame from each speaker in the dataset is shown in Figure 7.
In the experiment we build an individual AAM for each speaker by manually annotating every 50th frame. The visual features are then extracted by fitting the AAM to each frame in the video of the speaker.
Training the speech recognition system consists of learning acoustic and visual GMMs for each digit using samples from the training data. Learning is performed using VB inference. Testing is performed by classifying the test data. To evaluate the performance of the system we use the misclassification rate, i.e. the number of wrongly classified samples divided by the total number of samples.
We train acoustic and visual GMMs separately for each digit. The probability distributions (see (11)) are denoted by p(x A ) and p(x V ) for the acoustic and visual components, respectively. The probability of a sample (x A , x V ) belonging to digit class c is then given by p A (x A |c) and p V (x V |c).
As we wish to test the effect of noise in the audio channel, acoustic noise ranging from -5dB to 25dB signal-to-noise ratio (SNR) in steps of 5 dB is added to the test data. We use additive white Gaussian noise with zero mean and variance The acoustic and visual GMMs are combined into a single classifier by exponentially weighting each GMM in proportion to an estimate of the information content in each stream. As the result no longer represent probabilities we use the term score. For a given digit we get the combined audio-visual model and Note that (36) is equivalent to a linear combination of log likelihoods. The stream exponents cannot be determined through a maximum likelihood estimation, as this will always result in a solution with the modality having the largest probability being assigned a weight of 1 and the other 0. Instead, we discriminatively estimate the stream exponents. As the number of classes in our experiment is relatively small we perform this optimization using a brute-force grid search, directly minimizing the misclassification rate. Due to the constraint (39) it is only necessary to vary λ A from 0 to 1. The corresponding λ V will then be given by 1 − λ A . We vary λ A from 0 to 1 in steps of 0.1. The set of parameters λ A and λ V that results in the lowest misclassification rate are chosen as optimum parameters.
In the experiment we perform classification for each of the SNR levels using (36) and calculate the average misclassification rate. We compare audio-only, visual-only, and audio-visual classifiers. For the audio-only classifier the stream weights are λ A = 1 and λ V = 0 and for visual-only λ A = 0 and λ V = 1. For the audio-visual classifier the discriminatively trained stream weights are used. Figure 8 shows average misclassification rate for the different models and noise levels.
From the results we observe that the visual channel does contain information relevant to speech, but that visual speech is not in itself sufficient for speech recognition. However, by combining acoustic and visual speech we are able to increase recognition performance above that of audio-only speech recognition, especially the presence of acoustic noise.

Conclusion
In this paper we propose a basic AVASR system that uses MFCCs as acoustic features, AAM parameters as visual features, and GMMs for modeling the distribution of audio-visual speech feature data. We present the EM and VB algorithms as two alternatives for learning the audio-visual speech GMMs and demonstrate how VB is less affected than EM by overfitting while leading to automatic model complexity selection.
The AVASR system is implemented in Python using SciPy and tested using the CUAVE database. Based on the results we conclude that the visual channel does contain relevant speech information, but is not in itself sufficient for speech recognition. However, by combining features of visual speech with audio features, we find that AVASR gives better performance than audioonly speech recognition, especially in noisy environments.