As I was dealing with noisy data, the features needed to be robust to the ambient noise in the signal. With this reason in mind I used 13 mel frequency cepstral coefficients and 12 delta cepstral coefficients. Apart from this one major thing that was done to enhance the final recognition was implementation of spectral subtraction and endpoint detection. The various feature extraction techniques are discussed below:
Blocking into frames:
Section of N(256) consecutive speech samples are used as a single frame. Consecutive frames are spaced M(128) samples apart. Thus achieving a 50% overlap between adjacent windows.
X¸(n) = Ã(M*l +n) , 0 <= n <= N-1 and 0 <= l <= L-1
N = Total No. of samples in a frame.
M = Total No. of sample spacing between the frames. [ measure of overlap]
L = Total number of frames.
Spectral subtraction and endpoint detection:
The whole speech signal was first passed to this routine which would do a endpoint detection depending whether the speech was 2.5 times greater than the threshold. This threshold was the average energy in the first four and last four frames of the signal which is assumed to be void of the command word. The threshold is applied from the beginning of the frame and from the end of the frame thus giving us two frame numbers in between which the whole utterance is present.
|We calculate the energy per frame first. ie.
P[l...m]= Sum k=1..j ( s[k]²)
where s[k] are the speech data in the frame. Similarly P is calculated for all the frames and an average is taken for the final noise value[ E].
E= [Sum k=l...m (p[k]²)]/m
The threshold is set at ( constant* E), as the detecting criterion.
After endpoint detection of the speech signal using above two frame numbers the command word is picked up from the whole signal. The average noise energy and this truncated signal is passed to the spectal subtraction routine. This calculates a 512 point FFT per frame each 256 long and does spectral subtraction using the noise FFT directly in the frequency domain.
X(f) = Y(f) * Hss(f)
Hss(f) = √ (1 - 1/ SNR(f))
SNR(f) = | Y(f)|2 ∕ | N(f) |2
Where Y(f) = 512 point FFT per frame of the signal, N(f) = 512 point FFT of the noise frame passed.
Mel frequency cepstral coefficients:
This X(f) per frame is used to calculate the mel frequency cepstral coeffecients. The cepstral coefficients, which are the coefficients of the Fourier transform representation of the log magnitude spectrum, have been shown to be a more robust, reliable feature set for speech recognition than the LPC coefficients. Because of the sensitivity of the low order cepstral coefficients to overall spectral slope and the sensitivity of the high-order cepstral coeffecients to noise, it had become a standard technique to weight the cepstral coefficients by a tapered window so as to minimize these sensitivities.
For this project 13 mel frequency cepstral coefficients are generated per frame and these are used as the feature vector. After the cepstral coefficients were generated a Cepstal mean normalisation [CMN] was done to get ride of the bias signal present across the coefficients. The details of the calculations being outside the scope of this project are not delved into. Please feel free to contact me if you need more information about the same.
Delta mel frequency cepstral coefficients:
To have even more robustness to noisy I take the del cepstral coefficients of the above CMN cepstral coefficients. I use a window of four coefficients about the central coefficient to calculate the new feature.
Thus in the end a feature vector of 26 coefficients is used for the final recognition i.e. 13 Mel frequency cepstral coefficients and 13 delta cepstral coefficients.
Back to project page