Hidden Markov Model

12 state, left to right Hidden Markov Model
Description:
The HMM for this project was taken from the H2M Matlab Toolbox by Olivier Cappé and modified to suit my needs. As my vocabulary was of four words likewise four HMM models were used. There was silence in the beginning and in the end of the utterances and also the utterance was made of two separate parts i.e. Lights, Fan or On, Off. So to encompass all the different scenario's of time differences in all the four major sections of the speech a 12 state model was used. Also the longest utterance " Lights On" if split up needs about 7-8 states hence 12 was a nice compromise. A viterbi forward algorithm was used to calculate the maximum likelyhood for the various utterances and choose the best among them. As this HMM was taken off the shelf I will not delve into the details of explaining the same.
Training:
For training I generated the training set in two ways to see the accuracy in each case:
- In this case, 3 utterances of each four command word in 7 different directions i.e. 0°, 30°, 60°, 90°, -30°, -60°, -90° were recorded in a quiet ambience directly via Matlab.
- In this case, 12 utterances of each of the four command word in a broadside position were recorded with babble noise in the background coming from different directions but fixed location. The SNR was purposely kept very low to see the performance of the adaptive beamforming algorithm. Out of these 12 utterances per class, 8 were used for training the HMM's. The recording was done using Windows media player to capture signal from both the microphones.
Babble noise was chosen for the noise because it is very similar to the actual ambient noise found in a cocktail party like environment moreover the algorithm performed better with pink noise as the background noise. So this experiment setup mimics a real life situation to the best of its capability. Using the above mentioned two training sets the parameters A, mu, and Sigma were calculated for all the four models and stored separately.
Testing:
For testing the following two tests were considered:
- Corresponding to the training set 1, the test data was obtained by using part of the training set 2.
- This data was tested without cleaning the noise and the accuracy was noted.
- The same test data was now cleaned using the beamforming algorithm and then the testing was done to see the new recognition accuracy. This was done because in a pragmatic world, the HMM will have to be trained before hand on some data and then used at different noise conditions and various other environment conditions.
- Corresponding to the training set 2, the test data was obtained by using the rest of the 4 utterances per class
- The testing data was not cleaned by applying the beamforming algorithm and the recognition accuracy was calculated.
- In this case the testing data was cleaned by applying the beamforming algorithm and recognition accuracy was noted.
This encompasses all the various ways in which the testing and training data set was generated. The final results, speech samples and graphs are presented in the results and conclusions page.