Tuesday, August 11, 2009

Technology behind myMusik.us

Google's 411 service 1-800-goog-411 is a very interesting deployment of speech recognition. It is Google's entry into speech transcription and speech driven search. http://research.google.com/archive/goog411.pdf is a good description of its internals. The myMusik.us architecture shares a lot of similarities and I would like to compare and contrast the key attributes:
  • This is the entire description of the core pieces of the Google framework from the above paper: "The speech recognition engine is a standard, large-vocabulary recognizer,with PLP features and LDA, GMM-based triphone HMMs, decision trees, STC [11] and an FST-based search [12].
[11] M.J.F. Gales, “Semi-tied covariance matrices for hidden markov models,”
Proc. IEEE Trans. SAP, May 2000
[12] “OpenFst Library,” http://www.openfst.org.

myMusik.us uses a standard small-vocabulary recognizer, with MFCC features, GMM-based triphone HMMs, 1000 tied states and a normal Viterbi-based search.

  • Google uses a training set that has > 1 million utterances
myMusik.us has invested in speech training algorithms to reduce training times. We currently have 10,000 utterances for our trainer growing at the rate of about 10,000 every 6months. Our models converge faster than Google's models.

  • Google's focus is a large vocabulary.
myMusik.us recognizes that the fundamental limitations in 2009 speech recognition techniques prevent any deployment to recognize the sentence "Sanjay wants a cuppa chai at the Barrista on M.G. Road" :-) This is not going to happen anytime soon. Our focus is a small vocabulary indicated by the user. Or a small vocabulary indicated by the domain-space. We are not interested in large vocabulary speech recognition problems. Switching between vocabularies to adjust to context is our forte.
  • http://research.google.com/roundtable/ has a nice video on a Manager's view of Google's speech recognizer. A common theme in the video is the fact that Google uses web query data in developing its language models. The relationship between words and the context in which a word occurs influences Google's language models. The larger the corpus, the better the models.
myMusik.us has invested in algorithms for good endpointing (i.e silence detection). So, our language models are just long lists of individual words with no relationships. Relationships matter in spoken transcriptions. They do not matter when the intent is to get the message across quickly and get the information as quickly as possible.

No comments:

Post a Comment