Wednesday, December 16, 2009

Competitors in the speech to text area

Nuance just released a dictation tool (Dragon Dictation v1.0.1) for the iPhone. http://www.mobilecrunch.com/2009/12/16/nuance-updates-dragon-dictation-app-to-let-you-keep-your-contacts-secret/

Generic transcription of speech to text
Google Mobile App for the iPhone and Nuance are the only tools with favorable ratings (ie mostly 4s and 5s). You would expect Microsoft/Tellme to have an offering too, right? Strangely, as of Dec. 2009, they still don't have an app for the iPhone.
VoiceOnTheGo http://www.voiceonthego.com and Dial2Do (http://dial2do.com) are smaller players with similar offerings.

Keeping in touch with your friends and family (contacts ie)
Infinear http://www.infinear.com focuses on proper nouns (your father's name, mother's name, children's, friends', restaurants...) are EVEN more important to an average user than generic words. If I can put you in direct contact with a person whose name you can say, then there is no need for transcription. We store the entire audio on our servers. Every client (including your phone) can play back these messages for offline access. Hooking all this up via Yahoo Mail or GMail opens up the largest set of collaborative users on this planet-267 million yahoo users and 100 million GMail users ie.

Thursday, December 10, 2009

www.infinear.com An EC2 hosted speech driven solution

www.infinear.com is the official site of Sanjay Research's speech recognizer. It reads out websites, blogs and now Yahoo Mails too. You can listen to unread emails, rpely to specific ones and even compose emails addressing recipients by name. All handsfree and driven by your voice. Check out http://www.infinear.com for more details.

In future blogs, we will discuss the technology powering infinear.com and the process of hosting an Asterisk driven solution on Amazon EC2. Enjoy...

Saturday, August 22, 2009

August 2009 update on US handset sales

http://news.cnet.com/8301-1035_3-10309605-94.html?tag=mncol;mlt_related
The latest numbers support my claim from a few weeks back-the handset market in the US is bipolar too. RIM and Apple dominate the highend smartphone segment. The midrange market is ceding to the high end. The low end matters to price conscious consumers. Together, the article claims, 80% of the market by 2014 will belong to these 2 categories.

The billion $ unanswered question is.... of the 80%, how much is accounted by the high end? I will take a shot at this. AT&T just announced that data plans are mandatory for all smartphones. Clearly, AT&T is pushing iPhone apps tied to data plans. And is building out their network to accomodate future growth in this data space. But this segment (in spite of huge profits for Apple and AT&T) will not grow beyond 5% of the handset market. If AT&T unlocks iPhones and introduces competition for voice+data plans, this will grow to 25-40%. This will happen in 2012.

In between 2009 and 2012, the handset market belongs to the low end. Margins are high in the apps space-but theres so few of them, If user experience improves by orders of magnitude, the lowend can join the party!! And party on till 2014 and beyond....

Sanjay Research will focus on the low end with its voice offerings. And take a slice of the apps space hopefully.

Tuesday, August 11, 2009

Technology behind myMusik.us

Google's 411 service 1-800-goog-411 is a very interesting deployment of speech recognition. It is Google's entry into speech transcription and speech driven search. http://research.google.com/archive/goog411.pdf is a good description of its internals. The myMusik.us architecture shares a lot of similarities and I would like to compare and contrast the key attributes:
  • This is the entire description of the core pieces of the Google framework from the above paper: "The speech recognition engine is a standard, large-vocabulary recognizer,with PLP features and LDA, GMM-based triphone HMMs, decision trees, STC [11] and an FST-based search [12].
[11] M.J.F. Gales, “Semi-tied covariance matrices for hidden markov models,”
Proc. IEEE Trans. SAP, May 2000
[12] “OpenFst Library,” http://www.openfst.org.

myMusik.us uses a standard small-vocabulary recognizer, with MFCC features, GMM-based triphone HMMs, 1000 tied states and a normal Viterbi-based search.

  • Google uses a training set that has > 1 million utterances
myMusik.us has invested in speech training algorithms to reduce training times. We currently have 10,000 utterances for our trainer growing at the rate of about 10,000 every 6months. Our models converge faster than Google's models.

  • Google's focus is a large vocabulary.
myMusik.us recognizes that the fundamental limitations in 2009 speech recognition techniques prevent any deployment to recognize the sentence "Sanjay wants a cuppa chai at the Barrista on M.G. Road" :-) This is not going to happen anytime soon. Our focus is a small vocabulary indicated by the user. Or a small vocabulary indicated by the domain-space. We are not interested in large vocabulary speech recognition problems. Switching between vocabularies to adjust to context is our forte.
  • http://research.google.com/roundtable/ has a nice video on a Manager's view of Google's speech recognizer. A common theme in the video is the fact that Google uses web query data in developing its language models. The relationship between words and the context in which a word occurs influences Google's language models. The larger the corpus, the better the models.
myMusik.us has invested in algorithms for good endpointing (i.e silence detection). So, our language models are just long lists of individual words with no relationships. Relationships matter in spoken transcriptions. They do not matter when the intent is to get the message across quickly and get the information as quickly as possible.

Saturday, August 8, 2009

Hey its mid-2009 and the mobile world looks like its finally entering the information age!! About time, eh?? You have read about the world going gaga over the Apple-iPhone, the Palm-Pre or the RIM-Blackberry. Thats 3% of the worldwide shipment of phones. Accounting for 35% of profit. Deutsche Bank's Brian Modoff has analyzed these trends to profitability quite well http://www.wikio.com/themes/Brian+Modoff












These figures focus on the manufacturers-what about the consumers? Where are they? What do they want? Where are they headed? Whats in it for Sanjay Research ? I see a definite bipolar worldwide market-a high end >$100 smartphone market mostly adopted by US, Europe,Japan,S.Korea and pockets of the MidEast. The rest of the world (especially the fastest growing one-India) will stay in the sub-$50 phone range http://trendsniff.com/2009/02/22/mobile-subscribers-china-india-2009/