HTML5 Speech API

From MozillaWiki
Revision as of 12:25, 3 May 2011 by Roshanvid (talk | contribs) (Created page with "== Project Proposal == The HTML Speech Incubator group has proposed the implementation of speech technology in browsers in the form of uniform, cross-platform APIs that can be ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Project Proposal

The HTML Speech Incubator group has proposed the implementation of speech technology in browsers in the form of uniform, cross-platform APIs that can be used to build rich web applications. The aim of this project is to implement these API's as a part of the Mozilla Firefox Browser. I would like to split the project into 2 parts or phases -


1.Speech Input API

The speech input API aims to provide an alternative input method for web applications, without using a keyboard or other physical device. This API can be used to input commands, fill input elements, give directions etc. There have been 3 different proposals for the implementation of this API. On discussing with my mentor, Olli, i learned that the SpeechRequest API proposal would be most flexible(link: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Feb/att-0023/speechrequest.xml.html) and would like to base my implementation on this proposal.

The API would consist of 2 main components -

A Media Capture API to capture a stream of raw audio. Its implementation is platform dependent. The plan is to first support the 3 major platforms - Mac OS X, Linux and Windows and eventually add support for mobile platforms like android. This should be possible using the libsydneyaudio library which is a part of gecko or using some code from the Mozilla Rainbow project. A streaming API to asynchronously stream microphone data to a speech recognition server and to get the results back. This will be similar to how XMLHttpRequest is implemented. For the prototype, i will be using google's speech server for recognition. Eventually, the api should be able to support both local and remote engines or a combination of both depending on the connection available. I have considered using CMU Sphinx(http://cmusphinx.sourceforge.net/) to implement a local speech engine model. All of this will be implemented using a new XPCOM component.


Security/Privacy issues

A speech input session should be allowed only with the user's consent. This could be provided using a doorhanger notification. The user should be notified when audio is being recorded possibly using a record symbol somewhere in the web browser UI itself like the URL bar or status bar.


API Design -

The API will look like the interface described in the proposal.

The developer should be able to specify a grammar(using SRGS) for the speech which is useful when the set of possible commands is limited. The recognition response would be in the form of an EMMA document. The developer should be allowed to set a threshold for accuracy and sensitivity to improve performance. The developer should be able to choose what speech engine to use. The developer should be able to start, stop, handle errors and multiple requests as required.


2.Text To Speech API

The text to speech API will be based on google's proposal(http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Feb/att-0022/htmltts-draft.html).This API can be used for speech translation, turn by turn navigation, dialog systems etc.

API Design -

The API will introduce a new element <tts> that extends HTMLMediaElement. It will be similar to how the <audio> and and <video> tags are implemented.
A playback UI should allow the user to start, stop and disable text to speech. The current spoken word can be highlighted.
The developer should be able to specify the language, position, start/stop playback and handle errors programmatically.
The API should itself be independent of the underlying speech synthesizer. If speech synthesis is not supported, appropriate text should be displayed.
For speech synthesis on mac, i can use the NSSpeechSynthesizer class in their cocoa api. Windows could use its native speech SDK and linux has some open source speech engines. This hasn't been decided yet and i have to discuss this further with my mentor.