SpeechAPI
Contents
Introduction
This project is an extension of the GSoC Speech project. It offers support to voice commands inside the firefox browser and has lead to an extension in SpeechRecognition API as well as text-to-speech API.
Initial contributors:
Roshan Vidyashankar and Anant Narayanan are the initial contributors to the Speech Project
Contributors for this extended SpeechAPI project are:
- Rohan Dalvi
- Harshank Vengurlekar
- Jagannath Ramesh
Technical Stuff
Speech Input API
The speech input API aims to provide an alternative input method for web applications, without using a keyboard or other physical device. This API can be used to input commands, fill input elements, give directions etc. It is based on SpeechRequest[[1]]
The API consists of 2 main components -
- A Media Capture API to capture a stream of raw audio. Its implementation is platform dependent. Mac, Windows and Linux are to be supported first, eventually adding support for android.
- A streaming API to asynchronously stream microphone data to a speech recognition server and to get the results back. This will be similar to how XMLHttpRequest is implemented. The api should be able to support both local and remote engines or a combination of both depending on the network connection available.
Security/Privacy issues
- A speech input session should be allowed only with the user's consent. This could be provided using a doorhanger notification.
- The user should be notified when audio is being recorded possibly using a record symbol somewhere in the web browser UI itself like the URL bar or status bar.
API Design -
The API will look like the interface described in the SpeechRequest proposal.
- The developer should be able to specify a grammar(using SRGS) for the speech which is useful when the set of possible commands is limited. The recognition response would be in the form of an EMMA document.
- The developer should be allowed to set a threshold for accuracy and sensitivity to improve performance.
- The developer should be able to choose what speech engine to use.
- The developer should be able to start, stop, handle errors and multiple requests as required.
Text To Speech API
The text to speech API will be based on google's proposal([2]).This API can be used for speech translation, turn by turn navigation, dialog systems etc.
API Design -
- The API will introduce a new element <tts> that extends HTMLMediaElement. It will be similar to how the <audio> and and <video> tags are implemented.
- A playback UI should allow the user to start, stop and disable text to speech. The current spoken word can be highlighted.
- The developer should be able to specify the language, position, start/stop playback and handle errors programmatically.
- The API should itself be independent of the underlying speech synthesizer. If speech synthesis is not supported, appropriate text should be displayed.
- What speech engines is yet to be decided.
Hacking Firefox UI
To practically implement speech-to-text and text-to-speech, it was imperative to hack the UI of development version of Mozilla Firefox(Nightly).We had to modify the files browser.xul, browser.css, browser.js for that.In order to ensure security and privacy issues, we modified these files and added two separate buttons in the UI, one for initiating speech input from the user and the other to translate the selected text to speech.In this process we added functions in the javascript file (browser.js) to incorporate the functionality of those two buttons on the UI.
Tentative Schedule
- 28th January - 4th February - SpeechRequest + endpointer code compiled.
- 5th February - 11th February - Fixes to microphone handling on linux, some other small fixes, getting familiar with the code.
- 12th February - 18th February - continue some small fixes ( for example simplify thread handling ).
- 19th February - 25th February - Christmas etc, not much progress, but continue with fixing SpeechRequest API, adding possibly some new features.
- 26th February - 1st March - Holiday Season, not much progress, but continue with SpeechRequest Implementation.
- 2nd March - 8th March - Get TTS working.
- 9th March - 15th March - Enhancements to the TTS Implementation.
- 16th March - 22nd March - First speech commands: for example browser go back & go forward etc.
- 23rd March - 29th March - More speech commands, maybe read entire text
Updates
You can check out the first update here
We have recently released the second update on the same github link, you can clone, make changes and send a pull request.