SpeechRTC - Speech enabling the open web
- 1 SpeechRTC and Mozilla
- 1.1 Starting point
- 1.2 The Client
- 1.3 The Server
- 1.4 The speech decoder
- 1.5 Web Speech API
- 1.6 GSOC Progress
- 1.7  Demos, Links and references
SpeechRTC and Mozilla
Speech recognition on any modern handsets is almost a standard feature, and the target of SpeechRTC is bring it to Firefox OS and other Mozilla products by creating a scalable and flexible platform with focus on delivering great experience to users and empowerment of developers finally offering full support to Web Speech API and other tools.
SpeechRTC is already used on two published Firefox OS apps, and proved to run both online as offline on any device with 1.3, even unagi devices. So the fast track is first integrate it with FxOS as some OS level integrated apps, to build the foundations and then release the Web Speech API to developers on sequence.
Online Mode: On online mode, Audio is captured encoded on Opus through MediaRecorder and streamed through websockets to a nodejs application at the server that handle the connection with the decoder. There are also methods to change language model when necessary.
Offline Mode: On offline mode, audio is captured as pcm from gUm and streamed to a web worker "thread", who treats it, and handle its processing with the decoder api ported to js by emscripten. As online mode, offline has language model switch support. Despite this proved to work, the ideal approach is run the decode on a separate cpp process and communicate with it through IPC, to make it run even on phones with constrained cpu.
The speech decoder
Third-party licensing is extremely costly (usual unit is millions) and lead to an unwanted dependency. Write a decoder from scratch is tough, and requires highly specialized and difficult to find engineers.
The good news are that exists great open source toolkits that we can use and enhance. I am a long time supportert and contributor of CMU Sphinx that have a number of quality models on different languages openly available. Plus pocketsphinx can run very fast and accurate when well tuned for both FSG and LVSCR language models.
For LVSCR we can also consider Julius and benchmark it since he has great proved results.
We should also build scripts to automatically adapt the acoustic model per user with his own voice, to constantly auto-improve the service individually for him but also for the service as overall.
Some argued with me about privacy on online services. At the ideal screnario, actually online recognition is required only for LVSCR, while FSG can be handled offline if architected correctly. I think letting users to choose or not to let us use his voice to improve models is how other OSes handle this issue.
Offline and online
The same speech server can be designed to run both online as offline, letting the responsibility to handle transmission to the middleware that handle the connections with the front.
Web Speech API
After we build boths online as offline backends on scalable way, we connect it with the already ready Web Speech API on Gecko, and release the api to developers and automatically starts to support every web app already developer with Web Speech API support that currently only runs on Chrome.
Bug tree on Bugzilla
- Get the builds on Bugzilla
- Week 1
- Bonding and discuss with mentor about the architecture adopted and introduction to Gecko
- Week 2
- Setup of environment to Firefox compilation and debug tools
- Start of pocketsphinx integration with Gecko
- Week 3
- Integrating pocketsphinx with Gecko and Web Speech API layer
- Week 4
- Pocketshpinx already integrated with Gecko. Coding the integration with web speech api C++ layer
- Week 5
- Pocketsphinx integrated and first decodes already happening. Still working to finish full integration, generate grammars, profiling and etc..
- Week 6
- Pocketsphinx integrated and decoding in-file with language model switch on Linux and Mac
- Week 7
- Patching build and packaging system to generate builds for B2G and Desktop.
- Week 8
- Patch pocketsphinx to load grammar in-memory and tests on Dekstop and Flame.
- Week 9
- Tests on different devices and accents, and update models and pocketsphinx sources.
- Week 10
- Patch pocketsphinxrecognition service to switch grammars and decode speech entirely in-memory on a thread
- Week 11
- Conclusion of pocketsphinxrecognition service to switch grammars and decode speech entirely in-memory on a thread. ** Rebase with m-c and generate builds for all platforms (b2g, fennec and desktop).
- Put all bugs on bugzilla (https://bugzilla.mozilla.org/showdependencytree.cgi?id=1032964&hide_resolved=1)
- Squash commits and generate patches for review
- Week 12
- Work with the reviewers to approve the patch for landing
- Write the mochitests
- Change SpeechGrammarList to use Promises
We currently have a board on Trello with a live task status: https://trello.com/b/UWXblmKb/webspeech-api
Follow the repo: https://github.com/andrenatal/gecko-dev
 Demos, Links and references
- The crab
- Emscripten Offline recognition on Peak
- SpeechRTC Github
- ChatterThing - Telefonica Hackaton Campus Party BR Winner
- https://www.youtube.com/watch?v=mTlcjPG7ogM (portuguese)
- Andre Natal: http://www.linkedin.com/in/andrenatal
- Product Management Contact
- Sandip Kamat: http://www.linkedin.com/in/sandip, twitter: @sankam