Changes

SpeechRTC - Speech enabling the open web

4,001 bytes added, 18:01, 21 April 2014

no edit summary

== SpeechRTC - Speech enabling the open web ==

Speech recognition on any modern handsets is almost a standard feature, and the target of SpeechRTC is bring it to Firefox OS and other Mozilla products, by creating a scalable and flexible services platform ~~and tools,~~ with focus on delivering great experience to users, and empowerement of developers finally offering full support to Web Speech APIand other tools. == How-to == 1. Starting point SpeechRTC is already used on two published Firefox OS apps[1], and proved to run both online as offline at any device with 1.3, even unagi. So the fast track is first integrate it with FxOS as some OS level integrated apps, to build the foundations and then release the Web Speech API to developers on sequence.

~~== How we'll do it? ==~~

~~1. Online or offline?~~

1. The Client

On online mode, Audio is captured encoded on Opus through MediaRecorder and streamed through websockets to a nodejs application at the server that handle the connection with the decoder. It also have methods to change language model when necessary.

On offline mode, audio is captured as pcm from gUm and streamed to a web worker "thread", who treats it, and handle its processing with the decoder api ported to js by emscripten. As online, aslo has language model switch support. Despite this proved to work, the ideal approach is run the decode on a separate cpp process and communicate with it through IPC, to make it run even on phones with constrained cpu.

2. The Server

On online mode, the nodejs application who receive audio and grammars from peers is responsible to handle the connection with the voice server, who then decode opus to pcm and pass it to the decoder when dealing with recognition, or swicth the language model when requested to. Some argued with me on the past about also running the decode on node, but I decided that would be better to the project decouple it, since we may need to use different decoders and voice servers that may not run on javascript.

On offline mode, currently the decode is made on a webworker, but the ideal is to run it on a separate standalone process and communicate through IPC. This can reduce the overhead dramatically of running on javascript and enable even the $25 phone to have offline speech recognition.

3. The speech decoder

Third-party licensing is extremely costly (usual unit is millions) and lead to an unwanted dependency. Write a decoder from scratch is tough, and requires highly specialized and difficult to find engineers. The good news are that exists great open source toolkits that we can use and enhance. I am a long time supportert and contributor of CMU Sphinx that have a number of quality models on different languages openly available. Plus pocketsphinx can run very fast and accurate when well tuned for both FSG and LVSCR language models. For LVSCR we can also consider Julius and benchmark it since he has great proved results. 3.1 AMAutomatic retrain We should also build scripts to automatically adapt the acoustic model per user with his own voice, to constantly auto-improve the service individually for him but also for the service as overall. 3.2 LMPrivacy Some argued with me about privacy on online services. At the ideal screnario, actually online recognition is required only for LVSCR, while FSG can be handled offline if architected correctly. I think letting users to choose or not to let us use his voice to improve models is how other OSes handle this issue. 3.3 Offline and online The same speech server can be designed to run both online as offline, letting the responsibility to handle transmission to the middleware that handle the connections with the front. 4. Web Speech API After we build boths online as offline backends on scalable way, we connect it with the already ready Web Speech API on Gecko, and release the api to developers and automatically starts to support every web app already developer with Web Speech API support that currently only runs on Chrome.58. ~~Enhancements~~ Demos, Links and ~~retrain~~references 8.1 The crab Video: https://www.youtube.com/watch?v=pnCRH-Iznrc App: https://marketplace.firefox.com/app/the-crab 8.2 Voicity Video: https://www.youtube.com/watch?v=cjjFvyH3kdc App: https://marketplace.firefox.com/app/voicity 8.3 Emscripten Offline recognition on Peak Video: https://www.youtube.com/watch?v=FXKXhrRDEb8 8.4SpeechRTC Github https://github. ~~Privacy~~com/andrenatal/speechrtc 8.5 ChatterThing - Telefonica Hackaton Campus Party BR Winner https://www.youtube.com/watch?v=mTlcjPG7ogM (portuguese)

Andrenatal

Confirm

58

edits

Changes

SpeechRTC - Speech enabling the open web

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools