Changes

Jump to: navigation, search

Web Speech API - Speech Recognition

91 bytes removed, 20:24, 14 January 2020
m
small edits to wording/grammar and a few clarifications based on feedback
Chrome, Edge, Safari and Opera support a form of this API currently for Speech-to-text, which means sites that rely on it work in those browsers, but not in Firefox. As speech input becomes more prevalent, it helps developers to have a consistent way to implement it on the web. It helps users because they will be able to take advantage of speech-enabled web experiences on any browser they choose. We can also offer a more private speech experience, as we do not keep identifiable information along with users’ audio recordings.
If nothing else, our lack of support for voice experiences is a webcompat web compatability issue that will only become more of a handicap as voice becomes more prevalent on the web. We’ve therefore included the work needed to start closing this gap among our 2019 OKRs for Firefox, beginning with providing WebSpeech API support in Firefox Nightly.
===== What does it do? =====
<ol>
<li>The Web Speech API code in the browser is responsible for prompting the user for permission to record from the microphone, determine when stopped speakinghas ended, and submit the data to our speech proxy server. There are four headers that can be used by the client to alter the proxy's behavior [https://github.com/mozilla/speech-proxy/blob/master/Makefile#L13]:</li>
* Accept-Language-STT: determines the language aiming to be decoded by the STT service
* Store-Sample: determines if the user allows '''''Mozilla''''' to store the '''audio''' sample in our own servers to further use (training our own models, for example)
* Store-Transcription: determines if the user allows '''''Mozilla''''' to store the '''transcription''' in our own servers to further use (training our own models, for example)
* Product-Tag: determines which product is making use of the API. It can be: vf for voicefill, fxr for Firefox Reality, wsa for Web Speech API, and so on.
<li>Once the proxy receives the request with the audio sample, it looks for the headers that were set, and nothing besides . Nothing other than what was requested by the user plus a timestamp and the user-agent is saved. You can check it here: [https://github.com/mozilla/speech-proxy/blob/master/server.js#L324] </li>
<li>The proxy then looks for the format of the file and decodes it to raw pcm. </li>
<li>A request is made to the STT provider set in the proxy's configuration file containing '''just the audio file'''. </li>
<li>Once the STT provider returns the request containing a transcription and a confidence score, that it is then forwarded to the client , who then is responsible to take an action accordingly with according to the user's request.</li>
</ol>
===== How does your proxy server work? Why do we have it? =====
There are both technical and practical reasons to have thata proxy server. We wanted to have both the flexibility to abstract the redirection of the user's requests to different STT services without changing the client code, but and also to have a single protocol to be used across all projects at Mozilla. But the most beneficial reason was to make sure that we would keep our user's users anonymous in the case when we need to use a 3rd party provider, once is . In this case, the requests to the provider are made from our own server and just only the audio sample is submitted to them, who then returns the get a transcription. See below a list Some benefits of some benefits for routing the data through our speech proxy:# Before sending the data any 3rd party STT provider, we have the chance to strip all user's identifying information and make an anonymous request to the provider, preserving then the user's identity.# In the case When we need to user use 3rd party and paid STT services, we don't need to ship the service's key along with the client's code. # Centralizing and funneling the requests through our servers prevents decreases the chance of abuse from the clients, and allow allows us to implement mechanisms like throttling, blacklists and such, etc. # We can switch between the STT services in real time as we need/want, and redirect the request to any service we decide choose, without changing any code in the client. For example: send english English requests to provider A and pt-br to provider B without sending any update to the client.
# We can both support STT services on premises as off premises without having any extra logic in the client.
# We can centralize all requests coming from different products into a single speech endpoint making it easier to measure the engines both quantitative as qualitative.
# We can support different audio formats without adding extra logic to the clients regardless of the format supported by the STT provider, like adding compression or streaming between the client and the proxy.
# If the user desires users desire to contribute to Mozilla's mission and let us save the their audio samples, we can, without sending it to 3rd party providers.
Confirm
1
edit

Navigation menu