Changes

Jump to: navigation, search

Web Speech API - Speech Recognition

3,934 bytes added, 01:22, 20 November 2019
adding more info regarding speech proxy
Firefox can specify which server receives the audio data inputted by the users. Currently we are sending audio to Google’s Cloud Speech-to-Text. Google leads the industry in this space and has speech recognition in 120 languages.
Prior to sending the data to Google, however, Mozilla routes it through our own servers server's proxy first[https://github.com/mozilla/speech-proxy], in part to strip it of user identity information. This is intended to make it impractical for Google to associate such requests with a user account based just on the data Mozilla provides in transcription requests. (Google provides an FAQ for how they handle transcription data [https://cloud.google.com/speech-to-text/docs/data-logging-terms].) We opt-out of allowing Google to store our voice requests. This means, unlike when a user inputs speech using Chrome, their recordings are not saved and can not be attached to their profile and saved indefinitely.
For Firefox, we can choose whether we hold on to users’ data to train our own speech services. Currently we have audio collection defaulted to off, but eventually would like to allow users to opt-in if they choose.
 
===== How does your proxy server work? Why do we have it? =====
There are both technical and practical reasons to have that. We wanted to have both the flexibility to abstract the redirection of the user's requests to different STT services without changing the client code, but also to have a single protocol to be used across all projects at Mozilla. But the most beneficial reason was to make sure that we would keep our user's anonymous in the case we need to use a 3rd party provider, once is this case, the requests to the provider are made from our own server and just the audio sample is submitted to them, who then returns the transcription. See below a list of some benefits for routing the data through our speech proxy:
# Before sending the data any 3rd party STT provider, we have the chance to strip all user's information and make an anonymous request to the provider, preserving then the user's identity.
# In the case we need to user 3rd party and paid STT services, we don't need to ship the service's key along the client's code.
# Centralizing and funneling the requests through our servers prevents abuse from the clients, and allow us to implement mechanisms like throttling, blacklists and such.
# We can switch between the STT services in real time as we need/want, and redirect the request to any service we decide without changing any code in the client. For example: send english requests to provider A and pt-br to provider B without sending any update to the client.
# We can both support STT services on premises as off premises without having any extra logic in the client.
# We can centralize all requests coming from different products into a single speech endpoint making it easier to measure the engines both quantitative as qualitative.
# We can support different audio formats without adding extra logic to the clients regardless of the format supported by the STT provider, like adding compression or streaming between the client and the proxy.
# If the user desires to contribute to Mozilla's mission and let us save the samples, we can, without sending it to 3rd party providers.
===== Where are our servers and who manages it? =====
[[File:Wsa architecture.png|1000px|]]
 
A typical request follows the steps below:
 
<ol>
<li>The Web Speech API code in the browser is responsible for prompting the user for permission to record from the microphone, determine when stopped speaking, and submit the data to our speech proxy server. There are four headers that can be used by the client to alter the proxy's behavior [https://github.com/mozilla/speech-proxy/blob/master/Makefile#L13]:</li>
* Accept-Language-STT: determines the language aiming to be decoded by the STT service
* Store-Sample: determines if the user allows '''''Mozilla''''' to store the '''audio''' sample in our own servers to further use (training our own models, for example)
* Store-Transcription: determines if the user allows '''''Mozilla''''' to store the '''transcription''' in our own servers to further use (training our own models, for example)
* Product-Tag: determines which product is making use of the API. It can be: vf for voicefill, fxr for Firefox Reality, wsa for Web Speech API, and so on.
<li>Once the proxy receives the request with the audio sample, it looks for the headers that were set, and nothing besides what was request by the user plus a timestamp and the user-agent is saved. You can check it here: [https://github.com/mozilla/speech-proxy/blob/master/server.js#L324] </li>
<li>The proxy then looks for the format of the file and decodes it to raw pcm. </li>
<li>A request is made to the STT provider set in the proxy's configuration file containing '''just the audio file'''. </li>
<li>Once the STT provider returns the request containing a transcription and a confidence score, that is then forwarded to the client who then is responsible to take an action accordingly with the user's request.</li>
</ol>
===== There are three parts to this process - the website, the browser and the server. Which part does the current WebSpeech work cover? =====
Confirm
58
edits

Navigation menu