Gecko:MediaStreamLatency

From MozillaWiki
Jump to navigation Jump to search

Minimizing the end-to-end latency of WebRTC is very important. Minimizing the latency of Web Audio, for example in the case where a script wants to start playing an AudioBuffer immediately, is also very important. What do we need to do with MediaStreamGraph to support this?

On this page I'm trying to work out estimates of the current latency of our MediaStreams setup, and how we can improve the situation.

Current State

The MediaStreamGraph runs every 10ms (modulo scheduling issues); let's call this T_graph. Let's assume the graph processing time is T_proc per tick --- hopefully just a few ms, so let's say 3ms. On my laptop the cubeb callback in nsBufferedAudioStream runs every 25ms, let's call this T_cubeb.

On the WebRTC "send" side, we have a microphone producing callbacks periodically (T_mike, 4ms in current code, I think). (I'll ignore driver processing delay, which we can't control.) That data is immediately appended to the input queue of a SourceMediaStream. At the next MediaStreamGraph tick, that data is moved to the output side of the SourceMediaStream or, probably soon, a wrapping TrackUnionStream. At that moment it can be observed by a MediaStreamListener. Let's define the overall send latency T_send as the longest delay between a sample being recorded and reaching the network stack via a MediaStreamListener. Assuming pessimal timing of the MediaStreamGraph tick, that's (for the first sample in a chunk)

 T_send = T_mike + T_graph + T_proc

In my example, this is 4 + 10 + 3 = 17ms. The ideal latency would be just T_mike = 4ms.

On the "receive" side, at each MediaStreamGraph tick we pull the latest data from the network stack (via NotifyPull), and copy it to the end of the output nsAudioStream. The worst-case latency before data is picked up by the libcubeb callback (ignoring scheduling problems) happens when a sample has to wait T_graph for the graph to tick, then wait for graph processing, and then we append the sample to the nsAudioStream buffer and have to wait for at most T_cubeb for it to be consumed and played. (I'll ignore driver processing time, which we can't control.)

 T_recv = T_graph + T_proc + T_cubeb

In my example, this is 38ms. The ideal latency would be to pull the latest audio from the network stack in each libcubeb callback, i.e. T_cubeb = 25ms.

Playing an AudioBuffer has the same latency analysis.

Desirable State

I propose running MediaStreamGraph processing every 10ms normally but *also* running it during the libcubeb callback to "catch up" to produce the rest of the data needed when the libcubeb callback fires. On Windows 7, putting a sleep(10ms) in the libcubeb callback doesn't seem to hurt. On the receive side, this will mean that the data returned to the libcubeb callback will always include the latest audio available from the network stack, reducing the latency to T_cubeb + T_proc = 28ms.

In fact we can do a little better by moving our NotifyPull calls to run as late as possible during MediaStreamGraph processing. If the network audio is not being processed, the NotifyPull(s) can happen after any WebAudio processing has been done. This effectively reduces T_proc to almost nothing. Let's call it 1ms, so we've reduced our latency to 26ms, almost the minimum allowed by libcubeb.

Script playing AudioBuffers won't be so amenable to that optimization since we need to process all messages from the DOM, so we'll take the full T_proc and have a latency of 28ms.

On the send side, to minimize latency we really should change the input API to be pull-based. The current code for Windows actually pulls from Win32 every 4ms on a dedicated thread. If we let the MediaStreamGraph do the pull instead, we can reduce latency to T_graph and T_proc. As above we can minimize T_proc for streams that aren't involved in other processing, so we can get a latency of T_graph + T_proc = 11ms on the receive side. To do better, we'd have to lower T_graph or identify cases where we can provide a "fast path" that gives the network stack consumer access to samples as soon as they're queued for the SourceMediaStream. However, I think we should focus on trying to achieve the goals already laid out here.

Latency targets per platform

Linux

We have two backends in cubeb:

ALSA
Pretty rudimentary. Most distros are not using ALSA directly anymore. However, some people choose to do so
PulseAudio
Advertised as good as ALSA (I could confirm that on my setup, but read below). This gives us automatic latency adjustments in case of underrun. We can get actual latency (as opposed to the requested latency), so we can react.

Both have been tested on a thinkpad w530 and on a Thinkpad t420, the latency is great (sub 10ms). I just got a 300 euros netbook I can test with, but I need a 32bits build to check. I am sure that minimal latency relies heavily on good interaction between: - The soundcard kernel driver - ALSA - Pulse

and I believe my setup happens to have a good driver (plus fast CPU and such), and this will be completely different with different hardware and software environment.

I know that some Pulse releases in the while are buggy and make the latency grow, we should make sure to be able to detect that.

Other browsers achieve 40ms, we want to be able to be at least as good.

Windows

We have only one backend in cubeb, that uses winmm, which is not intended at all to be low-latency. We need a new backend, that would be using the WASAPI, so we can achieve good performances.

WASAPI is in the 30ms range, and can go lower if: - we are running on decent audio hardware - we use the windows API to increase the audio thread priority

It is possible to go lower if we request exclusive access to the hardware, which is something WASAPI exposes in its API.

On my Thinkpad t420, I could bring the latency down to 512 frames at 48kHz (around 10ms) (using a DAW that has a WASAPI backend). I tried to increase the CPU load, and I could make it underrun a bit, but nothing terrible (it would be unnoticed when doing a WebRTC call). At 1024 frames, I could not detect underrun by ear.

WASAPI is available on Windows Vista, Windows 7, Windows 8. People using a stock Windows XP with a normal sound card won't get great latencies. The only way around it is to a super low level API that takes exclusive access of the hardware (basically bypassing the system's mixer and talking directly to the kernel).


MacOS

No problems at all, we can bring the latency super low (12.5ms is what we do at the moment, lower is doable). No glitches noticeable, latency is stable and everything works fine under high CPU load without underruns.

Android

For Android < 4.1 won't be able to do anything great, see <http://code.google.com/p/android/issues/detail?id=3434>. 100+ms are expected, nothing we can do to lower that.

For Android > 4.1, on some devices (currently, Galaxy Nexus, Nexus 4, Nexus 10, that is, high-end devices where specs are controlled by Google), you can achieve around 8ms, using what is called FastMixer. Basically, you do a trade of between using resampling (that is, you are bound to use the hardware's preffered samplerate), putting effects, and other goodies. You also have to use a certain buffersize. We don't care about that, because we just want a PCM interface, so it is all great.

Good thing is, we can detect at runtime if the device can run at low latency, so we can do it right now.

B2G

No idea.

Plan of action

-

Bugs to file

TODO