Gecko:NewPerfHarness

We need a new perf harness to be used initially for testing fennec, but expected to be used more generally. The purpose of this page is to gather desiderata and come out with (a) tentative design(s) for infrastructure. Low-level details concerning specific metrics and implementation of infrastructure can be hashed out elsewhere.

Background

Our performance tests currently rely on metrics accessible by web content (Date.now, events, etc.). This approach is incompatible with testing performance perceived by users, because we explicitly lie to web content for perf reasons. Implementing this approach sanely is also made next-to-impossible by process separation, GPU rendering, async scrolling and animation, and async rerendering (fennec).

There are additional issues with the implementation of our current test harnesses that are exarcerbated when they're run on mobile devices. The harnesses are mostly written to run on the same system as the one being tested, meaning that the tests and the harness can compete for system resources.

Therefore, I claim that a new testing harness is in order.

Non-goals

Replace any existing testing infrastructure. This new harness is intended to complement talos et al.
Measure javascript, DOM, etc. performance in isolation. Only whole-system, black-box testing here.

Goals

Approximately in order of importance

Measure what users experience. E.g., measure pixels appearing on screen, not dispatch of MozAfterPaint.
Test infrastructure doesn't compete with tests for system resources. E.g., don't serve pages through http.js running on the test system.
Take the software paths used in the wild. E.g., load pages through ns*Http*Channel, not through data/file channel.
Go through the hardware used in the wild. E.g., load pages through NIC/WNIC, not from disk.
Tests can be run locally by all developers.
Test results are repeatable.
Test results are reported in a statistically sound way.
Test data can be reported at arbitrary granularity. E.g., data available down to the level of times of individual runs of a test within a trial.
Tests run as part of existing automation framework. E.g., run on each checkin to m-c, changes in results reported to tree-management

Many of these goals conflict. Finding suitable trade-offs is a main topic of discussion.

Types of measurements to be made

Responsiveness: ping the browser in various ways, measure pong in the form of pixels appearing on screen
Perceived load time: not just how fast pixels appear on screen, but which pixels and according to what pattern.
Panning/zooming (for fennec): how fast can content be moved on screen, how long does it take for "checkerboard" regions to be filled in
Scrolling (non-fennec): similar to above
Framerate of animations (actual framerate!)

Ideal infrastructure (WIP)

Tests are driven by a robot that has

Control of all input devices on test system
A camera that records the value of each individual pixel drawn on every monitor vsync. (NB: this implies the quantum of measurable time is 1/vsync-rate.)

Test pages are served over a network that has

Exactly precise latency and bandwidth
Arbitrarily configurable latency and bandwidth

Approximating the ideal

No one has time to build such a robot, and perfect networks don't exist, so we will need to approximate them, probably in platform-specific ways.

Network

A cross-platform approximation to the ideal network is to run DNS and HTTP servers on a (quiet) host machine over a dedicated (quiet) ethernet or wireless network (preferably wireless, for the sake of fennec). The minimum configurable latency is the intrinsic network latency, but bandwidth and latency could be arbitrarily throttled (approximately) by the HTTP server itself. Pages would need to be served from a ramfs.

Android (top priority)

Android has a debugging interface ("adb") that allows a host system to designate TCP ports for which sockets opened on the device should be proxied through the host. The host can also use the interface to screen-shot the device. There is additionally a "UI monkey" program that we could copy and theoretically control through adb. Adb theoretically works equally well on linux, mac, and windows hosts.

So, the basic idea would be, write a driver program that runs on the host, sends off commands to the device using adb, and takes screenshots as frequently as possible without skewing test results. Additionally an HTTP server would run on the host, configured to listen over the proxied TCP port and serve pages from a ramfs. (DNS?)

Trade-offs

Using USB interface instead of WNIC (though adb can apparently instead use WNIC)
adb-client competes for resources with other programs on device
Unknown latency, unknown cost of taking screenshot, unknown cost of dispatching phony UI events
Requires a quiet host system, a la "Network" section
Adb is flaky on windows machines

Windows

(bsmedberg tells me that jimm has been working on using windows performance counters for repaint data.)

???