Gecko:NewPerfHarness

From MozillaWiki
Jump to: navigation, search

We need a new perf harness to be used initially for testing fennec, but expected to be used more generally. The purpose of this page is to gather desiderata and come out with (a) tentative design(s) for infrastructure. Low-level details concerning specific metrics and implementation of infrastructure can be hashed out elsewhere.

Background

Our performance tests currently rely on metrics accessible by web content (Date.now, events, etc.). This approach is incompatible with testing performance perceived by users, because we explicitly lie to web content for perf reasons. Implementing this approach sanely is also made next-to-impossible by process separation, GPU rendering, async scrolling and animation, and async rerendering (fennec).

There are additional issues with the implementation of our current test harnesses that are exarcerbated when they're run on mobile devices. The harnesses are mostly written to run on the same system as the one being tested, meaning that the tests and the harness can compete for system resources.

Therefore, I claim that a new testing harness is in order.

Non-goals

  • Replace any existing testing infrastructure. This new harness is intended to complement talos et al.
  • Measure javascript, DOM, etc. performance in isolation. Only whole-system, black-box testing here.

Goals

Approximately in order of importance

  • Measure what users experience. E.g., measure pixels appearing on screen, not dispatch of MozAfterPaint.
  • Test infrastructure doesn't compete with tests for system resources. E.g., don't serve pages through http.js running on the test system.
  • Take the software paths used in the wild. E.g., load pages through ns*Http*Channel, not through data/file channel.
  • Go through the hardware used in the wild. E.g., load pages through NIC/WNIC, not from disk.
  • Tests can be run locally by all developers.
  • Test results are repeatable.
  • Test results are reported in a statistically sound way.
  • Test data can be reported at arbitrary granularity. E.g., data available down to the level of times of individual runs of a test within a trial.
  • Tests run as part of existing automation framework. E.g., run on each checkin to m-c, changes in results reported to tree-management

Many of these goals conflict. Finding suitable trade-offs is a main topic of discussion.

Types of measurements to be made

  • Responsiveness: ping the browser in various ways, measure pong in the form of pixels appearing on screen
  • Perceived load time: not just how fast pixels appear on screen, but which pixels and according to what pattern.
  • Panning/zooming (for fennec): how fast can content be moved on screen, how long does it take for "checkerboard" regions to be filled in
  • Scrolling (non-fennec): similar to above
  • Framerate of animations (actual framerate!)

Ideal infrastructure (WIP)

Tests are driven by a robot that has

  • Control of all input devices on test system
  • A camera that records the value of each individual pixel drawn on every monitor vsync. (NB: this implies the quantum of measurable time is 1/vsync-rate.)

Test pages are served over a network that has

  • Exactly precise latency and bandwidth
  • Arbitrarily configurable latency and bandwidth

Approximating the ideal

No one has time to build such a robot, and perfect networks don't exist, so we will need to approximate them, probably in platform-specific ways.

Network

A cross-platform approximation to the ideal network is to run DNS and HTTP servers on a (quiet) host machine over a dedicated (quiet) ethernet or wireless network (preferably wireless, for the sake of fennec). The minimum configurable latency is the intrinsic network latency, but bandwidth and latency could be arbitrarily throttled (approximately) by the HTTP server itself. Pages would need to be served from a ramfs.

Android (top priority)

Android has a debugging interface ("adb") that allows a host system to designate TCP ports for which sockets opened on the device should be proxied through the host. The host can also use the interface to screen-shot the device. There is additionally a "UI monkey" program that we could copy and theoretically control through adb. Adb theoretically works equally well on linux, mac, and windows hosts.

So, the basic idea would be, write a driver program that runs on the host, sends off commands to the device using adb, and takes screenshots as frequently as possible without skewing test results. Additionally an HTTP server would run on the host, configured to listen over the proxied TCP port and serve pages from a ramfs. (DNS?)

Trade-offs

  • Using USB interface instead of WNIC (though adb can apparently instead use WNIC)
  • adb-client competes for resources with other programs on device
  • For a 800x480 phone (galaxy s), a 60fps screen capture would require 5.76MB/s. Max effective USB 2.0 bandwidth seems to be in the range of 40MB/s. The transfer can be optimized to not send successive duplicate runs of pixels at the cost of metadata and comparisons. A simple (non-optimal) row deduplication scheme bumps transfer requirement up to 5.7672MB in the worst case (all new rows every time). It's unclear how much the data transfer alone would disturb the device. It would likely significantly impact adb-over-USB if it's also proxying tcp for tests, but tests could use the idle WNIC instead of adb. The built-in screencap utilties are totally unoptimized and are not designed to be fast (or sync to vblank). We would need to write this screenvid prog from scratch (and it would need root on the device). Strain on CPU/memory bus/GPU(if any) is also unclear, although it's relatively easy to write an optimal implementation and measure.
  • Unknown latency, unknown cost of dispatching phony UI events
  • Requires a quiet host system, a la "Network" section
  • Adb is flaky on windows machines

Windows

(bsmedberg tells me that jimm has been working on using windows performance counters for repaint data.)

Meeting notes

  • approximation methods:
    • measure framerate by only sampling representative pixel(s)
    • use GPU "tracer" commands/events (?) to be notified when queue in front of tracer finishes
    • record invalidated regions in invalidate events
    • guess at document rendering by examining document state; pending images, stylesheets, etc.
  • propagate reliable, useful info from compositor->content, expose through extension to window.performance
  • problems regarding deciding when page is "done loading" from outside black box
    • quiescent state might have changing visible area, e.g. marquee [ed. ignoring for now]
    • infinitely stable visible area hiding infinite load latency [ed. will probably need to cap load times for now. or something]
  • grabbing frames faster from within test machine itself
    • run on multi-core desktop, snapshot on different processor
    • run once only recording vblanks for timing. then run again, stopping ff at vblanks until frame is grabbed, then resume (possibly after next blank) for frame contents. match up the two runs
    • virtualbox has a headless mode that can record frames
  • network
    • Bradley Gates (former netscape intern) worked on perf under various network paramters
    • netm (?) SW traffic shaper for linux
    • possibly relevant work from chromium folks