From MozillaWiki
Jump to: navigation, search
This page is no longer relevant. Performance benchmarks have moved elsewhere.


Benchmarks require some porting effort before they can run in the automated framework. A list of benchmarks and their status is available here.

Benchmarking Framework

The framework source is hosted on github.

(NOTE: as of 17 December the tests are running 30 replicates each)

Test results are averaged over 30 iterations. The browser is restarted between each iteration. For test runs on mobile devices, the device is rebooted between iterations.

Tests for Windows and Linux are done on Dell Vostro 470 machines (Intel i5-3450, 4GB memory, Geforce GT620), running Windows 7 and Ubuntu 12.04.

All Android and FirefoxOS tests are done on Nexus 4 phones. This is to ensure a common baseline and have a device with enough memory to run all the benchmarks, many of which would not function on more resource-constrained devices.

Current Results

View Perfy-Overview.html (requires LDAP) and click to drill down to details.

The charts show average score over the 30 test runs performed. We believe average is a fair aggregate for the sub-test results given the characteristics we see:

Many of the sub-tests have bimodal behavior (example). We highly suspect garbage collection is being triggered during some of the test runs, and is the cause of the two modes. The plan is to add more tooling to count the number and duration of GC events to confirm this suspicion.

Each individual mode has low variance; the weight given to each of the modes is equal to the number of samples that we observe; and the average provides us with number that best reflects that balance.

Caveat - The time series (example) does NOT use average, it uses median! The purpose of the time-series is to detect regressions in performance, which means we are only interested in one of the two modes and whether it changes over time. The median naturally chooses the most popular mode giving us consistent results over time. There are dangers to using median; for one, it is not sensitive to change in the balance of modes over time. Yet, median is much better than average given the small number of tests we perform. Average is sensitive to the random fluctuations between each battery of tests, not large enough when viewed on it's own, but distracting when looking at variation between batteries. We do not want to be chasing regressions when GC happened to hit thrice this round, and only twice last time.


  • Phase I - establish weekly measurements of the minimum viable product:
    • Tests we want to run.
    • Approved tests that are compatible with the framework.
    • List of phones already on hand.
    • List of phones/OS/browser combinations we need to add support to.
  • Phase II:
    • Full automation
    • Remaining platforms (OSX)
    • Benchmark porting guide

Automation/Scalability Issues

Individual test machines need to be started by hand each week

Test machines should be simple/dumb and test manager should drive them. Test machines should connect to manager to request work units and report results back to the manager.

Installed browsers must be updated manually

For Firefox, test machines can download/install from and manage their own install area. For Chrome, will vary by platform. Windows stays up to date on its own. Linux requires apt-get update. Darwin will be similar to Windows (I think).

Test must be manually started each week

Test manager can run test plan on a schedule.

Results must be pushed to ES by hand

Test manager can upload finished test runs. Will require test manager to have VPN access to ES cluster.

Mobile devices can be unstable

Test manager can time-out task if results are not received and reassign tasks.

Some tests require files to be served from a proper web server instead of file URI

Run all tests from the test manager instead of pushing them to the devices.

Device discovery is not possible, must be configured manually

Use zeroconf to allow workers to discover manager and ask for work.

30 replicates my be too much/too few

Continue running tests until noise reaches acceptable level rather than for a fixed number of cycles.


  • Martin Best - Sr. Engineering Program Manager -
  • Chris Peterson - Engineering Program Manager -
  • Robert Clary - Automation Developer -
  • Alan Kligman - Platform Engineer -
  • Joel Maher - Automation Developer -
  • Deb Richardson - Product Manager -
  • Milan Sreckovic - Engineering Manager, GFx -
  • Clint Talbert - Automation & Tools Engineering Lead -
  • Kannan Vijayan - Developer -
  • Vladimir Vukicevic - Engineering Director -
  • Robert Wood - Automation Developer -


Project Team Meeting Thursdays at 1:30 PT for 30 mins
  • Vidyo Room: Vladimir Vukicevic's Vidyo Room
  • Physical Meeting Room: TOR-5E Finch
  • Invitation: Contact to get added to the meeting invite list.
  • Meeting Notes: Meeting Notes Etherpad
  • Server:
  • Channel: #games