Platform/Games/GameFocusedBenchmarking
Contents
- 1 Benchmarks
- 2 Benchmarking Framework
- 3 Current Results
- 4 Scope
- 5 Automation/Scalability Issues
- 5.1 Individual test machines need to be started by hand each week
- 5.2 Installed browsers must be updated manually
- 5.3 Test must be manually started each week
- 5.4 Results must be pushed to ES by hand
- 5.5 Mobile devices can be unstable
- 5.6 Some tests require files to be served from a proper web server instead of file URI
- 5.7 Device discovery is not possible, must be configured manually
- 5.8 30 replicates my be too much/too few
- 6 People
- 7 Communication
Benchmarks
Benchmarks require some porting effort before they can run in the automated framework. A list of benchmarks and their status is available here.
Benchmarking Framework
The framework source is hosted on github.
(NOTE: as of 17 December the tests are running 30 replicates each)
Test results are averaged over 30 iterations. The browser is restarted between each iteration. For test runs on mobile devices, the device is rebooted between iterations.
Tests for Windows and Linux are done on Dell Vostro 470 machines (Intel i5-3450, 4GB memory, Geforce GT620), running Windows 7 and Ubuntu 12.04.
All Android and FirefoxOS tests are done on Nexus 4 phones. This is to ensure a common baseline and have a device with enough memory to run all the benchmarks, many of which would not function on more resource-constrained devices.
Current Results
View Perfy-Overview.html (requires LDAP) and click to drill down to details.
The charts show average score over the 30 test runs performed. We believe average is a fair aggregate for the sub-test results given the characteristics we see:
- Many of the sub-tests have bimodal behavior (example). We highly suspect garbage collection is being triggered during some of the test runs, and is the cause of the two modes. The plan is to add more tooling to count the number and duration of GC events to confirm this suspicion.
Each individual mode has low variance; the weight given to each of the modes is equal to the number of samples that we observe; and the average provides us with number that best reflects that balance.
Caveat - The time series (example) does NOT use average, it uses median! The purpose of the time-series is to detect regressions in performance, which means we are only interested in one of the two modes and whether it changes over time. The median naturally chooses the most popular mode giving us consistent results over time. There are dangers to using median; for one, it is not sensitive to change in the balance of modes over time. Yet, median is much better than average given the small number of tests we perform. Average is sensitive to the random fluctuations between each battery of tests, not large enough when viewed on it's own, but distracting when looking at variation between batteries. We do not want to be chasing regressions when GC happened to hit thrice this round, and only twice last time.
Scope
- Phase I - establish weekly measurements of the minimum viable product:
- Tests we want to run.
- Approved tests that are compatible with the framework.
- List of phones already on hand.
- List of phones/OS/browser combinations we need to add support to.
- Phase II:
- Full automation
- Remaining platforms (OSX)
- Benchmark porting guide
Automation/Scalability Issues
Individual test machines need to be started by hand each week
Test machines should be simple/dumb and test manager should drive them. Test machines should connect to manager to request work units and report results back to the manager.
Installed browsers must be updated manually
For Firefox, test machines can download/install from ftp.mozilla.org and manage their own install area. For Chrome, will vary by platform. Windows stays up to date on its own. Linux requires apt-get update. Darwin will be similar to Windows (I think).
Test must be manually started each week
Test manager can run test plan on a schedule.
Results must be pushed to ES by hand
Test manager can upload finished test runs. Will require test manager to have VPN access to ES cluster.
Mobile devices can be unstable
Test manager can time-out task if results are not received and reassign tasks.
Some tests require files to be served from a proper web server instead of file URI
Run all tests from the test manager instead of pushing them to the devices.
Device discovery is not possible, must be configured manually
Use zeroconf to allow workers to discover manager and ask for work.
30 replicates my be too much/too few
Continue running tests until noise reaches acceptable level rather than for a fixed number of cycles.
People
- Martin Best - Sr. Engineering Program Manager - mbest@mozilla.com
- Chris Peterson - Engineering Program Manager - cpeterson@mozilla.com
- Robert Clary - Automation Developer - bclary@mozilla.com
- Alan Kligman - Platform Engineer - akligman@mozilla.com
- Joel Maher - Automation Developer - jmaher@mozilla.com
- Deb Richardson - Product Manager - deb@mozilla.com
- Milan Sreckovic - Engineering Manager, GFx - msreckovic@mozilla.com
- Clint Talbert - Automation & Tools Engineering Lead - ctalbert@mozilla.com
- Kannan Vijayan - Developer - kvijayan@mozilla.com
- Vladimir Vukicevic - Engineering Director - vladimir@mozilla.com
- Robert Wood - Automation Developer - rwood@mozilla.com
Communication
Project Team Meeting | Thursdays at 1:30 PT for 30 mins
|
IRC |
|