To develop a web dashboard that is useful for identifying and tracking the state of intermittent oranges in our tinderbox unit tests. This should help developers identify which oranges are most 'interesting', and should give people a notion of the overall state of oranges over time.
Since the implementation of the dashboard will require tinderbox failures be put into a database, we could potentially use this database in the tinderbox+pushlog UI, which would allow it to query data from a (fast) database, rather than parsing buildbot logs as it sometimes currently does.
These projects are deprecated and replaced by the new War on Orange/OrangeFactor application.
Topfails was the first database-driven orange tracker developed in our team. It shows failures in terms of overall occurrences. It suffers from a buggy log parser, and a UI with relatively few views.
Old Orange Factor
Orange Factor is a newer dashboard by jmaher. It calculates the average number of oranges per push (the 'orange factor'), and tracks that number over time. We're currently using it as a base to explore the usefulness of other statistics.
The system has several moving parts:
- modifications to TBPL that write orange comments to a database
- a Mozilla Pulse consumer that listens for buildbot messages that are generated when unit tests are finished
- a unittest logparser, that parses buildbot logs, and feeds the resulting data into ElasticSearch
- an instance of ElasticSearch, which is hosted by the Metrics team, that stores the parsed log data and the TBPL bug data
- a web dashboard that pulls data from the database and displays various interesting statistics about it
Development & Deployment
The OrangeFactor web app can be run locally. See the instructions at:
Making Oranges Interesting
Currently, our intermittent oranges are not very interesting. After they've been identified, they are usually more-or-less ignored. This has caused us to accumulate oranges to the point where we have to deal with several of them for every commit (and by 'deal with', I mean 'log it and forget it'), which is time consuming for the sheriffs and for anyone who pushes a commit. At the same time, it demotivates any effort to actually fix them.
We'd like to help change that. We think we can help by creating a dashboard to analyze oranges in the following ways:
- identify the oranges that occur most frequently; these are the oranges that would produce the greatest improvement in our orange factor if fixed
- identify significant changes in the frequency of a given orange; if a known intermittent orange suddenly begins to occur more frequently, it may be related to a recent code change, and this might give developers more information about when/why it occurred, which would hopefully help in fixing it
- identify interesting patterns in failures; some failures may occur more frequently on certain OS's, build types, architectures, or other factors; by providing views which can track oranges across a range of factors, we might be able to provide developers with data that would help them reproduce failures or give them insight into their cause
- identify overall trends in orange occurrences, already part of the legacy Orange Factor app; this can help track the 'orangeness' of a product over time, and can help measure the helpfulness of orange-fixing activities
A list of dashboard views that may be interesting. We're currently using OrangeFactor as a platform to experiment with views.
- [DONE] display of overall orange factor over time
- [DONE] display of failures/day, for a given failure
- [DONE] display of failures/commit/day, for a given failure
- [DONE] display of moving averages of the above
- display of failure frequencies which exceed certain limits (probably based on standard deviation)
- [DONE] display of most common failures, in aggregate, and separated by various factors: platform, OS version, architecture, build type, etc
The amount of information yielded from the parsed logs is vast. The raw data will be noisy and the trends will not be easily discerned. So statistical analysis should be used to manipulate the data and seek insight into trends.
How Tinderbox Stores Its Data
Tinderbox stores logs in the format
* xxx is approximately the time that buildbot picked up the test to run * yyy is the time the log was e-mailed to tinderbox * zzz is the pid of the perl process that processed the log (no really)
Tinderbox maintains a list of bug->log associations at http://tinderbox.mozilla.org/Firefox/notes.txt. The format used therein is:
1291056044|WINNT 5.2 mozilla-central debug test firstname.lastname@example.org|1291059392|Bug%20614474
* yyy is the same as yyy above * mmm is a string representing the testrun, in a format which isn't in the raw buildbot log * ttt is the time that the bug was starred
None of this data can be found in the raw buildbot logs themselves, although yyy is approximately the same as the timestamp of the logfile on stage.mozilla.org (they're not exact though, there is usually a few seconds difference between the time the log was e-mailed to tinderbox (yyy) and the time the log was copied to stage).
The log metadata is all stored in ElasticSearch, see the ElasticSearch page for details on querying this database.
The War on Orange site pulls its data from a REST API. Other applications can hook into this to get the raw orange data.
The API root is at http://brasstacks.mozilla.com/orangefactor/api/. Parameters are passed via the query string, eg. ?key1=value1&key2=value2. Example: http://brasstacks.mozilla.com/orangefactor/api/count?startday=2011-05-21&endday=2011-05-27&tree=mozilla-central
All returned data is in JSON format.
Provides a date-indexed list of oranges, with bug numbers, along with minimal details of each bug.
- startday: Mandatory. In ISO format, e.g. 2011-05-27.
- endday: Mandatory. Also in ISO format.
- bugid: Optional. Return orange data for this bug only.
- tree: Optional. Return information about this tree only. Defaults to mozilla-central. Pass "all" for orange data on all trees.
- type: Optional. Return information for this build type only. Must be "opt" or "debug". Defaults to none (both build types).
Returns an object with two properties:
- oranges: An object with dates as properties, e.g. data['oranges']['2011-05-27']. Each property is another object with orange data for the day, with the following properties:
- orangecount: total number of oranges for that day, e.g. 54.
- testruns: number of test runs that day, e.g. 24. The "Orange Factor" is orangecount/testruns.
- oranges: details of the oranges that occurred that day. It is an array of objects, each one having these simple properties:
- bugs: An object with bug ids as properties. Each bug in the above list of oranges is represented here. The information is gathered via pulse and thus is quicker to access than querying Bugzilla. Only a few basic properties are available; for more detailed info, you will have to consult Bugzilla:
Returns a date-indexed summary of orange data.
The parameters are the same as for bybug. The returned data is also the same, except that the 'bugs' property is not returned, and the array of orange details (data['oranges'][<date>]['oranges']) is empty. This is a faster way to get just the summarized numerical data for, e.g., Orange Factor calculations.
Returns minimal details for one or more bugs.
The only parameter is "bugid", which takes a bug id or a comma-separated list of bug ids.
Returns the same data as the "bugs" property of the bybug returned data.
Returns information on one or more test runs.
The request can be made in one of two fashions. To get information about only one test run,
- starttime: Unix timestamp of the run's start time.
- machine: Hostname of the test machine.
Or to get information on several,
- runs: a comma-separated list of timestamps and machines, in the form <timestamp>|<machine>,<timestamp>|<machine>, eg testrun?runs=1297070365|talos-r3-leopard-012,1295484366|talos-r3-xp-037
Returns an object with properties in the form '<timestamp>|<machine>' (regardless of which parameter format was used). Each property has an array of matching test runs, with these properties:
- passed/failed/todo: the number of tests in each category; may be missing for testruns that never completed due to crashes, etc
- elapsedtime: number of seconds it took for the testrun to complete
- builder: the buildbot builder string
- machine: the machine name
- cmdline: the command line used to invoke the test
- buildtype: opt or debug
- platform: buildbot platform string
- date: date the testrun was run, YYYY-MM-DD
- buildid: the buildbot buildid used to run the test
- revision: the hg revision used to run the test
- testfailure_count: the number of failing tests in this testrun.
- testrunerrors: an array of errors that could not be pinned to a specific test; these are usually memory leaks or crashes that occur at the end of a testrun. these errors are not included in 'testfailure_count'
- testfailures: an array of testfailures which is testfailure_count in length; each member of this array has two keys:
- test: the name of the test that failed
- failures: a list of failures that occurred
- logurl: URL to the complete test log.
Returns information on test failures.
This is different from information on oranges, in that (a) these failures may not have been starred and (b) there may be more than one failure per test.
- startday: Mandatory. In ISO format, e.g. 2011-05-27.
- endday: Mandatory. Also in ISO format.
- tree: Optional, defaults to all.
- type: Build type, opt or debug, defaults to all.
Returns an object with properties named by test run ID and containing an array of failures. Each failure is an object with the following properties:
- errors: an array of objects with 'status' and 'text' properties describing the error.