1 The Top of the List
2 Capacity
- 2.1 Short Term
- 2.2 Long Term
3 Turn around time
- 3.1 Short Term
- 3.2 Long Term
4 Reproducibility
- 4.1 Short Term
- 4.2 Long Term
5 More flexible bbot scheduling
- 5.1 Short Term
- 5.2 Long Term
6 Bugmail/Personal, Product Bugzilla stati
- 6.1 Short Term
- 6.2 Long Term
7 Bisect in the cloud
- 7.1 Short Term
- 7.2 Long Term
8 Orange quarantine
- 8.1 Short Term
- 8.2 Long Term
9 Opening up of TBPL to allow any automation system to show data
- 9.1 Short Term
- 9.2 Long Term
10 Streamline Bugzilla integration with Try/Checkin/Reviews
- 10.1 Short Term
- 10.2 Long Term
11 Test Measurements of how worthwhile/useful tests are
- 11.1 Short Term
- 11.2 Long Term
12 Make all test runners use the same code/methodology - mozbase, mach front ends, to make tests both easy to run, use, and to write.
- 12.1 Short Term
- 12.2 Long Term
13 Ed's Views on things
- 13.1 Piecemeal solutions
14 Getting orange factor down
15 Getting talos regression inforation into tbpl on per change basis
16 scalability is an issue
17 stress testing of tests
18 TBPL is becoming one dashboard to rule them all
19 Need a way to pull data into tbpl from non-buildbot sources

The Top of the List

Capacity

Short Term

Analyze VMs - *can* we run tests on AWS -speaking to some people no! AWS can be slow for Windows (20 min startup sometimes)
- jmaher says only 50 failures running unit tests on AWS (which suites?)
- failures are also consistent across multiple runs
Get more machines (releng) (iX machines)
- Windows and linux test slave replacements
- Are we going to use these for both talos and unittests? Follow up with Armenzg
- 1044 minis in production
Identify problematic machines - reflash them to fix them. Can we auto-detect dead machines (reduces intermittency)
For VMs, we would probably want a one-off script to compare between vms and desktop env's.
Monitor sdcard burnout on mobile

Long Term

For branch to branch comparisons of intermittency we want this in orange factor
Tools to monitor failure trends on mobile devices w.r.t. burnout, dead sdcard etc.

Turn around time

Short Term

Develop a short cycle test - would need to be scrupulously maintained
- Smoke test to start/stop the browser - adding more tests to that doesn't really seem to add any other stuff.
- If windows builds fail, but Linux, OSX, Android are fine - would we not schedule full suite of tests on all platforms because one platform is dead?
  - On try you still want all the results. Whereas on inbound, then you don't want to waste the cycles on something that will be backed out. Sheriff could retrigger if for some reason we still want those tests
Low hanging fruit in some harnesses - like parallelizing xpcshell on multi-core
- xpcshell could be improved by having fast disks as well (for tests that use sqllite data - using ramdisk for tmp/ file) could we do ramdisk on windows?
Be smarter about when we run various tests
- Run tests off of where they were checked in - don't run robocop if mobile doesn't change etc.
- don't run js tests if you're not actually changing js stuff
- don't run desktop tests if you only change b2g
- Run JS tests as part of make check - could pull these out and optimize them
priority levels for try e.g. I don't need my results until tomorrow
get lots of data about build times before optimizing (revive buildfaster dashboard)

Long Term

makefile optimizations - mozbuild files (build faster)
make the tests faster?
- only run a portion of tests and at some interval run all (but need bisect in the cloud)
  - run the longer tests periodically
  - run the high frequency orange tests periodically
  - parallelize tests more between machines
  - run the tests that have never failed in the last year periodically (i.e. dom-levelX)
  - choose statistically a sampling of tests for each changeset
What is the feasibility of audit of tests and measure what is in there and what is duplicated etc. (test measurement)
- even more awesome would be some sort of test ownership that emerges from this audit

Reproducibility

Short Term

Environment you can get in production - complete with the sendchange scripts to fire off the exact test you're interested in.
Might be replaced by bisect in the cloud?
Add more ability to get runtime debugging output of the product for failures/random oranges
- Run a failing test again with more debugging output? Do we have debugging settings we can actually toggle?
- Could potentially re-use the existing NSPR logging and enable it at run time through ENV var, but it's not clear that it would get us enough logging to be useful. TODO Investigate.
Investigate differences in failure between debug and opt builds

Long Term

Downloadable environment to run tests in that matches the bbot environment (depends on running tests on VMs)
halt on error for manual investigation

More flexible bbot scheduling

Short Term

Allow for having more of a stage area to try out various changes to the production environment
Easier to try new harness automation
Put some config files into the trees now. à la talos.json

Long Term

Completely move away from buildbot scheduler
There are off the shelf products for desktop but not mobile
- But there is an ability to make scheduling for mobile appear to be scheduling things on desktop (using mozharness/mozpool/lifeguard)

Bugmail/Personal, Product Bugzilla stati

Short Term

Release 4.2 with the dashboards
- See how they get used in the wild and respond to requests and input
annoying reminder when you have reviews that are >2 days old!!!!!!!!!!!!!!!!!
- existing functionality in bugzilla now (time frame is 7 days)
- might look into decreasing time out
Use X-Bugzilla-Who to filter out TBPL robot signer.

Long Term

Extend component watching to make the email easier to fine tune
Analyze the use of "Product-izing ed's workflow to escape the TBPL robot comments"

Bisect in the cloud

Short Term

Should solve performance and correctness issues
depends on buildbot scheduling
- Depends on schedule builds & tests for itself (or it builds itself)
- should be smart enough to use existing builds where it is possible (recent, non-coalesced)
- What drives this? A script we maintain that works with buildbot or buildbot itself
- The data display - what's the best way to present the output of the output of this tool?
- How is the tool going to be used?
  - Use case A): we don't run all the tests all the time - the thing could be fired off from TBPL to go back and figure out what broke on the treee
  - Use case B) we have something that has no automated test but we create a failing one and we give this monster that test and let it go find which changeset broke it.
- One idea is to replicate doing the sheriff action with pushing a patch, building, retriggering etc. until you find the smallest range possible - might be one way to do a first cut.
- Ideally go down into individual changesets i.e. with hg bisect and not pushes (but that requires extra builds that aren't mapped to a push) might be longterm
Depends on having enough capacity to dedicate machines with this
ted has part of it - mozregression is another part of it

Long Term

Something stand alone but integrated (visually) with TBPL etc.
Good UI to showcase the regression hunt, status, and the regression range
Fire and forget developer use case. It emails them a link to the UI when it is finished

Orange quarantine

Short Term

Run as full test suites so you could determine that test state issues are an issue
- When a new test is added to a tree - run it a hundred times in a row to see how often it fails (doesn't solve state issues)
Tests are flaky until proven otherwise
One issue is that our chunks change over time as we add tests so the chunk is defined differently - that might be part of the cause of a random orange
Almost a 1/3 of our new oranges are caused by new tests.
What is the mechanism?
- Perhaps have them show up as hidden by default
- Have some kind of automated analysis to promote tests for graduation/backout

Long Term

- If we enabled TBPL to view individual buildbot steps we could split things into their compnent parts and we could run the flaky tests within the same buildbot step as the normal runs but have TBPL display them as a different letter and treat its state differently. (this would benefit us for many of our existing tests in addition to the flaky tests).

Opening up of TBPL to allow any automation system to show data

Short Term

Allows better integration of full stack automation, but need a way to view previous runs of sporadic test runs.
Better insulates TBPL from buildbot so that we use better interfaces for data exchange rather than old-school side effects of other codebases. (i.e. TinderboxPrint)

Long Term

Log Service piece to streamline the parsing and generation of logfiles.

Streamline Bugzilla integration with Try/Checkin/Reviews

Short Term

Bitrot identifcation
Push to try
Push to real trees (inbound)
Backout
LogService - to parse the output (from try and trees) to detail whether the push worked or not
automatically put the changeset id in the bug

Long Term

Drive process of checkin/backout/try from bugzilla
Schedule landings in a queue to load balance the slaves
Run test run until it fails and hands you a vm to debug the test. Maybe a NEEDSREPO flag

Test Measurements of how worthwhile/useful tests are

Short Term

flaky test quarantine could help
Taking a close look at top 10% longest duration tests to determine if we can reduce runtime/if they are duplicated with other tests
- There are a few tests that are exponentially slower than others - since those are wasting so much resources they seem like a good place to start.
High speed test - so that people could run a test and get results back quickly
static analysis of tests to discover antipatterns (e.g. setTimeout() with nonzero value)

Long Term

Code coverage metrics
remove tests that are duplicating functionality
could try breaking functionality in ceratin ways and seeing which ones fail and then determine how many tests that failed are actually duplicates
clint's crazy idea to look at differences in code coverage to determine what is and is not duplicated.

Make all test runners use the same code/methodology - mozbase, mach front ends, to make tests both easy to run, use, and to write.

Short Term

Sane Log parsing/generation across all tests/consumers etc
- Log service that keeps logs and allows requesters to pull down a specific log/slice of log file.
new harnesses should use mozbase and make it support required functionality

Long Term

port existing harnesses to mozbase

Go through these and decide what the short and long term deliverables are on each one. Then prioritize.

spokesperson: ???
Joel will lead the releng + team mtg
Mcote will set up the Toronto Developer meeting
On Thurs. Ed M will join us.

Ed's Views on things

Piecemeal solutions

So many of our solutions are piecmeal, and having unification in that area would be useful to keep many of hte tools sane - not having TBPL, builbot, tests etc all doing parsing

Getting orange factor down

This would be useful because it opens up many avenues going forward
If we could sove that we could spread commits out more
- If something is non-critical we could push directly from try to m-i (autoland)
- So people could do checkin-needed and it would automatically happen and schedule things around the clock and not have the masses descending on the tree during US hours.

starting to reach the max number of commits per hour

Getting talos regression inforation into tbpl on per change basis

scalability is an issue

need to be able to run tests on vms

stress testing of tests

new tests should be assumed to be flaky by default and separate them out
tests are flakyuntil they proove themselves otherwise
autoloand could ignore these when making its determiations on when to land.

TBPL is becoming one dashboard to rule them all

that's going to be unsustainable eventually
we are going to need to break it down into views/modules and use cases
at the moment even developers doin't know hwo to use it completely
we can make that more accessible
regression hunting mode
sheriffing mode
I just want to see my stuff mode
No real way to showcase when a particular machine is falling over.
Orange filing view - want to see all unstarred oranges that haven't been filed.
- It would be good to see the history on these so that you can tell if the machine keeps coming up as a problem
- each machine result on TBPL is either commented on or not - it would be good to have another state - a 'to-triage' state so that sheriffs can "file" something away to be dealt with later.
- Difficult if all you see is unstarred oranges because you need context to know if you should file it or not (i.e. it could have been backed out)

Need a way to pull data into tbpl from non-buildbot sources

for these kinds of tests that are only one sporadically would need a view to show "what happened last time" for the tests

= Amount of duplication between tools - specifically tbpl & orange facotr

comment data stored in both places
both parse the logs etc

From a-team presentation:

[jhammel: there was some discussion previously of (given some particular flag) of landing directly from successful try -> m-c/m-i . Is this still worthwhile? Would seem to help turnaround time both for computer time (x0.5ish) and human time (lag in committing)

would require (probably) getting OF lower and/or robust starring practices

]

Discussion of how it's difficult to tell when a try build is actually complete
- would require getting what is run out of buildbot-configs and into the tree. This is not easy, but would enable fixing a lot of things (try chooser, talos names, compare talos, graphserver names, etc; if there was a consumable form of what is run *ALL* tools could use it instead of duplicating this in many different places and having to update in many different places (the reality of course being that updating lags and there is no clear chain of informaiton to follow to ensure that things are up to date).
What about future tests? This list is all reactive, but there are probably lots of things we can do to ensure tests written from now on are less likely to cause problems
- Better documentation, more stringent reviews, static analysis, need to pass x times on slower machines before first checkin etc..

Auto-tools/Projects/Futurama/2012-10-30

Contents

The Top of the List

Capacity

Short Term

Long Term

Turn around time

Short Term

Long Term

Reproducibility

Short Term

Long Term

More flexible bbot scheduling

Short Term

Long Term

Bugmail/Personal, Product Bugzilla stati

Short Term

Long Term

Bisect in the cloud

Short Term

Long Term

Orange quarantine

Short Term

Long Term

Opening up of TBPL to allow any automation system to show data

Short Term

Long Term

Streamline Bugzilla integration with Try/Checkin/Reviews

Short Term

Long Term

Test Measurements of how worthwhile/useful tests are

Short Term

Long Term

Make all test runners use the same code/methodology - mozbase, mach front ends, to make tests both easy to run, use, and to write.

Short Term

Long Term

Ed's Views on things

Piecemeal solutions

Getting orange factor down

Getting talos regression inforation into tbpl on per change basis

scalability is an issue

stress testing of tests

TBPL is becoming one dashboard to rule them all

Need a way to pull data into tbpl from non-buildbot sources

Navigation menu

Search