Auto-tools/Projects/Futurama/2012-10-30

From MozillaWiki
Jump to: navigation, search

The Top of the List

Capacity

Short Term

  • Analyze VMs - *can* we run tests on AWS -speaking to some people no! AWS can be slow for Windows (20 min startup sometimes)
    • jmaher says only 50 failures running unit tests on AWS (which suites?)
    • failures are also consistent across multiple runs
  • Get more machines (releng) (iX machines)
    • Windows and linux test slave replacements
    • Are we going to use these for both talos and unittests? Follow up with Armenzg
    • 1044 minis in production
  • Identify problematic machines - reflash them to fix them. Can we auto-detect dead machines (reduces intermittency)
  • For VMs, we would probably want a one-off script to compare between vms and desktop env's.
  • Monitor sdcard burnout on mobile

Long Term

  • For branch to branch comparisons of intermittency we want this in orange factor
  • Tools to monitor failure trends on mobile devices w.r.t. burnout, dead sdcard etc.

Turn around time

Short Term

  • Develop a short cycle test - would need to be scrupulously maintained
    • Smoke test to start/stop the browser - adding more tests to that doesn't really seem to add any other stuff.
    • If windows builds fail, but Linux, OSX, Android are fine - would we not schedule full suite of tests on all platforms because one platform is dead?
      • On try you still want all the results. Whereas on inbound, then you don't want to waste the cycles on something that will be backed out. Sheriff could retrigger if for some reason we still want those tests
  • Low hanging fruit in some harnesses - like parallelizing xpcshell on multi-core
    • xpcshell could be improved by having fast disks as well (for tests that use sqllite data - using ramdisk for tmp/ file) could we do ramdisk on windows?
  • Be smarter about when we run various tests
    • Run tests off of where they were checked in - don't run robocop if mobile doesn't change etc.
    • don't run js tests if you're not actually changing js stuff
    • don't run desktop tests if you only change b2g
    • Run JS tests as part of make check - could pull these out and optimize them
  • priority levels for try e.g. I don't need my results until tomorrow
  • get lots of data about build times before optimizing (revive buildfaster dashboard)

Long Term

  • makefile optimizations - mozbuild files (build faster)
  • make the tests faster?
    • only run a portion of tests and at some interval run all (but need bisect in the cloud)
      • run the longer tests periodically
      • run the high frequency orange tests periodically
      • parallelize tests more between machines
      • run the tests that have never failed in the last year periodically (i.e. dom-levelX)
      • choose statistically a sampling of tests for each changeset
  • What is the feasibility of audit of tests and measure what is in there and what is duplicated etc. (test measurement)
    • even more awesome would be some sort of test ownership that emerges from this audit

Reproducibility

Short Term

  • Environment you can get in production - complete with the sendchange scripts to fire off the exact test you're interested in.
  • Might be replaced by bisect in the cloud?
  • Add more ability to get runtime debugging output of the product for failures/random oranges
    • Run a failing test again with more debugging output? Do we have debugging settings we can actually toggle?
    • Could potentially re-use the existing NSPR logging and enable it at run time through ENV var, but it's not clear that it would get us enough logging to be useful. TODO Investigate.
  • Investigate differences in failure between debug and opt builds

Long Term

  • Downloadable environment to run tests in that matches the bbot environment (depends on running tests on VMs)
  • halt on error for manual investigation

More flexible bbot scheduling

Short Term

  • Allow for having more of a stage area to try out various changes to the production environment
  • Easier to try new harness automation
  • Put some config files into the trees now. à la talos.json

Long Term

  • Completely move away from buildbot scheduler
  • There are off the shelf products for desktop but not mobile
    • But there is an ability to make scheduling for mobile appear to be scheduling things on desktop (using mozharness/mozpool/lifeguard)

Bugmail/Personal, Product Bugzilla stati

Short Term

  • Release 4.2 with the dashboards
    • See how they get used in the wild and respond to requests and input
  • annoying reminder when you have reviews that are >2 days old!!!!!!!!!!!!!!!!!
    • existing functionality in bugzilla now (time frame is 7 days)
    • might look into decreasing time out
  • Use X-Bugzilla-Who to filter out TBPL robot signer.

Long Term

  • Extend component watching to make the email easier to fine tune
  • Analyze the use of "Product-izing ed's workflow to escape the TBPL robot comments"

Bisect in the cloud

Short Term

  • Should solve performance and correctness issues
  • depends on buildbot scheduling
    • Depends on schedule builds & tests for itself (or it builds itself)
    • should be smart enough to use existing builds where it is possible (recent, non-coalesced)
    • What drives this? A script we maintain that works with buildbot or buildbot itself
    • The data display - what's the best way to present the output of the output of this tool?
    • How is the tool going to be used?
      • Use case A): we don't run all the tests all the time - the thing could be fired off from TBPL to go back and figure out what broke on the treee
      • Use case B) we have something that has no automated test but we create a failing one and we give this monster that test and let it go find which changeset broke it.
    • One idea is to replicate doing the sheriff action with pushing a patch, building, retriggering etc. until you find the smallest range possible - might be one way to do a first cut.
    • Ideally go down into individual changesets i.e. with hg bisect and not pushes (but that requires extra builds that aren't mapped to a push) might be longterm
  • Depends on having enough capacity to dedicate machines with this
  • ted has part of it - mozregression is another part of it

Long Term

  • Something stand alone but integrated (visually) with TBPL etc.
  • Good UI to showcase the regression hunt, status, and the regression range
  • Fire and forget developer use case. It emails them a link to the UI when it is finished

Orange quarantine

Short Term

  • Run as full test suites so you could determine that test state issues are an issue
    • When a new test is added to a tree - run it a hundred times in a row to see how often it fails (doesn't solve state issues)
  • Tests are flaky until proven otherwise
  • One issue is that our chunks change over time as we add tests so the chunk is defined differently - that might be part of the cause of a random orange
  • Almost a 1/3 of our new oranges are caused by new tests.
  • What is the mechanism?
    • Perhaps have them show up as hidden by default
    • Have some kind of automated analysis to promote tests for graduation/backout

Long Term

    • If we enabled TBPL to view individual buildbot steps we could split things into their compnent parts and we could run the flaky tests within the same buildbot step as the normal runs but have TBPL display them as a different letter and treat its state differently. (this would benefit us for many of our existing tests in addition to the flaky tests).

Opening up of TBPL to allow any automation system to show data

Short Term

  • Allows better integration of full stack automation, but need a way to view previous runs of sporadic test runs.
  • Better insulates TBPL from buildbot so that we use better interfaces for data exchange rather than old-school side effects of other codebases. (i.e. TinderboxPrint)

Long Term

  • Log Service piece to streamline the parsing and generation of logfiles.

Streamline Bugzilla integration with Try/Checkin/Reviews

Short Term

  • Bitrot identifcation
  • Push to try
  • Push to real trees (inbound)
  • Backout
  • LogService - to parse the output (from try and trees) to detail whether the push worked or not
  • automatically put the changeset id in the bug

Long Term

  • Drive process of checkin/backout/try from bugzilla
  • Schedule landings in a queue to load balance the slaves
  • Run test run until it fails and hands you a vm to debug the test. Maybe a NEEDSREPO flag

Test Measurements of how worthwhile/useful tests are

Short Term

  • flaky test quarantine could help
  • Taking a close look at top 10% longest duration tests to determine if we can reduce runtime/if they are duplicated with other tests
    • There are a few tests that are exponentially slower than others - since those are wasting so much resources they seem like a good place to start.
  • High speed test - so that people could run a test and get results back quickly
  • static analysis of tests to discover antipatterns (e.g. setTimeout() with nonzero value)

Long Term

  • Code coverage metrics
  • remove tests that are duplicating functionality
  • could try breaking functionality in ceratin ways and seeing which ones fail and then determine how many tests that failed are actually duplicates
  • clint's crazy idea to look at differences in code coverage to determine what is and is not duplicated.

Make all test runners use the same code/methodology - mozbase, mach front ends, to make tests both easy to run, use, and to write.

Short Term

  • Sane Log parsing/generation across all tests/consumers etc
    • Log service that keeps logs and allows requesters to pull down a specific log/slice of log file.
  • new harnesses should use mozbase and make it support required functionality

Long Term

  • port existing harnesses to mozbase

Go through these and decide what the short and long term deliverables are on each one. Then prioritize.

  • spokesperson: ???
  • Joel will lead the releng + team mtg
  • Mcote will set up the Toronto Developer meeting
  • On Thurs. Ed M will join us.

Ed's Views on things

Piecemeal solutions

So many of our solutions are piecmeal, and having unification in that area would be useful to keep many of hte tools sane - not having TBPL, builbot, tests etc all doing parsing

Getting orange factor down

  • This would be useful because it opens up many avenues going forward
  • If we could sove that we could spread commits out more
    • If something is non-critical we could push directly from try to m-i (autoland)
    • So people could do checkin-needed and it would automatically happen and schedule things around the clock and not have the masses descending on the tree during US hours.

starting to reach the max number of commits per hour

Getting talos regression inforation into tbpl on per change basis

scalability is an issue

  • need to be able to run tests on vms

stress testing of tests

  • new tests should be assumed to be flaky by default and separate them out
  • tests are flakyuntil they proove themselves otherwise
  • autoloand could ignore these when making its determiations on when to land.

TBPL is becoming one dashboard to rule them all

  • that's going to be unsustainable eventually
  • we are going to need to break it down into views/modules and use cases
  • at the moment even developers doin't know hwo to use it completely
  • we can make that more accessible
  • regression hunting mode
  • sheriffing mode
  • I just want to see my stuff mode
  • No real way to showcase when a particular machine is falling over.
  • Orange filing view - want to see all unstarred oranges that haven't been filed.
    • It would be good to see the history on these so that you can tell if the machine keeps coming up as a problem
    • each machine result on TBPL is either commented on or not - it would be good to have another state - a 'to-triage' state so that sheriffs can "file" something away to be dealt with later.
    • Difficult if all you see is unstarred oranges because you need context to know if you should file it or not (i.e. it could have been backed out)

Need a way to pull data into tbpl from non-buildbot sources

  • for these kinds of tests that are only one sporadically would need a view to show "what happened last time" for the tests

= Amount of duplication between tools - specifically tbpl & orange facotr

  • comment data stored in both places
  • both parse the logs etc

From a-team presentation:


[jhammel: there was some discussion previously of (given some particular flag) of landing directly from successful try -> m-c/m-i . Is this still worthwhile? Would seem to help turnaround time both for computer time (x0.5ish) and human time (lag in committing)

  • would require (probably) getting OF lower and/or robust starring practices

]

  • Discussion of how it's difficult to tell when a try build is actually complete
    • would require getting what is run out of buildbot-configs and into the tree. This is not easy, but would enable fixing a lot of things (try chooser, talos names, compare talos, graphserver names, etc; if there was a consumable form of what is run *ALL* tools could use it instead of duplicating this in many different places and having to update in many different places (the reality of course being that updating lags and there is no clear chain of informaiton to follow to ensure that things are up to date).
  • What about future tests? This list is all reactive, but there are probably lots of things we can do to ensure tests written from now on are less likely to cause problems
    • Better documentation, more stringent reviews, static analysis, need to pass x times on slower machines before first checkin etc..