Auto-tools/Projects/Futurama/2012-10-02

From MozillaWiki
Jump to: navigation, search

Summary

First meeting. It started a little rocky because it's a new process and a bit uncomfortable. We talked about several big topics - from the lack of reproducibility in the current automation setup (major headache when trying to fix failing issues) to the intermittent orange problem, and to VMs for capacity as well as for diagnosing oranges. The summary of talking points are below, with some questions we will try to address on thursday:

Summary of big points

  • Review Initial conversations with some people in https://etherpad.mozilla.org/engineering-productivity-pain-points
  • Reproducibility issue (in general) (biggest problem)
  • Orange issue (specifically)
  • VMs for capacity and orange repro'ing (DBurns will speak to friends at testing services)
  • Expanding the infrastructure to a more developer/service friendly infra.
  • measuring tests - how do we know what we're testing is what? Where are we at with code coverage?
  • Making adding tests/platforms more simple.
  • automatic regression hunting wired into TBPL - what would we need to write to do this?
  • When a new platform is mentioned we should push back to product and say "you can have that test but we don't have the resources to track down these issues".
    • We can allocate ateam headcount for it but it would be good to have platform team too.

The full raw log follows.

Raw Log

Desktop Automation

  • Inability to reproduce things w/o a build machine (buildbot env ) to get things to recreate a problem.
    • Ideally, we could replicate a full, end-to-end environment that matches production
  • Infrastructure security is too locked down for some of the new service oriented projects - like social, sync, safe browsing etc.
  • A lot of our tests might not be relevant for a particular change - we have no way to measure what is relevant. We don't know what most of the tests do.
    • coverage?
  • Need more lawyers (aka managers).
    • In other words, we could use people to figure out what is in those tests.
  • there are really only three useful suites: reftests, mochitest, xpcshell
  • Need a really easy way to roll out a test and server for that test that is documented and easy to follow.
  • Need realistic networks.
    • That's similar to talos - trying to set up the stoneridge stuff wth talos.

capacity issues

  • we are not running all tests on mobile right now because we don't want to run always passing tests
  • makes regression range hunting hard
  • NEED: automated regression hunting capacity. Go to TBPL and mark a failure somehow that it gets automatically bisectedand figure out what caused it
    • Mozregressoin and a couple of other tools could help with this.
    • To tie this to oranges, we could run the oranges in a separate suite or in a staging environment. Because right now we have 1200 intermittent orange bugs and it's nearly impossible to get developers stick to fixing issues.
  • Almost need an orangefactor that is hooked into the result analyzer that can determine when something goes from intermittent to full failure
    • almost want something like the performance regression tracking - you have to track the frequency of failures over time because if we had a way to say we know this is random failing test, then we could mark them so they don't turn the build orange, and we would still have a way to detect if the test moves from just flaky to perma-fail.
    • google has something like this with flaky-tests
  • If we had some way to move these intermittent issues out then we could find a way to automate through them to help diagnose issues/analyze them.
  • But the reproducibliity aspect is the hardest nut to crack here.
    • could we run tests automated fashion on something more like a developer desktop to see how likely it would be that a developer could run and debug it locally.
  • Could we push a failing test into Bughunter to see if it could repro it?
    • mostly to reuse the infrastructure
    • or capture a snapshot when the test starts failing to then allow people to download and debug that

if we could set up a constrained VM environment to force random failures and timeouts and see if we could make it easier to repro oranges. This would probably work b/c of the difference in test slave vs developer box (most random oranges are timing issues).

Running in VMs as a way to scale up capacity

  • research coudl be done to test on vms to see if they are the same bugs we find in normal automation or not
  • We have to do this ^ anyway if we move tests back into VMs.
  • If we can prove them as a platform it will be the only way to go back to this.
  • there are test-in-the-cloud shops that do this as their core business (keeping VMs from being flaky) and we can probably learn something from them. Saucelabs comes to mind, cloudbees, travis CI, circle CI. Would be worth talking to them.
    • travis CI is open source we can go see how their setup is.

Developer Headaches

  • Intermittent orange
    • the opera idea - putting tests in a "trial environment" that they have to graduate from
    • oranges are hard to figure out how to ascertain the cause of them
    • if it's a totally new test then that's pretty easy, but for that assertion to hold, it's hard.
    • At opera if a graduated test goes intermittent, it goes back into the staging area and people start investigating the issue.
    • We haven't had much success on getting developers to fix these things, but some of that is due to the difficulty of getting a reproducible environment w/the buildbot framework.
    • Roc did the record/replay debugging back about a year ago (when that was available).
    • IDEA:The holy grail is something that can run in automation and when it fails it gives you an environment where you cold run the flaky tests and allow developers to work on it from there.
    • In reftest, you can run things as fail-if random if things go flaky. You're still running it then, but then you don't differentiate between something that fails intermittently and the starting to fail permanently. This also deserves its own topic.

Platforms

  • could break the idea that test systems need to be the same as platform for perf. We can break that and make it easier to bring in new platforms (capacityand perf)
  • Everytime we bring in a new platform we break lots of tests
  • We never have enough engineering resources dedicated to make that work. They really shouldn't start a new platform w/o enogh resources committed to doing that you're just not going to get them fixed and you waste a ton of time trying to track them down and get them fixed.

Initial conversations with some people in https://etherpad.mozilla.org/engineering-productivity-pain-points

Sheriffing

Integration

Some of the plans that people have

  • large wooden rabbit - future releng job scheduler
  • mozharness - to replace the "test runner" with buildbot

questions

  • why is it so difficult to have a proper stage for the buidlbot environment?
    • because our buildbot setup is complicated and not packaged for easy replicable deployment
    • we have extended buildbot far past what it was designed for
    • lots of stuff has been added haphazardly over the years
    • theoretically possible to replicate a buildbot environment locally, but very very complex and difficult
    • everything is packaged together, including private keys for ftp access and other things that don't need to be or shouldn't be shared.

Next steps

  • Review Initial conversations with some people in https://etherpad.mozilla.org/engineering-productivity-pain-points
  • Reproducibility issue (in general) (biggest problem)
  • Orange issue (specifically)
  • VMs for capacity and orange repro'ing (DBurns will speak to friends at testing services)
  • Expanding the infrastructure to a more developer/service friendly infra.
  • measuring tests - how do we know what we're testing is what? Where are we at with code coverage?
  • Making adding tests/platforms more simple.
  • automatic regression hunting wired into TBPL - what would we need to write to do this?
  • When a new platform is mentioned we should push back to product and say "you can have that test but we don't have the resources to track down these issues".
    • We can allocate ateam headcount for it but it would be good to have platform team too.