Releng Automation Plans and Pain Points

Been doing a lot of reactivity stuff like the ateam as well There are a few ideas in general of what they want to do in the mid-term Two challenging areas right now: 1. Simplify the configurations - no good idea of how to do this right now.

- what tests/builds happen on a particular branch/platform
- how to get this to be understandable and reduce the maintenance burden
- For a long time we had problems where configuration on branches would drift apart from each other changes on one branch would not land on other branches
  - Used a lot of inheritance to fix that ^ but that made it really hard to know the affect of the changes
- Affects A*team b/c it's hard to tweak how certain tests are run on various branches and platforms.

2. Looking at having more flexible build/test scheduling

- Putting information about the builds into the tree has been pretty useful (i.e in the m-c)
- Have changes to the builds/tests in the tree tracked in history - talos.json clang settings is another example.
- Let other people - developers, ateam etc make specific changes for variousbranches.
- Enables more of a "runtime" change

The latter problem can be addressed in buildbot, also looking at alternatives to bbot. The first problem might be solved by pushing all the configuraiton into the tree, which solves the branch problem.[jhammel++]

- that could make it much easier to test changes to these things (i.e. the try case)
- And you get the property that when you merge between branches you get the changes of the test frameworks into the new tree seamlessly.
- Merge day is a major nightmare for releng due to this.

LWR - more of a thought experiment than anything. It's trying to look at the scheduling problem. fixing the scheduler and making it more flexible and customizable and less opaque. Looked at 3rd party tools - mostly looking at CI based systems like Jenkins. Would be intereested to hear about other 3rd party job scheudlers

But you don't want them tied necessarily to the remote job execution piece as well.

The reproducibility problem

One of the issues where people don't see issues locally is they don't run the same set of steps
- Getting that logic into scripts will go a long way to help with that (mach, mozharness etc)
- Getting VMs might not be viable to help ith this - would still have issues getting repro environments
- You might be able to do the stop and wait for a developer person to look into a box when it fails
- Could also have the build env record of the entire job run and/or a video recording of how it works so you have the log of CPU/memory and the screen recording of hwat happens on the machine and the text log.
- Make it easier for devs to get on the machine when it is in that state - some kind of gateway web app vnc/etc to get onto the machine.

Turn around time

Takes a long time to get a build done (and people are working on that)
Takes a long time to run tests (and is there people working on that?)
It's a broken system if all we do is add tests and no one is looking at taking out tests that aren't working that are flaky or slow or what have you
We had the idea of a small sanity test suite and if that passes you then go run the full set of tests.
Doesn't help with reproducibility but it does help with turn around time for developers.
Could create a fast test suite and curate it. - 5 minutes tops
- thought experiment: what would we want in such a suite?
  - any test you push with your change should be run in the "fast suite"
  - and small pieces from various directories - one core test from several frameworks - a core set of reftest/crashtests/jsreftests
  - do the same with mochitest?
  - worker threads, javascript tests, ipc tests

Auto-tools/Projects/Futurama/2012-10-11

Releng Automation Plans and Pain Points

The reproducibility problem

Turn around time

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

How to Contribute

MozillaWiki

Around Mozilla

Tools