Summit2013/Sessions/Sunday/Sheriffs
Sheriffing presentation from Mozilla Summit 2013, Santa Clara
KWierso: Who has heard of the sherriffs? Who has interacted with us?
[show of hands]
I think most people here already know pretty much about sheriffing, but I'll go ahead and explain the basics for anyone who doesn't.
Generally, sheriffing is ensuring that the process for landing patches in Mozilla runs as smoothly as possible. People write patches, hopefully get them reviewed, hopefully push them to try and get them tested, then land them. After they land them, it triggers a bunch of tests. The results of the tests appear on TBPL. Sherriffs watch TBPL to see whether the tests succeed.
Tests can break because of bad code, infrastructure failures, bad slaves, network errors, something that needs a clobber. A lot of what we do is playing detective to figure out why something failed, what we need to do to fix it, whether we need to close the tree.
If developers don't have commit access, they can mark bugs checkin-needed and one of us (usually RyanVM) will check it in for them.
RyanVM: That started as something I did as a community member before I was hired as a sheriff. I've been involved with the Mozilla community almost 10 years now; I started out by writing basic patches; not complicated stuff but anything simple I found. Early on I relied on other people to land my patches, usually people I know. When I got commit access myself I wanted to make it easier for new contributors, especially people without contacts they could ask, to get their patches pushed. When I became a sheriff, that just got rolled in with my other duties.
KWierso: To make things hard, tests can also fail intermittently. When a new failure shows up we have to figure out if it's a known failure or a new failure. We use TBPL to mark failed jobs as known bugs.
Code usually lands on integration branches used by different groups of people (b2g-inbound, mozilla-inbound, fx-team. etc.). We sherriffs usually try to merge those branches with mozilla-central once or twice a day.
edmorley: Then we update bug states to say when the bug is fixed, the milestone it landed in, and so on.
KWierso: We're also a point of contact between developers and releng for issues that cause tree closures, since we usually have to know what's going on.
RyanVM: So, that's what this job has become. It's relatively new as an official position within the organization. Organizationally we're part of A-Team, the automation team, which develops and operates all our automated testing infrastructure. We're a small group within A-Team that's mainly interacting with external groups like Firefox and B2G developers, rather than working on code that's owned by A-Team. Our job includes teaching new contributors things like how to format and land and test their patches correctly.
KWierso: If someone asks me, what exactly is it you do here, it's hard to say exactly.
RyanVM: John O'Duinn has a monthly blog post with metrics for just how mind-bogglingly many jobs we run on a regular basis, dozens of commits per hour and dozens of machine-days per push.
edmorley: Because of infrastructure and capacity shortages at peak times, we can end up with a backlog for a particular type of job, and we have to skip that job on some pushes. If something lands and breaks the tree but doesn't have a job run, that makes it hard to find the right regression range. Delays in running tests can also mean that we don't find out something broke for hours, in which time another twenty commits may have landed -- and then when you break out the first regression, you may find that things are still broken because of another bad push that landed in the meantime.
RyanVM: Treeherder is a new tool we're building to replace TBPL -- and not just TBPL but a whole lot of things like all the scattered data sources that TBPL draws from.
edmorley: One problem with TBPL currently is that we do data "joins" in the front-end right now. The data for pushes, jobs, etc. all lives in different places. That means simple questions like "have all the jobs for this push been completed" can't be answered at the TBPL level. Just to figure out how to display things we have a bunch of regex filters that map IDs to friendly names. If we had a single source of truth in TreeHerder then we wouldn't need to keep all these lists in sync across TBPL and other tools like OrangeFactor. We want to have Treeherder ready in Q1 (but in Mozilla style, that might not mean Q1 next year). :)
glob: Do you have documentation on how to sheriff?
edmorley: That's a Q4 goal. Part of the reason we don't have it yet is the way this has role has grown from something that was done by volunteers and learned by people as they went along
mbrubeck: I wrote up a basic developer guide to using TBPL which is part of this MDN page: Committing Rules and Responsibilities
glob: With sherriffs becoming paid staff, is there also a gap on the weekend?
RyanVM: We've talked about changing our schedules to have some hours on the weekend and some time off mid-week. We still have philor (who is a volunteer) watching the tree on weekends, and most of us also peek in on weekends.
dividehex: Where does the data for TBPL come from?
edmorley: Test data comes from logs from the test slaves; there is also a JSON file of the most recent jobs that's generated and put on a server; TBPL then has a cron job that pulls this JSON and updates a database. The use of all these different crons and the TBPL UI refresh interval mean that there can be up to about 15 minutes from when a job actually fails to when someone notices it and starts fixing it.
mbrubeck: You could turn things orange even faster if the slaves notified the system about failures while the tests were still running, instead of us parsing the log after the whole job is complete. (But that would probably require major infrastructure changes.)
RyanVM: You can help watch the tree even if you don't have commit access; you just need to ask on IRC for help when you need something backed out or the tree closed. (That's what philor does when he's at his day job.)
edmorley: Some of the problems we've had in the past that made our job harder, like coalescing of jobs, have been made better by work to increase capacity and infrastructure reliability. Something else that's going to help a lot is autoland, where people can mark a bug for landing in Bugzilla and a script will check the status of the bug, and can land it automatically. It could do things like land at off-peak times when we have more free capacity. We'd also like to automate things like bisecting, and make coalescing smarter. That could use data from TreeHerder which will have data about which pushes are known to be completely busted.
dividehex: Are there opportunities to contribute to TreeHerder?
edmorley: Yes! There's an IRC channel and a weekly meeting if you want to find out more.
One problem we want to solve is customizing views for users in a finer-grained way than the binary "hidden" flag for jobs in TBPL.
mak: What can developers do to help your work?
KWierso: Fix intermittent failures.
RyanVM: Fixing intermittent oranges is very helpful because it increases our signal to noise ratio. We've had success doing this on certain branches by targeting the most frequent failures.
mak: The problem is that with permanent sherriffs, nobody reads dev-tree-management where the OrangeFactor emails come in. You might want to go directly to module owners to draw attention to top failures in their modules.
edmorley: We've started being more proactive about setting needinfo on bugs. We also want to switch off tbplbot comments in bugs so it doesn't spam developers who are working on them, and replace them with weekly summary reports. We still have some work to make that possible. Instead of abusing the comment mechanism, we can link to data and graphs from TreeHerder / OrangeFactor.
RyanVM: Another problem for developers is that they may not have access to the platform or hardware where the test is occurring, so we're trying to make people more aware they can request access to a "loaner" slave from releng.
edmorley: One other idea I was discussing with ctalbert, which might become possible as we move to test manifests, is to add new tests in a sort of "kindergarden" state where we can retrigger them separately from old tests that are more known to be stable. If we stress-test new tests before moving them into the "proved" set of tests, maybe we can weed out badly-behaved tests before they become a problem.
RyanVM: Intermittent tests also hurt developers because they make it harder to trust Try results or interpret them correctly, and by making things like autoland harder to implement correctly.
KWierso: Another thing developers can do to help is making sure to run the right tests when pushing to Try.
mbrubeck: It would be nice if we had automated coverage data, so Try could just look at my patch and choose a set of test suites automatically.
RyanVM: Yeah, and we'd also like to improve our logic for not building on platforms that aren't affected by a given patch. Even if you don't want to sit and watch the tree all day, there are a lot of interesting sheriffing-related projects like this that you can work on.
edmorley: With some of these issues, we do have a lot of visibility since are in the middle and have contact with developers, releng, a-team, etc. We're working on making suggestions to all these teams based on what we can see.
Mook: Will moz.build help with cutting down on platforms that aren't affected?
mbrubeck: I think that's hard because we parse moz.build at build time, and currently the scheduler doesn't do a build or even check out the tree.
RyanVM: For now we're experimenting with doing limited test subsets on certain branches like B2G-Inbound. We started that with the birch branch. We're still learning interesting things but it's going pretty well now.
On a final note, we now have sheriff coverage in the EU and US time zones, but we could really use some people to help with sheriffing during Asian work hours!