Sfink/Thought Experiment - One Minute Builds
Recently in a thread about the current build infrastructure overload and resulting fallout, someone claimed that builds should take less than a minute. A couple people responded with various reasons why that's not reasonable in actual practice, but it got me thinking: what would it take? I performed a thought experiment, and I'm documenting it here in case it reveals anything useful.
In the unlikely event that someone find this interesting and useful, please, feel free to update however you wish. Or use the discussion page if you prefer.
Update: I am updating this with the results of email discussion.
- 1 Goal
- 2 Top-Down
- 3 Bottom-Up
- 3.1 Push, Poll, Schedule
- 3.2 Build Slave Pull
- 3.3 Make
- 3.4 Compile
- 3.5 Link
- 3.6 Reporting
- 4 Conclusion
- 5 Tasks
"After a developer pushes to mozilla-central, provide useful feedback within one minute."
- After a developer pushes: I am starting the stopwatch at the moment the developer fires off the push, because what I care about is the length of the feedback loop.
- mozilla-central: I am only talking about mozilla-central. It is the big funnel point that needs to be fast to keep up with our patch rate. IMHO, there are far too many individual sources feeding into it; it would be much better to have a hierarchical system, because different levels have very different needs. The tracemonkey branch is an excellent example of this. Please pay attention to Dave Mandelin when he talks about this. The try server is also a good example; it filters out a lot of stuff that might otherwise go into the choke point.
- useful feedback: This is a good place to waffle. We have lots of feedback that is useful, but is still not required to meet the goal. Getting all useful feedback would imply that the build/test/perf runs across all platforms complete within one minute. That would be an interesting thought experiment, but it's not this one.
- within one minute: I'm going to define this as the median time being less than one minute. Obviously, there are going to be pushes that require a lot more work than others, and I'm not trying to tackle all of them. Using the median means that half of all pushes give useful feedback in a minute or less.
I will start by taking a top-down view of the overall process, to lop off portions that don't need to be tackled.
In my view, the main parts of the process are (1) build, (2) functional test, and (3) performance test. We have to be pretty harsh in deciding what counts as "useful feedback" in order to keep things under a minute.
Clearly there is useful feedback to be had without performance tests. The same can be said of functional tests. We want those results quickly too, but we have enough explosions during the build phase that the earliest useful feedback is just whether the whole tree built or not.
That leads to the question of platforms. Is it useful to have the results from just a single platform, or do you really need all of them to extract any value? In practice, I see sheriffs in METERING mode waiting just until the first one or two builds come out green before allowing the next push to proceed. So empirically, it appears that one or a small number of platforms is sufficient. (This restriction may not be needed; builds are mostly parallelizable.)
Finally, a major simplification I want to make is that because we only care about the time for a push to mozilla-central, it's fair to expect the developer to have already spent significant work somewhere "upstream" -- which probably means the try server.
So all we care about is figuring out whether the build succeeds on a single platform. The next step is to follow the push through the stages of the build pipeline to make sure that each stage is fast enough. I don't really know that much about the pipeline, but as I see it, the steps are:
- Developer pushes to m-c
- Something polls the tree every minute and detects the push (this could be done with a trigger, but my understanding is that it currently isn't. I don't know the reason.)
- A 3-minute timer is started, to allow the dev to get in any "oops" fixes before the jobs start
- A scheduler constructs a set (graph?) of jobs based on the push: a build per platform, a test dependent on each of those, and Talos runs dependent on the tests.
- The build slaves pull from m-c. We only care about one slave here, because we just want the first build result in less than a minute. The rest can take longer.
- The slave does a depend make
- The make does a bunch of compiling
- The make does a bunch of linking
- The output of the build gets reported back to the developer
I am intentionally breaking up compile from link because they require very different solutions (i.e., linking is hard).
Push, Poll, Schedule
Update: The 1-minute poll could be replaced with an hg trigger, but it would require teaching something named "pulse" (pulse.mozilla.org) about hg events. Or something.
- Not sure if you want me to comment here, but pulse already knows about hg commits and can push messages instantly (https://github.com/LegNeato/hg-broker). I rolled it out on hg.mozilla.org and ran into some deadlocks I need to clean up though. Once rolled out it should enable instant notifications via AMQP -- Christian (email@example.com)
The 3-minute timer is probably unnecessary, because it sounds like in practice the "oops" moment happens when the developer gets the first failed build result. Also, the 3 minute lag causes accidental merges where someone else pushes in the 3 minute window, and the first pusher gets the build including the second pusher's bits. (Did I get that right?)
Build Slave Pull
Update: Apparently, try server slave pulls are very different from m-c slave pulls. Try slaves have one cloned repo per build type, and tend to wipe them out frequently and start over. This is being addressed by bug 589885 and possibly by using the shared repo hg extension.
An approach suggested by Thunderbird people but never implemented is to pull onto one slave, then zip that up, upload to an FTP server, and then have all the other slaves pull that down. Alternatively, if the slaves could see each other they could rsync the source tree or the repo from the "master slave" (heh).
My original approach
The simple workaround would be to have a "chosen slave" who gets to go first. It'll be fast and everyone else will be slower. That serves our goal, but is gaining latency at the cost of throughput.
I'm going to confess that I still don't entirely understand the hg data model, and instead pretend that it's the original git data model: everything is a wad of data addressable by its SHA-1 hash value. They all live together in one big happy pile, and operations can pull from multiple piles if they want to. (Not even git uses this anymore; they're packing together multiple revisions of the same file now just like hg does.) Now, git and hg are more or less isomorphic, so if I rely on my naive git data model, it should be applicable with some tweaks. (And if that's too handwavy for you, then switch all the slaves to using the git client, still going against the hg repo with Scott Chacon's git-hg extension. Then you can still pull from the hg repo but make use of supplemental git object repos. Phhbbbbtt!)
Taking a broader view, it would be better if the m-c server broadcast any incoming rev to all slaves. That would (1) save network bandwidth (via the broadcast), (2) parallelize the data transmission with the polling and scheduling, and (3) remove contention on the m-c repo server when it's scanning through its data to figure out what to send to each slave. (The slave will still end up needing to do that work, but now it's all local; it'll have everything it discovers it needs.)
We're only dealing with the happy case of mozilla-central here, so we can (almost) always use depend makes. Often enough that the median is reliably a depend make, at least.
The depend make takes uncomfortably long by itself, probably largely because of the recursive invocations, but I'll assume for now that it doesn't take enough of a minute to deal with here.
pymake has some plans to magically derive the global dependency tree from the recursive makefiles.
This is the heart of the problem. Compiling, even when using dependencies, is going to take more than a minute.
Update: Option 2 below is new. Option 1 has some updates.
Option 1: Fast Compiles on the local box
One way to resolve this is with a magical compiler that goes really really fast, using precompiled headers and sub-file dependency tracking and pixie dust. I suspect this is technically possible, but people have been working on this for a long time and I want something that is doable today. Well, today-ish. Update: There are reports that precompiled headers on Windows gain you as much as ccache.
Option 2: Fast Compile via Distribution =
I'm not sure why I didn't mention distcc before. I use it heavily. I had some notion that the build slaves are already fairly evenly loaded, but that doesn't matter if we're ok with trading throughput for latency.
This is simple enough: make all build slaves distcc hosts. To go to the extreme of speeding up one build, give remote jobs priority over local ones and only allow the "chosen" slave to use distcc. Less extreme but more practical might be to give the chosen slave priority over others, assuming this is possible with distcc. (And if it isn't, modify it!)
Option 3: Not Compiling
The other route is to not do the compiles at all. Ccache is good for that, if you've already done exactly the compile you want on the exact system you're building on. That's not true here, so what do we need to change to make it true enough?
If the same code, or close to it, was already compiled on the try server (or another upstream system) through ccache, then we could reuse its cache. A simple and obvious way to do this would be to put the ccache on a network share. Assuming the cache is big enough, then any compile with identical inputs would be able to pull its output straight from the cache and avoid the compile.
My personal experience of putting ccache on a network share was horribly scarring. It would be especially bad if multiple readers are accessing it. You might as well do the build yourself.
So let's go back to the broadcast (multicast?) approach: when the try server slave is building, it spews the cached files out over the network and every slave grabs a copy. Logically, you can just say that you have a ccache directory shared over the network, but all of the clients of that directory aggressively prefetch it and cache it locally. Use a broadcast protocol to save bandwidth. This is a thought experiment, remember? (Don't worry; you can probably do without any broadcast/multicast messiness. See below.)
That leaves just a few problems: (a) the try server may not have done a build close enough to the desired one for its cache to be useful, (b) the ccache would be ridiculously big and burn insane amounts of bandwidth, and (c) the ccache data might take too long to get copied over to the m-c build slaves and so would not be available when needed.
(a) Prebuilt bits must be relevant
(a) is not a problem if the developer rebased to m-c before doing the try build. These days, that's not a safe assumption. So either we enforce a behavior change (eg don't allow an m-c submission if the base is too old), or we automate it by making the try server do the rebase itself. The latter is tempting -- in fact, if we really only want the build from one platform, then you could leave the try server job scheduling as-is except add in a separate "rebase build" that did the automatic rebase and built just on that platform. That has the added benefit of notifying the developer when a rebase might not work automatically and might require some manual intervention. To be fancy, you could block the m-c submission until that build was complete.
Either way, that should be enough handwaving to demonstrate that in principle it's possible to expect that a build has been done that shares most of the source bits in the eventual push.
The thought experiment that this is, let's consider the extreme: if all m-c pushes are gated through the try server, and the try server requires an absolutely up-to-date source tree, then you could just grab the build directly from the try server and never do the build (or tests) at all on m-c. But that's going around in circles; the try server would become m-c, and you'd inherit all of the same problems that you started out with.
(b) Cache size and bandwidth and (c) lateness
The size is theoretically not a problem, because the cache uses LRU eviction. But in reality there would be so much crap getting jammed into it that the useful bits would always be evicted before you needed them. Also, bandwidth is still an issue: all those useless compile results would be broadcast over the network, and many would be thrown out without being used.
One fairly big hammer would be to have the developer mark a push as an imminent m-c commit. Only those pushes would participate in the shared ccache. (And perhaps only those builds would get the automatic rebase build?)
That may not be enough. So perhaps pushing all build bits immediately is excessive; maybe we really need to send only the bits that are guaranteed (or at least likely) to be useful to a particular slave. The problem is that if we wait for the build slave to actually request each build artifact before sending it, then we're back to the networked ccache suckage. Perhaps there's an intermediate solution: wait for the build slave to request the first bits, but then start sending over everything it's going to need after that. The build will then be limited by the time required to transfer all of the data, but there will be no embedded pauses. So how long does it take to send the bits?
I have a debug obj directory that is 3.4GB. That's... big. But we only care about compiles, so looking at just the .o files, it's 1.1GB. Let's say we have a gigabit network dedicated to this craziness. I have a lot of experience streaming video files over gigabit, and if you're going one way almost all the time, you can actually achieve 97% of the theoretical bandwidth without too much pain. In our case, the traffic is always from the try build slaves to the m-c build slaves, so we really only need half duplex for this dedicated network. On the other hand, these machines are doing a bunch of stuff besides streaming bits, so let's say we only get 50% of the maximum. That puts us at 19 seconds for the data transfer. That eats up a third of our minute, but it might work.
One quick way to make this better: only play these games for optimized builds. They require far fewer bits.
I wonder how well those object files compress. That might help. Or hurt.
I wonder how similar different versions of object files are? If they are substantially self-similar, perhaps a binary diff might save a lot of time and bandwidth. We could use rsync's batch update mechanism.
So once the slave requests the first object file, how do we know what other ones to send? One option would be to modify ccache to remember. (Tell it when you're starting and stopping a build, and tag everything used in between.) Or diff the ccache repo before and after a build. (Not so good if you're accelerating the try builds with ccache too; you'd miss anything that hit in the cache. But perhaps if you had atime getting updated, you could grab anything accessed in between.) Theoretically, you could tie ccache into hg and associate build outputs with collections of hg SHA-1 inputs, but it's not direct: ccache is indexed by preprocessed source files. So you'd have to say "these SHA-1s get preprocessed into this ccache hash key".
Pretending for the moment that we've resolved the compile problems, next up is linking. ccache doesn't help with linking. It doesn't pay attention to the link, nor should it: most of the time is spent on I/O. (Linking is the process of taking a bunch of really big files, updating various pieces of them, and writing out the really really big result.)
Update: Someone mentioned that I may be totally off-base about linking being I/O bound. They may be right. They are right for PGO links, I guess.
I don't know where we are in terms of link time. I'm guessing that a minute is going to be quite a challenge though.
On Linux, at least, the first hammer to use would be to switch to the gold linker instead of ld. It's supposed to be something like 15% faster, but in a previous life I saw an order of magnitude speedup with it on a large, heavily templatized g++ code base. (We're much larger, but way less template-happy.) It also isn't 100% compatible, and just today someone reported that it doesn't build FF on linux64.
I don't know enough about linkers to really do any better than that. At least, not without going completely crazy. So I'll try going completely crazy:
For the purpose of getting rapid feedback, we don't actually need the final binaries. And yet, I'm not willing to say that simply seeing whether everything will compile is good enough. A lot of problems only crop up during the link. So we need something that can quickly tell us whether the link would succeed without actually generating the final file.
The first approach to this is trivially easy: output to /dev/null. Unfortunately, it only works for the final binary. And I don't know for sure whether it actually works. (Linkers very possibly read their output as they're generating it.) Perhaps outputting to /dev/shm and deleting the result would be better.
The second approach would be to modify the linking process to only include the interesting bits in the output file. "Interesting" in this case means "necessary to show problems with the link". So you'd want to include all of the symbols that are supposed to be there, but the contents of those symbols need not be accurate. (For example, you can throw out all static functions or anything else with internal linkage or hidden visibility.) It seems like you ought to be able to produce much, much smaller files as a result, which takes away much of the I/O burden and memory pressure from linking. It seems to me like it ought to help, but I'm just making shit up here.
Maybe it would be most practical to combine these: send all object files and link outputs to /dev/shm. Use objcopy to copy them back to their real destination, stripping out the bits that are unnecessary for checking correctness.
If you're really clever, you have these "fake link" jobs then queue up a real link that is postponed until after the results are reported to the developer. That seems hard and messy, because those results could be used in various ways, and you'd have to record everything that happened to them.
Easier would be to do the fake link, report the results, then delete all fake-linked outputs and rerun make. Then you'd report again. It means multiple reports from a single build, but you might be able to finesse that (queue up a "real build" job behind the "fake build" job, forcing it to use the same machine, and then each job has a single report output but the 2nd would reuse the 1st's compile.)
Of course, this burns throughput for latency, which may not be a great idea given that we are facing a throughput bottleneck as well.
I don't know how much delay there is in the reporting system. Right now, I think the build logs get mailed over to tinderbox, which parses through them for errors, and then a job comes by and constructs the waterfall page out of them. I believe there are plans to leave the build logs on the build slaves and expose them from there, though. So if it's slow now, there's already something coming along to fix it.
- Christian here again. Buildbot masters currently send messages into pulse, so you could get instant results pushed via that (though you may need to inline the logs). It's what is being used for the war on orange. To see some build messages flowing through, take a look at the live message stream
No particular conclusion. I maintain that fast feedback is not fundamentally impossible. I'm not sure whether any of the ideas from this thought experiment are worth pursuing, but I hope that going through the whole thing might spark ideas for improvement in the minds of people who might actually do something with them.
Update: I though I should take a stab at listing out the tasks required to implement these ideas. Some of these are mutually exclusive. Many are bad ideas. But it's probably easier to see them all in one place.
Push to Central Repo
No change needed.
- Set up an hg trigger and teach pulse how to understand it.
- https://github.com/LegNeato/hg-broker. Christian rolled it out on hg.mozilla.org and ran into some deadlocks. Once rolled out it should enable instant notifications via AMQP
- Eliminate the 3-minute stable tree timer.
Pull to Slave
- Reuse existing repos (bug 589885)
- Use the hg share extension to avoid a fresh clone
- Clone to one slave, then rsync the source tree to the rest
- Clone to one slave, then rsync the repo to the rest and update on each
- Rsync the central repo and update -r from that
- Broadcast/multicast updates to the central repo to all slaves' repos
Path 1: Faster local compile
- Precompiled headers: khuey has taken a crack at this for one subtree using cl.exe
- See "Get global dependencies" in the next section
- See "switch to ccache 3" below
Path 2: Faster distributed compile
- Deploy distcc on all slaves
- Figure out how to manage distcc priorities, or implement a way
- Identify the "fast" builders that should use distcc, or have higher priority with distcc
- Get global dependencies
- Use pymake to extract them from the current files
- Rewhack the whole makefile setup to not be recursive
Path 3: Ccache hacks/improvements
- bug 588150 - switch to ccache 3. Speeds up builds, apparently.
- Figure out how to pull ccache objects from one cache to another
- Compute which objects to pull for a given build
- Option 1: Compare the cache directory before and after, using access times
- Option 2: Modify ccache to record the needed information
- Option 3: Graft hg and ccache together so ccache can utilize hg's SHA-1 checksums
- Do nothing
- Enable incremental linking for Windows
- Get someone to implement a reusable incremental linker for Linux
- Figure out objcopy (or ?) hacks to make a check-only link
- Get someone who understands linking to add a check-only mode
Stuff to do, I guess, when the build logs move over. But not really related to this goal. Unless we add in incremental information updates (see the logs as they are happening.)