There seems to be fairly general agreement that we need a generational GC in order for Firefox to compete with Chrome. Benchmarks that create a lot of short-lived objects are the problem, for the following reasons:
- We can use a gigantic heap to avoid ever collecting during the benchmark. However, this causes our live set to be scattered across the enormous heap, so we have poor cache locality and maybe TLB misses. Also, it obviously takes a lot of memory.
- If we do collect during the benchmark, we are at an even bigger disadvantage. To collect, V8 just has to move a few live objects from the nursery to the tenured generation. We have to mark all live data (not just the stuff in the nursery) and then we have to sweep the entire heap.
So the goals of moving to a generational collector are to increase cache locality and to make collections less expensive. Most successful collectors these days tend to have the following basic outline.
- They use a small nursery.
- Objects that live long enough are copied from the nursery to a tenured generation, which is collected using mark and sweep.
- The tenured generation may be compacted occasionally to prevent fragmentation.
Collectors like Sun's G1 use many generations (or zones); the nursery is just the set of zones into which they're currently allocating.
Note: generational collection will not necessarily make GC pause times any shorter. To fix this, we could make parts of the collector incremental or parallel. We would also have to look at the cycle collector as well. I don't yet have a good understanding of this code.
The main problem with generational collection is that it requires objects to be moved out of the nursery, so all pointers to those objects must be updated. Consequently, it must be possible to identify all roots precisely. We cannot use conservative scanning without some modification.
However, there is a hybrid scheme that might work, first proposed by Andreas. We could scan the C stack conservatively. Any potential root inside the nursery would be "pinned", meaning we would not move it to the tenured generation. Other objects would be copied as usual.
We would have to watch out, when allocating into the nursery, that we don't allocate over a pinned object. We could do this in a bump-pointer allocation scheme by treating the beginning of the first pinned object as the end of the nursery. When we reach this point, rather than triggering a nursery GC, we would advance over the pinned object and set the new limit to the next pinned object.
One danger with conservative collection is that we may spend a lot of time scanning the stack. We want to make nursery collections very, very fast. If we have a huge C stack to scan every time, we may not be able to reach that goal. We could try to limit this problem as follows. We could identify the functions that take up the biggest proportion of the stack (ones that appear frequently on the stack and have big stack frames). We could convert these functions to accurate collection, using one of the techniques described below. Then we could skip over those stack frames during the conservative scan.
To avoid pinning, we need to know the exact set of roots. V8 does this by wrapping all pointers in handles (pointers to pointers). The set of all live handles is known; they are typically used in a stack-like fashion, so it isn't too difficult to manage them. Moving an object causes the handle to be updated. Nothing on the C stack ever needs to change.
Objective Caml use a slightly different "shadow stack" scheme. Pointers into GC memory are stored not on the C stack but on a separate stack that contains only pointers. It's easy to scan this stack and update its pointers during GC. O'Caml uses macros to make shadow stack pointers look just like normal pointers.
Whichever one we choose, we will have to deal with the inevitable correctness bugs caused by failing to identify all the roots. So we should modify Sixgill to help us out here. Brian says this shouldn't be too hard, in theory. However, actually implementing it, making the results understandable, and setting up the infrastructure to get it to run on Tinderbox every time would still be a significant investment.
- Obviously we need to decide whether to use conservative or accurate collection. An easy way to "decide" this is to implement pinning. If we don't meet our performance goals that way, then we can do accurate collection. Almost of the work we would already have done would still be applicable, except maybe some tuning. Alternatively, we could try to model the behavior of conservative and accurate collectors and test them on simulated workloads. However, this would be a lot of work and it wouldn't necessarily be very predictive of real performance.
- Another big decision is whether to continue to divide up memory into arenas based on allocation class (GCKind). Most collectors that I know of use a single nursery for objects of all sizes (except huge objects, which are not nursery-allocated). This probably increases cache locality. However, our JSObjects are so big right now that I'm not sure it matters. We should probably do some experiments here to determine the impact of either scheme. We also need to try to reduce object size.
- Should the collector be incremental? Incremental collection means that we collect in small steps, allowing JS code to run in between. It's typical for the tenured generation to be collected incrementally. (To avoid read barriers, the nursery must be collected atomically. For the same reason, compaction cannot happen incrementally.) Incremental collection requires a more complex write barrier, but it's probably worth the cost.
- Should the collector be concurrent? Concurrent collection is where JS code runs at the same time as the collector, but in a different thread. To collect concurrently, we would have to make our data structures synchronized. This isn't too hard. I suspect that the hard part would be the additional debugging effort.
- Should the collection be parallel? This is where we use multiple GC threads at once. I doubt this is worth it in the near future.
I think we should try to be as fast as V8 on Earley-Boyer. We should also remain competitive on Splay. We should probably identify some additional goals, though. Since we're probably targeting pause time as well, we should pick a maximum value, like "at most 20ms pause time, 95% of the time". This has shown to be a poor metric for hard real-time collectors, but I think it should suffice for our purposes.
- As a first research step, it would make sense to look into how other collectors work. We should post this information in Bugzilla, as was done for JM. Certainly we should understand pretty deeply how V8's GC works, as well as JSC's. Object Caml has a very fast incremental, generational collector. Sun's G1 collector for Java is considered state-of-the-art, at least for big servers. It is extremely concurrent and parallel. Finally, the Mono collector is supposed to be conservative and copying. It would be good to see what they do. All of these collectors are open source.
- We need to get more data to make an informed decision between accurate and conservative scanning. A lot of this is already out there, like the average stack size and number of pointers on the stack. We should also try to measure the time to do a conservative scan in isolation, which we can do by running the scan once without doing any marking, and then doing it again normally.
- Look into the bug databases for V8 and JSC, trying to find cases where their GCs perform badly. These would make good benchmarks for us.
- We may want to do some modeling to select the best collector design. This might involve taking traces of object allocations, property updates, etc. and then feeding this data to simplified implementations of various collector designs. Implementing a prototypes of a few collectors would also make the actual collector implementation a lot smoother.
- The first implementation step would probably be to add a write barrier to SpiderMonkey. The write barrier's job is to maintain a "remembered set". This is a set of objects in the tenured generation that have pointers into the nursery. These objects are used as additional roots when collecting the nursery. The write barrier works as follows. Every time JS code does "x.f = y", we would check if x is in the nursery and if y is not. In that case, we would add x to the remembered set. The main difficulty here would be to identify all the places where these updates happen without forgetting any. Hopefully there are a few pinch points. Obviously, we would want to measure the cost of the write barrier.
- Then we would start scanning the JS stack accurately.
- The method JIT would need to be changed to take into account that objects on the JS stack may move during stub calls. So we would need more syncing code. I'm unsure of how the trace JIT would need to change.
- We have to make sure that all other roots are such that we can update them if an object moves. This may involve changing our rooting APIs, as well as their callers. It sounds like it's fairly common for C++ objects to point directly to JSObjects.
- Then we need to implement the generational collector. That is, we need a separate nursery and a facility to collect the nursery without collecting the tenured space. It's hard to see how to reduce this into independent parts. However, getting a very simple implementation working shouldn't be an enormous amount of work---perhaps one or two weeks. Given that we would already have tried a few designs (and actually implemented simplified versions of them), I'm optimistic about coding time.
Additional Steps for Conservative Collection
- If conservative stack scanning is too expensive for us to meet performance goals, we may have to start converting the larger stack frames to use accurate collection. Hopefully there will be some low hanging fruit. If not, we will have to go to fully accurate collection.
Additional Steps for Accurate Collection
- Design and implement a new API for identifying roots. Something like handles or a shadow stack.
- Implement a static analysis in Sixgill to find rooting bugs.
- Change the entire JS engine and browser to use the API. I have no idea how hard this would be.
All of these are desirable. Some of them could be done in parallel with other work.
- Implement compaction for the tenured generation. This reduces fragmentation, and most collectors seem to do it.
- Try to eliminate as many finalizers as possible. Reduce the amount of malloc'd memory that JSObjects point to, so we get everything under our control.
- Try to eliminate recursive marking and use a more traditional iterative scheme. With some effort, it feels like this should be a performance win.
- Make our allocation path as quick as possible. Bug 606631 does this for simple constructors, but it's pretty ugly. As the JITs improve, with support for function inlining and combining SETPROP PICs, we could make this change a lot more cleanly.
- Make objects smaller. I feel like, with some effort, there are a number of fields we can eliminate. This would decrease our cache footprint.
- Incremental collection.
- Concurrent collection.