JaegerMonkey

From MozillaWiki
Jump to navigation Jump to search

This is the coder's badge of glory, That he protect and tend his monkey, Code with honor, as is due, And through the bits to God is true. --damons, IRC

JaegerMonkey (or JägerMonkey) is a new method JIT for SpiderMonkey. The goal is to get reliable baseline performance on the order of other JS JIT systems. JM will be a baseline whole-method JIT that doesn't necessarily do many traditional compiler optimizations. Instead, it does dynamic-language-JIT-oriented optimizations like PICs and specialization of constant operands.

Bug 536277 is the meta bug for this project.

Planning

First Deliverable

An inline/call-threaded version of TraceMonkey where:

  • Inline-threaded part is integrated well with tracing. I.e., we can jump efficiently from inline-threaded code to the trace monitor, and from a trace exit back to inline-threaded code.
  • With tracing jit OFF, JaegerMonkey is epsilon faster than basic SM.
  • With tracing jit ON, JaegerMonkey is not slower than basic TM (on SunSpider, etc.)

Once this is in place, we can then make it faster and faster by adding more optimizations.

Major Optimizations

Here is a chart of the major optimizations that we think we need to do in order to make JM fast, defined as cutting 950ms from our current JM-only SunSpider score and 13500ms from our v8-v4 score. For each optimization, the table shows a guess as to how many ms it will cut off our JM-only SunSpider and v8-v4 scores, the size of the task in person-weeks, and a possible assignee. All numbers are just guesses except as noted in the comments. Keep in mind that benefits are not really additive, but for the purpose of this table, interaction benefits or penalties are shared out among the individual items.

Note that most of these items can be worked on independently. The main exception is that compiler fast paths need to be done after we have decided what a jsval is. Thus, those two items form the critical path. Note that those items together are 1/3-1/2 of the win we need, so we cannot be fast without them.

It's not known that we will be fast after completing and tuning these items--there may be other issues, e.g., with object allocation and GC, that are not covered here.


Name
Est. SS Benefit (ms)
Est. V8 Benefit
Size (wks)
Candidate Assignee
PIC
50
3500
1 dmandelin
Compiler value handling
200
2000
1
dvander
Globals
100
500
4
dvander
Scope chain
0
500
4
intern
Trace monitoring
200
2000
<1
dvander
Regexps
0
1000
6 cdleary
New jsvals
200
2000
8
lw (+others)
Compiler fast paths
200
2000
4
all


Item descriptions and commentary:

  • PIC. Polymorphic inline caching for property accesses. This is basically done--all that remains are to sort out some minor correctness issues and get it running on ARM. Performance is pretty much done, so the estimated perf benefits are real measured perf benefits in this case.
  • Compiler value handling. Currently, each jsop is compiled to work exactly the way it does in the interpreter: load values from the stack or local slots; do stuff; then store values back to the stack. We are improving the compiler so that it can hold values in machine registers across jsops. It will also avoid moves and stores entirely when possible. "Register allocation" would be a reasonable name for this optimization, but doesn't capture all of it. We found a 10-20% perf improvement in early tests, which is reflected in the table estimates. This is partially done--the 1 week size is to finish.
  • Globals. Globals work as they do in the interpreter right now. It should be possible to get them down in most cases to 1-2 loads per global, plus a shape guard if the global was undeclared. Some initial thoughts on how to do this were posted in the js newsgroup.
  • Scope chain. AKA "closure variable access". This is similar to globals. We don't know how much of this is present in the benchmarks, but it's clearly important for general web perf. It seems to require a major overhaul of the scope chain.
  • Trace monitoring. Currently, we call out to the trace monitoring function on every loop edge. This optimization means that when we blacklist, we would patch the method so that it doesn't call out any more. We should also not call out when tracing is not enabled. We also might want to do loop edge counts without calling out at the beginning. The 200 ms benefit for SunSpider is based on the fact that our pure JM score went up 200 ms when tracing was combined with JM.
  • Regexps. I believe we don't compile all the regexps in v8. This item means getting a new regexp compiler, or upgrading our current one, so we can compile them all.
  • New jsvals. We are going to a new jsval format. Currently, we are working on a 128-bit format, with a 64-bit value payload (that can hold any int, double, or pointer without masking or compression on any 32- or 64-bit platform) and 64 bits for alignment and type tags.
    We know that we need a new format to be fast, but there is some risk about exactly what format. The pluses for the 128-bit idea are that it performed well in a pilot study and that extracting the unboxed value is simply taking the right bits, with no masking, shifting, or arithmetic. A minor risk is that it will increase memory usage too much, but measurements there suggest we will be OK. A bigger risk is that it will require more register pressure, or more memory traffic when copying values, decreasing performance. There is no way to know which format is best without implementing it and testing it for real.
    The specific benefits of a new format are (a) doubles don't have to be on the heap, making allocation much faster and reducing indirection, (b) integers can be 32 bits, allowing a larger range to be stored in an integer format and making boxing much cheaper following bit operations, and (c) potentially reducing the number of operations it takes to box and unbox, depending on the format. Benefits (a) and (b) can be achieved with any reasonable alternate boxing format (32-bit NaN-boxing, 64-bit NaN-boxing, or "fat" values like our 128-bit values). Benefit (c) is maximized if the boxing format contains the unboxed value unmodified--only fat value formats like our current 128-bit values achieve that.
  • Compiler fast paths. It's clear from our past experience and measurements that staying on fast paths and avoiding stub calls is key to method JIT performance. Good fast paths for the most common 50 or so ops should cover 99% of ops run. It doesn't make sense to start this before the new jsvals are done. The good thing about this one is that it parallelizes very well.

Ongoing Work

Completed Work

Imported the Nitro assembler and verified that it works with a basic test harness and the beginnings of the compiler.

JS stack cleanup and simplification. See Bug 536275.

Basic method JIT compiler with about 20 fast paths.

Integrated TM and JM.

PIC, in rough form.

Current Work

Upgrade the compiler to hold values in registers and avoid stack and other memory traffic.

Change the basic jsval to a 128-bit format with values unmodified (no masking or shifting) in the top 64 bits.

Finish PIC.

Next Steps

Better global variable access.

Better closure variable access.

Cheaper transitioning on and off trace.

More method compiler fast paths.

Developing JaegerMonkey

Getting the Code

The code is currently in a user repo:

http://hg.mozilla.org/users/danderson_mozilla.com/jaegermonkey/ 

Building

Currently we only support shell builds. To build with JM, pass this option to configure:

--enable-methodjit

For debug spew, you can add --enable-methodjit-spew. It is enabled by default if you also specified --enable-debug.

For more information on building, see Building SpiderMonkey.

Running

Use the -m option to run with the method jit turned on. You can learn more about debug spew with:

 JMFLAGS=help /path/to/js -m

Design Discussion

Initial Design Decisions

The general idea is to do something along the lines of dmandelin's first prototype, Sully's prototype, and Nitro. The important design specifics that we're planning to go with for now are:

1. Do everything per-thread, just like TM does with traces.

2. For the native code generator, take Nitro's cross-platform "assembly blatter".

3. To make trace transitions fast, change the interpreter and trace stack layouts so they closely match.

Discussion on point 2 (code gen):

We considered and rejected these alternatives for the code generator:

2x1. Generate code blocks ahead of time and memcpy blocks together to create native code. I tried this at the beginning of my first prototype, and it didn't work very well. One problem is that relative jump displacements need patching, so this isn't as simple as it first seems. Also, in order to get good perf, you need to bake in constants and do other specialization, which requires increasingly complicated patching.

Adobe is doing an interesting research variant on this idea, where they compile the interpreter C code to LIR, compile that, and then memcpy (and presumably patch) those chunks. But this sounds too complicated and risky for us.

2x2. Generate LIR and compile with nanojit. Sully did this. The main problem is that there is not enough control over the results to get the best code. In particular, there are tricks for calling "stub functions" (functions that implement JS ops that are not inlined) very efficiently that nanojit doesn't currently support. We think there will be other tricks with manual register allocation and such that are also not currently supported. We don't want to gate this work on nanojit development or junk nanojit up with features that will be non-useful for it's current applications. Also, the compilation time is much longer for LIR than for using an assembler.

2x3. Roll our own assembler. This just sounds like extra unnecessary work if we can just use Nitro's.

More detail on point 3 (stack layouts):

Ideally, the interpreter stack layout would be identical to the on-trace stack layout, so that no importation or conversions are necessary. Of course, the interpreter requires type tagging but tracing must not have type tagging, so we have to compromise a little bit.

Luke's current idea is to have the interpreter use two chunks of stack memory. One will have unboxed values. The other will have type tags, and any other metadata the tracer doesn't care about. Allocating stack slots or frames will be just two pointer bumps and a bounds check. In inline-threaded code, 2 registers can be reserved to point to a known position (e.g., start of active frame), so that stack accesses are just a machine load or two (for the tag). Values will be boxed in the current SM style when they are stored to object slots.

The layout of the unboxed stack will be the same in the interpreter or on trace. To get this, we mostly have to delete or move out of band the extra fields in JSStackFrame. We will need to reorder a bit too. Once we have that, to enter trace, we do no work, and to leave trace, we just memcpy typemaps into the interpreter type tags stack.

Planned Optimizations

  1. Fast calls to stub functions. This is based on a trick that Nitro uses. The idea is that stub functions logically have an array parameter or several parameters, which include input jsvals and also interpreter stuff like the sp, fp, cx, etc. Much of this is constant so the call can be made fast by setting up an area in the C stack with all the arguments filled in. To make a call, we just have to store the input jsvals and do a call instruction.
  2. Fast paths for all common ops. For all common JSOPs, we need to inline the common cases as a fast path, and only call stub functions for slow or rare cases. This can be done incrementally, op by op.
  3. PIC. This is really a subset of item 2. In fact, "PIC" is a bit wrong, because as Andreas pointed out, we can start by inlining fast paths that access/guard against the property cache.
  4. Fast transition to and from traces. There are parts to this goal. First, a simpler stack layout will allow traces, Jäger, and interpreter to execute on the stack, instead of using separate stacks with copying back and forth. Second, we can use PIC-like techniques to make trace lookup fast. Third we can separate the type tags from values on the VM stack so that, when we enter trace, the required type checking reduces to a memcmp and, when we leave trace, the type restoration reduces to a memcpy.
  5. Eliminate PC update. In an inline-threaded interpreter, we don't need to update the PC, because EIP encodes that. To enable this, we have to make sure no ops snoop the PC. We also need to help the GC/decompiler by making sure we have some way to provide them a PC (using a mapping or something) on demand.
  6. Eliminate SP update. Inside basic blocks of JSOPs, we shouldn't need to keep a proper stack. Instead, we can teach the compiler to track which logical stack element is in which register and generate faster code.
  7. Fast closures. This is important for advanced web apps as well as Dromaeo and the V8 benchmarks. See bug 517164.
  8. Fast global variable access. We should get globals closer to being a slot access. See bug 548844
  9. Faster arguments access. fp->argv[0] should be a compile-time static offset from fp. See bug 539144