Changes

Jump to: navigation, search

JaegerMonkey

421 bytes added, 23:48, 16 April 2010
no edit summary
Once this is in place, we can then make it faster and faster by adding more optimizations.
== Major Optimizations<br> ==
Here is a chart of the major optimizations that we think we need to do in order to make JM&nbsp;fast, defined as cutting 950ms from our current JM-only SunSpider score and 13500ms from our v8-v4 score. For each optimization, the table shows a guess as to how many ms it will cut off our JM-only SunSpider and v8-v4 scores, the size of the task in person-weeks, and a possible assignee. All numbers are just guesses except as noted in the comments. Keep in mind that benefits are not really additive, but for the purpose of this table, interaction benefits or penalties are shared out among the individual items.
Note that most of these items can be worked on independently. The main exception is that compiler fast paths need to be done after we have decided what a jsval is. Thus, those two items form the critical path. Note that those items together are 1/3-1/2 of the win we need, so we cannot be fast without them. It's not known that we will be fast after completing and tuning these items--there may be other issues, e.g., with object allocation and GC, that are not covered here.
It's not known that we will be fast after completing and tuning these items--there may be other issues, e.g., with object allocation and GC, that are not covered here.
<br>
{| width="80%" cellspacing="1" cellpadding="1" border="0"
|-
! scope="col" | Name<br>! scopecolspan="col2" colspanscope="2col" | Est. SS Benefit (ms)<br>! scope="col" | Est. V8 Benefit<br>! scope="col" | Size (wks)<br>
! scope="col" | Candidate Assignee<br>
|-
| PIC<br>| align="right" colspan="2" | 50<br>| align="right" | 3500<br>| align="right" | 1
| dmandelin<br>
|-
| Compiler value handling<br>| align="right" colspan="2" | 200<br>| align="right" | 2000<br>| align="right" | 1<br>
| dvander<br>
|-
| Globals<br>| align="right" colspan="2" | 100<br>| align="right" | 500<br>| align="right" | 4<br>
| dvander<br>
|-
| Scope chain<br>| align="right" colspan="2" | 0<br>| align="right" | 500<br>| align="right" | 4<br>
| intern<br>
|-
| Trace monitoring<br>| align="right" colspan="2" | 200<br>| align="right" | 2000<br>| align="right" | &lt;1<br>
| dvander<br>
|-
| Regexps<br>| align="right" colspan="2" | 0<br>| align="right" | 1000<br>| align="right" | 6
| cdleary<br>
|-
| New jsvals<br>| align="right" colspan="2" | 200<br>| align="right" | 2000<br>| align="right" | 8<br>
| lw (+others)<br>
|-
| Compiler fast paths<br>| align="right" colspan="2" | 200<br>| align="right" | 2000<br>| align="right" | 4<br>
| all<br>
|}
<br>
Item descriptions and commentary:
*PIC. Polymorphic inline caching for property accesses. This is basically done--all that remains are to sort out some minor correctness issues and get it running on ARM. Performance is pretty much done, so the estimated perf benefits are real measured perf benefits in this case.*Compiler value handling. Currently, each jsop is compiled to work exactly the way it does in the interpreter: load values from the stack or local slots; do stuff; then store values back to the stack. We are improving the compiler so that it can hold values in machine registers across jsops. It will also avoid moves and stores entirely when possible. "Register allocation" would be a reasonable name for this optimization, but doesn't capture all of it. We found a 10-20% perf improvement in early tests, which is reflected in the table estimates. This is partially done--the 1 week size is to finish.*Globals. Globals work as they do in the interpreter right now. It should be possible to get them down in most cases to 1-2 loads per global, plus a shape guard if the global was undeclared. Some initial thoughts on how to do this were posted in the js newsgroup.*Scope chain. AKA "closure variable access". This is similar to globals. We don't know how much of this is present in the benchmarks, but it's clearly important for general web perf. It seems to require a major overhaul of the scope chain.*Trace monitoring. Currently, we call out to the trace monitoring function on every loop edge. This optimization means that when we blacklist, we would patch the method so that it doesn't call out any more. We should also not call out when tracing is not enabled. We also might want to do loop edge counts without calling out at the beginning. The 200 ms benefit for SunSpider is based on the fact that our pure JM&nbsp;score went up 200 ms when tracing was combined with JM.*Regexps. I&nbsp;believe we don't compile all the regexps in v8. This item means getting a new regexp compiler, or upgrading our current one, so we can compile them all.*New jsvals. We are going to a new jsval format. Currently, we are working on a 128-bit format, with a 64-bit value payload (that can hold any int, double, or pointer without masking or compression on any 32- or 64-bit platform) and 64 bits for alignment and type tags. <br>We know that we need a new format to be fast, but there is some risk about exactly what format. The pluses for the 128-bit idea are that it performed well in a pilot study and that extracting the unboxed value is simply taking the right bits, with no masking, shifting, or arithmetic. A minor risk is that it will increase memory usage too much, but measurements there suggest we will be OK. A bigger risk is that it will require more register pressure, or more memory traffic when copying values, decreasing performance. There is no way to know which format is best without implementing it and testing it for real. <br>The specific benefits of a new format are (a) doubles don't have to be on the heap, making allocation much faster and reducing indirection, (b) integers can be 32 bits, allowing a larger range to be stored in an integer format and making boxing much cheaper following bit operations, and (c) potentially reducing the number of operations it takes to box and unbox, depending on the format. Benefits (a) and (b) can be achieved with any reasonable alternate boxing format (32-bit NaN-boxing, 64-bit NaN-boxing, or "fat" values like our 128-bit values). Benefit (c) is maximized if the boxing format contains the unboxed value unmodified--only fat value formats like our current 128-bit values achieve that.
*Compiler fast paths. It's clear from our past experience and measurements that staying on fast paths and avoiding stub calls is key to method JIT performance. Good fast paths for the most common 50 or so ops should cover 99% of ops run. It doesn't make sense to start this before the new jsvals are done. The good thing about this one is that it parallelizes very well.
313
edits

Navigation menu