Javascript:SpiderMonkey:OdinMonkey: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 14: Line 14:
== Further work (after initial landing) ==
== Further work (after initial landing) ==


Optimizations, roughly in priority order:
Fit in better with the browser:
# Optimize asm.js-to-Ion transition with custom-generated exit stub
* Per-function profile information
# Investigate why we spend so much time (15%) in ion::ThunkToInterpreter and ion::Bail under FFI calls; these should be trivial functions.
* Make asm.js calls show up in backtraces (Debugger, StackIter)
# Avoid stack overflow checks with signal handler support
* Better about:memory reporting
# Add float32 and uint64 [http://wiki.ecmascript.org/doku.php?id=strawman:value_objects see value objects proposal]
# Optimize idiv (using signal handler for FP exception)
# Optimize double-to-int conversion (using signal handler for FP exception)
# Full GVN/range analysis support for all the new asm.js IM MIR nodes
# ToInt shouldn't generate so much code
# Ensure we're emitting the smallest mod/rm encoding for lea/loads/stores on x86 (viz., if displacement is 0).
# ARM: use hardfp internally
# Consider re-enabling effective-address folding on x64.
# Don't spill non-volatile registers at calls out to C++


Other work items:
Optimizations:
* Optimize asm.js-to-Ion transition with custom-generated exit stub
* Investigate why we spend so much time (15%) in ion::ThunkToInterpreter and ion::Bail under FFI calls; these should be trivial functions.
* Avoid stack overflow checks with signal handler support
* Optimize idiv (using signal handler for FP exception)
* Optimize double-to-int conversion (using signal handler for FP exception)
* Full GVN/range analysis support for all the new asm.js IM MIR nodes
* ToInt shouldn't generate so much code
* Ensure we're emitting the smallest mod/rm encoding for lea/loads/stores on x86 (viz., if displacement is 0).
* ARM: use hardfp internally
* Consider re-enabling effective-address folding on x64.
* Don't spill non-volatile registers at calls out to C++
* Create automatic instrumentation so we can compare, at the basic-block level, how many instructions are executed in both GCC/LLVM-compiled C++ and Odin-compiled asm.js. This should point us directly to our worst codegen pain points.
 
Extensions to DOM/JavaScript that would help asm.js:
* FunctionBlob (TODO: link to proposal)
* FunctionBlob (TODO: link to proposal)
** Efficient transfer between workers
** Efficient transfer between workers
** Efficient IndexedDB serialization/deserialization
** Efficient IndexedDB serialization/deserialization
* Add a mapping from return pc -> function information
** Browser-wide code caching via postMessage
** Use to provide profile information to SPS without dynamic instrumentation.
* Add float32 and uint64 [http://wiki.ecmascript.org/doku.php?id=strawman:value_objects see value objects proposal]
** Make asm.js calls show up in backtraces (Debugger, StackIter)
* Add SIMD support using [http://wiki.ecmascript.org/doku.php?id=harmony:binary_data BinaryData] objects with value semantics
* Create automatic instrumentation so we can compare, at the basic-block level, how many instructions are executed in both GCC/LLVM-compiled C++ and Odin-compiled asm.js. This should point us directly to our worst codegen pain points.
* Add ArrayBuffer.swap to allow a single linked asm.js module to work with many hunks of data over time.
* Add ArrayBuffer.resize to allow growable heap (sbrk).
* To allow asm.js generation from pthread code, allow a single ArrayBuffer to be shared by two or more Workers and add necessary synchronization primitives. Because threads+locks are such a great paradigm for concurrency.


== General IonMonkey optimizations ==
== General IonMonkey optimizations ==

Revision as of 21:35, 27 February 2013

Goal

Provide an optimized implementation of the (co-evolvoing) asm.js spec which achieves near-native performance (within 2x of -O2) on JS generated from C/C++ (with the pilot code generator being Emscripten).

Tasks before initial landing

The code is currently on https://hg.mozilla.org/users/lwagner_mozilla.com/odinmonkey.

  • See bug 840282.
  • Also:
    • Unbreak IonSpew
    • Final polish on error messages (name the types/numbers involved in failure)

Further work (after initial landing)

Fit in better with the browser:

  • Per-function profile information
  • Make asm.js calls show up in backtraces (Debugger, StackIter)
  • Better about:memory reporting

Optimizations:

  • Optimize asm.js-to-Ion transition with custom-generated exit stub
  • Investigate why we spend so much time (15%) in ion::ThunkToInterpreter and ion::Bail under FFI calls; these should be trivial functions.
  • Avoid stack overflow checks with signal handler support
  • Optimize idiv (using signal handler for FP exception)
  • Optimize double-to-int conversion (using signal handler for FP exception)
  • Full GVN/range analysis support for all the new asm.js IM MIR nodes
  • ToInt shouldn't generate so much code
  • Ensure we're emitting the smallest mod/rm encoding for lea/loads/stores on x86 (viz., if displacement is 0).
  • ARM: use hardfp internally
  • Consider re-enabling effective-address folding on x64.
  • Don't spill non-volatile registers at calls out to C++
  • Create automatic instrumentation so we can compare, at the basic-block level, how many instructions are executed in both GCC/LLVM-compiled C++ and Odin-compiled asm.js. This should point us directly to our worst codegen pain points.

Extensions to DOM/JavaScript that would help asm.js:

  • FunctionBlob (TODO: link to proposal)
    • Efficient transfer between workers
    • Efficient IndexedDB serialization/deserialization
    • Browser-wide code caching via postMessage
  • Add float32 and uint64 see value objects proposal
  • Add SIMD support using BinaryData objects with value semantics
  • Add ArrayBuffer.swap to allow a single linked asm.js module to work with many hunks of data over time.
  • Add ArrayBuffer.resize to allow growable heap (sbrk).
  • To allow asm.js generation from pthread code, allow a single ArrayBuffer to be shared by two or more Workers and add necessary synchronization primitives. Because threads+locks are such a great paradigm for concurrency.

General IonMonkey optimizations

  • Optimize control flow to minimize jumping (we tend to a lot worse than GCC here since we don't even try to optimize this)
    • Rearrange loops to put the condition at the end
    • Fold jump-to-jump into single jump
    • Reorder blocks to replace jumps with fall-through.
  • Align loop headers on natural boundaries (I can see GCC doing this, it's also a well-known suggestion)
  • Align ExecutableAllocator allocations to a 16-byte boundary.
  • Use 32-bit register encoding on x64 for when the MIRType is int32
  • Use pc-relative constant double loads instead of trying to use immediates (GCC does, need to measure perf)
  • Soup up EffectiveAddressAnalysis to handle (see TODO)