Sfink/JS Stack Walking

From MozillaWiki
Jump to: navigation, search

Walking the JS stack

I'm writing this for a fairly general audience, including people who are not on the JS team, so I'm going to lay out some basic information first.

Stack Layout

SpiderMonkey currently maintains its stack on the heap, and this JS stack is mostly independent of the native C/C++ stack. The full details on how the JS stack is managed can be found in a comment at the top of js/src/vm/Stack.h. This is due to change in IonMonkey, which will switch to using the native stack for JS.

Some quick terminology: in SpiderMonkey, we talk about "scripts", which are either JS functions, the toplevel code outside of any functions, or the code compiled by an eval(). The data structure representing a script is JSScript.

A particular JS script may be executed in a variety of ways: in the interpreter, JITted, or JITted but inlined into another "outer" script. (Until recently, it was way more complicated: portions of a script would get JITted by the trace JIT, possibly dynamically inlining all or portions of called functions, and sometimes bailing out partway through to finish executing in the interpreter. That craziness is gone now with the removal of the tracing JIT, though it may return in some form for IonMonkey.) Regardless, the JS stack contains a single frame for each invocation of a non-inlined script.

Native stack frames for JITted code also store a VMFrame struct, which represents the execution of some number of JITted scripts. Initially, the VMFrame is created for just the script you're invoking, but that script may call other JITted scripts, and no new native frame is pushed for those. The VMFrame fields are updated to reflect the currently-executing state, as well as a pair of JS StackFrame pointers <entryfp,fp> representing the range of JS stack frames that the JIT native frame is executing. Note that VMFrames are auxiliary data, and are not needed just to walk the JS stack.

Each JSContext has its own JS stack, but all of those context-specific stacks are interleaved together into a per-thread StackSpace data structure.

In the simplest case, you can simply iterate through the entire StackSpace using an AllFramesIter iterator. For a given StackFrame, you can grab out the JSScript it is executing, and from there get the URL and starting line number of the script. Well, if you ignore inlined scripts, anyway.

Ridiculously oversimplified pseudocode:

 foreach jsframe in AllFramesIter(threadData):
   script = jsframe.script
   print(script.filename, script.lineno)

Exact Line Number

Getting the exact line number is trickier. In general, you want to get the pointer into the JS bytecode instruction being executed. We'll refer to that as the PC. Then you can use the src notes -- metadata stored in the JSScript -- to map the PC to the actual line number.

If the frame in question is running JITted and is the leaf (youngest/current) frame, you may first need to map the native instruction pointer (IP) to the JS bytecode pointer (PC) using a 'native map' stored in the JITScript data structure. If the frame is JITted but not a leaf, it will have synced their PCs into the VMFrame struct before invoking the next JS function, so no native map games are necessary.

Note that by "leaf", I'm only referring to the JS stack. If you have a frame for a JITted script on the native stack underneath some younger native frames, you are not guaranteed that the PC in the VMFrame will be up to date. (Many times it will; the terminology for these cases is "fallible VM calls". The "infallible VM calls" are the tricky ones.)

StackFrames also store the PC of the previous (calling) frame. Think of it as the JS bytecode return address. If you are using a StackIter, it will give you access to the PC "belonging to" a given stack frame, so you do not need to worry about the off-by-one. (iter.pc() returns the PC within the StackFrame given by iter.fp(), not the PC of the caller that was saved in that frame.)

Pseudocode:

  for jsframe in StackIter(cx):
    if jsframe.isScript():
      pc = jsframe.pc()
      lineno = JS_PCToLineNumber(jsframe.fp().regs.script, pc)
    (else jsframe.isNativeCall())

Unfortunately, for general stack unwinding you cannot use StackIter because it modifies the stack (eg instantiating inline frames for simplicity) and it requires a JSContext. (FIXME: We can change things to be able to hand it a JSContext, though we don't want to, since we want frames for all contexts.) If you are not using StackIter but instead using AllFramesIter, then things are a lot hairier. There is no easy pc() accessor. Inlined frames will not be expanded for you, so a given frame may actually represent an outer script plus a set of inlined (nested, even) scripts. The PC will point into the innermost (currently active) script, not to the outer script. (FIXME; I haven't fully implemented handling of inlined frames yet.)

Intermediate State

Accessing the JS stack during a signal handler (or while stopped by another thread, as in OSX and Windows) is considerably more complicated. The basic problem is that data structures referring to the JS stack may be in a state of flux, so it is unsafe to even dereference arbitrary StackFrame* pointers. The state stored within StackFrames, VMFrames, and StackSpaces could also be in flux.

This difficulty is not just for getting exact line numbers. The straightforward way of getting the stack of scripts relies on unsafe accesses. (In other words: it's harder to get a stack of call sites than just a stack of functions, but the concurrent mutation problem makes even the former difficult.)

Flagging Mutation

One approach to handle this is for the stackwalking code to know when things are in flux and drop samples (or return partial information) in those cases. Options: (1) atomically increment a global variable around mutation (it's a flag for multiple threads); (2) set a per-thread flag; and/or (3) do nothing during normal operation, but when walking the native stack, detect when the IP is in the machine code address range of code that modifies needed state.

Two problems with this approach are complexity and speed. Example sites of mutation:

  • JITted code that updates the VMFrame (the "in flux" state can even cross call sites of "infallible" C++ functions)
  • ContextStack::popInvokeArgs (simple case)
  • PopActiveVMFrame() called by JIT trampolines, unsafe until repointRegs called from inside the regular C++ method EnterMethodJIT()

There are probably others, and identifying all of these cases requires detailed knowledge of the engine and is fairly brittle.

There are some other odd cases, like where js::Interpret maintains the set of registers (including the PC) in a local variable and repoints JSContext->regs there instead of pointing somewhere within the JS stack. Then the PC is mutated within that register struct ("mostly" safely; you might get a pointer to a previous bytecode.)

Previously, imacros also introduced headaches, but they're dead now.

Validating Data

Another approach would be to withstand mutation by validating all data before accessing/dereferencing it, and if anything doesn't look right, bailing out and dropping that sample.

From the ThreadData we can find the youngest stack frame. (Well, the youngest or its immediate caller, at least.) This can be validated against the address range corresponding to the currently live stack; if it is inside it, then at least it is safe to dereference. (You might have a stale fp that was popped and then another frame was pushed over the top of it, so the data it points to may be garbage, but at least you can dereference it.)

The JSScript* can be pulled out of the StackFrame and checked against a global table of valid scripts. Anything in there is guaranteed to still be live, so can be used freely. It could possibly the *wrong* script (if the fp was bogus), but for expected users (sampling profilers) it's acceptable to get the wrong answer occasionally as long it happens rarely.

The valid stack region can be determined from the StackSpace in the ThreadData.

To find the PC for a JITted frame, you need to get the VMFrame. This requires first knowing that an IP corresponds to JITted code rather than C/C++ code. Then you need the address of the VMFrame itself. The "correct" solution to this is to walk the stack and get out the CFA ("Call Frame Address" in DWARF terminology; it's roughly the value of the stack pointer just before a function is called, and is used to find the saved return address as well as all other values on the native stack.) The VMFrame is a local variable in JITted code, and so will always be found at a fixed offset from the CFA. This requires a stack unwinding tool that gives both return addresses and CFA info. (For example, glibc's backtrace() function only provides the return addresses.) libunwind can do this on Linux and OSX, but only for frames for which it has CFI (call frame information). It has a dynamic CFI registration mechanism, but it was only ever implemented for ia64. It appears that it would be relatively easy to modify libunwind to accept standard DWARF debug info, and I have a patch to generate such info (used for providing gdb with the JIT code information.) I'm not sure what the performance cost is of generating that info. I do know that generating the full DWARF info plus embedding it into an ELF file and queueing it up for GDB to consume is very slow, but I don't know how much of that is the DWARF generation itself.

The easier solution is to look one frame deeper/older, because JaegerMonkey arranges things so that the VMFrame will be passed as the first parameter to any called functions. So assuming the deeper frame is not itself JITted, libunwind can give you the stack pointer saved for a given frame and the VMFrame pointer can be read off the top of the stack. libunwind will give

  • some* value for the stack pointer even if it gets confused by JIT frames, so

with some validation against the native stack and full validation on anything pulled out of the VMFrame, it can still be used optimistically even without implementing dynamic JIT registration for libunwind.

But to get the PC itself, you need to call maybeScript() on each JS frame, which could potentially access fp->fun()->isInterpreted() -- which means the fp->fun() must point to a valid memory region. This is detectable using the conservative GC machinery, with careful handling to avoid seeing intermediate state of the bookkeeping.

The PC can be validated against the script's bytecode range.

To retrieve the PCs for the JS frames,

 cx = threadData.currentContext()
 foreach jsframe in AllFramesIter(threadData):
   if ! validStackPointer(cx, jsframe):
     bail

Delayed Analysis

Another source of pain is that stack snapshotting should be fast. You don't want to interfere with regular timing any more than necessary. That means that during a signal handler you mainly just want to walk the C and JS stacks, record everything you need into fixed-size buffers (no memory allocation, remember!), and continue. A separate thread can walk along and do any necessary detailed analysis.

When saving out the JS stack, you want to copy out as little information as possible, because of the cost in both time and space. Unfortunately, by the time the profiler thread gets to a sample, any pointers to data structures that you captured when taking the sample may have expired.

So for each type of data involved, there are a couple of choices: (1) copy anything needed; (2) detect that the data has expired and drop the sample; (3) don't copy initially, but when the datum expires check whether it might be needed by a sample and copy out whatever is needed at that time.

Data structures in question:

  • JSScript - used to get filename and lineno, and map a PC to a line number
  • Script filename (they have a different lifetime than JSScript)
  • JITScript - used to map IPs to PCs

JSScript

During the signal handler, all scripts are checked against the global table of valid scripts, so the script is known to be fully valid and all fields can be used freely.

During the delayed sample processing, scripts may have died. A callback on script destruction queues up a "PendingDeadScript", which keeps a copy of the script's filename, line number range, source notes, and other data that will be needed later. Then the script is released to be destroyed. (Note that script destruction happens during GC, and by the time the profiler knows that a script is to be destroyed, it's too late to re-root it to save it.) Periodically, the pending dead script queue will be updated to discard script corpses that are old enough to not be needed.

This is not a great solution, because it means saving info for *all* dying scripts, not just ones that were active or hit by a sample.

Perhaps a better approach would be to make the global table of live scripts keep a counter of how many times each script was hit by a sample. I haven't thought too hard about this, but I think it might require taking a lock from the signal handler, which is always messy.

Another alternative would be to discard samples that include dead scripts, because scripts only die during GC anyway. That's not great from the point of view of a sampling profiler, because "heavy" scripts both execute frequently and allocate lots of memory, and so they would be under-represented in the profile. But perhaps the percentage loss is low enough.

Finally, the signal handler could construct a set of all scripts that were sampled, and root them for the GC. This feels somewhat hairy from the locking perspective (the signal handler would need to lock against reads from the profiler thread and the profiled thread), but it might be reasonable.

Script Filenames

Many scripts share the same filename, so the lifetimes of script filenames is independent from scripts'. Right now, I'm keeping a separate table of filenames, but I never delete from it, so let's just say that I haven't thought too deeply about this one.

Native Address to JIT Code Mappings

When handling an IP taken at time T0, the table only tells me if the script is still alive at time T1 (when the sample is being processed.) So I also maintain a "history log" which records every script creation/destruction event and when it happens (with respect to the sample counter). Conceptually, you could first look up a script in the global table, then walk through the history log in reverse time order to modify whatever was found. In practice, you scan the log in forward time order: skip over all entries that happened before T1. Then the next entry found is the first event relevant to that IP that happened *after* that IP was sampled. If it is a creation, then the IP could not have been in the table before that point. If it is a deletion, then the IP *was* in the table, but is no longer valid. If the IP is not found in the history log, then the global table is valid for both T0 and T1.

If the IP was found in the global table, then we can use it freely.

If the IP was not found at all, or a creation event for it was found in the history log, then the IP does not belong to JIT code.

If the IP was found in a deletion event in the history log, then it belongs to a JITScript that has been deleted. To handle these, a callback on JITScript deletion scans through the queue of pending samples looking for matching IPs. If any are found, the IP -> PC mapping is done and written into a side table, then the JITScript is allowed to die. When the sample is processed, it checks in that table for IP -> PC mappings of the appropriate age.

The the history log is regularly pruned to discard events that are old enough that no sample could possibly use them. Note that this must consider *all* active profilers, since the history log is global.

Interleaved Stacks

The native stack and the JS stack are somewhat independent, but it is useful to treat them as interleaved: a JIT frame on the native stack might correspond to a contiguous subsequence of the JS stack, as would js::Interpret frames.

 (C)  main() calls
 (C)  js::RunScript() calls
 (C)  js::Interpret() executes
 (JS) http://www.ducklips.com/defenestrate.js:17 calls
 (JS) http://www.ducklips.com/defenestrate.js:234 calls
 (JS) http://cdn.net/jquery-7.2.js:33 calls
 (C)  nsXPConnect::doStuff() calls
 (C)  nsDOMWhack::match() calls
 (C)  nsXPConnect::invokeGetter() calls
 (C)  js::mjit::InvokeMethodJIT() executes
 (JS) http://cdn.net/jquery-7.2.js:182 calls
 (JS) http://cdn.net/jquery-7.2.js:79 calls
 (C) js::math::Cosine()

[any resemblance to actual C++ or JS function names is purely coincidental]

Interleaving the stacks is (relatively) straightforward: JIT frames contain the VMFrame data structure, which contains both a (possibly stale) current fp and the fp at which the JIT was first invoked (the 'entryfp'). The truly current fp is kept in a known register, so we can know the exact start and end fps.

It's almost the same with js::Interpret: the current fp is maintained in a FrameRegs struct stored in a local variable. The entryfp is passed in as a parameter. So if js::Interpret is somewhere on the stack, we can use the CFI (Call Frame Information) from the stack unwinding to find the address corresponding to js::Interpret's activation frame, and from that look up the relevant local variables. (If the innermost frame of our native stack *is* js::Interpret, then this won't necessarily work since the most up-to-date value might be in a register. But the JS stack frame pointer is only modified when JS scripts are invoked, which requires nontrivial C function calls, which means they'll be saved across those calls. So we'll usually get the right answer just by looking at the saved locals, and we can accept a little bit of staleness in our stack reconstruction procedure.)

What is required for this is the native address range of js::Interpret (so we can recognize native stack frames corresponding to js::Interpret), and the offsets of those local variables. To figure out the address range, I insert registration calls at the beginning and end of js::Interpret. Within those calls, I use libunwind to walk up the stack one frame, and record those return addresses. The resulting address range doesn't cover the whole of js::Interpret, but it doesn't really need to either. (Anything outside of that region is before or after interpreting any JS code, so no JS frames really "belong" to that invocation of js::Interpret anyway.)

To get the save locations of the locals, I pass their addresses into one of the registration routines. libunwind tells me the CFA (Call Frame Address, the stack pointer corresponding to the js::Interpret activation), so I just compute the offset from the CFA to each of the locals' addresses.

The rest is just data twiddling. Using the full JS stack, and the set of native stack frames that "own" JS stack frame subsequences, weave them together into one data structure based on the <fp,entryfp> ranges for JIT and js::Interpret frames.

Native Stack Walking

Note that all of this depends on being able to walk the native stack. So far, I have relied on having debug symbols available, and have only used 64-bit Linux. I've found libunwind to be surprisingly robust, even managing to see through JIT stack frames to the C++ frames underneath them, which is more than I can say for GDB. To have this generally useful, we'd want to be able to walk stacks in production builds on all OSes and architectures. Here's my handwavey thinking about how to achieve this:

First, consider Linux. I think libunwind is a good solutioni here. The quick fix is to enable the frame pointer (and make sure that libunwind relies on it; given the amd64 ABI, an unwinder could legitimately ignore it.) That kind of sucks, so the next fallback would be compiling with .eh_frame sections that give the necessary unwind info. This shouldn't cost anything in runtime performance (assuming it doesn't get read off of disk unnecessarily, but I'll ignore that for now), though it does cost space. I'm hoping that just the .eh_frame sections are small enough to be tolerable, but if not, we could compress them by discarded the portions of the unwind tables that give unneeded registers. I think we could probably just keep the CFA, return address, stack pointer, whatever register maintains the current JS StackFrame for JIT code, and whatever registers are needed to compute those values (if any; an example might be the frame pointer when it's being used for that purpose.) And technically, we really only need the JS frame register for functions that might be directly called from JIT code.

The main idea is that we don't need debug info; we just need the same info used to unwind the stack for C++ exception handling. (Namely, the .eh_frame section, though if we don't want it to be mapped into memory we could use .debug_frame instead.)

I'm not very familiar with OSX, so I'll pretend that it's identical to Linux. Though apparently it isn't; you need some "dlsym" thing for regular debug info.

I'm totally unfamiliar with Windows. Full debug info is in PDBs, I guess. But it seems like (1) the general idea that we need way, way less than full debug info holds; and (2) Windows still needs to be able to unwind stacks for exception handling, so there should be some analog of .eh_frame.

IonMonkey

IonMonkey is going to switch to a unified stack, so the native stack will contain one frame for each JS frame, embedded at the right spot, using the same conventions as the native C/C++ code. This should remove much of the complexity, because only a single stackwalk is required, and it doesn't depend on all kinds of data that might be in flux at the time of observation. You would still need to know what type of frame each native frame is, in order to know how to unwind it and look up the local variables that would give the source location info.

Status and Future

Tests

This is brittle code that relies on unstable code. Without tests it will break frequently.

Platform support

I have only been developing on amd64 Linux with frame pointers and debug symbols. The JS stack walker is mostly useless without support for OSX, Windows, and Android.

amd64: libunwind should work fine on x86 and ARM, though I think I have some minor register naming stuff to fix to make it properly portable.

Linux: libunwind should work on OSX, but I haven't tested it. (Remember that I am using more from libunwind than just the return address stack.) libunwind does not and never will work on Windows. The Windows API has a StackWalk64 that contains a stack pointer, which should be good enough for our purposes.

Frame pointers and debug symbols: this is the tough part. It's no use relying on frame pointers, because (1) our release builds won't have them and (2) the amd64 ABI doesn't promise they'll work anyway.

For libunwind, the next step would be to look at the size of the minimum debug information required to walk the stack and access the locals we care about, and implement a way to download just that much when profiling is turned on. Alternatively, we could look at the cost of compiling in .eh_frame sections to see if it would be feasible to just have the information there all of the time.

For Windows, Ehsan is looking into this. His current plan is to use a tool to strip down the PDB files to produce much smaller ones. WinAPI's StackWalk64 accepts function pointers for accessing debug info, so we just need to write a version that handles our variant. Sounds great to me! An alternative would be to use (cached) symbol server lookups.

SPS Integration

SPS (Simple Profiler System) is a separate profiler that currently only has native stack walking, but which would like to incorporate JS stacks. Probably the easiest version of JS stacks to integrate would be the interleaved C++/JS stacks, to avoid pushing the complexity of two independent stacks into all profile info consumers. (Though much of that complexity is already present in anything that handles multiple threads with independent stacks.)

Stackwalking support

JS stack walking unfortunately depends on native stack walking. I have been using libunwind on x64 Linux with debug symbols. To make this usable in a production browser, every one of those conditionals needs to be relaxed.

x64: libunwind should handle x86 and ARM

Linux: libunwind *should* work on OSX, though I haven't tested it.

JSFunction validity checking

To check for the validity of JSFunctions, I am currently doing a generic test for an addressible address range. This is incorrect, since it could trigger side effects. I have already discussed the options above, but some way to make use of the conservative GC address tester seems necessary.

Inlined Frames

The support for inlined frames is partial. As mentioned above, StackIter materializes inline frames before doing its iteration, but that's unsuitable for signal handler use because it writes to memory. A profiler also probably would rather reflect the unmodified state anyway. AllFramesIter does not materialize frames, but the support for describing and traversing them is incomplete. (It may even be broken with inlined frames currently, eg throwing out any stack with an inlined frame. I haven't tested.)

Thread safety and multiple runtimes

I have tried hard to get the thread safety right, but there is approximately a 0% chance that it is completely correct -- *especially* if you want to use this on more than one thread at a time.

One particularly shaky area is that I recently changed which threads access which CodeMap-related data structures, and I need to audit the locking/flag-checking code. (Whenever the signal handler starts accessing things that might be locked by the main thread, it gets messy.)

Performance

The one truly horrible performance problem I currently know of is that I'm using a simple unsorted vector for the native address lookup table. This really ought to be a tree lookup.

I also suspect that the validity checking needs to be optimized.

The current implementation also has too much overhead when the profiler is off. A specific example: since the table of valid JSScripts must be 100% accurate at all times, I maintain the table even before the profiler is enabled for any thread. This overhead could be eliminated by scanning through all scripts in all compartments in all runtimes when turning on profiling for the first time (and discarding the table when profiling is turned off.) There are a few others that need to be dealt with as well. Some of them will degrade the functionality a bit, but we can live with those losses.

I haven't measured the overall overhead yet, so there are undoubtably other areas that need to be fixed to use this in a real production setting.

The profiler uses too much memory for script filenames, and leaks that memory.

Output

So far I'm not doing a whole lot with the profile data being collected. I need to look at SPS to see if has the frontend functionality we want, and whether it's usable in a way that doesn't require the full browser.

I'd also like to produce basic gprof-style textual reports and kcachegrind input.

Landing

To get a minimally useful version running on a single platform and configuration:

  • Augment the current JIT registration API
  • Fix the script filename memory leak
  • Fix the perf overhead when profiling is off
  • Land the deque implementation I am using (or figure out how to eliminate its use)
  • Land a small patch to add a line numbering src note to the beginning of all scripts (it is currently omitted because it is redundant with script->lineno, but the stack walker sometimes needs it when the script is dead.) Or eliminate the necessity for it.
  • Land js_MapPCToLineNumber, an API that is usable when the original script is dead but a copy of the src notes is still available.
  • Land configure.in changes to link in libunwind
  • Fix the compile on currently-unsupported platforms
  • Land shell integration (implemented, but it may be excessive -- it includes yet another embedded options parser)
  • Profile the overhead of the RAII-style "activity" markers that I've added to most JM stub calls to see if it's acceptable to land with them in.