Sfink/Memory Ideas

From MozillaWiki
Jump to: navigation, search

Problem A: System too unusable to diagnose

When a bad memory leak kicks in, the system can be too unusable to get useful data out.

Solutions: Make it easier to get information out when the system is suffering

#A1 Periodically log memory-related information (existing bug, I think? also telemetry)

#A2 Maintain a rotating database of detailed memory-related information (cf atop)

#A3 Make about:memory capable of outputting to a file, for use with a command-line invocation 'firefox about:memory?verbose=1&outfile=...'

Solution: Prevent the system from getting into such a bad state

#A4 Make a per-compartment (or per-?) cap on memory usage

#A5 When sufferingMode==true, disable GC/CC on big tabs. Probably need to deactivate them too.

#A6 Early warning when memory usage is getting too high

#A7 Crash reporter-like UI for reporting memory problems (do not require an actual crash to trigger)

Problem B: Regular users can't generate useful reports

Hard for regular users to generate a useful memory problem report

(all solutions from problem A are relevant here)

#B1 Provide a way to dump and submit a reachability graph

#B2 Documentation for how to best help with a memory problem, with various steps to follow.

#B3 Track memory to individual page/tab/compartment/principals.

#B4 Tools for generating profiles with subsets of addons installed (or for running with different subsets of addons within one profile)

#B5 Tools for blaming memory usage on addons (eg detecting "safe" addons to remove from consideration. Cross-referencing other users' addons and memory usage similar to the crash correlation reports -- requires telemetry.)

Problem C: Knowledgeable users can't generate useful reports

Hard for developers or knowledgeable and motivated users to generate a useful memory problem report

The above problem B crossed into this, so everything there is relevant.

#C1 Rationalize and document all of our various leak-detection tools.

#C2 Automation and Windows equivalents of my /proc/<pid>/maps hacks

#C3 Dumpers that give full heap, full graph, pruned graph. Visualizers, analyzers, etc. of the dumps.

#C4 Collect age of various memory objects (how many CCs or GCs it has been alive.)

Problem D: Uncollected garbage

Garbage is not collected

Solution: Report cycles that CC misses

#D1 Conservative scanner to find cycles involving things not marked as CC-participants and report them as suspicious.

Solution: Report resources that leak over time but are still referenced (so they are cleaned up before shutdown)

#D2 Register "expected lifetime" at acquisition time. Report things that live longer than expected, filtered by diagnostics. ("lifetime assertions"? Not quite.)

#D3 Detect subgraphs that grow (at a constant rate?) while a page is open.

#D4 Detect subgraphs that are never accessed

Problem E: Unleaked but excessive memory usage

High memory usage, not leaked

(aside from current work like generational gc)

#E1 "Simulator" that runs over logs and estimates peak memory usage if CC/GC ran at optimal times.

#E2 Use reproducible test runs to evaluate what the performance/memory tradeoff is for various things (eg jit code, structure sizes)

Problem F: Hard to track down problems

Hard to navigate through a memory dump or the current state to track down a specific problem

#F1 Dump all roots of a compartment, and trace roots back to the XPCOM/DOM/whatever thing that is holding onto that root (when available)

#F2 Go from JS object to things keeping it alive (dump out GC edges) -- see jimb's findReferences (currently JS shell only)

#F3 Record addr,size,stack at every allocation (kgadd's heap visualizer)

#F4


Details:

A2. atop records a ton of statistics about memory, disk, network, CPU, and other things at a 10 minute sampling interval. Stats are collected both on a global and per-process granularity. It monitors every process that starts and stops, even if the process appeared and disappeared entirely between two samples. It dumps all this in a somewhat-compressed binary log.

The visual UI has a good set of heuristics for detecting "large" values, and coloring the output accordingly. If your disk is busy for >90% of the sampling interval, it'll turn red. If your network traffic is a high percentage of the expected maximum bandwidth, it'll turn red. etc.

It lets you use it in 'top-like' mode, where it displays the current state of things, as well as in a historical mode where it reads from a log file. (It is decidedly *not* seamless between the two, but it should be.)

It also allows dumping historical data to text files. I've used that for generating graphs of various values.

For the browser, many of the same metrics are applicable, but I'd also like an equivalent of the processes' info. The idea is to know "what was going on at XXX?" So it should be user and browser actions, which tab was active, network requests, significant events firing, etc.


A3. The idea is that rather than waiting for the screen to redraw for every action in getting to about:memory, you just do firefox 'about:memory...' and go have a cup of tea while it thinks about it.

A5. This is based on pure speculation, but I don't understand why the browser is so incredibly unusable when memory usage is going nuts. Why is all that memory being touched? Why isn't it just swapped out and forgotten? Under the assumption that it's the GC scanning it over and over again, it seems like it would be nice to suppress GC in this situation. Generational GC could eliminate this problem in a nicer and much more principled way.

B2. I have the impression that we have many, many memory-related problem reports that end up being useless. I think that's really our fault; it's too hard for users to file useful bug reports. Experienced Mozilla devs don't even know what to do.

B5. eg: collect up all API calls that an addon makes (or record them, or whatever.) Maintain a whitelist of APIs. (If you pass in a string, assume it may be duplicated a thousand times and stored in a sqlite DB forever, but if you're just setting existing booleans or reading state, you're blameless.)

C2. When looking at a memory leak, I took several snapshots of /proc/<pid>/maps, diffed them to find a memory region that appeared and did not disappear, and then dumped out the raw memory to a file. Then I ran strings on it.

D2. I don't really know enough about the system to flesh this out properly, but it seems like when you have a bunch of memory lingering around when it really ought to be dead, that many of the objects comprising that memory should be able to "know" that they *probably* shouldn't live past... the current page, or for more than a few seconds, or whatever. Assuming this is possible, it should be possible to walk up a dominator graph and give a fairly directed answer to "why has this outlived what it thought its lifespan would be?"

Not every memory allocation needs to be marked for this to work. You just need one object within the "leaked" memory to be marked.

It could also walk the graph "en masse" to ignore individual objects that are reachable longer than expected and focus on the clusters of objects that are kept alive by the same thing. (I'm thinking that the expected lifetime is a guess, and may be inaccurate.)


D4. eg use mprotect on a random subset of the heap to find pages (or smaller regions, but that's harder) that are never accessed after some point. Remove the GC/CC from consideration.