Crash reporting overhaul: Difference between revisions

Jump to navigation Jump to search
Added information on the minidump_stackwalker project, fixed a few typos and added new project entries
(Added information on dump_syms and completed the list of client-side tools)
(Added information on the minidump_stackwalker project, fixed a few typos and added new project entries)
Line 12: Line 12:


== crash monitor ==
== crash monitor ==
== in-minidump crash annotations ==


== crash reporter client ==
== crash reporter client ==
Line 30: Line 32:


=== Description ===
=== Description ===
The [https://github.com/mozilla-services/socorro/ Socorro] service we use for
ingesting and processing crashes relies on a Breakpad-based
[https://github.com/mozilla-services/socorro/ stackwalker] originally written
by Ted Mielczarek to extract stack traces from minidumps. This tool takes a
minidump as input, fetches the appropriate symbol files from our symbol-server
and ultimately emits the stack traces and additional information in JSON format.


=== Rationale ===
=== Rationale ===
Like our other server-side tools the stackwalker is based on a forked version
of Breakpad which is different than the one we have in mozilla-central. This
caused divergence in the past between stack traces seen on developers' machines
or try and Socorro. The extra work needed to manually keep in sync slows down
development and Breakpad's slow-moving upstream doesn't help. Last but not
least this tool is not fully robust in the face of malformed or unexpected
inputs. We've often spent time tracking issues that showed up on Socorro months
after the fact because we had no useful output to work on, nor it was easy to
detect that the failure was happening within the tool versus an issue with the
input. Last but not least we don't have proper automated tests for this tool,
relying on manual testing of every release which is time-consuming and exposes
us to regressions.


=== Plan ===
=== Plan ===
We plan on rewriting the stackwalker tool by extending the
[https://github.com/luser/rust-minidump rust-minidump] crate. Several changes
will be needed to the crate including:
* Bringing the minidump layout structures up-to-date
* Implementing a CFI & FP-based stackwalker for the x86, x86-64, ARM and AArch64 architectures
* Teach the stack walker to talk to a symbol server to fetch the required symbol files
* Implementing parsers of unsupported minidump streams
* Update all the human-readable mappings of various values and constants to bring them on-par with Breakpad
* Teach the stackwalker to read crash annotations from the Mozilla-specific .extra file and output Socorro-compatible JSON


=== Results ===
=== Results ===
The new stackwalker tool was deployed on Socorro in mid-December 2021 and was
described by many as the
[https://twitter.com/Gankra_/status/1470805017280004098 perfect deployment]:
* The new tool proved to be twice as fast as the old one while consuming less memory
* We had no regressions save for {{bug|1757890}} which was caught out later on
* The new tool is covered by an extensive test suite
* Results from the new tool were better than the old one
** Problematic minidumps that crashed the old tool were now being handled correctly
** Stack traces were better overall, being much better on macOS (more on this later)
** Issues were now logged in detail in the debug output which is also accessible on Socorro making solving issues much simpler
In addition to replacing the old tool the new one brought along a very useful
new feature: support for
[https://gankra.github.io/blah/compact-unwinding/ Apple compact unwinding format]
which :gankra reverse-engineered from LLVM's sources. This turned the quality
of our macOS stack traces from mediocre to exact overnight.
Last but not least the project has been picked up by [https://sentry.io/ Sentry]
for use in their software as a replacement for their own Breakpad-based
stackwalker. Sentry developers have been contributing changes to the crate at a
steady pace and will likely take over its maintenance.


== dump_syms ==
== dump_syms ==
Line 80: Line 134:


* Consolidate all the tools into a single portable and retargetable executable
* Consolidate all the tools into a single portable and retargetable executable
* Leverage Rust's existing ecosytem of crates to read debug information instead of rolling our own.
* Leverage Rust's existing ecosystem of crates to read debug information instead of rolling our own.
* Significantly improve the performance and reduce the resource usage of this tool. This is especially important considering that dumping symbol files is in the critical path of all our builds on automation and takes an appreciable amount of time and resources.
* Significantly improve the performance and reduce the resource usage of this tool. This is especially important considering that dumping symbol files is in the critical path of all our builds on automation and takes an appreciable amount of time and resources.


Line 108: Line 162:
tend to be smaller thanks to significantly reduced redundancy in the output.
tend to be smaller thanks to significantly reduced redundancy in the output.


During the coures of the project we contribute changes to the crates we used
During the course of the project we contribute changes to the crates we used
and Sentry in particular accomodated for a number of changes that we needed to
and Sentry in particular accommodated for a number of changes that we needed to
implement the new tool.
implement the new tool.


Line 172: Line 226:
[https://blog.mozilla.org/nnethercote/2020/04/15/better-stack-fixing-for-firefox/]
[https://blog.mozilla.org/nnethercote/2020/04/15/better-stack-fixing-for-firefox/]
describing his approach and results.
describing his approach and results.
== Telemetry-based dashboards ==
Confirmed users
407

edits

Navigation menu