Confirmed users
407
edits
(Added information on dump_syms and completed the list of client-side tools) |
(Added information on the minidump_stackwalker project, fixed a few typos and added new project entries) |
||
Line 12: | Line 12: | ||
== crash monitor == | == crash monitor == | ||
== in-minidump crash annotations == | |||
== crash reporter client == | == crash reporter client == | ||
Line 30: | Line 32: | ||
=== Description === | === Description === | ||
The [https://github.com/mozilla-services/socorro/ Socorro] service we use for | |||
ingesting and processing crashes relies on a Breakpad-based | |||
[https://github.com/mozilla-services/socorro/ stackwalker] originally written | |||
by Ted Mielczarek to extract stack traces from minidumps. This tool takes a | |||
minidump as input, fetches the appropriate symbol files from our symbol-server | |||
and ultimately emits the stack traces and additional information in JSON format. | |||
=== Rationale === | === Rationale === | ||
Like our other server-side tools the stackwalker is based on a forked version | |||
of Breakpad which is different than the one we have in mozilla-central. This | |||
caused divergence in the past between stack traces seen on developers' machines | |||
or try and Socorro. The extra work needed to manually keep in sync slows down | |||
development and Breakpad's slow-moving upstream doesn't help. Last but not | |||
least this tool is not fully robust in the face of malformed or unexpected | |||
inputs. We've often spent time tracking issues that showed up on Socorro months | |||
after the fact because we had no useful output to work on, nor it was easy to | |||
detect that the failure was happening within the tool versus an issue with the | |||
input. Last but not least we don't have proper automated tests for this tool, | |||
relying on manual testing of every release which is time-consuming and exposes | |||
us to regressions. | |||
=== Plan === | === Plan === | ||
We plan on rewriting the stackwalker tool by extending the | |||
[https://github.com/luser/rust-minidump rust-minidump] crate. Several changes | |||
will be needed to the crate including: | |||
* Bringing the minidump layout structures up-to-date | |||
* Implementing a CFI & FP-based stackwalker for the x86, x86-64, ARM and AArch64 architectures | |||
* Teach the stack walker to talk to a symbol server to fetch the required symbol files | |||
* Implementing parsers of unsupported minidump streams | |||
* Update all the human-readable mappings of various values and constants to bring them on-par with Breakpad | |||
* Teach the stackwalker to read crash annotations from the Mozilla-specific .extra file and output Socorro-compatible JSON | |||
=== Results === | === Results === | ||
The new stackwalker tool was deployed on Socorro in mid-December 2021 and was | |||
described by many as the | |||
[https://twitter.com/Gankra_/status/1470805017280004098 perfect deployment]: | |||
* The new tool proved to be twice as fast as the old one while consuming less memory | |||
* We had no regressions save for {{bug|1757890}} which was caught out later on | |||
* The new tool is covered by an extensive test suite | |||
* Results from the new tool were better than the old one | |||
** Problematic minidumps that crashed the old tool were now being handled correctly | |||
** Stack traces were better overall, being much better on macOS (more on this later) | |||
** Issues were now logged in detail in the debug output which is also accessible on Socorro making solving issues much simpler | |||
In addition to replacing the old tool the new one brought along a very useful | |||
new feature: support for | |||
[https://gankra.github.io/blah/compact-unwinding/ Apple compact unwinding format] | |||
which :gankra reverse-engineered from LLVM's sources. This turned the quality | |||
of our macOS stack traces from mediocre to exact overnight. | |||
Last but not least the project has been picked up by [https://sentry.io/ Sentry] | |||
for use in their software as a replacement for their own Breakpad-based | |||
stackwalker. Sentry developers have been contributing changes to the crate at a | |||
steady pace and will likely take over its maintenance. | |||
== dump_syms == | == dump_syms == | ||
Line 80: | Line 134: | ||
* Consolidate all the tools into a single portable and retargetable executable | * Consolidate all the tools into a single portable and retargetable executable | ||
* Leverage Rust's existing | * Leverage Rust's existing ecosystem of crates to read debug information instead of rolling our own. | ||
* Significantly improve the performance and reduce the resource usage of this tool. This is especially important considering that dumping symbol files is in the critical path of all our builds on automation and takes an appreciable amount of time and resources. | * Significantly improve the performance and reduce the resource usage of this tool. This is especially important considering that dumping symbol files is in the critical path of all our builds on automation and takes an appreciable amount of time and resources. | ||
Line 108: | Line 162: | ||
tend to be smaller thanks to significantly reduced redundancy in the output. | tend to be smaller thanks to significantly reduced redundancy in the output. | ||
During the | During the course of the project we contribute changes to the crates we used | ||
and Sentry in particular | and Sentry in particular accommodated for a number of changes that we needed to | ||
implement the new tool. | implement the new tool. | ||
Line 172: | Line 226: | ||
[https://blog.mozilla.org/nnethercote/2020/04/15/better-stack-fixing-for-firefox/] | [https://blog.mozilla.org/nnethercote/2020/04/15/better-stack-fixing-for-firefox/] | ||
describing his approach and results. | describing his approach and results. | ||
== Telemetry-based dashboards == |