Crash reporting overhaul

From MozillaWiki
Jump to: navigation, search

Introduction

This page describes the various components involved in the rewrite of our crash reporting machinery, the rationale behind each rewrite, the goals we set for each component as well as the plan and progress information for each of them.

Client-side tools and components

Exception handlers

Status: not started
Developer(s): gsvelto
Source code:

Original source code:

Bugs:

  • bug 1620989 - Rewrite the Linux exception-handler in Rust
  • bug 1620990 - Rewrite the Windows exception-handler in Rust
  • bug 1620991 - Rewrite the macOS exception-handler in Rust

Description

Exception handlers and signal handlers are used to intercept abnormal conditions conditions within Firefox and respond to them. Depending on the affected code and the type of exception the handlers will either ignore it, conditionally process it or commence crash reporting in case of fatal ones.

Rationale

We have several exception and signal handlers scattered through the code. The main one is provided by Breakpad and is used to catch fatal exceptions. In addition to it the JIT sets several handlers to process benign ones and we have a few more ranging from swallowing exceptions in certain processes to detecting and working around Windows bugs. This proliferation of handlers causes has several downsides:

  • There are no abstractions whatsoever. Every piece of code that needs to install a handler does so by using low-level platform-specific code (sometimes doing bare syscalls!)
  • Several handlers must be called in a specific sequence in order to work, with every step deciding if the exception should go further or not. On Windows this happens naturally as exception handling is structured, however the order is implicit and depends on the startup sequence. On macOS where the handling is non-hierarchical not only we have an implicit ordering but it relies on tricks to set mach message handlers at different levels (process VS threads) in order for them to be called in the desired sequence. Finally on Linux/Android there is no in-built ordering as only one signal handler can be active at the same time, this means that every handler has to take care of others that were installed before it and manually forward signals when necessary.
  • The combination of the lack of abstraction and implicit ordering means that their use is brittle. Coders are wary of touching them and sometimes scenarios like early crashes yield unpredictable results due to not all the handlers being in place yet.
  • Some handlers are covered by tests but not all of them nor is the sequence in which some must be called.

Plan

We should rewrite the exception handlers starting with a crate that would provide proper abstractions and explicit ordering. The code relying on existing handlers would then need to be updated to use the crate instead, registering a callback, the conditions for it to be called and the order in which it needs to appear with regards to the other callbacks. The overall goal is to remove platform-specific code from the existing handler and shrink them down to just their core functionality, moving all the platform code into the crate. With the ordering explicitly set we'd also remove all sorts of ambiguity. Once all the handlers are migrated we should hook functions used to install handlers (such as signal(), sigaction(), SetUnhandledExceptionHandler(), task_set_exception_ports(), etc...) to prevent library code from injecting handlers under our nose. The hooks will either disregard the handlers or insert them in the right places in the hierarchy on a case-by-case basis.

Minidump writers

Status: in progress
Developer(s): gsvelto, msirringhaus (contributor) and other external contributors
Source code:

Original source code:

Bugs:

  • bug 1620993 - Rewrite the Linux-specific minidump writer code in Rust
  • bug 1689358 - Add ARM/AArch64 support to the oxidized minidump rust writer
  • bug 1620995 - Rewrite the macOS-specific minidump writer code in Rust
  • bug 1620994 - Rewrite the Windows-specific minidump writer code in Rust

Description

Minidump writers are at the core of the crash reporting infrastructure. They extract information about a crashed process - such as register contents, thread states & stack, interesting memory areas, list of memory mappings, etc... - and store them in a file following Microsoft minidump format plus a few extensions commonly used extensions. We currently leverage Breakpad's minidump writers for generating minidumps across all platforms.

Rationale

The minidump writers manipulate the memory of a crashed process and have several complex failure modes. Breakpad's writers have several issues that are difficult to address:

  • They are not very robust, sometimes failing to generate a minidump at all in cases where a partial one could be made
  • Their error-reporting is almost non-existent. This makes it impossible to diagnose issues when they're encountered in the wild
  • The existing code has limited test coverage and extending it is complex
  • They are designed to support both out-of-process and in-process generation, this adds a significant amount of complexity we don't need
  • The requirement to write in-process minidumps forces the code to do bare syscalls and avoid memory allocations, this heavily contributes to the existing code's complexity
  • They are tightly bound to Breakpad's architecture making it impossible to re-use them in other contexts
  • The way they communicate with other processes when doing out-of-process crash generation is highly platform-specific, not hidden behind an abstraction and brittle (some code relies on timeouts not to cause deadlocks!)

Because of the above we'd like to rewrite the writers in Rust using a more modular architecture, extensive error-reporting and full test coverage.

Plan

The rewrite calls for the following to be done:

  • We will prepare a crate which will have the only purpose of offering minidump writing
  • We will re-implement the writers starting from Breakpad's code one platform at a time starting with Linux, then Windows, then macOS
  • We will only implement out-of-process crash generation, cutting out the need for brittle in-process generation code
  • Functionality to synthesize minidumps will be provided to facilitate testing
  • Non-Microsoft extensions will be documented and provided with test-coverage, possibly integrating them with rust-minidump so that we'd have a single source of truth WRT them

Crash monitor

Status: not started
Developer(s): gsvelto
Source code:
Original source code: N/A
Bugs:

  • bug 1620998 - Write a crash monitor program to handle annotations and minidump writing

Description

Minidump generation is the step that occurs right after we have intercepted an exception and involves reading data from the crashed process and writing it out to disk. Additionally we need to write out the crash annotations to the .extra file that makes up a crash report togther with the minidump. Currently this involves two distinct code paths in Firefox depending on the affected process. For child process crashes the minidump writing step is done by the main process in a background thread. The same thread is responsible for receiving the crash annotations from the child process and write out the .extra file. On the other hand if the main process crashes then the entire writing phase happen within the main process' exception handler as no separate process is available to accomplish this task.

Rationale

The current system suffers from significant complexity, is fragile, hard to test and often unreliable:

  • When the main process crashes all crash generation happens within an exception handler. This means that no memory allocations are possible and we can only do bare syscalls on Linux/macOS (that is we cannot use libc functions). This lead to frequent issues such as deadlocks (when some code accidentally did an allocation or syscall), stack overflows (because the signal handler stack is small) and failures to generate the minidump. In particular on Windows we call MiniDumpWriteDump() in the crashed process, which Microsoft documentation explicitly warns against.
  • Crash generation for child processes is generally more reliable but also suffers from a major issue: the presence of two distinct IPC channels and the complexity of the child process' exception handlers frequently lead to deadlocks in the affected code. The worst ones we encountered blocked Firefox entirely but they were rare. The more common ones would lead only the crash generation thread to get stuck, preventing crash reporting alone.
  • The presence of two distinct code paths (two exception handlers, two ways of streaming out crash annotations, etc...) adds significant complexity to the codebase and was often a source of bugs. In some cases a change done to one of the two paths was not replicated in the other causing a lack of functionality. In other cases shared code was written but failed to work properly because of the different constraints imposed on it by the context within which it was run.
  • Reporting errors during crash generation is complex. For child processes we have some machinery that can describe why we failed to write a minidump, however we have no such thing for main process crashes. This makes it hard to diagnose why we failed to generate a crash report.
  • The new Rust minidump writers only support out-of-process crash generation. This means we currently have to use two different minidump writers for in-process generation (Breakpad) and out-of-process generation (Rust).

Plan

To address the issues of the current system we plan on move all crash report generation to an external process (aka the "crash monitor"). The crash monitor will be responsible for generating both the minidump and the .extra file that make up a crash report. It should be able to detect crashes that are currently not detectable (such as OOM crashes on Linux) and will hand over the fully generated crash reported to the main process. If for some reason we failed the monitor will communicate the reason to the main process. Because crash generation is currently done in Breakpad this needs to happen in several steps:

  1. We first have to create the crash monitor executable and equip it with the ability to communicate with the existing breakpad infrastructure. This is necessary because we still rely on Breakpad's exception handler.
  2. We will then move the Breakpad minidump-writer in the crash monitor and use it to generate minidumps for platforms where we still use Breakpad.
  3. For platform where we already have a Rust-based writer we will wire it up so that it will be used instead of Breakpad code.
  4. We will use the current out-of-process exception handler for all processes, removing the main process' dedicated exception handler. We will modify it to not assume that it is talking with the main process but rather with the crash monitor.
  5. To extract crash annotations we will use the mechanism that is currently used by child processes, removing the in-process writing code.
  6. We will modify the Windows Error Reporting interceptor to pass the exception over to the crash monitor instead of doing crash generation by itself.
  7. We will enable the crash monitor to launch the crash reporter client in case of main process crashes.
  8. Finally we will add a mechanism to launch the crash monitor as soon as possible during startup, this should happen before any exception handlers are registered or possibly lazily by the exception handlers themselves.

Crash reporter client

Status: in progress
Developer(s): afranchuk
Source code:
Original source code:

Bugs:

  • bug 1759175 - Rewrite the crash reporter client in Rust

Description

The crash reporter client is the tool we use to submit crash reports when the browser crashes entirely. Its role is to gather the minidump and crash annotations' file, add missing annotations, send a crash ping and prompt the user to submit the crash (possibly with a comment). Once the user interacts with it the crash reporter client will submit the crash and record its submssion and restart Firefox.

Rationale

The crash reporter client is a particularly rigid piece of code due to its platform-specific nature, the fact that it cannot use any of the libraries we use in Firefox and that it's usually launched from within an exception handler. Because of the above it suffers from a multitude of issues:

  • We have platform-specific code for the UI, file management and network operations making maintainance a nightmare
  • This is mostly platform-specific C/C++ code, but macOS also has some Objective-C thrown into the mix
  • The UI of the macOS version cannot be changed because it uses a binary description generated from a tool that was obsoleted ages ago (the last version of it ran on PowerPC macs only)
  • The Windows version has poor handling of paths with non-ASCII characters in them
  • The Windows version has poor high-DPI screen support
  • There are no tests covering its functionality
  • It relies heavily on environment variables to communicate with Firefox, this causes problems on Linux (see bug 1752703) and makes it hard to run it manually for testing
  • The entire processing flow is synchronous and blocks the UI at every step (parsing the annotations, creating the local stack trace, submitting it, etc...)
  • Localization is done via an INI file and cannot use Fluent
  • Given the hard-coded nature of the UI I don't know how it behave with RTL language, probably very poorly
  • It does not support Glean-based telemetry

Plan

We should rewrite the client in Rust leveraging the standard library and common crates as much as possible to remove the platform-specific code:

  • The UI still needs to be done using platform-specific code, but we have a chance to modernize it, especially on Windows and macOS
  • We must use Fluent for localization and ensure proper support of RTL languages
  • Networking code poses a problem. Ideally we'd like to use reqwest but it's not vendored in mozilla-central yet. It might still require platform-specific code though we hope to avoid it
  • We should leverage Rust's asynchronous facility to make processing non-blocking and the UI responsive
  • We should add platform-independent tests for the parts of the code that do not require UI interaction

Glean-based crash pings

Status: in progress
Developer(s): afranchuk
Source code: https://github.com/mozilla/glean/
Original source code:

Bugs:

Description

For every crash we detect Firefox Desktop sends a crash ping holding information that would help us detect issues and prioritize which ones should be fixed. This information is highly structured, resembles the JSON output of Socorro's stackwalker, and unfortunately has seen relatively little use in the last few years mostly because it's hard to process.

Rationale

The crash ping uses legacy telemetry. This is problematic for a number of reasons: legacy telemetry is exclusively available in Firefox Desktop (not in GeckoView), we don't have good interfaces to process these pings, the tools we use to extract information can be complicated and mobile products are using a more modern data collection system, Glean.

Firefox for Android (aka Fenix) does not submit the crash ping that Firefox Desktop submits. It instead records a crash_count metric that's submitted in the metrics ping (via AC-'s lib-crash). Fenix does collect crash-related information through Sentry. The old Firefox for Android (Fennec) had full-featured crash pings but Fenix doesn't have any at all, leading to a rather large blind spot in our telemetry.

The best way to address all of the above is to migrate the crash ping to Glean: once this is implemented, all Mozilla products using Glean will be able to benefit from this improvement (Firefox Desktop, Firefox Android, Focus, etc.)

Plan

This migration requires several steps with changes happening in different parts of the codebase:

  • Prepare the design of a minimal crash ping that can be implemented using existing client- and server-side machinery.
  • Add support for this minimal Glean-based crash ping to Firefox desktop (inside the CrashManager) and Fenix where it needs to be done from scratch.
  • Once the new ping's functionality has been validated, broaden the design to include parts that might require adding a new metric type (such as stack traces) and include all of the legacy crash ping information.
  • Modify the code previously introduced to fully populate the Glean-based ping and make its payload match the legacy one. This might need extra work on the Fenix side, especially to capture stack traces.
  • Last but not least the crash reporter client needs to be instructed to send Glean crash pings in addition to legacy telemetry pings. Currently Glean doesn't support C++ so this work will need to happen after we rewrite the crash reporter client.
  • Decommission the legacy telemetry crash ping and remove the relevant code from Firefox desktop.

minidump-analyzer

Status: not started
Developer(s):
Source code: https://github.com/rust-minidump/rust-minidump/
Original source code:

Bugs:

Description

The minidump-analyzer tool is similar to the #minidump_stackwalker in that it processes minidumps and emits stack traces in Socorro-compatible JSON format. The main differences between the tools are that minidump-analyzer runs on client machines rather than on our servers, it uses native debug information where possible to unwind stacks instead of Breakpad symbol files and it doesn't symbolicate its output. minidump-analyzer is run on a client machine for every crash that generated a valid minidump and its output is used to populate the contents of the crash ping.

Rationale

The minidump-analyzer suffers from many of the same problems as the #minidump_stackwalker with regards to stability, maintenance burden and lack of automated testing. It is yet another Breakpad-based stack walker that produces slightly different results than the others. Additionally support for using native unwinding information was never fully implemented. We only ever implemented support for the Windows x86-64 platform and we would have to implement support for other platforms from scratch in order to make it fully functional.

We'd like to replace this tool with one which re-uses the same code as Socorro's #minidump_stackwalker in order to reduce maintenance and keep results consistent between Socorro and crash telemetry. Additionally we'd like to add support for all missing platforms (Linux and macOS, as well as Windows/AArch64).

Plan

We plan on reusing the stackwalker developed as part of the #minidump_stackwalker project using Sentry's symbolic crate to parse the native debug information. This will require a few changes to the rust-minidump crate:

  • We need to add support for fetching unwinding directives from native debug information via the symbolic crate
  • We need to add machinery to find the appropriate files on the client machine instead of fetching them from a symbol server
  • We don't need to wire up the symbolicator to the native debug information as the stack traces we emit are raw and will be symbolicated later
  • We'll have to vendor rust-minidump and its dependencies into mozilla-central
  • We'll have to build and package the tool like we do with the existing one
  • Finally we should remove Breakpad's processor sources from the build as they won't be needed anymore

Sentry is already working on integrating symbolic with rust-minidump so we're currently waiting it out. This might require very little work in the end.

Server-side tools and components

minidump_stackwalker

Status: completed
Developer(s): gankra, gsvelto
Source code: https://github.com/rust-minidump/rust-minidump/
Original source code:

Description

The Socorro service we use for ingesting and processing crashes relies on a Breakpad-based stackwalker originally written by Ted Mielczarek to extract stack traces from minidumps. This tool takes a minidump as input, fetches the appropriate symbol files from our symbol-server and ultimately emits the stack traces and additional information in JSON format.

Rationale

Like our other server-side tools the stackwalker is based on a forked version of Breakpad which is different than the one we have in mozilla-central. This caused divergence in the past between stack traces seen on developers' machines or try and Socorro. The extra work needed to manually keep in sync slows down development and Breakpad's slow-moving upstream doesn't help. Last but not least this tool is not fully robust in the face of malformed or unexpected inputs. We've often spent time tracking issues that showed up on Socorro months after the fact because we had no useful output to work on, nor it was easy to detect that the failure was happening within the tool versus an issue with the input. Last but not least we don't have proper automated tests for this tool, relying on manual testing of every release which is time-consuming and exposes us to regressions.

Plan

We plan on rewriting the stackwalker tool by extending the rust-minidump crate. Several changes will be needed to the crate including:

  • Bringing the minidump layout structures up-to-date
  • Implementing a CFI & FP-based stackwalker for the x86, x86-64, ARM and AArch64 architectures
  • Teach the stack walker to talk to a symbol server to fetch the required symbol files
  • Implementing parsers of unsupported minidump streams
  • Update all the human-readable mappings of various values and constants to bring them on-par with Breakpad
  • Teach the stackwalker to read crash annotations from the Mozilla-specific .extra file and output Socorro-compatible JSON

Results

The new stackwalker tool was deployed on Socorro in mid-December 2021 and was described by many as the perfect deployment:

  • The new tool proved to be twice as fast as the old one while consuming less memory
  • We had no regressions save for bug 1757890 which was caught out later on
  • The new tool is covered by an extensive test suite
  • Results from the new tool were better than the old one
    • Problematic minidumps that crashed the old tool were now being handled correctly
    • Stack traces were better overall, being much better on macOS (more on this later)
    • Issues were now logged in detail in the debug output which is also accessible on Socorro making solving issues much simpler

In addition to replacing the old tool the new one brought along a very useful new feature: support for Apple compact unwinding format which :gankra reverse-engineered from LLVM's sources. This turned the quality of our macOS stack traces from mediocre to exact overnight.

Last but not least the project has been picked up by Sentry for use in their software as a replacement for their own Breakpad-based stackwalker. Sentry developers have been contributing changes to the crate at a steady pace and will likely take over its maintenance.

dump_syms

Overview

Status: completed
Developer(s): calixte, gsvelto
Source code: https://github.com/mozilla/dump_syms
Original source code:

Bugs:

  • bug 1588538 - Use the new Windows dump_syms in Firefox local builds
  • bug 1588534 - Use the new Windows dump_syms to dump Microsoft libraries
  • bug 1588739 - Rewrite the Linux-specific implementation of dump_syms in Rust
  • bug 1588740 - Rewrite the macOS-specific implementation of dump_syms in Rust

Description

The dump_syms tool is used to extract symbol files (.sym) from binaries and libraries. It generates both symbols and stack unwinding information and stores them in the Breakpad symbol file format [1].

We use this tool both to extract symbol files from Firefox builds and from system libraries across all supported platforms.

Rationale

The Breakpad-based tools suffer from a number of different issues:

  • They lack support for recent additions to the native debugging formats, and particularly DWARF5. Upstream isn't in an hurry to add them so we had to roll our own changes but they're incomplete.
  • Each platform has its own tool and each tool cannot be cross-compiled. So we have three distinct implementations of dump_syms: one for Windows, one for Linux and one for macOS.
  • The Windows implementation relies on Microsoft's closed-source DLLs from the DIA SDK to access PDB files. Besides making it impossible to run the tool under non-Windows platforms this exposes us to bugs that we cannot fix.
  • Function name demangling is platform-dependent, as such the same function yields different symbols on different platforms (e.g. the anonymous namespace being presented as (anonymous namespace) on Linux and as `anonymous namespace` on macOS).
  • The Windows dump suffers from bugs in Microsoft's demangler implementation.
  • We have to use ugly tricks to fix up certain symbols that are synthesized by LLVM and which the Microsoft demangler does not understand.
  • The implementation is slow and consumes large amounts of memory. Dumping a debug build of libXUL can take several minutes and consume over 4 GiB of RAM.
  • The Linux implementation is incapable of dealing with compressed debug information.

Plan

The goals for this rewrite are the following:

  • Consolidate all the tools into a single portable and retargetable executable
  • Leverage Rust's existing ecosystem of crates to read debug information instead of rolling our own.
  • Significantly improve the performance and reduce the resource usage of this tool. This is especially important considering that dumping symbol files is in the critical path of all our builds on automation and takes an appreciable amount of time and resources.

To achieve this goal we would like to use a mix of Sentry's Symbolic Rust crates - to access debug information and to demangle the symbols - and crates that allow direct access to the debug information such as goblin and pdb.

All these crates are well maintained, have responsive upstream communities, support more functionality than Breakpad. Additionally they support Rust as a tier 1 language when it comes to handling and demangling symbols which is a nice touch given the nature of our codebase.

Results

The new dump_syms tool has been rolled out across all of Mozilla infrastructure and has been in use since the summer of 2020. It is significantly faster than the old tool (we've seen reductions of an order of magnitude in the time needed to dump libxul) and consumes an order of magnitude less memory. It has broad support for modern debug information (including parts that were reverse-engineered specifically for the new tool such as Apple compact unwinding information).

The symbols it emits are higher quality than the old tool, uniform across different platform and have much better coverage. Additionally the symbol files tend to be smaller thanks to significantly reduced redundancy in the output.

During the course of the project we contribute changes to the crates we used and Sentry in particular accommodated for a number of changes that we needed to implement the new tool.

fix-stacks

Overview

Status: completed
Developer(s): njn, glandium
Source code: https://github.com/mozilla/fix-stacks/
Original source code:

Bugs:

  • bug 1596292 - Replace stack-fixing scripts with a Rust-based one

Description

The fix-stacks tool looks for raw stack traces within the output of our test runs and replaces the raw memory addresses with function names so that the output is readable.

Rationale

The legacy implementation of fix-stacks is split in three different Python scripts, each one being platform dependent. The Linux and macOS scripts rely on calling platform-specific tools such as addr2line or otool. These tools are called several times and take a significant amount of time to process large debug information (such as that produced by a debug build of libxul). The macOS version is so slow that it's disabled by default in certain tasks because it would cause the tasks to time out. The version relying on Breakpad symbols is platform independent but requires an additional step (generating the symbols) and consumes enormous amounts of memory (see bug 1493365). We don't have a version that uses native debug information on Windows.

Plan

The goals for this rewrite are the following:

  • Consolidate all the scripts into a single platform-agnostic executable
  • Use native debug information so we don't need an extra processing step
  • Significantly improve the performance and reduce the resource usage of this tool given it affects the runtime of tests both on automation and locally

To achieve this goal we would like to use Sentry's (https://crates.io/crates/symbolic Symbolic) Rust crates. These crates provide a platform-agnostic interface to read debug information thus being a perfect fit for our use-case.

Results

The project was deemed complete in April 2020, with the old scripts removed and the new tool used across all tasks and all platforms. The resulting tool is significantly smaller in size compared to the original scripts, provides better output, is anywhere from 2x to 100x (!) times faster than the scrips while using less memory. The performance improvements shorten the execution of tasks with failures both on the try server and locally and enabled us to have stack-fixing in tasks that previously couldn't afford it.

njn wrote a detailed blog post [2] describing his approach and results.