Crash reporting improvements

From MozillaWiki
Jump to: navigation, search

Introduction

This page lists the various improvements that we want to introduce after having finished overhauling the existing crash reporting machinery (see the Crash reporting overhaul page for more information). Many of the tasks described here were features that had been requested years ago but could not be implemented in a reasonable amount of time using the old Breakpad-based tooling.

List of projects

Minidump storage for crash annotations

Status: not started
Developer(s):
Source code:
Original source code:

Bugs:

Description

Crash annotations are a set of pieces of information that accompany a minidump to form a complete crash report. Crash annotations contain critical information such as the Firefox version and build ID but also ancillary information such as how much memory a process was using, or a user-provided string associate with a failed assertion that crashed the process.

Currently crash annotations are stored in a JSON file (with an .extra suffix) that is sent along with the minidump to Socorro. Depending on the type of crash this file is either written out by the exception handler (if the main process crashed) or the contents of the annotations are forwarded to the main process which then writes them out (in the case of a child process crash).

Rationale

There are several issues with the current system:

  • Having a separate file adds significant complexity both when submitting and processing crash reports, and also additional failure modes (like only one of the files being present in the report)
  • The file needs to be written out after the minidump has been written out, adding complexity to the exception handler
  • For child processes an extra IPC channel is needed to send the annotations
  • Setting annotations is a relatively expensive process
  • Some annotations are synthesized at crash time and dealt with ad-hoc code, there is no unified mechanism to handle them together with the others

Given the above storing the annotations within a minidump would simplify the crash reporting flow, eliminate an additional IPC channel and greatly streamline the effort to store annotations by user code.

Plan

Annotations should be stored within the minidump and read directly from the crashed process. This requires several steps:

  • The crash annotations interface in Gecko needs to be modified so that a process can flag where its annotations are stored
  • The crash-time annotations need to be removed and replaced with regular ones
  • We need to add a mechanism to separate between the process' annotations and global ones that must be included in every crash
  • Minidump writers need to be modified to identify where the annotations are stored in a process memory, read them and write them out within the minidump
  • Finally teach the stackwalker tool to look for the annotations in the minidump and print them out

Additionally some changes will be required to Socorro on the ingestion side. Socorro currently relies on the .extra file contents for filtering. For example annotations containing the product version are used to decide if a crash is coming from a version of Firefox that is very old and thus should be dropped. If we store the annotations within the minidump we need to provide a way for Socorro to extract them without processing the full minidump, so that it can still apply its filtering rules. To this end we need to write a streamlined minidump pre-processor that only extracts this information and provides it in JSON format. This might prove useful for other types of filtering we don't currently do (such as rejecting reports caused by hardware faults or unconditionally accepting those that might indicate security-sensitive issues). The rust-minidump crate provides all the necessary functionality to write this tool.

Telemetry-based dashboards

Overview

Status: not started
Developer(s): Source code: Original source code:

Description

Rationale

Plan

Disassembly in the stack walker

Overview

Status: completed
Developer(s): cmartin
Source code: https://github.com/rust-minidump/rust-minidump
Original source code: N/A

Description

Sometimes some critical information is missing when inspecting a crash. For example crashes hitting non-canonical addresses on x86-64 don't have the real crashing address but a useless placeholder instead (see bug 1493342). In other cases such as when executing an illegal istruction we only have the address of the instruction but no idea what it was.

Rationale

To fill in the missing information in crash reports it would be useful to disassemble the crashing instruction and be able to inspect it:

  • For non-canonical addresses we could reconstruct the real crashing address from the registers and immediate values in the instructions
  • For misaligned vector accesses we could reconstruct the real crashing address from the registers and immediate values in the instructions
  • For invalid instructions we could tell if the instruction is valid and non-supported or downright invalid (in the case of a bit-flip or corrupted executable for example)
  • For privileged or unsupported instructions we'd be able to tell if it's our fault or if the machine configuration is not adequate
  • For null pointer accesses we'd be able to remove the fixed offset often applied to the pointer and make the crash more obvious (or tell it apart from bit-flips in the lower bits)
  • Hardware bugs often result in impossible crashes where the crash reason simply could not have been triggered by the faulting instruction. For example the crash reason is an invalid access but the faulting instruction is a branch, or an arithmetic operation that does not access memory. With the disassembled instruction in hand we could detect those cases and flag the crash report as suspicious.

Plan

We could integrate a disassembler such as iced in the rust-minidump stackwalker and use it to disassemble the crashing instruction. Our minidump writers usually include the memory area around the crashing instruction so we could also disassemble the entire area. We'd then use the raw result to verify and adjust the crashing address and add a human readable output of the disassembly to the JSON output so that it can be surfaced.

Support inlined functions in crash stacks

Overview

Status: completed
Developer(s): mstange
Source code:

Original source code: N/A
Bugs:

Description

For a long time Breakpad symbol files only included names and information for non-inlined functions. This was recently changed and now symbols files can include the name of inlined function as well as the regions of memory where they were inlined, complete with indexes to discern at what level of the stack they appeared.

Rationale

Firefox code includes heavy inlining, especially in layered Rust and C++ code. The lack of inline information has hampered us, often making interpreting crashes very much non-obvious. Adding support for inlined functions would make it easier to diagnose bugs and would significantly simplify triage of certain modules.

Plan

  • The first step is to introduce support in Symbolic to correctly parse these fields while reading .sym files. This will be used to later add support in the stack walker
  • Once Symbolic support is ready dump_syms needs to be modified to emit these directives. Symbolic already supports reading inlining information from native debuginfo so it's a matter of leveraging that information
  • Finally the stack walker needs to be modified to take into account the new directives and emit inlined frames in the output

Improved stack overflow detection & analysis

Overview

Status: in progress
Developer(s): gsvelto
Bugs:

Description

For years we've assumed that stack overflows would be captured by the Breakpad exception handlers; this assumption was based on the presence of crash reports involving stack overflows on Windows, the use of an alternate signal stack on Linux and macOS' exception handler architecture which delegates exceptions to a separate thread. Real-world testing and bugs proved that we were actually missing a significant amount of stack overflows:

  • On Linux the alternate signal stack was only available on the main thread, stack overflows in other threads wouldn't be caught
  • When we did catch a stack overflow on Linux the minidump writer might mistake the guard page for the stack, thus storing an empty stack in the generated minidump
  • On Windows only some stack overflow crashes were caught, others would be silently forwarded to Windows Error Reporting
  • On macOS the exception handler seems capable of catching the overflow but the minidump writer produces a malformed minidump which is completely unusable
  • Crash reports caused by stack overflows are obvious on Windows which has a specific exception for them, but on macOS/Linux they're indistinguishable from other crashes

Plan

This project requires tackling several issues:

  • On Linux we need to ensure all threads have an alternate signal stack installed when they're launched and we need to modify the minidump writer to properly identify where the stack is
  • On macOS we need to investigate the issues with minidump writing, possibly integrating the required changes in the oxidized minidump writer
  • On Windows we need to ensure that the Windows Error Reporting interceptor catches stack overflows
  • We need to introduce a test that specifically checks for crash overflows and ensures that they're being caught properly, then enable it one platform at a time
  • Last but not least we need to flag macOS/Linux stack overflows so that they're easy to tell apart from other type of regular crashes