Crash reporting overhaul

From MozillaWiki
Revision as of 14:28, 18 February 2022 by Gsvelto (talk | contribs) (Overall structure & information about fix-stacks)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

This page describes the various components involved in the rewrite of our crash reporting machinery, the rationale behind each rewrite, the goals we set for each component as well as the plan and progress information for each of them.

Client-side tools and components

crash report generation

crash reporter client

minidump-analyzer

Server-side tools and components

minidump_stackwalker

Status: completed
Developer(s): gankra, gsvelto
Source code: https://github.com/luser/rust-minidump/
Original source code:

Description

Rationale

Plan

Results

dump_syms

Overview

Status: completed
Developer(s): calixte, gsvelto
Source code: https://github.com/mozilla/dump_syms
Original source code:

Description

Rationale

Plan

Results

fix-stacks

Overview

Status: completed
Developer(s): njn, glandium
Source code: https://github.com/mozilla/fix-stacks/
Original source code:

Description

The fix-stacks tool looks for raw stack traces within the output of our test runs and replaces the raw memory addresses with function names so that the output is readable.

Rationale

The legacy implementation of fix-stacks is split in three different Python scripts, each one being platform dependent. The Linux and macOS scripts rely on calling platform-specific tools such as addr2line or otool. These tools are called several times and take a significant amount of time to process large debug information (such as that produced by a debug build of libxul). The macOS version is so slow that it's disabled by default in certain tasks because it would cause the tasks to time out. The version relying on Breakpad symbols is platform independent but requires an additional step (generating the symbols) and consumes enormous amounts of memory (see bug 1493365). We don't have a version that uses native debug information on Windows.

Plan

The goals for this rewrite are the following:

  • Consolidate all the scripts into a single platform-agnostic executable
  • Use native debug information so we don't need an extra processing step
  • Significantly improve the performance and reduce the resource usage of this

tool given it affects the runtime of tests both on automation and locally

To achieve this goal we would like to use Sentry's (https://crates.io/crates/symbolic Symbolic) Rust crates. These crates provide a platform-agnostic interface to read debug information thus being a perfect fit for our use-case.

Work on this is tracked under bug 1596292 and its blockers.

Results

The project was deemed complete in April 2020, with the old scripts removed and the new tool used across all tasks and all platforms. The resulting tool is significantly smaller in size compared to the original scripts, provides better output, is anywhere from 2x to 100x (!) times faster than the scrips while using less memory. The performance improvements shorten the execution of tasks with failures both on the try server and locally and enabled us to have stack-fixing in tasks that previously couldn't afford it.

njn wrote a detailed blog post [1] describing his approach and results.