Crash reporting overhaul: Difference between revisions

Crash reporting overhaul (view source)

Revision as of 11:14, 4 April 2022

5,137 bytes added , 4 April 2022

Added information about the crash monitor project

Gsvelto

Confirmed users

424

edits

@@ Line 12: / Line 12: @@
 == Crash monitor ==
+Status: not started<br>
+Developer(s): gsvelto<br>
+Source code:<br>
+Original source code: N/A<br>
+Bugs:<br>
+* {{bug|1620998}}
+=== Description ===
+Minidump generation is the step that occurs right after we have intercepted an
+exception and involves reading data from the crashed process and writing it out
+to disk. Additionally we need to write out the crash annotations to the .extra
+file that makes up a crash report togther with the minidump. Currently this
+involves two distinct code paths in Firefox depending on the affected process.
+For child process crashes the minidump writing step is done by the main process
+in a background thread. The same thread is responsible for receiving the crash
+annotations from the child process and write out the .extra file. On the other
+hand if the main process crashes then the entire writing phase happen within
+the main process' exception handler as no separate process is available to
+accomplish this task.
+=== Rationale ===
+The current system suffers from significant complexity, is fragile, hard to
+test and often unreliable:
+* When the main process crashes all crash generation happens within an exception handler. This means that no memory allocations are possible and we can only do bare syscalls on Linux/macOS (that is we cannot use libc functions). This lead to frequent issues such as deadlocks (when some code accidentally did an allocation or syscall), stack overflows (because the signal handler stack is small) and failures to generate the minidump. In particular on Windows we call [https://docs.microsoft.com/en-us/windows/win32/api/minidumpapiset/nf-minidumpapiset-minidumpwritedump MiniDumpWriteDump()] in the crashed process, which Microsoft documentation explicitly warns against.
+* Crash generation for child processes is generally more reliable but also suffers from a major issue: the presence of two distinct IPC channels and the complexity of the child process' exception handlers frequently lead to deadlocks in the affected code. The worst ones we encountered blocked Firefox entirely but they were rare. The more common ones would lead only the crash generation thread to get stuck, preventing crash reporting alone.
+* The presence of two distinct code paths (two exception handlers, two ways of streaming out crash annotations, etc...) adds significant complexity to the codebase and was often a source of bugs. In some cases a change done to one of the two paths was not replicated in the other causing a lack of functionality. In other cases shared code was written but failed to work properly because of the different constraints imposed on it by the context within which it was run.
+* Reporting errors during crash generation is complex. For child processes we have some machinery that can describe why we failed to write a minidump, however we have no such thing for main process crashes. This makes it hard to diagnose why we failed to generate a crash report.
+* The new Rust minidump writers only support out-of-process crash generation. This means we currently have to use two different minidump writers for in-process generation (Breakpad) and out-of-process generation (Rust).
+=== Plan ===
+To address the issues of the current system we plan on move all crash report
+generation to an external process (aka the "crash monitor"). The crash
+monitor will be responsible for generating both the minidump and the .extra
+file that make up a crash report. It should be able to detect crashes that are
+currently not detectable (such as OOM crashes on Linux) and will hand over the
+fully generated crash reported to the main process. If for some reason we
+failed the monitor will communicate the reason to the main process. Because
+crash generation is currently done in Breakpad this needs to happen in several
+steps:
+# We first have to create the crash monitor executable and equip it with the ability to communicate with the existing breakpad infrastructure. This is necessary because we still rely on Breakpad's exception handler.
+# We will then move the Breakpad minidump-writer in the crash monitor and use it to generate minidumps for platforms where we still use Breakpad.
+# For platform where we already have a Rust-based writer we will wire it up so that it will be used instead of Breakpad code.
+# We will use the current out-of-process exception handler for all processes, removing the main process' dedicated exception handler. We will modify it to not assume that it is talking with the main process but rather with the crash monitor.
+# To extract crash annotations we will use the mechanism that is currently used by child processes, removing the in-process writing code.
+# We will modify the Windows Error Reporting interceptor to pass the exception over to the crash monitor instead of doing crash generation by itself.
+# We will enable the crash monitor to launch the crash reporter client in case of main process crashes.
+# Finally we will add a mechanism to launch the crash monitor as soon as possible during startup, this should happen before any exception handlers are registered or possibly lazily by the exception handlers themselves.
 == Minidump storage for crash annotations ==

Crash reporting overhaul: Difference between revisions

Crash reporting overhaul (view source)

Revision as of 11:14, 4 April 2022

Navigation menu

Search