Logging and Diagnostics
Contents
The Problem
A number of issues that have led us to look for a better solution in this area. First of all, most logging information in Mozilla is currently written to the system console window. In debug builds, this includes failures of NS_ENSURE_* macros, JS errors and application-specific log messages. This means that real problems can get buried and go unnoticed. Some progress has been made towards reducing the number of NS_ENSURE_* warnings that are displayed in the course of normal Firefox operation, but the proper solution is probably a more powerful and flexible logging framework.
Another problem is the lack of information associated with log messages, whether of the NS_ENSURE_*, printf or nsIConsoleService variety. This makes it very difficult to track down tricky bugs that are hard to reproduce in a debugger, especially those related to threading, UI messages and events. Having additional information like stack traces and values of local/global variables associated with log messages would this task much easier.
Finally, it's very difficult to diagnose problems at remote sites if they can't be easily reproduced. Better mechanisms for logging to disk would be extremely helpful. NSPR logging is of limited value since it can't be turned on/off programatically.
Use Cases
Catching Development-Time Errors
Honza is writing some C++ code as part of a complex extension. Part of his code makes some modifications to a database. He tests his code and it seems to work fine, so he checks in. Because the system console is not very visible and is filled with unrelated information (including NS_ENSURE_* failures inside the Mozilla core), he doesn't notice is that some of his code is failing at an NS_ENSURE_* macro, and the information written to the database is therefore incomplete and inconsistent.
Debugging Tricky Problems
Marketa is writing some code to respond to messages coming in through a network socket. She isn't getting the expected behavior, but she can't locate the problem in a debugger since she is receiving thousands of messages and only a small number of those are failing. She strews trace messages through the relevant code, but even though she can see traces related to the problem, she doesn't know exactly why the relevant code is triggered. Solving the problem requires hours (or days) or trial and error.
(Another good example of this is tracking download deadlocks, which explains the motivation for Brendan's deadlock detection code with stack traces (NS_TRACE_MALLOC_XXX)).
Remote Diagnostics
Acme Firefox Extensions releases a cool new extension for ordering pizza in all 48 contiguous U.S. States, right from inside your Firefox browser! They deliver it to 1000 beta testers, but many of the testers complain that their pizzas are arriving late and are already ice cold. The Acme developers suspect that the problem is to do with the way the extension is integrated with the web service gateway of their IPP (Internet Pizza Provider). They add a bunch of trace statements to try to narrow down the problem. The traces confirm that the gateway is giving unexpected responses, but they can't figure out why the problem is so difficult to reproduce consistently.
With great effort they add code to generate stack traces for some of the log messages. They can see that a certain function is failing, but they still don't understand why. Only when they add further code to do a complete symbol dump to the log for the relevant frame in the stack trace do they finally notice that a variable is not being initialized correctly, causing a problem if the value randomly ends up being negative.
The Requirements
- Low-level logging for code that can't access XPCOM.
- Filter messages by module.
- Write messages to system console, file, XUL window, etc.
- Stack information for each message. Ideally this would include support for "mixed mode" stacks that include, for example, C++ and JavaScript frames. If there were some way to provide stacks across PLEvent post boundaries, this would be a huge plus.
- Symbolic information for each stack frame, including the values of all local and global variables, and some way to filter this information selectively to avoid humungous logs.
- Automatic cleanup of log files after some specified period.
- Some sort of Talkback-like functionality to "phone home" with log files when specific errors (e.g. crashes) occur.
The Solution?
I think a solution might consist of the following:
- A low-level logging module similar to NSPR logging but with the ability to turn logging on and off programatically.
- More sophisticated NS_ENSURE_* macros that use this module and can provide stack traces and application-specific formatted messages -- something like NS_ENSURE_SUCCESS(rv, MY_MODULE_NAME, nsMessage("Couldn't open file %s", fileName));
- An XPCOM logging service along the lines of that described in bug 228205, also based on the low-level logging module. The service would act very much like the nsIObserverService, enabling listeners to subscribe to specific modules. Logging messages should contain a stack frame reference (presumably nsIStackFrame) and some way to get at program symbols associated with each stack frame. The most obvious way to do this would be to have an nsIStackFrameSymbols interface or something as a member of nsIStackFrame, but perhaps there is some way to store the symbols elsewhere and retrieve them when needed. In any case, generation of stack frame and symbolic information should be parametrizable since obviously this is a very processor-intensive task.
- A generic component for generating stack traces wrapped in XPCOM interfaces. This would also be useful for providing a satisfactory fix for bug 322959, getting Brendan's deadlock detection w/stack trace code working without the need for tweaking, etc.
- A datasource that implements the log service listener interface and caches log messages.
- A XUL component that displays a tree view of the datasource, so that each message can be opened up to see the associated stack trace, and each stack frame can be opened to see the associated variables/values.
- An XPCOM component that implements the log service listener interface and logs to a file. There should be parameters for determining when new log files are created (e.g. daily) and when they should be cleaned up.
- A Talkback-like component for sending log files to a server using some standard protocol (SMTP? FTP?). Ideally it would be possible to separate the client and server components from the actual transport used, so that people could write custom transports.