Exceptions

From MozillaWiki
Jump to: navigation, search

This is a discussion and planning document for refactoring Mozilla to use C++ exceptions. Exceptions will likely arrive after Mozilla 2, but we hope to improve out-of-memory (OOM) handling for Mozilla 2 and maybe take a few steps closer to exception safety.

Originally, this page suggested the goal would be to enable exceptions while preserving current behavior. But it looks like preserving current behavior exactly would make the code too ugly (e.g., needing try statements with empty catch blocks for ignored error codes), defeating the main purpose of going to exceptions. The new goal is to go more slowly, improving error handling functionality and cleaning up the code so it looks really nice once exceptions can be turned on.

The next steps are:

  • Remove all current OOM handling code
  • Create contracts for outparams
  • Rewrite call sites that ignore nsresults
  • Rewrite call sites that use NS_SUCCEEDED

These steps are explained in more detail below. And by the way, we'd love to have community help with any of these. Look for "Coding Task" below to find small but useful bites of work that you might be interested in starting on.

Improved OOM Handling

Background

As of Mozilla 1.9, the general strategy for handling OOMs is this:

  • Functions that allocate memory check for a null pointer (returned from new or malloc) and return NS_OUT_OF_MEMORY if allocation failed.
  • Other functions check for NS_FAILED return codes or NS_OUT_OF_MEMORY errors and try to respond accordingly.

This doesn't really work. In practice, if an allocation fails, Firefox crashes within a second. There are many problems with this strategy, but the important ones for exceptions purposes are: (1) no call site should ever ignore an OOM (and it's not practical to achieve that with nsresult NS_OUT_OF_MEMORY), and (2) most of the time, the right response is to activate a centralized routine that tries to free memory (which is best done with exceptions or an allocation failure handler rather than sprinkling OOM checks everywhere).

So the new OOM strategy will go something like this:

  • Allocation operations (new, malloc) will never return null pointers.
  • No method will return NS_OUT_OF_MEMORY.
  • No code will check for NS_OUT_OF_MEMORY.
  • General OOM conditions will go to an OOM handler that immediately crashes. (Later, the handler may be upgraded to try to free memory--this is a separate issue.)
  • A few allocations that "predictably" cause OOM will be able to check for and respond to OOMs using a special mechanism (e.g., calling a special allocator). The prime example of this kind of allocation is buffers for downloaded images: a huge image could easily cause OOM, and this can be relatively easily handled by the local image code (by not displaying the image).

This strategy will be more robust, more easily improved in the future, and it will reduce code size by removing all the OOM checks.

Rewriting OOMs

There are two rewriting tasks for OOMs:

First, we want to remove all allocation null checks. For example,

   Foo *p = new Foo();
   if (p == NULL) return NS_OUT_OF_MEMORY;
   use(p);

will be replaced by

   Foo *p = new Foo();
   use(p);

Second, we want to remove all explicit OOM checks. For example, in

   nsresult rv = p->foo();
   if (rv == NS_OUT_OF_MEMORY) { ... }

the if statement would be deleted.

For these tasks, we need a full automatic rewriting system, with a pattern detection component and a patch generation component. The example patterns given here can be detected by looking at ASTs, so Dehydra GCC would be a great tool to use for detecting them. Once we have the list of patterns (say, as line numbers where the if statements start), we can work on an Elsa-based patch generator.

Coding Task 1: Dehydra GCC script to detect null pointer checks for allocated memory.

Coding Task 2: Dehydra GCC script to detect explicit tests for NS_OUT_OF_MEMORY return value.

Contracts for Outparams

Background

Some methods ("getters") 'return' a value other than an nsresult to the caller. Generally, the value is returned through a pointer-typed parameter, a.k.a., outparam. As of Mozilla 1.9, it's not always clear whether the returned value can be null if there is not error, or if the value can be changed if there is an error, etc., making things more confusing for callers.

This situation needs to be cleared up for exceptions. With exceptions, we want these methods to return their value directly, which will have performance and developer sanity benefits. In that case, when there is an exception, no value is returned at all to the calling context. Thus, for all methods, when the error code is nonzero, outparams should not be modified.

Some methods have the property that when the return code is NS_OK, the outparam is non-null (equivalently, when the outparam is non-null, the return code is nonzero). We want to identify these methods, because for them, a check for null is a valid way of checking for an error, and should be treated the same as an nsresult check for the other refactorings discussed here.

Code Changes

Coding Task 3: Create an analysis to check for methods that modify outparams when the return a nonzero nsresult. Eventually this analysis should be a gcc plugin so it can be integrated with the build process.

Coding Task 4: Create an analysis to find methods that always return non-null outparams when the nsresult is NS_OK. Eventually, we want to annotate these methods with gcc attributes and enforce the annotation with a gcc plugin.

Fixing Ignored nsresults

Background

There are a fair number of call sites that ignore nsresult return values. This can be for several reasons, including:

  • The caller checks failure using some other condition (e.g., a null return value)
  • The function being called always returns NS_OK.
  • At this call site, the caller has ensured that the function will succeed.
  • The caller doesn't need to respond to errors.

These calls need checking before we can enable exceptions. In general, it won't be possible to ensure that a function doesn't throw an exception, especially if we use exceptions for OOM. Thus, call sites that now ignore nsresults need to be looked at and made exception safe.

Finding Ignored nsresults

The key need here is a tool to automatically find call sites that ignore return values. There is a script under development (by dmandelin) that does this, but it needs to be improved to handle all the special cases, such as checking for a null return value.

Coding Task 5: Finish the ignored nsresult analysis.

Once the list is in place, the calls will need manual attention.

Removing NS_SUCCEEDED

Background

Many call sites check nsresults with NS_SUCCEEDED. For example:

   nsresult rv = p->PrepareToUse();
   if (NS_SUCCEEDED(rv)) {
     p->Use();
   }
   return rv;

This doesn't make sense with exceptions, because with exceptions enabled, if PrepareToUse returns, then it succeeded.

Investigating NS_SUCCEEDED

We want to rewrite call sites that use NS_SUCCEEDED to look more like something that will work with exceptions. But it's not clear yet exactly what that looks like. The first step is to analyze some existing uses of NS_SUCCEEDED to get an idea of what patterns exist and how to rewrite them. The example above, with exceptions can look like:

   p->PrepareToUse();
   p->Use();

Before we have exceptions, we will probably want something like:

   nsresult rv = p->PrepareToUse();
   if (NS_FAILED(rv)) return rv;
   p->Use();
   return NS_OK;

This would be fairly easy to rewrite to the exceptions version, because the NS_FAILED check is easily identified as equivalent to letting the exception propagate to the caller.


Far Future Stuff

This was the original plan for implementing exceptions, now categorized as far future. Also hopefully these steps will be easier once we've done the medium-term stuff documented above.

Benefits of Exceptions

The main reason for using exceptions is improving developer productivity. Hopefully, error handling code will become cleaner and more concise, and modifying code will be easier. outparamdel will make it easier to call getter methods. The error-handling overhaul might even come close to entirely nuking code relating to OOMs and null outparam pointers.

We don't know what the performance effects will be, but it's expected they will be minimal either way. Exceptions can increase performance because error tests are no longer needed in the common success case, output values can be returned directly, and inline code for propagating errors is removed. But exceptions can reduce performance because throwing and catching them is relatively expensive and the catch tables take up space.

General Plan

This section attempts to list all the changes that need to be made to make Firefox use exceptions. As much as possible, we'd like to break this process down into small steps and schedule them in an efficient order. (TBD)

Enabling Exceptions in the Compiler

An essential step is to enable exceptions in the compiler. By itself, this shouldn't do too much, but it may change the behavior of operator new in OOM conditions to throw exceptions instead of returning a null pointer. We'll want to do that eventually, anyway.

Taras: Exceptions are already occurring the code https://bugzilla.mozilla.org/show_bug.cgi?id=353144#c70

"Generic" Exceptions

There are several FF-specific issues with exceptions, which are discussed below. In this section, we discuss how to make FF use C++ exceptions for "generic" error conditions.

nsresult return types will be changed to void. This affects the IDL header file generator, hand-written declarations, and implementation method signatures.

An exception type hierarchy needs to be created to replace error codes. There is an opportunity here to include extra information in the exception object.

Statements like return NS_ERROR_FOO will be replaced with throw statements.

Call sites of nsresult methods need to be updated to handle exceptions instead of testing error codes. The key is to preserve current behavior. Call sites fall into a few categories:

Ignore nsresult. Call sites that ignore the nsresult must be wrapped in try { call() } catch {}. If we can prove that the call will never throw an exception, which we can currently do for about 1/3 of this category, we could remove the wrapper. However, this is potentially dangerous for the future: the called method may someday throw or propagate exceptions. Thus, it is better to make the call site exception-safe before removing the wrapper.

Propagate nsresult. This is when the method containing the call site simply propagates the error code to its caller without executing any other code (except destructors). The macro NS_ENSURE_SUCCESS does this. This category is very nice: with exceptions, there is no error handling code at all. To remove the existing error handling code, we can simply remove the assignment of the error code out of the call site and remove the return statement. This may create a dead rv variable, which can be deleted as a (conceptual) separate step.

We place call sites that log the error into this category as well: logging will be needed less with exceptions.

Compensate, then propagate nsresult. This is when the method containing the call site propagates the error code, but only after running compensation code. Compensation code restores some invariant that would otherwise be lost. The prototypical example is releasing resources to avoid leaks.

The preferred (by the designers of C++) pattern for this scenario is to create a stack object whose destructor maintains the invariant. This is particularly easy to do if the nsresult compensation code executes for both the failure and success cases: we wrap the compensation code in an object destructor and define an instance just before the call site.

If the compensation code currently runs only in the error case, then a direct translation cannot be made using compensation objects. The direct translation would be try { call() } catch { compensate(); throw e; }. However, it may be that for some call sites it is possible to express the error handling code using a stack compensation object.

Compensate, then return other nsresult. This is just like the previous case, except that a different error code is returned. In this case, we must use a catch block so that we can throw a new exception--propagation won't preserve the existing behavior. But we could still use stack compensation objects for compensation, and place only the new throw in the catch block.

But we might prefer to propagate the exception--the error handling code is simpler, and the stack trace is preserved. This would require more changes in the calling code.

Compensate, then continue. This is like the previous case, except that no error code is returned--execution continues. Again, because we are not propagating the exception, we need a catch block. Again, we do have the option of using stack compensation objects and an empty catch block.

Special Case Error: OOM

In current FF, new returns a null pointer when OOM, but with exceptions enabled, it will throw std::bad_alloc.

On most platforms, this shouldn't matter too much, as true OOMs are rare anyway. But we would of course like FF not to crash in this case, which requires catching std::bad_alloc. The easy way to do this is to refactor these call sites as described for generic errors. A more ambitious solution is to remove most of the catch blocks for OOMs, make the call sites exception-safe, and catch OOMs at a high level. That way, there would be very little code devoted to OOM handling.

Code that allocates memory by means other than new (e.g., malloc) will need to be changed to throw std::bad_alloc on failure.

Special Case Error: XPCOM Infrastructure

QueryInterface has been redesigned so that it never throws an exception. It returns a null pointer on failure, and callers must check for it if there is a possibility of failure. (Many QueryInterface calls are guaranteed to succeed.) (Note: the redesign hasn't been checked in to trunk yet.) See also.

GetService may be handled the same way. (TBD)

CreateInstance might also generate OOMs only, although other failure modes are possible. (TBD)

XPConnect

XPConnect needs to be rewritten so that when JavaScript calls C++, the FFI catches C++ exceptions and generates JavaScript exceptions if necessary. Also, when C++ calls JavaScript, C++ exceptions should be generated in response to JavaScript exceptions. This is potentially a lot of work, because there are separate XPConnect implementations for every platform and compiler.

Automated Refactoring

There are on the order of 50,000 call sites affected by the refactoring to use exceptions, so we want to automate as much as possible. All of the basic patterns discussed above can be detected by static analysis and rewritten automatically, although there are always corner cases that are difficult to automate. We hope the corner cases are a small fraction that can be fixed by hand in reasonable time.

Code Text Quality

One key issue with automatic rewriting is the quality of the resulting source code text from a human reader's point of view. We don't want to automatically produce ugly exception code--one of our top-level goals is to make the code more readable.

This issue needs work. We'll probably have to discover the requirements through trial and review.

Static Analysis Infrastructure

Refactoring relies on static analysis to (a) detect patterns to be refactored, and (b) verify the safety of the refactoring. Verifying the safety of the refactoring means ensuring that the refactored program will behave identically to the original. Note that the safety verification is typically not sound--sound refactorings are very hard to create. But we hope to create "almost sound" refactorings that will work in practice. Of course, the refactored code must always be reviewed and tested.

Our static analysis infrastructure includes [Pork]-Oink/Elsa, [Dehydra] Classic, and Dehydra GCC. We can already do a decent job on many of the exceptions refactorings using this infrastructure. The only thing we really miss at this point is an abstract interpretation framework based on intraprocedural CFGs. We also lack an alias analysis, but we don't think precise alias analysis is important for this task.

Refactoring patterns

Here we discuss automatic refactoring as applied to each pattern described above.

Ignore

For the ignore pattern, we need an analysis to detect ignored nsresult return values and then a rewriting to wrap the call in a try block with empty catch. Optionally, the rewriting can also produce an annotation (comment or macro) at call sites that always succeed (assuming no JavaScript implementations) to aid in the later process of making the code exception-safe.

The ignore analysis can be formulated like this: A call site return value is ignored if along all forward code paths the value never reaches the condition of a branch statement and never escapes the method by being stored in a field, global variable, outparam, or argument of a called method.

To be accurate, the analysis really should be flow-sensitive (i.e., trace forward paths, and pay attention to statement ordering), because the same rv variable is often used for the results of many calls.

This analysis has been coded as a Dehydra script, and it works fairly well. But there is a limitation because there are too many paths to explore them all in Dehydra, which introduces unsoundness. However, the analysis could be expressed efficiently and without this unsoundness using abstract interpretation.

The rewriting is simple: the call expression is wrapped in a try block with empty catch. We need to think about whether this should be done with a macro, inline on one line, or on more than one line.

Propagate

For the propagate pattern, we need an analysis to detect call sites that on error immediately return the same error code, and a rewriting to remove the error code storage, testing, and return.

The propagate analysis can be formulated like this: A call site return value is propagated directly if along all forward code paths where the value is non-zero the value is returned before any side effects other than logging occur. Side effects include writing to a variable that escapes the method or calling another method.

Dehydra is a good fit for this analysis because in practice, this pattern can be detected or rejected by exploring very short paths forward from the call site: typically either the value is returned or a side effect is produced almost immediately.

One problem with producing this analysis in Dehydra is that Dehydra doesn't represent if conditions precisely. Dehydra is good at detecting things like if (NS_FAILED(rv)) (assuming that we have patched FF to make NS_FAILED an inline function, which we have), but doesn't detect things like if (NS_FAILED(rv) || foo). The simple case is the most common, so we might be able to get away with doing the more complex ones manually, but the complex cases are common enough that they are probably worth automating as well.

Rewriting propagate sites is potentially tricky, because the analysis property is defined over control-flow paths, but rewriting source code generally operates on ASTs. Hence, we're not prepared to say just yet how this will work in detail, only a general description: The basic idea that all code paths where the value is nonzero should be deleted (because they will never be reached when an exception is thrown) and all code paths where the value is zero should be specialized for a zero value.

Compensate and Propagate

The analysis conditions to detect the compensate and propagate pattern: (a) the call site is not an ignore or propagate, and (b) every path where rv is nonzero reaches ends by returning rv.

Rewriting: TBD

Compensate and Rethrow

The analysis conditions to detect the compensate and rethrow pattern: (a) the call site is not an ignore or propagate, and (b) every path where rv is nonzero reaches ends by returning a non-rv error code.

Rewriting: TBD

Compensate and Continue

The analysis conditions to detect the compensate and rethrow pattern: (a) the call site is not an ignore or propagate, and (b) every path where rv is nonzero eventually joins the paths where rv is zero.

Rewriting: TBD