Mozilla 2/Strings/Static Analysis

From MozillaWiki
Jump to: navigation, search

How many strings exist *only* for conversion?

Procedure:

  • find calls to NS_ConvertUTF8toUTF16 and CopyUTF8toUTF16
  • check to see whether that string is modified after the conversion takes place

TODO: define "modification" [dmandelin] Can modification be defined as any of (1) calling a non-const method, (2) passing as a non-const argument, (3) being a parameter, or (4) escaping (by having a pointer stored or a pointer or reference returned)?

repeat for utf16->ut8

If "AString" were immutable, where would we fail?

Imagine that all nsAStrings currently allocated on the stack became a different type (nsAStringBuilder or std::wstring or something). But when we pass strings around, they are immutable. Classify any cases where this wouldn't work:

Take the following methods:

nsresult GetAString(nsAString &result)
{
  result.Assign("foo"); // this is ok, it can be converted to return a new
                        // immutable string
}

nsresult AppendToAString(nsAString &result)
{
  result.Append("foo"); // this won't work... it modifies the inout param so
                        // we would have to rewrite "result" to be
                        // nsAStringBuilder&, or split it into two separate
                        // params, one in, one out.
}

Finalizer-less strings

[dmandelin] I'm writing up my thoughts on how to make strings work without requiring any MMGC finalizers, which is the first application of this analysis.

String Finalizers. Currently, nsTSubstring has a finalizer, which can (1) decrement the reference count on the nsStringBuffer (if the string is F_SHARED), (2) free the character data buffer (if the string is F_OWNED), or (3) do nothing (if the string is F_FIXED).

To get rid of (1), we need to (a) enable GC for the nsStringBuffer member (a local change) and (b) live without the reference count. The RC is used in mutation methods to know when copy-on-write must occur, so without the RC we can't do copy-on-write. Thus, F_SHARED strings need to be immutable.

To get rid of (2), we can (a) require the data buffer of an F_OWNED string to be allocated with MMGC, so that it can be GC'd, or (b) require the programmer to ensure the destructor is invoked on F_OWNED strings (e.g., by allocating the string as an automatic stack variable). Whether to support (a), (b), or both will depend on the use cases in the existing code base. (q.v. calls to Adopt.)

String Classes. Here are some tentative thoughts on what string classes can be used to support finalizer removal. Please note this is really on a conceptual level, so it wouldn't necessarily be implemented as this set of C++ classes, and I'm also leaving out discussion of dependent strings, etc., for now. This is just about nsSubstring/nsString/nsAutoString.

As explained above, we need an immutable string type. Call it ImmutableString. Its API could be just like nsSubstring except without any mutator methods. All ImmutableStrings have an immutable, sharable MMGC-controlled nsStringBuffer.

The mutable string type, MutableString, adds back the mutator methods. Because it offers a superset of behavior of ImmutableString, MutableString is a subclass of ImmutableString.

There are several kinds of MutableString, with different storage for the string data. An AutoStackString stores character data on the stack and is meant to be used for local variables (just like nsAutoString). An AutoHeapString stores character data on the heap and cleans up the heap in its destructor; this would replace stack-local F_OWNED strings. A HeapString stores character data in an MMGC-controlled heap buffer; this can replace heap-stored F_OWNED strings iff the buffer was originally allocated by MMGC.

Another interesting application of HeapString is as a builder for string values that are to be stored in ImmutableStrings: the buffer can be passed (instead of copied) from a HeapString to a new ImmutableString as long as the HeapString's buffer pointer is set to null at the same time.

Dumbest Rewriting. Here I describe the dumb way to rewrite FF to use immutable strings. It probably wouldn't perform well, but it should work, providing a baseline for more elaborate rewritings.

First, all heap strings and all string parameters become ImmutableString. Then, any points in the code where an ImmutableString is mutated are changed to use update operators. For example:

immString.Append("foo")

becomes something like

immString = immString.Concat("foo")

The Concat method of the new version should return a new ImmutableString with the result. The assignment operator then sets immString to point to the buffer of the new string. The old buffer is dropped on the floor, to be recovered later by GC. (In this version, the ImmutableStrings aren't really immutable, they just have immutable string buffers. But if you think of an ImmutableString instance as being a handle to an immutable string, sort of like a String pointer in Java, then it is immutable after all.)

nsAutoStrings will stay the way they are (called AutoStackStrings in the terminology above).

The main reason this rewriting is dumb is that when there is a sequence of mutation operations in the existing code (esp. in a loop), the dumb version does a corresponding sequence of string allocations, which is much more expensive than the current code's mutable strings.

Second Dumbest Rewriting. The logical improvement to the dumbest rewriting is to try to use mutable strings when there is a sequence of mutation operations. Thus, where the current code has nsAString, we allow ourselves the choice of using ImmutableString or AutoStackString. We would like to use AutoStackString as much as possible, because it's cheap, but we have to be sure to use ImmutableString when necessary. Here is my type-inference-based idea:

  • Current nsAutoString stays (becomes AutoStackString).
  • Current strings on the heap (i.e., members of GCable classes or contained in hashtables/arrays) must become ImmutableString.
  • Strings not on the heap that have mutators called on them must become AutoStackString.
  • Other strings (mainly parameters) can be either Immutable or AutoStack.

Now, we are left with 2 problems:

Problem 1: heap strings are Immutable, but might have mutator methods called on them. We can always revert to the dumbest solution, but we'd like to do better. So, if any method makes more than one mutator call to a heap string, we'll rewrite the method to copy the string to a mutable HeapString at the beginning, and back to the heap string at the end. So:

while (b) {
 mString.Append(stuff);
}

becomes

HeapString temp(mString); // must copy--could be shared
while (b) {
 temp.Append(stuff);
}
mString.Take(copy); // no copy--mString steals buffer

Problem 2: the more interesting problem of selecting Immutable or AutoStack when we have the choice. To do this, we create a type variable for each string where we have the choice, and then set up a constraint system. Whenever there is an "assignment" between strings "a := b" (e.g., parameter passing), we place a constraint "a >= b". We then define "Immutable >= AutoStack" and find the least solution to the system (i.e., the one with the most AutoStacks).

I need to think about that a little more. As described, the solution will always be "everything AutoStack", except where things have to be immutable. In that case, there needs to be inserted an AutoStack->Immutable conversion at the last moment. That seems OK in principle, but I'll have to look at some examples and see how it plays out in real code.

There are also more optimizations that can be applied, such as using HeapString when the result will go to the heap, to save the copy from the stack.