Fixed-width strings

From MozillaWiki
Jump to navigation Jump to search

Rationale

Currently, Tamarin-Tracing strings are UTF-8. As much as these are desirable in terms of memory consumption, index access and other functions are unbearably slow.

Design Goals

  1. Fixed-width strings with fixed widths of 8, 16 and 32 bits. Widths are either automatic (based on UTF-8 input), or manual (as created).
    • 8-bit strings contain the first Unicode 256 characters. UTF-8 is not supported in strings.
    • 16-bit strings are UTF-16, including surrogate pairs.
  2. String contents may be based upon static data.
  3. String concatenation creates a tree structure, which is flattened when needed.


String types =

Strings are immutable, but their contents may be different. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references if is the result of a concat operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a

union

with a tag.

String concatenation =

It would not be a good idea to create a new, flat string every time two strings are concatenated. Consider this loop:

var s = "";
for (var i = 32; i <= 1024; i++)
  s += String.fromCharCode (i);

If a new, flat string would be created on every iteration, this would lead to a almost 1000 copy operations, with a growing string buffer. Instead, the resulting string contains two String pointers that point to the two source strings.

The above example would create a deep tree, which is also undesirable. Therefore, a String instance contains a

treeDepth

field that contains the deepest depth of both subtrees plus one. The concat operation will contain a threshold where a string will be flattened before it is used for concatenation. This value should be determined using various benchmarks for optimal memory/performance ration. Also, the field is limited in size (10 bits?), so at some point automatic flattening is forced.