Fixed-width strings: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
No edit summary
(Changed from concept to description of implementation)
Line 1: Line 1:
=== Design Goals ===
=== Design ===
 
Currently, Tamarin-Tracing strings are UTF-8. As much as these are desirable in terms of memory consumption, index access and other functions are unbearably slow.


The design goals for the new StringObject class are
The design goals for the new StringObject class are
# Fixed-width strings with fixed widths of 8, 16 and 32 bits. Widths are either automatic (based on UTF-8 input), or manual (as created).
# Fixed-width strings with fixed widths of 8, 16 and 32 bits. Widths are either automatic (based on constructor arguments), or manual (as requested).
#* 8-bit strings contain the first 256 Unicode characters. UTF-8 is not supported in strings.
#* 8-bit strings contain the first 256 Unicode characters. UTF-8 is not supported in strings.
#* 16-bit strings are UTF-16, including surrogate pairs.
#* 16-bit strings are UTF-16, including surrogate pairs.
# String contents may be based upon static data.
# String contents may be based upon static data.
# String concatenation creates a tree structure, which is flattened when needed.
# String APIs should be as close to SpiderMonkey string APIs as possible to make ActionMonkey implementation easy.
# String APIs should be as close to SpiderMonkey string APIs as possible to make ActionMonkey implementation easy.
# Preferred access to characters is via a charAt() method that returns 32-bit characters regardless of the underlying implementation. Low level API strXX() are available for direct read-only buffer access if really needed.
# Preferred access to characters is via a charAt() method that returns 32-bit characters regardless of the underlying implementation.
# Integrators control which width is being used internally. If they only feed 16-bit data, or UTF-8 data with the desired width set to 16 bits, only 16-bit data will be used. There will be a compile-time flag that comments out code that  deals with other widths that 16 bits.


Strings are immutable, but their contents may change, invisible to the owner of the instance. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references as the result of a concat operation. Finally, a string can be the result of a substring operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a <tt>union</tt> with a tag.
There are three types of strings:


Out-of-memory conditions will be handled by the allocator in a future version. The String class will use checks for NULL, and return NULL for new strings. For internal operations, the result of out-of-memory conditions still need to be defined.
# Direct strings, where the character data immediately follows the instance data. Memory is allocated via low-level calls to GC-Alloc() with an in-place constructor call.
# Static strings, where the character data is kept elsewhere. This data must be guaranteed to exist longer than the String instance itself. C character constans are good candidates. ABC data is also be a good candidate as long as the unload of the ABC data does not cause the data these strings to become invalid.
# Dependent strings, where a DRC'ed pointer keeps a reference to the master string, and the string contains a pointer to the start of the character data, and a length count.
 
Out-of-memory conditions will be handled by the allocator in a future version. The String class uses checks for NULL, and returns NULL for new strings whose allocation failed.  


=== UTF-8, UTF-16 and UTF-32 ===
=== UTF-8, UTF-16 and UTF-32 ===


The core String class will ignore all encoding related issues. A string is just an array of characters. Widening a string from 16 to 32 bits will, for example, not combine a surrogate pair to a new character, as narrowing a string will not crack a 32-bit character with a value outside the Basic Multilingual Pane (0x0000-0xFFFF) into two surrogate pairs. The latter limitation prohibits automatic narrowing of 32-bit strings into 16-bit strings if the 32-bit strings contain characters with a value > 0xFFFF.
The String class ignores all Unicode encoding related issues. A string is just an array of characters. For Unicode conformant processing, a StringUtils class exists with a number of static methods:


For these conversions, and for UTF-8 encoding and decoding, separate layers will be provided that return error codes. These APIs include:
* Widening and narrowing with surrogate pair processing for 16-bit and 32-bit strings
* Widening and narrowing with surrogate pair processing for 16-bit strings
* Creating a string out of UTF-8 data
* Creating a string out of UTF-8 data
* Creating an UTF-8 data buffer out of a string
* Creating an UTF-8 data buffer out of a string
Line 33: Line 32:
=== Creation ===
=== Creation ===


Strings may either be created with 8, 16, or 32 bit data. In addition, strings may be created with UTF-8 data, which results in the smallest width that can hold the data, or a desired width that may cause the creation method to return NULL if the UTF-8 string contains characters that cannot be represented in the desired width. This is the case for 8-bit strings, and for 16-bit strings, if the character exceeds the value 0x10FFFF.  
Strings may either be created with 8, 16, or 32 bit data. A UTF-8 version of the StringUtils::create() method processes UTF-8 data, which results in the smallest width that can hold the data, or a desired width that may cause the creation method to return NULL if the UTF-8 string contains characters that cannot be represented in the desired width. This is the case for 8-bit strings with UTF-8 character values greater then 0x00FF, for 16-bit strings with UTF-8 character values greater than 0xFFFF, and for 32-bit strings with UTF-8 character values greater than 0x10FFFF.


Strings are never zero-terminated. Zero-characters are legal as part of a string.
Strings are never zero-terminated. Zero-characters are legal as part of a string.
Line 40: Line 39:


The maximum string width determines the way strings are created. It is an optional argument to the string constructors.
The maximum string width determines the way strings are created. It is an optional argument to the string constructors.
# 8 bits: If the source data contains 16 or 32 bit data, the return value is null.
# 8 bits: If the source data contains 16 or 32 bit data, the return value is NULL.
#16 bits: If the source data contains 32 bit values, surrogate pairs are created. If a character is > 0x10FFFF, NULL is returned.
#16 bits: If the source data contains 32 bit values, surrogate pairs are created. If a character is > 0x10FFFF, NULL is returned.
#32 bits: If a character is > 0x10FFFF, NULL is returned.


This allows implementers to define the maximum width of strings; they can choose to use 8, 16 or 32 bits only, or they can choose to go with whatever width that fits best. If they choose best-fit widths, string creation methods do not create UTF-16 surrogate pairs. If a script creates surrogate pairs, these will remain in strings, though, although a flattening operation could detect surrogate pairs and widen the flattened string to 32 bits.
This allows implementers to define the maximum width of strings; they can choose to use 8, 16 or 32 bits only, or they can choose to go with whatever width that fits best. If they choose best-fit widths, String creation methods do not create UTF-16 surrogate pairs. Use the StringUtils class to process surrogate pairs.
 
''Question: How are out-of-memory conditions handled? The current implementation often just assumes success. There should be some sort of exception, and the same mechanism should be used to report strings that cannot be created.''


=== Concatenation, substrings, and flattening ===
=== In-Place Concatenation ===
It would not be a good idea to create a new, flat string every time two strings are concatenated. Consider this loop:
It would not be a good idea to create a new, flat string every time two strings are concatenated. Consider this loop:


Line 54: Line 52:
   s += String.fromCharCode (i);
   s += String.fromCharCode (i);


If a new, flat string would be created on every iteration, this would lead to a almost 1000 copy operations, with a growing string buffer. Instead, the resulting string contains two String pointers that point to the two source strings.  
If a new, flat string would be created on every iteration, this would lead to a almost 1000 copy operations, with a growing string buffer.
 
The concatenating operation uses a different approach called in-place concatenation. A memory allocator like MMgc usually aligns memory on fixed boundaries. MMgc uses 16-byte boundaries. On Windows, if you allocate a string of 2 characters, MMgc allocates 6 extra bytes, which usually is wasted memory.
 
This example deals with 8-bit characters. For other widths, divide the number of bytes by the character withs to get the number of characters.
 
So, the string "Hi" looks in memory as follows. Note that the character "&curren;" is used to show unused characters.


The above example would create a deep tree, which is also undesirable. Therefore, a String instance contains a <tt>treeDepth</tt> field that contains the deepest depth of both subtrees plus one. The concat operation will contain a threshold where a string will be flattened before it is used for concatenation. This value should be determined using various benchmarks for optimal memory/performance ration. Also, the field is limited in size (10 bits?), so at some point automatic flattening is forced.
<table border="1">
    <tr>
    <td colspan="4">len=2</td>
    <td colspan="4">left=6</td>
  </tr>
  <tr>
      <td>H</td>
      <td>i</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
  </tr>
</table>


In-place concatenation is supported. This method allocates string data buffers at multiples of bytes (16 or 32 bytes). Concatenating to such a string would fill the buffer until available space is exhausted, and create a new String instance that would point into the original buffer, but with a larger length. This way, the original string seems unchanged because its length and buffer did not change, and the new string shares the same buffer, with a larger length. This technique reduces memory allocation and provides flattened strings most of the time. This is especially desirable if the loop is an concat /access loop, where a simple concat would create a tree that is flattened in every loop iteration.
The concatenation of "Hi" and " world" would not create a new string containing "Hi world", but rather fill the buffer with the second string, and then create a dependent string pointing to the result.


In the future, the tracing optimizer may detect concatenation in a loop and inform the String implementation about such a loop, which then would e.g. allocate larger data buffers to speed up these loops.
For the VM, the original string has not changed because its length did not change:


The .abc image contains UTF-8 strings that are not 0-terminated. A future abc format could deliver 8/16/32 bit fixed with strings, so it is desirable to support a string object that can point to external data. If the UTF-8 string contains Latin-1 characters only, the original buffer is used to create the String instance avoiding the copying of data.
<table border="1">
    <tr>
    <td colspan="4">len=2</td>
    <td colspan="4">left=0</td>
  </tr>
  <tr>
      <td>H</td>
      <td>i</td>
      <td>&nbsp;</td>
      <td>w</td>
      <td>o</td>
      <td>r</td>
      <td>l</td>
      <td>d</td>
  </tr>
</table>


Getting a substring also flattens the source string. The substring is an instance that contains a pointer to the source string, and pointer to the start of the source string buffer. The length field contains the string length. This string is already flat, although it contains a reference to another string. It may be desirable to have a separate flattening function for this case, and for the the case of a "super-string" that an in-place concatenating operation created.
The dependent string, however, keeps a reference to its master, and spans the entire string:


When a string is flattened, its two String pointers are replaced with a flat data buffer. The resulting width of the string is determined by the widths of the strings in the tree. Usually, the resulting string width is the widest of all substrings found. If desired (with an #ifdef), substrings could also be analyzed if they are wider than the containing data, if e.g. a 16-bit strings only contains 8-bit characters. This is, of course, a performance hit, but may be desired if memory footprint is important, because flattening the concatenation of a, say, 8-bit string and a 16-bit string may result in a 8-bit string if the flattener is allowed to check the contents of the 16-bit string.
<table border="1">
    <tr>
    <td colspan="8">pointer to master</td>
  </tr>
  <tr>
    <td colspan="4">start=0</td>
    <td colspan="4">len=8</td>
  </tr>
</table>


16-bit strings containing surrogate pairs will never automatically be widened to 32 bits, losing surrogate pairs, because this would change the string length and the location of its characters. There will be a separate API that allows for this conversion.
If another concatenation would cause the same string "Hi" (which actually is "Hi world") and a space character, the buffer can even be reused, and a new dependent string could be created:


=== Thread safety ===
<table border="1">
    <tr>
    <td colspan="8">pointer to master</td>
  </tr>
  <tr>
    <td colspan="4">start=0</td>
    <td colspan="4">len=3</td>
  </tr>
</table>


Since strings are immutable, they are by definition thread safe. The only unsafe operation if the flattening operation, and in-place concatenation. Currently, TT is not thread safe, so the sensible code should be clearly marked with a TODO comment until a global threading solution for TT is available.
If the string would not be " world", but, say, " Jonathan", the buffer would be too small for the append operation. Therefore, the result of such a concatenation is a new string, containing "Hi Jonathan". To accommodate loops where strings are appended to a base string, the concatenation allocates extra characters at the end that correspond to the length of the new string. The new string length would be 11, with extra 11 bytes (remember that we are talking about 8-bit characters in this example), making 22 bytes. MMgc would add two more bytes, resulting in a total character buffer length of 24 bytes:


=== SpiderMonkey compatibility ===
<table border="1">
    <tr>
    <td colspan="4">len=11</td>
    <td colspan="20">left=13</td>
  </tr>
  <tr>
      <td>H</td>
      <td>i</td>
      <td>&nbsp;</td>
      <td>J</td>
      <td>o</td>
      <td>n</td>
      <td>a</td>
      <td>t</td>
      <td>h</td>
      <td>a</td>
      <td>n</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
      <td>&curren;</td>
  </tr>
</table>


''This is just the start of this section...''
As a result, the new string can be used to concatenate 13 extra characters in-place. If the string again grows beyound the buffer capacity, a new string is again allocated with an even bigger room for extra characters etc. To inhibit excessive growth and memory consumption, the maximum number of extra bytes is limited to 64 KBytes. So even for large strings, a copy operation is only forced every 64 KBytes.


* The SM API offers the registration of string finalizers. Strings can be created with custom buffers that the finalizer takes care of deallocating. There are some predefined internal finalizers. All finalizers are stored in a global table with a fixed maximum size. The registration of a finalizer returns an index value into this table. Finalizer indexes correspond to string type enumerators (see last section). The size of that array is determined with a #define to allow ActionMonkey to compile the desired maximum size.
In-place concatenation is only possible if the right-hand string width is smaller or equal to the the left-hand string width. If it is wider, a new string is created that contains the widened left-hand string concatenated with the right-hand string.


* JS_GetStringChars() returns a pointer to UTF-16 characters, and JS_GetStringBytes() returns a pointer to UTF-8 characters. Both buffers are guaranteed to live as long as the string instance lives. SM maintains a separate cache for this purpose, where string buffers are garbage-collected. Other encodings may be requested as well.
A first implementation used a tree-based approach on top of in-place concatenation. Unfortunately, this resulted in large trees that held references to all right-hand-side strings, and the resulting flatten operation was highly recursive, which is not desirable for small devices. The current approach is a good compromise between low memory usage and a low number of string copy operations. Also, it reduces the number of objects that need to be marked during a GC operation.


* What about growable strings? They are contradictory to the immutability of Tamarin strings. They could be implemented, but there should be some safety measure so they cannot be passed in to the engine. ''SRJ: I don't know if growable strings are a requirement for SM compatibility or not, but if they aren't, I think we're better off keeping strings immutable. It fits the ES model well and allows for useful simplification of the code.''
=== Static string data ===


* SM (and probably other integrations as well) would like to use only UTF-16. The String implementation should, if possible, use macros that remove unnecessary code if only UTF-16 is to be supported.
Strings can be created using static character data. This data must, of course, stay alive as long as the strings stay alive. C string constants are a great candidate. ABC data can be unloaded, so ABC data is currently not usable as static data. It should be, though, since ABC strings can be used directly if they are ASCII, which they most often are. We need to come up with a locking mechanism that locks ABC string data in memory as long as there are String instances that point to ABC string data. A possible approach would be to restrict these strings to interned strings. On unload, the strings in that table could either be freed, or the data could be copied to the heap, and the pointer could be replaced.


=== Sample instance data ===
=== Thread safety ===


This is a sample representation of instance data, just to illustrate the concept. In this sample, a String instance would occupy between 8+n (direct data) and 16 bytes, assuming 4-byte pointers. The first 64 bits would be the lengths and the bits-and-flags field, followed by aligned pointers.
Since strings are immutable, they are by definition thread safe. The only unsafe operation is the in-place concatenation operation. Currently, TT is not thread safe, so the sensible code is clearly marked with a TODO comment until a global threading solution for TT is available.


The listing uses bit fields, but I think that we should use masks and shifts to ensure the correct width and alignment of data, along with inline methods to access the fields.
=== SpiderMonkey compatibility ===


The string type is a tag that describes which union to use. In addition, it it an index into the table of finalizers. Negative values are internal finalizers, while 0 and positive values are external finalizers.
* The SM API offers the registration of string finalizers. Strings can be created with custom buffers that the finalizer takes care of deallocating. The String code is prepared to handle these finalizers, but the code is currently commented out (and incomplete). All finalizers are stored in a global table with a fixed maximum size. The registration of a finalizer returns an index value into this table. Finalizer indexes correspond to string type enumerators (see last section). The size of that array is limited to 16. The table should be able to store a JSContext pointer together with the finalizer.


class String ... {
* JS_GetStringChars() returns a pointer to UTF-16 characters, and JS_GetStringBytes() returns a pointer to UTF-8 characters. Both buffers are guaranteed to live as long as the string instance lives. SM maintains a separate cache for this purpose, where string buffers are garbage-collected. Other encodings may be requested as well.
  ...
private:
  // common data
  uint32_t length;
  struct {
    unsigned int width:2;      // 0:1, 1:2, 2:not used, 3:4
      signed int type:9;        // string type, same as finalizer index
    unsigned int dynamic:1;    // if nonzero, buffer must be deleted
    unsigned int treeDepth:x;  // to be defined (more fields may follow)
    unsigned int padding:y;    // padding to 32 bits
  } data;
  // variable data according to tag
  union {


    struct {                    // data follows directly (normal case)
=== StringDataUTF8 ===
      union {
        unsigned char c8 [100]; // big number for debug display
        utf16_t c16 [100];      // actual size varies
        utf32_t c32 [100];
      }
    } direct;


    struct {                    // string with buffer (also flattened)
This TT helper class was used to wrap a String instance (which contained UTF-8 data) into a class providing direct access to the string buffer. The new String code creates a String instance containing UTF-8 data and provides access to that data if the String instance passed in to the constructor does not contain ASCII only. If the String contains ASCII only, it can be used directly. The operator new is private to make sure that the instance can be created on the stack only. This permits some optimizations regarding the temporary STring instance. Currently, pcre needs this buffer, causing a performance degradation and increase of memory usage in all cases where regular exporessions are used.
      union {
        unsigned char* c8;
        utf16_t* c16;
        utf32_t* c32;
      }
    } buffer;


    struct {                    // concatenated
=== StringNULLTerminatedUTF8 ===
      String* left, *right;
    } concat;


    struct {                    // substring and "superstring"
This TT helper class returns a pointer to a null-terminated UTF-8 string. Again, this string is a temporary String instance containing this data.
      String* source;          // flattened string source
      union {
        unsigned char* c8;      // buffer into source's data
        utf16_t* c16;
        utf32_t* c32;
      }
    } dependent;
  }
}

Revision as of 14:13, 18 July 2008

Design

The design goals for the new StringObject class are

  1. Fixed-width strings with fixed widths of 8, 16 and 32 bits. Widths are either automatic (based on constructor arguments), or manual (as requested).
    • 8-bit strings contain the first 256 Unicode characters. UTF-8 is not supported in strings.
    • 16-bit strings are UTF-16, including surrogate pairs.
  2. String contents may be based upon static data.
  3. String APIs should be as close to SpiderMonkey string APIs as possible to make ActionMonkey implementation easy.
  4. Preferred access to characters is via a charAt() method that returns 32-bit characters regardless of the underlying implementation.

There are three types of strings:

  1. Direct strings, where the character data immediately follows the instance data. Memory is allocated via low-level calls to GC-Alloc() with an in-place constructor call.
  2. Static strings, where the character data is kept elsewhere. This data must be guaranteed to exist longer than the String instance itself. C character constans are good candidates. ABC data is also be a good candidate as long as the unload of the ABC data does not cause the data these strings to become invalid.
  3. Dependent strings, where a DRC'ed pointer keeps a reference to the master string, and the string contains a pointer to the start of the character data, and a length count.

Out-of-memory conditions will be handled by the allocator in a future version. The String class uses checks for NULL, and returns NULL for new strings whose allocation failed.

UTF-8, UTF-16 and UTF-32

The String class ignores all Unicode encoding related issues. A string is just an array of characters. For Unicode conformant processing, a StringUtils class exists with a number of static methods:

  • Widening and narrowing with surrogate pair processing for 16-bit and 32-bit strings
  • Creating a string out of UTF-8 data
  • Creating an UTF-8 data buffer out of a string

These layers will catch the following:

  • Invalid UTF-8 character sequences
  • Single 16-bit and 32-bit characters that have a value between 0xD800 and 0xDFFF
  • 32-bit characters with a value > 0x10FFFF
  • Any other conditions that Unicode 5 considers to be ill-formed.

Creation

Strings may either be created with 8, 16, or 32 bit data. A UTF-8 version of the StringUtils::create() method processes UTF-8 data, which results in the smallest width that can hold the data, or a desired width that may cause the creation method to return NULL if the UTF-8 string contains characters that cannot be represented in the desired width. This is the case for 8-bit strings with UTF-8 character values greater then 0x00FF, for 16-bit strings with UTF-8 character values greater than 0xFFFF, and for 32-bit strings with UTF-8 character values greater than 0x10FFFF.

Strings are never zero-terminated. Zero-characters are legal as part of a string.

Strings are created using static creator functions. This allows the implementation to use raw memory allocation and in-place constructor calls to avoid having to do two memory allocations, one for the instance, and the other for the data. Strings created that way contain the data right behind the instance data.

The maximum string width determines the way strings are created. It is an optional argument to the string constructors.

  1. 8 bits: If the source data contains 16 or 32 bit data, the return value is NULL.
  2. 16 bits: If the source data contains 32 bit values, surrogate pairs are created. If a character is > 0x10FFFF, NULL is returned.
  3. 32 bits: If a character is > 0x10FFFF, NULL is returned.

This allows implementers to define the maximum width of strings; they can choose to use 8, 16 or 32 bits only, or they can choose to go with whatever width that fits best. If they choose best-fit widths, String creation methods do not create UTF-16 surrogate pairs. Use the StringUtils class to process surrogate pairs.

In-Place Concatenation

It would not be a good idea to create a new, flat string every time two strings are concatenated. Consider this loop:

var s = "";
for (var i = 32; i <= 1024; i++)
  s += String.fromCharCode (i);

If a new, flat string would be created on every iteration, this would lead to a almost 1000 copy operations, with a growing string buffer.

The concatenating operation uses a different approach called in-place concatenation. A memory allocator like MMgc usually aligns memory on fixed boundaries. MMgc uses 16-byte boundaries. On Windows, if you allocate a string of 2 characters, MMgc allocates 6 extra bytes, which usually is wasted memory.

This example deals with 8-bit characters. For other widths, divide the number of bytes by the character withs to get the number of characters.

So, the string "Hi" looks in memory as follows. Note that the character "¤" is used to show unused characters.

len=2 left=6
H i ¤ ¤ ¤ ¤ ¤ ¤

The concatenation of "Hi" and " world" would not create a new string containing "Hi world", but rather fill the buffer with the second string, and then create a dependent string pointing to the result.

For the VM, the original string has not changed because its length did not change:

len=2 left=0
H i   w o r l d

The dependent string, however, keeps a reference to its master, and spans the entire string:

pointer to master
start=0 len=8

If another concatenation would cause the same string "Hi" (which actually is "Hi world") and a space character, the buffer can even be reused, and a new dependent string could be created:

pointer to master
start=0 len=3

If the string would not be " world", but, say, " Jonathan", the buffer would be too small for the append operation. Therefore, the result of such a concatenation is a new string, containing "Hi Jonathan". To accommodate loops where strings are appended to a base string, the concatenation allocates extra characters at the end that correspond to the length of the new string. The new string length would be 11, with extra 11 bytes (remember that we are talking about 8-bit characters in this example), making 22 bytes. MMgc would add two more bytes, resulting in a total character buffer length of 24 bytes:

len=11 left=13
H i   J o n a t h a n ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤

As a result, the new string can be used to concatenate 13 extra characters in-place. If the string again grows beyound the buffer capacity, a new string is again allocated with an even bigger room for extra characters etc. To inhibit excessive growth and memory consumption, the maximum number of extra bytes is limited to 64 KBytes. So even for large strings, a copy operation is only forced every 64 KBytes.

In-place concatenation is only possible if the right-hand string width is smaller or equal to the the left-hand string width. If it is wider, a new string is created that contains the widened left-hand string concatenated with the right-hand string.

A first implementation used a tree-based approach on top of in-place concatenation. Unfortunately, this resulted in large trees that held references to all right-hand-side strings, and the resulting flatten operation was highly recursive, which is not desirable for small devices. The current approach is a good compromise between low memory usage and a low number of string copy operations. Also, it reduces the number of objects that need to be marked during a GC operation.

Static string data

Strings can be created using static character data. This data must, of course, stay alive as long as the strings stay alive. C string constants are a great candidate. ABC data can be unloaded, so ABC data is currently not usable as static data. It should be, though, since ABC strings can be used directly if they are ASCII, which they most often are. We need to come up with a locking mechanism that locks ABC string data in memory as long as there are String instances that point to ABC string data. A possible approach would be to restrict these strings to interned strings. On unload, the strings in that table could either be freed, or the data could be copied to the heap, and the pointer could be replaced.

Thread safety

Since strings are immutable, they are by definition thread safe. The only unsafe operation is the in-place concatenation operation. Currently, TT is not thread safe, so the sensible code is clearly marked with a TODO comment until a global threading solution for TT is available.

SpiderMonkey compatibility

  • The SM API offers the registration of string finalizers. Strings can be created with custom buffers that the finalizer takes care of deallocating. The String code is prepared to handle these finalizers, but the code is currently commented out (and incomplete). All finalizers are stored in a global table with a fixed maximum size. The registration of a finalizer returns an index value into this table. Finalizer indexes correspond to string type enumerators (see last section). The size of that array is limited to 16. The table should be able to store a JSContext pointer together with the finalizer.
  • JS_GetStringChars() returns a pointer to UTF-16 characters, and JS_GetStringBytes() returns a pointer to UTF-8 characters. Both buffers are guaranteed to live as long as the string instance lives. SM maintains a separate cache for this purpose, where string buffers are garbage-collected. Other encodings may be requested as well.

StringDataUTF8

This TT helper class was used to wrap a String instance (which contained UTF-8 data) into a class providing direct access to the string buffer. The new String code creates a String instance containing UTF-8 data and provides access to that data if the String instance passed in to the constructor does not contain ASCII only. If the String contains ASCII only, it can be used directly. The operator new is private to make sure that the instance can be created on the stack only. This permits some optimizations regarding the temporary STring instance. Currently, pcre needs this buffer, causing a performance degradation and increase of memory usage in all cases where regular exporessions are used.

StringNULLTerminatedUTF8

This TT helper class returns a pointer to a null-terminated UTF-8 string. Again, this string is a temporary String instance containing this data.