Fixed-width strings: Difference between revisions

no edit summary
No edit summary
Line 16: Line 16:
Strings are immutable, but their contents may be different. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references if is the result of a concat operation. Finally, a string can be the result of a substring operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a <tt>union</tt> with a tag.
Strings are immutable, but their contents may be different. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references if is the result of a concat operation. Finally, a string can be the result of a substring operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a <tt>union</tt> with a tag.


=== String creation ===
=== Creation ===


Strings may either be created with 8, 16, or 32 bit data. In addition, strings may be created with UTF-8 data, which results in the smallest width that can hold the data, or a desired width that may cause the creation method to return NULL if the UTF-8 string contains characters that cannot be represented in the desired width. This is the case for 8-bit strings, and for 16-bit strings, if the character exceeds the value 0x10FFFF.  
Strings may either be created with 8, 16, or 32 bit data. In addition, strings may be created with UTF-8 data, which results in the smallest width that can hold the data, or a desired width that may cause the creation method to return NULL if the UTF-8 string contains characters that cannot be represented in the desired width. This is the case for 8-bit strings, and for 16-bit strings, if the character exceeds the value 0x10FFFF.  
Line 34: Line 34:
=== Concatenation, substrings, and flattening ===
=== Concatenation, substrings, and flattening ===
It would not be a good idea to create a new, flat string every time two strings are concatenated. Consider this loop:
It would not be a good idea to create a new, flat string every time two strings are concatenated. Consider this loop:
  var s = "";
  var s = "";
  for (var i = 32; i <= 1024; i++)
  for (var i = 32; i <= 1024; i++)
   s += String.fromCharCode (i);
   s += String.fromCharCode (i);
If a new, flat string would be created on every iteration, this would lead to a almost 1000 copy operations, with a growing string buffer. Instead, the resulting string contains two String pointers that point to the two source strings.  
If a new, flat string would be created on every iteration, this would lead to a almost 1000 copy operations, with a growing string buffer. Instead, the resulting string contains two String pointers that point to the two source strings.  


The above example would create a deep tree, which is also undesirable. Therefore, a String instance contains a <tt>treeDepth</tt> field that contains the deepest depth of both subtrees plus one. The concat operation will contain a threshold where a string will be flattened before it is used for concatenation. This value should be determined using various benchmarks for optimal memory/performance ration. Also, the field is limited in size (10 bits?), so at some point automatic flattening is forced.
The above example would create a deep tree, which is also undesirable. Therefore, a String instance contains a <tt>treeDepth</tt> field that contains the deepest depth of both subtrees plus one. The concat operation will contain a threshold where a string will be flattened before it is used for concatenation. This value should be determined using various benchmarks for optimal memory/performance ration. Also, the field is limited in size (10 bits?), so at some point automatic flattening is forced.


''In SM, buffer allocations are not exact, but on certain boundaries (16 bytes?), leaving spare room at the end, so appends are possible. When appended, a new String instance is created that shares its buffer with the original instance, just longer. This technique would inhibit the use of 0x00 string terminators.''
In-place concatenation is supported. This method allocates string data buffers at multiples of bytes (16 or 32 bytes). Concatenating to such a string would fill the buffer until available space is exhausted, and create a new String instance that would point into the original buffer, but with a larger length. This way, the original string seems unchanged because its length and buffer did not change, and the new string shares the same buffer, with a larger length. This technique reduces memory allocation and provides flattened strings most of the time. This is especially desirable if the loop is an concat /access loop, where a simple concat would create a tree that is flattened in every loop iteration.
 
In the future, the tracing optimizer may detect concatenation in a loop and inform the String implementation about such a loop, which then would e.g. allocate larger data buffers to speed up these loops.


''The .abc image contains utf-8 strings that are not 0-terminated. A future abc format could deliver 8/16/32 bit fixed with strings, so it is desireable to support a string object that can point to external data. That ability is also desireable for the substring() and charAt() operations''
The .abc image contains UTF-8 strings that are not 0-terminated. A future abc format could deliver 8/16/32 bit fixed with strings, so it is desirable to support a string object that can point to external data. If the UTF-8 string contains Latin-1 characters only, the original buffer is used to create the String instance avoiding the copying of data.


Getting a substring also flattens the source string. The substring is an instance that contains a pointer to the source string, and pointer to the start of the source string buffer. The length field contains the string length. This string is already flat, although it contains a reference to another string. It may be desirable to have a separate flattening function for this case.
Getting a substring also flattens the source string. The substring is an instance that contains a pointer to the source string, and pointer to the start of the source string buffer. The length field contains the string length. This string is already flat, although it contains a reference to another string. It may be desirable to have a separate flattening function for this case, and for the the case of a "super-string" that an in-place concatenating operation created.


When a string is flattened, its two String pointers are replaced with a flat data buffer. The resulting width of the string is determined by the widths of the strings in the tree. Usually, the resulting string width is the widest of all substrings found. If desired (with an #ifdef), substrings could also be analyzed if they are wider than the containing data, if e.g. a 16-bit strings only contains 8-bit characters. This is, of course, a performance hit, but may be desired if memory footprint is important. A global setting for the maximum string width may be necessary if strings were created using UTF-8 data, because all widths may exist.
When a string is flattened, its two String pointers are replaced with a flat data buffer. The resulting width of the string is determined by the widths of the strings in the tree. Usually, the resulting string width is the widest of all substrings found. If desired (with an #ifdef), substrings could also be analyzed if they are wider than the containing data, if e.g. a 16-bit strings only contains 8-bit characters. This is, of course, a performance hit, but may be desired if memory footprint is important.


=== Thread safety ===
=== Thread safety ===


Since strings are immutable, they are by definition thread safe. The only unsafe operation if the flattening operation. Therefore, the <tt>flatten()</tt> implementation should look like this:
Since strings are immutable, they are by definition thread safe. The only unsafe operation if the flattening operation, and in-place concatenation. Currently, TT is not thread safe, so the sensible code should be clearly marked with a TODO comment until a global threading solution for TT is available.
void String::flatten() {
  if (0 != this->treeLevel) {
    ENTER_CRITICAL_SECTION;
    if (0 != this->treeLevel) {
      // do the magic
    }
    LEAVE_CRITICAL_SECTION
  }
}


=== SpiderMonkey compatibility ===
=== SpiderMonkey compatibility ===
Line 66: Line 61:
''This is just the start of this section...''
''This is just the start of this section...''


* Do we need to support external strings and their finalizers? External strings are strings created with a finalizer index, which is a 32-bit integer. Finalizers for this type are registered and deregistered. The additional index could fit into the maximum of 16 bytes for string instance data (assuming 4-byte pointers), but since the index is returned by JS_AddExternalStringFinalizer(), the implementation could be limited to, say, 8 bits, giving a maximum of 255 indexes (one index needs to be reserved for the default, built-in finalizer).
* The SM API offers the registration of string finalizers. Strings can be created with custom buffers that the finalizer takes care of deallocating. There are some predefined internal finalizers. All finalizers are stored in a global table with a fixed maximum size. The registration of a finalizer returns an index value into this table. Finalizer indexes correspond to string type enumerators (see last section). The size of that array is determined with a #define to allow ActionMonkey to compile the desired maximum size.


* How would JS_GetStringChars() and JS_GetStringBytes() work? Would they force a narrowing of the string? In that case, the corresponding method is needed (which would not be difficult to add). ''SRJ: What assumptions do these calls make about encoding and character width? Is it possible to deprecate them? IMHO code that can't deal with wide characters probably needs overhauling at this point anyway.''
* JS_GetStringChars() returns a pointer to UTF-16 characters, and JS_GetStringBytes() returns a pointer to UTF-8 characters. Both buffers are guaranteed to live as long as the string instance lives. SM maintains a separate cache for this purpose, where string buffers are garbage-collected. Other encodings may be requested as well.


* What about growable strings? They are contradictory to the immutability of Tamarin strings. They could be implemented, but there should be some safety measure so they cannot be passed in to the engine. ''SRJ: I don't know if growable strings are a requirement for SM compatibility or not, but if they aren't, I think we're better off keeping strings immutable. It fits the ES model well and allows for useful simplification of the code.''
* What about growable strings? They are contradictory to the immutability of Tamarin strings. They could be implemented, but there should be some safety measure so they cannot be passed in to the engine. ''SRJ: I don't know if growable strings are a requirement for SM compatibility or not, but if they aren't, I think we're better off keeping strings immutable. It fits the ES model well and allows for useful simplification of the code.''
* SM (and probably other integrations as well) would like to use only UTF-16. The String implementation should, if possible, use macros that remove unnecessary code if only UTF-16 is to be supported.


=== Sample instance data ===
=== Sample instance data ===
Line 76: Line 73:
This is a sample representation of instance data, just to illustrate the concept. In this sample, a String instance would occupy between 8+n (direct data) and 16 bytes, assuming 4-byte pointers. The first 64 bits would be the lengths and the bits-and-flags field, followed by aligned pointers.
This is a sample representation of instance data, just to illustrate the concept. In this sample, a String instance would occupy between 8+n (direct data) and 16 bytes, assuming 4-byte pointers. The first 64 bits would be the lengths and the bits-and-flags field, followed by aligned pointers.


We need to discuss whether strings should be 0-terminated. Processing a string buffer with C/C++ routines would be much easier if it did. Also, in debug mode, the display of a string is easier.
The listing uses bit fields, but I think that we should use masks and shifts to ensure the correct width and alignment of data, along with inline methods to access the fields.


''SRJ: IMHO we'd be better off not requiring null termination; this will give us more flexibility if in the future we want to have strings point at existing arbitrary data structures that contain string data. As far as C interop, this will only be easier with 8-bit strings, and when combined with the inherent lack of safety in the typical C string routines, I think we're better off avoiding ease of interop, lest someone decide that strcat() is a good way to add to a Tamarin string... debug modes can of course always provide wrappers that allow access to zero-terminated strings (TT has such a thing now).''
The string type is a tag that describes which union to use. In addition, it it an index into the table of finalizers. Negative values are internal finalizers, while 0 and positive values are external finalizers.
 
The listing uses bit fields, but I think that we should use masks and shifts to ensure the correct width and alignment of data, along with inline methods to access the fields.


  class String ... {
  class String ... {
Line 89: Line 84:
   struct {
   struct {
     unsigned int width:2;      // 0:1, 1:2, 2:not used, 3:4
     unsigned int width:2;      // 0:1, 1:2, 2:not used, 3:4
    unsigned int tag:2;         // data tag: 0:direct, 1:regular, 2:concat, 3:substr
      signed int type:9;       // string type, same as finalizer index
     unsigned int dynamic:1;    // if nonzero, buffer must be deleted
     unsigned int dynamic:1;    // if nonzero, buffer must be deleted
     unsigned int treeDepth:x;  // to be defined (more fields may follow)
     unsigned int treeDepth:x;  // to be defined (more fields may follow)
Line 96: Line 91:
   // variable data according to tag
   // variable data according to tag
   union {
   union {
     struct {                    // data follows directly
 
     struct {                    // data follows directly (normal case)
       union {
       union {
         unsigned char c8 [100]; // big number for debug display
         unsigned char c8 [100]; // big number for debug display
Line 103: Line 99:
       }
       }
     } direct;
     } direct;
     struct {                    // regular string (also flattened)
 
     struct {                    // string with buffer (also flattened)
       union {
       union {
         unsigned char* c8;
         unsigned char* c8;
Line 109: Line 106:
         utf32_t* c32;
         utf32_t* c32;
       }
       }
     } regular;
     } buffer;
 
     struct {                    // concatenated
     struct {                    // concatenated
       String* left, *right;
       String* left, *right;
     } concat;
     } concat;
     struct {                    // substring
 
       String* source;          // flattened substring source
     struct {                    // substring and "superstring"
       String* source;          // flattened string source
       union {
       union {
         unsigned char* c8;      // buffer into source's data
         unsigned char* c8;      // buffer into source's data
Line 120: Line 119:
         utf32_t* c32;
         utf32_t* c32;
       }
       }
     } substr;
     } dependent;
   }
   }
  }
  }
55

edits