Fixed-width strings: Difference between revisions

Replacing page with 'This page has moved to Tamarin:String_implementation.'
(Replacing page with 'This page has moved to Tamarin:String_implementation.')
 
(37 intermediate revisions by 3 users not shown)
Line 1: Line 1:
=== Rationale ===
This page has moved to [[Tamarin:String_implementation]].
 
Currently, Tamarin-Tracing strings are UTF-8. As much as these are desirable in terms of memory consumption, index access and other functions are unbearably slow.
 
=== Design Goals ===
 
# Fixed-width strings with fixed widths of 8, 16 and 32 bits. Widths are either automatic (based on UTF-8 input), or manual (as created).
#* 8-bit strings contain the first Unicode 256 characters. UTF-8 is not supported in strings.
#* 16-bit strings are UTF-16, including surrogate pairs.
# String contents may be based upon static data.
# String concatenation creates a tree structure, which is flattened when needed.
 
 
=== String types ===
 
Strings are immutable, but their contents may be different. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references if is the result of a concat operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a <tt>union</tt> with a tag.
 
=== String concatenation and flattening ===
It would not be a good idea to create a new, flat string every time two strings are concatenated. Consider this loop:
var s = "";
for (var i = 32; i <= 1024; i++)
  s += String.fromCharCode (i);
If a new, flat string would be created on every iteration, this would lead to a almost 1000 copy operations, with a growing string buffer. Instead, the resulting string contains two String pointers that point to the two source strings.
 
The above example would create a deep tree, which is also undesirable. Therefore, a String instance contains a <tt>treeDepth</tt> field that contains the deepest depth of both subtrees plus one. The concat operation will contain a threshold where a string will be flattened before it is used for concatenation. This value should be determined using various benchmarks for optimal memory/performance ration. Also, the field is limited in size (10 bits?), so at some point automatic flattening is forced.
 
When a string is flattened, its two String pointers are replaced with a flat data buffer. The resulting width of the string is determined by the widths of the strings in the tree. Usually, the resulting string width is the widest of all substrings found. If desired (with an #ifdef), substrings could also be analyzed if they are wider than the containing data, if e.g. a 16-bit strings only contains 8-bit characters. This is, of course, a performance hit, but may be desired if memory footprint is important. A global setting for the maximum string width may be necessary if strings were created using UTF-8 data, because all widths may exist.
55

edits