Fixed-width strings: Difference between revisions
m (→Design details) |
m (→String types =) |
||
| Line 12: | Line 12: | ||
=== String types | === String types === | ||
Strings are immutable, but their contents may be different. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references if is the result of a concat operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a < | Strings are immutable, but their contents may be different. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references if is the result of a concat operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a <tt>union</tt> with a tag. | ||
=== String concatenation ==== | === String concatenation ==== | ||
Revision as of 16:13, 2 May 2008
Rationale
Currently, Tamarin-Tracing strings are UTF-8. As much as these are desirable in terms of memory consumption, index access and other functions are unbearably slow.
Design Goals
- Fixed-width strings with fixed widths of 8, 16 and 32 bits. Widths are either automatic (based on UTF-8 input), or manual (as created).
- 8-bit strings contain the first Unicode 256 characters. UTF-8 is not supported in strings.
- 16-bit strings are UTF-16, including surrogate pairs.
- String contents may be based upon static data.
- String concatenation creates a tree structure, which is flattened when needed.
String types
Strings are immutable, but their contents may be different. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references if is the result of a concat operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a union with a tag.
String concatenation =
It would not be a good idea to create a new, flat string every time two strings are concatenated. Consider this loop:
var s = ""; for (var i = 32; i <= 1024; i++) s += String.fromCharCode (i);
If a new, flat string would be created on every iteration, this would lead to a almost 1000 copy operations, with a growing string buffer. Instead, the resulting string contains two String pointers that point to the two source strings.
The above example would create a deep tree, which is also undesirable. Therefore, a String instance contains a
treeDepth
field that contains the deepest depth of both subtrees plus one. The concat operation will contain a threshold where a string will be flattened before it is used for concatenation. This value should be determined using various benchmarks for optimal memory/performance ration. Also, the field is limited in size (10 bits?), so at some point automatic flattening is forced.