Fixed-width strings: Difference between revisions
(Implementation of fixed-width strings in Tamarin-Tracing) |
|||
| Line 1: | Line 1: | ||
== | === Rationale === | ||
Currently, Tamarin-Tracing strings are UTF-8. As much as these are desirable in terms of memory consumption, index access and other functions are unbearably slow. | |||
=== Design Goals === | |||
# Fixed-width strings with fixed widths of 8, 16 and 32 bits. Widths are either automatic (based on UTF-8 input), or manual (as created). | |||
#* 8-bit strings contain the first Unicode 256 characters. UTF-8 is not supported in strings. | |||
#* 16-bit strings are UTF-16, including surrogate pairs. | |||
# String contents may be based upon static data. | |||
# String concatenation creates a tree structure, which is flattened when needed. | |||
=== Design details === | |||
==== String types ==== | |||
Strings are immutable, but their contents may be different. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references if is the result of a concat operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a <pre>union</pre> with a tag. | |||
==== String concatenation ==== | |||
It would not be a good idea to create a new, flat string every time two strings are concatenated. Consider this loop: | |||
var s = ""; | |||
for (var i = 32; i <= 1024; i++) | |||
s += String.fromCharCode (i); | |||
If a new, flat string would be created on every iteration, this would lead to a almost 1000 copy operations, with a growing string buffer. Instead, the resulting string contains two String pointers that point to the two source strings. | |||
The above example would create a deep tree, which is also undesirable. Therefore, a String instance contains a <pre>treeDepth</pre> field that contains the deepest depth of both subtrees plus one. The concat operation will contain a threshold where a string will be flattened before it is used for concatenation. This value should be determined using various benchmarks for optimal memory/performance ration. Also, the field is limited in size (10 bits?), so at some point automatic flattening is forced. | |||
Revision as of 16:11, 2 May 2008
Rationale
Currently, Tamarin-Tracing strings are UTF-8. As much as these are desirable in terms of memory consumption, index access and other functions are unbearably slow.
Design Goals
- Fixed-width strings with fixed widths of 8, 16 and 32 bits. Widths are either automatic (based on UTF-8 input), or manual (as created).
- 8-bit strings contain the first Unicode 256 characters. UTF-8 is not supported in strings.
- 16-bit strings are UTF-16, including surrogate pairs.
- String contents may be based upon static data.
- String concatenation creates a tree structure, which is flattened when needed.
Design details
String types
Strings are immutable, but their contents may be different. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references if is the result of a concat operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a
union
with a tag.
String concatenation
It would not be a good idea to create a new, flat string every time two strings are concatenated. Consider this loop:
var s = ""; for (var i = 32; i <= 1024; i++) s += String.fromCharCode (i);
If a new, flat string would be created on every iteration, this would lead to a almost 1000 copy operations, with a growing string buffer. Instead, the resulting string contains two String pointers that point to the two source strings.
The above example would create a deep tree, which is also undesirable. Therefore, a String instance contains a
treeDepth
field that contains the deepest depth of both subtrees plus one. The concat operation will contain a threshold where a string will be flattened before it is used for concatenation. This value should be determined using various benchmarks for optimal memory/performance ration. Also, the field is limited in size (10 bits?), so at some point automatic flattening is forced.