Fixed-width strings: Difference between revisions
| Line 24: | Line 24: | ||
The maximum string width determines the way strings are created. It is an optional argument to the string constructors. | The maximum string width determines the way strings are created. It is an optional argument to the string constructors. | ||
# 8 bits: If the source data contains 16 or 32 bit data, the return value is null. | # 8 bits: If the source data contains 16 or 32 bit data, the return value is null. | ||
#16 bits: If the source data contains 32 bit values, surrogate pairs are created. If a | #16 bits: If the source data contains 32 bit values, surrogate pairs are created. If a character is > 0x10FFFF, null is returned. | ||
''Question: How are out-of-memory conditions handled? The current implementation often just assumes success. There should be some sort of exception, and the same mechanism should be used to report strings that cannot be created.'' | |||
=== Concatenation, substrings, and flattening === | === Concatenation, substrings, and flattening === | ||
Revision as of 16:57, 2 May 2008
Rationale
Currently, Tamarin-Tracing strings are UTF-8. As much as these are desirable in terms of memory consumption, index access and other functions are unbearably slow.
Design Goals
- Fixed-width strings with fixed widths of 8, 16 and 32 bits. Widths are either automatic (based on UTF-8 input), or manual (as created).
- 8-bit strings contain the first Unicode 256 characters. UTF-8 is not supported in strings.
- 16-bit strings are UTF-16, including surrogate pairs.
- String contents may be based upon static data.
- String concatenation creates a tree structure, which is flattened when needed.
- String APIs should be as close to SpiderMonkey string APIs as possible to make ActionMonkey implementation easy.
Tagged instance contents
Strings are immutable, but their contents may be different. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references if is the result of a concat operation. Finally, a string can be the result of a substring operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a union with a tag.
String creation
Strings may either be created with 8, 16, or 32 bit data. In addition, string may be created with UTF-8 data, which results in the smallest width that can hold the data.
String are created using static creator function. This allows the implementation to use raw memory allocation and in-place constructor calls to avoid having to do two memory allocations, one for the instance, and the other for the data. Strings created that way contain the data right behind the instance data.
The maximum string width determines the way strings are created. It is an optional argument to the string constructors.
- 8 bits: If the source data contains 16 or 32 bit data, the return value is null.
- 16 bits: If the source data contains 32 bit values, surrogate pairs are created. If a character is > 0x10FFFF, null is returned.
Question: How are out-of-memory conditions handled? The current implementation often just assumes success. There should be some sort of exception, and the same mechanism should be used to report strings that cannot be created.
Concatenation, substrings, and flattening
It would not be a good idea to create a new, flat string every time two strings are concatenated. Consider this loop:
var s = ""; for (var i = 32; i <= 1024; i++) s += String.fromCharCode (i);
If a new, flat string would be created on every iteration, this would lead to a almost 1000 copy operations, with a growing string buffer. Instead, the resulting string contains two String pointers that point to the two source strings.
The above example would create a deep tree, which is also undesirable. Therefore, a String instance contains a treeDepth field that contains the deepest depth of both subtrees plus one. The concat operation will contain a threshold where a string will be flattened before it is used for concatenation. This value should be determined using various benchmarks for optimal memory/performance ration. Also, the field is limited in size (10 bits?), so at some point automatic flattening is forced.
Getting a substring also flattens the source string. The substring is an instance that contains a pointer to the source string, and pointer to the start of the source string buffer. The length field contains the string length. This string is already flat, although it contains a reference to another string. It may be desirable to have a separate flattening function for this case.
When a string is flattened, its two String pointers are replaced with a flat data buffer. The resulting width of the string is determined by the widths of the strings in the tree. Usually, the resulting string width is the widest of all substrings found. If desired (with an #ifdef), substrings could also be analyzed if they are wider than the containing data, if e.g. a 16-bit strings only contains 8-bit characters. This is, of course, a performance hit, but may be desired if memory footprint is important. A global setting for the maximum string width may be necessary if strings were created using UTF-8 data, because all widths may exist.
Thread safety
Since strings are immutable, they are by definition thread safe. The only unsafe operation if the flattening operation. Therefore, the flatten() should look like this:
void String::flatten() {
if (0 != this.treeLevel) {
ENTER_CRITICAL_SECTION;
if (0 != treeLevel) {
// do the magic
}
LEAVE_CRITICAL_SECTION
}
}
Sample instance data
This is a sample representation of instance data, just to illustrate the concept. In this sample, a String instance would occupy between 8+n (direct data) and 16 bytes.
We need to discuss whether strings should be 0-terminated. Processing a string buffer with C/C++ routines would be much easier if it did. Also, in debug mode, the display of a string is easier.
Question: can we use bit fields, or better masks plus shift for the compacted values?
class String ... {
...
private:
// common data
uint32_t length;
struct {
unsigned int width:2; // 0:1, 1:2, 2:not used, 3:4
unsigned int tag:2; // data tag: 0:direct, 1:regular, 2:concat, 3:substr
unsigned int dynamic:1; // if nonzero, buffer must be deleted
unsigned int treeDepth:x; // to be defined (more fields may follow)
unsigned int padding:y; // padding to 32 bits
} data;
// variable data according to tag
union {
struct { // data follows directly
union {
unsigned char c8 [100]; // big number for debug display
utf16_t c16 [100]; // actual size varies
utf32_t c32 [100];
}
} direct;
struct { // regular string (also flattened)
union {
unsigned char* c8;
utf16_t* c16;
utf32_t* c32;
}
} regular;
struct { // concatenated
String* left, *right;
} concat;
struct { // substring
String* source; // flattened substring source
union {
unsigned char* c8; // buffer into source's data
utf16_t* c16;
utf32_t* c32;
}
} substr;
}
}