Fixed-width strings: Difference between revisions

Jump to navigation Jump to search
no edit summary
No edit summary
Line 13: Line 13:
# Integrators control which width is being used internally. If they only feed 16-bit data, or UTF-8 data with the desired width set to 16 bits, only 16-bit data will be used. There will be a compile-time flag that comments out code that  deals with other widths that 16 bits.
# Integrators control which width is being used internally. If they only feed 16-bit data, or UTF-8 data with the desired width set to 16 bits, only 16-bit data will be used. There will be a compile-time flag that comments out code that  deals with other widths that 16 bits.


=== Tagged instance contents ===
Strings are immutable, but their contents may change, invisible to the owner of the instance. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references as the result of a concat operation. Finally, a string can be the result of a substring operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a <tt>union</tt> with a tag.


Strings are immutable, but their contents may be different. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references if is the result of a concat operation. Finally, a string can be the result of a substring operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a <tt>union</tt> with a tag.
Out-of-memory conditions will be handled by the allocator in a future version. The String class will use checks for NULL, and return NULL for new strings. For internal operations, the result of out-of-memory conditions still need to be defined.
 
=== UTF-8, UTF-16 and UTF-32 ===
 
The core String class will ignore all encoding related issues. A string is just an array of characters. Widening a string from 16 to 32 bits will, for example, not combine a surrogate pair to a new character, as narrowing a string will not crack a 32-bit character with a value outside the Basic Multilingual Pane (0x0000-0xFFFF) into two surrogate pairs. The latter limitation prohibits automatic narrowing of 32-bit strings into 16-bit strings if the 32-bit strings contain characters with a value > 0xFFFF.
 
For these conversions, and for UTF-8 encoding and decoding, separate layers will be provided that return error codes. These APIs include:
* Widening and narrowing with surrogate pair processing for 16-bit strings
* Creating a string out of UTF-8 data
* Creating an UTF-8 data buffer out of a string
These layers will catch the following:
* Invalid UTF-8 character sequences
* Single 16-bit and 32-bit characters that have a value between 0xD800 and 0xDFFF
* 32-bit characters with a value > 0x10FFFF
* Any other conditions that Unicode 5 considers to be ill-formed.


=== Creation ===
=== Creation ===
55

edits

Navigation menu