55
edits
m (→Creation) |
No edit summary |
||
| Line 13: | Line 13: | ||
# Integrators control which width is being used internally. If they only feed 16-bit data, or UTF-8 data with the desired width set to 16 bits, only 16-bit data will be used. There will be a compile-time flag that comments out code that deals with other widths that 16 bits. | # Integrators control which width is being used internally. If they only feed 16-bit data, or UTF-8 data with the desired width set to 16 bits, only 16-bit data will be used. There will be a compile-time flag that comments out code that deals with other widths that 16 bits. | ||
Strings are immutable, but their contents may change, invisible to the owner of the instance. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references as the result of a concat operation. Finally, a string can be the result of a substring operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a <tt>union</tt> with a tag. | |||
Out-of-memory conditions will be handled by the allocator in a future version. The String class will use checks for NULL, and return NULL for new strings. For internal operations, the result of out-of-memory conditions still need to be defined. | |||
=== UTF-8, UTF-16 and UTF-32 === | |||
The core String class will ignore all encoding related issues. A string is just an array of characters. Widening a string from 16 to 32 bits will, for example, not combine a surrogate pair to a new character, as narrowing a string will not crack a 32-bit character with a value outside the Basic Multilingual Pane (0x0000-0xFFFF) into two surrogate pairs. The latter limitation prohibits automatic narrowing of 32-bit strings into 16-bit strings if the 32-bit strings contain characters with a value > 0xFFFF. | |||
For these conversions, and for UTF-8 encoding and decoding, separate layers will be provided that return error codes. These APIs include: | |||
* Widening and narrowing with surrogate pair processing for 16-bit strings | |||
* Creating a string out of UTF-8 data | |||
* Creating an UTF-8 data buffer out of a string | |||
These layers will catch the following: | |||
* Invalid UTF-8 character sequences | |||
* Single 16-bit and 32-bit characters that have a value between 0xD800 and 0xDFFF | |||
* 32-bit characters with a value > 0x10FFFF | |||
* Any other conditions that Unicode 5 considers to be ill-formed. | |||
=== Creation === | === Creation === | ||
edits