Fixed-width strings: Difference between revisions

Fixed-width strings (view source)

Revision as of 18:15, 13 May 2008

1,355 bytes added , 13 May 2008

no edit summary

Daumling

55

edits

@@ Line 13: / Line 13: @@
 # Integrators control which width is being used internally. If they only feed 16-bit data, or UTF-8 data with the desired width set to 16 bits, only 16-bit data will be used. There will be a compile-time flag that comments out code that  deals with other widths that 16 bits.
-=== Tagged instance contents ===
+Strings are immutable, but their contents may change, invisible to the owner of the instance. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references as the result of a concat operation. Finally, a string can be the result of a substring operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a <tt>union</tt> with a tag.
-Strings are immutable, but their contents may be different. String data can be 8, 16, or 32 bits wide; A string may contain a pointer to a string buffer, which may need to be deleted or not; or the string data may immediately follow the String instance in memory (by doing a raw allocate followed by an in-place constructor call); or it can hold two String references if is the result of a concat operation. Finally, a string can be the result of a substring operation. The contents may change dynamically during a flatten operation. Therefore, a String instance contains a <tt>union</tt> with a tag.
+Out-of-memory conditions will be handled by the allocator in a future version. The String class will use checks for NULL, and return NULL for new strings. For internal operations, the result of out-of-memory conditions still need to be defined.
+=== UTF-8, UTF-16 and UTF-32 ===
+The core String class will ignore all encoding related issues. A string is just an array of characters. Widening a string from 16 to 32 bits will, for example, not combine a surrogate pair to a new character, as narrowing a string will not crack a 32-bit character with a value outside the Basic Multilingual Pane (0x0000-0xFFFF) into two surrogate pairs. The latter limitation prohibits automatic narrowing of 32-bit strings into 16-bit strings if the 32-bit strings contain characters with a value > 0xFFFF.
+For these conversions, and for UTF-8 encoding and decoding, separate layers will be provided that return error codes. These APIs include:
+* Widening and narrowing with surrogate pair processing for 16-bit strings
+* Creating a string out of UTF-8 data
+* Creating an UTF-8 data buffer out of a string
+These layers will catch the following:
+* Invalid UTF-8 character sequences
+* Single 16-bit and 32-bit characters that have a value between 0xD800 and 0xDFFF
+* 32-bit characters with a value > 0x10FFFF
+* Any other conditions that Unicode 5 considers to be ill-formed.
 === Creation ===

Fixed-width strings: Difference between revisions

Fixed-width strings (view source)

Revision as of 18:15, 13 May 2008

Navigation menu

Search