Tamarin:Strings: Difference between revisions

Line 7: Line 7:
=== General ===
=== General ===


The new String class contains strings of variable width. A string can either be 8, 16, or even 32 bits (if 32-bit support is enabled). 8-bit strings contain the first 256 characters of the Unicode alphabet, often referred to as Latin-1. UTF-8 is not supported directly, but there is a string creation method that accepts an UTF-8 string, and a <tt>toUTF8String()</tt> method that returns a NUL--terminated UTF-8 string.
The new String class contains strings of variable width. A string can either be 8, 16, or even 32 bits (if 32-bit support is enabled). 8-bit strings contain the first 256 characters of the Unicode alphabet, often referred to as Latin-1. A special constructor accepts a null-terminated UTF-8 string.


Strings do not have a data buffer of its own; instead, strings are created with the data directly following the instance data. Therefore, there is no ''new'' operator; instead, there are a number of static <tt>create()</tt> methods that create and return a string.
Support for string widths of 32 bits is disabled ; a special constant enables this support.


The data inside a string can be stored in different ways. A substring contains a reference to the master string, the data pointers points into the master string, and the length also fits into the master string. String containing static data have a pointer to that data.
There is no ''new'' operator; instead, there are a number of static creation methods that create and return a string: <tt>createUTF8(), createLatin1(), createUTF16(), createUTF32()</tt>. All of these creators accept a width constant, so strings are created with widths of 8, 16, or 32 bits. The value ''kAuto'' lets the creators determine the width that fits best. If, for example, <tt>createUTF8()</tt> is invoked with a string that decodes to the Latin-1 character set, the resulting string width is 8 bits. All creators accepts a Boolean value that, if true, declares the character data to be static, meaning that the String instance can use the buffer directly without having to copy the character data. Of course, the character data must be guaranteed to live longer as the string, or derivates of that string.


Very important: Strings are never NUL-terminated, because they may contain a 0 as a valid character. It can very well happen that e.g. a string seems to contain the string "abcd", while the length is just 2, so the actual contents are "ab".
Usually, the character data is copied into a data buffer that the String instance points to. A substring contains a reference to the master string, the data pointers points into the master string, and the length also fits into the master string. Strings containing static data have a pointer to that data.
 
'''Very important: Strings are never NUL-terminated, because they may contain NUL characters as valid characters. '''


=== Creation ===
=== Creation ===


The only way to create a string is to use one of the <tt>create()</tt> methods:
The only way to create a string is to use one of the static creator methods:


   static Stringp String::create(const AvmCore* core, const char* buffer,
   static Stringp String::createLatin1(const AvmCore* core, const char* buffer,
     int32_t len, Width desiredWidth = kAuto, bool staticBuf = false);
     int32_t len = -1, Width desiredWidth = kAuto, bool staticBuf = false);


There is a crresponding method for ''const wchar*'' data. Also, there are three static methods <tt>createUTF8()</tt>, <tt>createUTF16()</tt>, and <tt>createUTF32()</tt> (the latter only if 32-bit support is enabled). The first creates a string out of UTF-8 data, the second out of UTF-16 data (combining surrogate pairs into UTF-32 characters if the requested width was 32 bits), and the last creates 16-bit data, creating surrogate pairs if necessary.  
There is a <tt>createUTF16()</tt> method for ''const wchar*'' data, a method <tt>createUTF8()</tt> for UTF-8 character data, and <tt>createUTF32()</tt> (the latter only if 32-bit support is enabled).  
The default argument for the desired string widht is ''kAuto''. In that case, the method checks the string and creates a String instance that best fits the string data. Therefore, the <tt>create()</tt> method with a ''const wchar*'' argument may return an 8-bit string if the data buffer only contains Latin-1 characters, but it will never return s 32-bit string, because it will not consider surrogate pairs as one single 32-bit characters.
The default argument for the desired string width is ''kAuto''. In that case, the method checks the string and creates a String instance that best fits the string data. If the source data is 32 bits and the desired width is 16 bit, surrpgate pairs will be created. If the source data is 16 bits and the destination data is 32 bits, surrogate pairs will be combined into a single UTF-32 character. If the requested width is too small to fit the string, NULL is returned.


If the ''staticBuf'' argument is ''true'', the buffer is considered to live as long as the supplied <tt>AvmCore</tt> instance, and the string data is not copied if it matches the criteria set by the requested width. For UTF-8 data, the data must be ASCII to match this criteria.
If the ''staticBuf'' argument is ''true'', the buffer is considered to live as long as the supplied <tt>AvmCore</tt> instance, and the string data is not copied if it matches the criteria set by the requested width. For UTF-8 data, the data must be ASCII to match this criteria.
Line 31: Line 33:
=== Data access ===
=== Data access ===


First, direct access to the data buffer should be avoided, since it is not guaranteed that the string data is unique, or even writable. Therefore, the <tt>c_str()</tt> method is gone and has been replaced with a <tt>const void* getData()</tt> method. To retrieve the actual data the returned pointer points to, use the <tt>getWidth()</tt> method, which returns either 1, 2, or 4. A class <tt>String::Pointers</tt> is actually a union containing pointers to 8, 16, or 32-bit string data.
Direct access to the data buffer is not longer possible, since it is not guaranteed that the string data is unique, or even writable. Therefore, the <tt>c_str()</tt> method is gone. It is till possible to access single characters via the <tt>charAt()</tt> method or the <tt>StringIndexer>/tt> class. The latter class is a fast way to iterate through the string data.
 
Example:
 
  // Create a string
  Stringp s = String::createLatin1(core, "Hello world");
  // Iterate through the string
  StringIndexer indexer(s);
  for(int i = 0; i < inders.length(); i++)
    process (indexer[i]);


Strings may be written to if they are created with a known width, and ''NULL'' as the buffer pointer.
To retrieve a character string that contains UTF-8 or UTF-16 data, use the classes <tt>StUTF8String()</tt> or <tt>StUTF16String()</tt>. These classes are stack-creatable only, and they contain a NUL-terminated string that can be accessed via its <tt>c_str()</tt> method. Note that creating such an instance on the stack causes a copy of the string to be created. Another class <tt>StIndexableUTF8String</tt> adds the computation between UTF-8 code points and byte offsets. All of these classes are data buffers only; they are not "real" String instances.


Example:
Example:


   // create a 16-bit string with 1024 characters
   // Create a string
   String* myString = String::create(core, NULL, 1024, String::k16);
   Stringp s = String::createLatin1(core, "Hello world");
   // Access the string data; this string is new and referenced nowhere,
   // Access that string as UTF-16 data
   // so this is a safe operation.
   StUTF16String s16(s);
   wchar* myChars = (wchar*) myString->getData();
   const wchar* p = s16.c_str();
 
To get a string of a known fixed width, use the <tt>getFixedWidthString()</tt> method. The method returns ''this'' if the string already has the requested width; otherwise, it returns a copy of the string with the given width. Note that if the requested width is too narrow because e.g. a 16-bit string contains characters >= 0x0100, and a 8-bit string is requested, the return value is ''NULL''.


To get a string of a known fixed width, use the <tt>getFixedWidthString()</tt> method. The method returns ''this'' if the string already has the requested width; otherwise, it returns a copy of the string with the given width. Note that if the requested width is too narrow because e.g. a 16-bit string contains characters >= 0x0100, and a 8-bit string is requested, the return value is ''NULL''. ALso, the returned string is not NUL-terminated.
Example:


   Stringp s = getSomeString();
   Stringp s = getSomeString();
Line 49: Line 62:
   if (!s16)
   if (!s16)
     return error;
     return error;
  do16BitBufferOperation((const wchar*) s16->getData(), s16->length());


To retrieve a character string that contains UTF-8 or UTF-16 data, use the methods <tt>toUTF8String()</tt> or <tt>toUTF16String()</tt>. These methods return a character string in UTF-8 or UTF-16 that is NUL-terminated. Note that this is not a String instance, but a simple data buffer containing the character data. The <tt>UTF8String</tt> class contains optimized methods that compute a code point index out of a byte index and vice versa, and the <tt>UTF16String</TT> class converts 32-bit characters into surrogate pairs if necessary (and if 32-bit support is enabled). If you do not need NUL-terminated strings, always try to use <tt>getFixedWidthString()</tt>, since the string already may contain characters in the desired width.
=== Appending data ===


String indexing is slower than usual because an index operation first has to determine the starting point of the string data according to the string type, and then add up the correct width. Therefore, the index operator has been removed. For single character access, use the <tt>charAt()</tt> method. There is a <tt>StringIndexer</tt> class that removes the first step of the calculation, which greatly speeds up indexed access to a string:
The String class offers several <tt>appendXXX()</tt> methods to append strings to a string. These methods return either a new String instance, or the String instance itself, if in-place concatenation was possible (see [[Tamarin:String_implementation]] for details).


   Stringp s = getSomeString();
   Stringp append(const String str);       // append a String instance
   // slow, but sufficient for single characters
   Stringp appendLatin1(const char* data);  // append characters
   wchar ch = s->charAt(0);
   Stringp append16(const wchar* data);     // append UTF-16 data
  // much better for loops; note that -> also is overloaded for quick access to the string
   Stringp append32(const utf32_t* data);   // append UTF-32 data
   StringIndexer str(s);
    
   for (int32_t i = 0; i < str->length(); i++)
If the appended data is too wide for the string, the string is widened. The latter three methods have overloads that adds the length of the string to be appended.
    ch = str[i];


=== Appending data ===
The old static <tt>concatStrings()</tt> is still available.


The main method to append strings to a string is the <tt>append()</tt> method. This method takes string data of any width (8, 16, or 32 bits) and appends it to a string. Always, a new String instance is returned (except when there are 0 characters to append, of course). SPecialezed versions append ''char*'' and ''wchar*'' data. Example:
Example:


   // Create an XML attribute with namespace
   // Create an XML attribute with namespace
Line 72: Line 83:
   if (ns) {
   if (ns) {
     name = ns;
     name = ns;
     name = name->append(":");
     name = name->appendLatin1(":");
     if (xml->isAttr)
     if (xml->isAttr)
       name = name->append("@");
       name = name->appendLatin1("@");
     name = name->append(xml->getName());
     name = name->appendLatin1(xml->getName());
   } else {
   } else {
     name = xml->getName();
     name = xml->getName();
Line 82: Line 93:
=== Additional String methods ===
=== Additional String methods ===


The String class contains most of the usual JavaScript String methods like <tt>indexOf()</tt> etc. These are highly optimized and accept integer arguments, so it is OK to use them freely. There is a special version of <tt>indexOf</tt> that accepts a ''char*'' for a quick compare with a character constant, as well as a <tt>matches()</tt> method that matches the string at a given position to a ''char*''. Example:
The String class contains most of the usual JavaScript String methods like <tt>indexOf()</tt> etc. These are highly optimized and accept integer arguments, so it is OK to use them freely. There is a special version of <tt>indexOf</tt> that accepts a ''char*'' for a quick compare with a character constant, as well as <tt>matchesXXX()</tt> methods that matches the string at a given position to an argument. Finally, there are <tt>containsXXX()</tt> methods to check for the existence of a substring.  
 
Example:


   Stringp s = ...;
   Stringp s = ...;
   if (s->matches("<?xml", 5)) ...
   if (s->matchesLatin1("<?xml", 5)) ...
   else if (s->matches("<![CDATA[", 9)) ...
   else if (s->matchesLatin1("<![CDATA[", 9)) ...
   if (s->indexOf(":")) < 0) ...
   if (s->indexOfLatin1(":")) < 0) ...
 
The <tt>getIndependentString()</tt> converts a substring to a normal string. This is handy if the string needs to live for a long time, but if you do not want the master of a dependent substring to live for that long.
 
Example:
 
  Stringp xml = parseVeryLargeXMLFile();
  Stringp start = xml->substr (0, 3);
  // if start was to be stored anywhere, the entire, hige XML string would remain alive
  // unless makeDynamic was called
  start = start->getIndependentString();
 
The <tt>makeDynamic()</tt> converts a string with a static buffer, or a string that is a substring to a string with a dynamic buffer.
55

edits