Tamarin:Strings: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
Line 46: Line 46:


   Stringp s = getSomeString();
   Stringp s = getSomeString();
   Stringp s16 = s->getFixedWidthString (String::k16);
   Stringp s16 = s->getFixedWidthString(String::k16);
   if (!s16)
   if (!s16)
     return error;
     return error;
   do16BitBufferOperation ((const wchar*) s16->getData(), s16->length());
   do16BitBufferOperation((const wchar*) s16->getData(), s16->length());


To retrieve a character string that contains UTF-8 or UTF-16 data, use the methods <tt>toUTF8String()</tt> or <tt>toUTF16String()</tt>. These methods return a character string in UTF-8 or UTF-16 that is 0-terminated. Note that this is not a String instance, but a simple data buffer containing the character data. The <tt>UTF8String</tt> class contains optimized methods that compute a code point index out of a byte index and vice versa, and the <tt>UTF16String</TT> class converts 32-bit characters into surrogate pairs if necessary (and if 32-bit support is enabled).
To retrieve a character string that contains UTF-8 or UTF-16 data, use the methods <tt>toUTF8String()</tt> or <tt>toUTF16String()</tt>. These methods return a character string in UTF-8 or UTF-16 that is NUL-terminated. Note that this is not a String instance, but a simple data buffer containing the character data. The <tt>UTF8String</tt> class contains optimized methods that compute a code point index out of a byte index and vice versa, and the <tt>UTF16String</TT> class converts 32-bit characters into surrogate pairs if necessary (and if 32-bit support is enabled). If you do not need NUL-terminated strings, always try to use <tt>getFixedWidthString()</tt>, since the string already may contain characters in the desired width.


String indexing is slower than usual because an index operation first has to determine the starting point of the string data according to the string type, and then add up the correct width. There ix an index operator, but there is also a <tt>StringIndexer</tt> class that removes the first step of the calculation, which greatly speeds up indexed access to a string:
String indexing is slower than usual because an index operation first has to determine the starting point of the string data according to the string type, and then add up the correct width. Therefore, the index operator has been removed. For single character access, use the <tt>charAt()</tt> method. There is a <tt>StringIndexer</tt> class that removes the first step of the calculation, which greatly speeds up indexed access to a string:


   Stringp s = getSomeString();
   Stringp s = getSomeString();
   // slow, but sufficient for single characters
   // slow, but sufficient for single characters
   wchar ch = (*s) [0];
   wchar ch = s->charAt(0);
   // much better for loops; note that -> also is overloaded for quick access to the string
   // much better for loops; note that -> also is overloaded for quick access to the string
   StringIndexer str (s);
   StringIndexer str(s);
   for (int32_t i = 0; i < str->length(); i++)
   for (int32_t i = 0; i < str->length(); i++)
     ch = str[i];
     ch = str[i];

Revision as of 14:37, 9 December 2008

Tamarin:Strings

The implementation of Tamarin strings has changed. This page is a how-to about the changes, and what can be done to adapt source code to the changes.

For implementation details, see Tamarin:String_implementation.

General

The new String class contains strings of variable width. A string can either be 8, 16, or even 32 bits (if 32-bit support is enabled). 8-bit strings contain the first 256 characters of the Unicode alphabet, often referred to as Latin-1. UTF-8 is not supported directly, but there is a string creation method that accepts an UTF-8 string, and a toUTF8String() method that returns a 0-terminated UTF-8 string.

Strings do not have a data buffer of its own; instead, strings are created with the data directly following the instance data. Therefore, there is no new operator; instead, there are a number of static create() methods that create and return a string.

The data inside a string can be stored in different ways. A substring contains a reference to the master string, the data pointers points into the master string, and the length also fits into the master string. String containing static data have a pointer to that data.

Very important: Strings are never 0-terminated, because they may contain a 0 as a valid character. It can very well happen that e.g. a string seems to contain the string "abcd", while the length is just 2, so the actual contents are "ab".

Creation

The only way to create a string is to use one of the create() methods:

 static Stringp String::create (const AvmCore* core, const char* buffer,
   int32_t len, Width desiredWidth = kAuto, bool staticBuf = false);

There is a crresponding method for const wchar* data. Also, there are three static methods createUTF8(), createUTF16(), and createUTF32() (the latter only if 32-bit support is enabled). The first creates a string out of UTF-8 data, the second out of UTF-16 data (combining surrogate pairs into UTF-32 characters if the requested width was 32 bits), and the last creates 16-bit data, creating surrogate pairs if necessary. The default argument for the desired string widht is kAuto. In that case, the method checks the string and creates a String instance that best fits the string data. Therefore, the create() method with a const wchar* argument may return an 8-bit string if the data buffer only contains Latin-1 characters, but it will never return s 32-bit string, because it will not consider surrogate pairs as one single 32-bit characters.

If the staticBuf argument is true, the buffer is considered to live as long as the supplied AvmCore instance, and the string data is not copied if it matches the criteria set by the requested width. For UTF-8 data, the data must be ASCII to match this criteria.

These methods may return NULL if the source data is malformed, or does not fit into the requested string width.

Data access

First, direct access to the data buffer should be avoided, since it is not guaranteed that the string data is unique, or even writable. Therefore, the c_str() method is gone and has been replaced with a const void* getData() method. To retrieve the actual data the returned pointer points to, use the getWidth() method, which returns either 1, 2, or 4. A class String::Pointers is actually a union containing pointers to 8, 16, or 32-bit string data.

Strings may be written to if they are created with a known width, and NULL as the buffer pointer.

Example:

 // create a 16-bit string with 1024 characters
 String* myString = String::create (core, NULL, 1024, String::k16);
 // Access the string data; this string is new and referenced nowhere,
 // so this is a safe operation.
 wchar* myChars = (wchar*) myString->getData();

To get a string of a known fixed width, use the getFixedWidthString() method. The method returns this if the string already has the requested width; otherwise, it returns a copy of the string with the given width. Note that if the requested width is too narrow because e.g. a 16-bit string contains characters >= 0x0100, and a 8-bit string is requested, the return value is NULL. ALso, the returned string is not 0-terminated.

 Stringp s = getSomeString();
 Stringp s16 = s->getFixedWidthString(String::k16);
 if (!s16)
   return error;
 do16BitBufferOperation((const wchar*) s16->getData(), s16->length());

To retrieve a character string that contains UTF-8 or UTF-16 data, use the methods toUTF8String() or toUTF16String(). These methods return a character string in UTF-8 or UTF-16 that is NUL-terminated. Note that this is not a String instance, but a simple data buffer containing the character data. The UTF8String class contains optimized methods that compute a code point index out of a byte index and vice versa, and the UTF16String class converts 32-bit characters into surrogate pairs if necessary (and if 32-bit support is enabled). If you do not need NUL-terminated strings, always try to use getFixedWidthString(), since the string already may contain characters in the desired width.

String indexing is slower than usual because an index operation first has to determine the starting point of the string data according to the string type, and then add up the correct width. Therefore, the index operator has been removed. For single character access, use the charAt() method. There is a StringIndexer class that removes the first step of the calculation, which greatly speeds up indexed access to a string:

 Stringp s = getSomeString();
 // slow, but sufficient for single characters
 wchar ch = s->charAt(0);
 // much better for loops; note that -> also is overloaded for quick access to the string
 StringIndexer str(s);
 for (int32_t i = 0; i < str->length(); i++)
   ch = str[i];

Appending data

The main method to append strings to a string is the append() method. This method takes string data of any width (8, 16, or 32 bits) and appends it to a string. Always, a new String instance is returned (except when there are 0 characters to append, of course). SPecialezed versions append char* and wchar* data. Example:

 // Create an XML attribute with namespace
 Stringp ns = xml->getNamespace();
 Stringp name = NULL;
 if (ns) {
   name = ns;
   name = name->append (":");
   if (xml->isAttr)
     name = name->append ("@");
   name = name->append (xml->getName());
 } else {
   name = xml->getName();
 }

Additional String methods

The String class contains most of the usual JavaScript String methods like indexOf() etc. These are highly optimized and accept integer arguments, so it is OK to use them freely. There is a special version of indexOf that accepts a char* for a quick compare with a character constant, as well as a matches() method that matches the string at a given position to a char*. Example:

 Stringp s = ...;
 if (s->matches ("<?xml", 5)) ...
 else if (s->matches ("<![CDATA[", 9)) ...
 if (s->indexOf (":")) < 0) ...