ServerJS/Encodings: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(→‎Class: Transcoder: add sourceCharset/destinationCharset constants)
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Rationale ==
For Streams, we need encodings support. There also should be a low-level API available for this.
For Streams, we need encodings support. There also should be a low-level API available for this.


There is some discussion on the mailing list (see <http://groups.google.com/group/serverjs/browse_thread/thread/6365b2a54615a134>) and here, there is a summary of these efforts.
= Specification =


== Encoding Names ==
== Encoding Names ==
Line 41: Line 39:
: encodingCheckerFunction takes the encoding name as a parameter and returns true-ish if the encoding should be listed. Regexes should also be supported. If the parameter is missing, returns all supported encodings.
: encodingCheckerFunction takes the encoding name as a parameter and returns true-ish if the encoding should be listed. Regexes should also be supported. If the parameter is missing, returns all supported encodings.


=== Class: Converter ===
=== Class: Transcoder ===


There also should be a class enc.Converter for more advanced conversion.
There also should be a class enc.Transcoder for general transcoding conversion (between ByteStrings or ByteArrays).


Please note that the interface is ''one way''. Despite the fact that it has two methods, it's ''one way''. It only supports ''encoding'', '''no''' decoding! Multiple people have got that wrong when glancing over it, so it's emphasised here.
; [Constructor] Transcoder(from, to)
 
; [Constructor] Converter(from, to)
: Where from and to are the encoding names.
: Where from and to are the encoding names.
; [Method] write(byteStringOrArray)
; [Constant] sourceCharset
: Convert input from a ByteString or ByteArray. The results are stored in an internal buffer, and also those parts of byteStringOrArray that could not be converted (for multi-byte encodings, in a separate buffer).
: String containing the (possibly normalised) source charset name.
: Returns nothing.
; [Constant] destinationCharset
; [Method] read([byteArray,] [maximumSize])
: String containing the (possibly normalised) destination charset name.
: Read maximumSize bytes or as many bytes as available out of the internal buffer. If byteArray is specified, the data is written into that ByteArray.
; [Method] push(byteStringOrArray[, outputByteArray])
: Returns a ByteString if byteArray is not specified, or byteArray itself otherwise.
: Convert input from a ByteString or ByteArray. Those parts of byteStringOrArray that could not be converted (for multi-byte encodings) are stored in a buffer. If outputByteArray is passed, the results are ''appended'' to outputByteArray.
; [Method] close()
: If outputByteArray was passed, returns outputByteArray, otherwise returns <u>the converted bytes as a ByteString</u>.
: <u>The result will also contain bytes accumulated in prior calls to pushAccumulate.</u>
; <u>[Method] pushAccumulate(byteStringOrArray)</u>
: <u>Convert input from a ByteString or ByteArray into an internal buffer that will be read out the next time push or close is called.</u>
; [Method] close([outputByteArray])
: Close the stream. Throws an exception if there was a conversion error (specifically, a partial multibyte character).
: Close the stream. Throws an exception if there was a conversion error (specifically, a partial multibyte character).
: Returns nothing and takes no parameters.
: <u>Writes the remaining output bytes (including those that were accumulated in pushAccumulate) into the here given outputByteArray (appended) or a new ByteString. If outputByteArray is given, it is returned, otherwise the ByteString is returned.</u>
: <u>Also adds initial shift state sequences if required by the encoding.</u>


'''TODO''': Which exception to throw on error?
'''TODO''': Which exception to throw on error?
Example usage:
  Converter = require('encodings').Converter
  converter = new Converter('iso-8859-1', 'utf-32')
  converter.write(input) // input is a ByteString
  output = converter.read() // output is a ByteString
  converter.close()
=== Alternative: Converter ===
There is another way the Converter interface could work. It should be more efficient (less memory consumption, less CPU usage) than the other Converter interface.
; [Constructor] Converter(from, to)
: Where from and to are the encoding names.
; [Method] push(byteStringOrArray[, outputByteArray])
: Convert input from a ByteString or ByteArray. Those parts of byteStringOrArray that could not be converted (for multi-byte encodings) are stored in a buffer. If outputByteArray is passed, the results are ''appended'' to outputByteArray.
: If outputByteArray was passed, returns outputByteArray, otherwise returns (as a ByteString) as much output as could be converted.
; [Method] close()
: Close the stream. Throws an exception if there was a conversion error (specifically, a partial multibyte character).
: Returns nothing and takes no parameters.


Example:
Example:


   Converter = require('encodings').Converter
   Transcoder = require('encodings').Transcoder
   converter = new Converter('iso-8859-1', 'utf-32')
   transcoder = new Transcoder('iso-8859-1', 'utf-32')
   output = converter.push(input) // input is a ByteString, and output too
   transcoder.pushAccumulate(input) // input is a ByteString
   converter.close()
   output = transcoder.close() // and output is a ByteString too


Another example:
Another example:


   converter = new Converter('utf-32', 'utf-8')
   transcoder = new Transcoder('utf-32', 'utf-8')
   output = new ByteArray()
   output = new ByteArray()
   while (input = readSomeByteFromSomewhere()) {
   while (input = readSomeByteFromSomewhere()) {
           converter.push(input, output)
           transcoder.push(input, output)
   }
   }
   converter.close()
   transcoder.close(output)
   // output is the complete conversion of all the input chunks concatenated now
   // output is the complete conversion of all the input chunks concatenated now
(See [[ServerJS/Encodings/OldClass]] for another API.)
= Implementation Recommendations =
First of all, it is recommended to implement convertToString, convertFromString and convert with Transcoder.
Secondly, you should make sure that initial shift state support is properly implemented. When you're using iconv, you need to call iconv(cd, 0, 0, &ob, &ol) in Transcoder.close(). An example of what an initial shift state is: In the Japanese ISO-2022-JP encoding, the default state are ASCII bytes. However, the state can be switched to Japanese with an escape sequence. To make sure that at the end of the text, the state is ASCII again, iconv will emit another escape sequence to switch back again. This is important if you want to concatenate ISO-2022-JP texts, and an implementation of Transcoder that doesn't properly emit these sequences is <b>broken</b>.
= Relevant Discussions =
* [http://groups.google.com/group/serverjs/browse_thread/thread/6365b2a54615a134 Proposal: Encodings API (independent from Streams)]
* [http://groups.google.com/group/serverjs/browse_thread/thread/3faf4a067973a71a How to make Encodings]

Latest revision as of 14:02, 6 June 2009

For Streams, we need encodings support. There also should be a low-level API available for this.

Specification

Encoding Names

The encoding names should be among those supported by ICONV, which seem to be a superset of http://www.iana.org/assignments/character-sets.

The following encodings are required:

  • US-ASCII
  • UTF-8
  • UTF-16
  • ISO-8859-1

Encoding names must be case insensitive

API

OK, so probably this should be a module:

 var enc = require('encodings')

Simple methods

For convenience, there should be these easy methods for converting between encodings:

string = enc.convertToString(sourceEncoding, byteStringOrArray)
Converts a ByteString or a ByteArray to a Javascript string.
byteString = enc.convertFromString(targetEncoding, string)
Converts a Javascript string to a ByteString.
byteString = enc.convert(sourceEncoding, targetEncoding, byteStringOrArray)
Converts a ByteString or a ByteArray to a ByteString.

Checking for available encodings

enc.supports(encodingName)
Checks if encodingName is supported and return true if so, false otherwise.
enc.listEncodings([encodingCheckerFunction or regex])
encodingCheckerFunction takes the encoding name as a parameter and returns true-ish if the encoding should be listed. Regexes should also be supported. If the parameter is missing, returns all supported encodings.

Class: Transcoder

There also should be a class enc.Transcoder for general transcoding conversion (between ByteStrings or ByteArrays).

[Constructor] Transcoder(from, to)
Where from and to are the encoding names.
[Constant] sourceCharset
String containing the (possibly normalised) source charset name.
[Constant] destinationCharset
String containing the (possibly normalised) destination charset name.
[Method] push(byteStringOrArray[, outputByteArray])
Convert input from a ByteString or ByteArray. Those parts of byteStringOrArray that could not be converted (for multi-byte encodings) are stored in a buffer. If outputByteArray is passed, the results are appended to outputByteArray.
If outputByteArray was passed, returns outputByteArray, otherwise returns the converted bytes as a ByteString.
The result will also contain bytes accumulated in prior calls to pushAccumulate.
[Method] pushAccumulate(byteStringOrArray)
Convert input from a ByteString or ByteArray into an internal buffer that will be read out the next time push or close is called.
[Method] close([outputByteArray])
Close the stream. Throws an exception if there was a conversion error (specifically, a partial multibyte character).
Writes the remaining output bytes (including those that were accumulated in pushAccumulate) into the here given outputByteArray (appended) or a new ByteString. If outputByteArray is given, it is returned, otherwise the ByteString is returned.
Also adds initial shift state sequences if required by the encoding.

TODO: Which exception to throw on error?

Example:

 Transcoder = require('encodings').Transcoder
 transcoder = new Transcoder('iso-8859-1', 'utf-32')
 transcoder.pushAccumulate(input) // input is a ByteString
 output = transcoder.close() // and output is a ByteString too

Another example:

 transcoder = new Transcoder('utf-32', 'utf-8')
 output = new ByteArray()
 while (input = readSomeByteFromSomewhere()) {
         transcoder.push(input, output)
 }
 transcoder.close(output)
 // output is the complete conversion of all the input chunks concatenated now

(See ServerJS/Encodings/OldClass for another API.)

Implementation Recommendations

First of all, it is recommended to implement convertToString, convertFromString and convert with Transcoder.

Secondly, you should make sure that initial shift state support is properly implemented. When you're using iconv, you need to call iconv(cd, 0, 0, &ob, &ol) in Transcoder.close(). An example of what an initial shift state is: In the Japanese ISO-2022-JP encoding, the default state are ASCII bytes. However, the state can be switched to Japanese with an escape sequence. To make sure that at the end of the text, the state is ASCII again, iconv will emit another escape sequence to switch back again. This is important if you want to concatenate ISO-2022-JP texts, and an implementation of Transcoder that doesn't properly emit these sequences is broken.

Relevant Discussions