ServerJS/Binary/C

This proposal was written by Daniel Friesen as an alternative to the Binary/B proposal.

It reflects the Blob type that is used in a number of existing Server-side JavaScript implementations, as well as a Buffer type reflecting the existing StringBuffer/StringBuilder within Java.

ByteArray is excluded and Buffer proposed instead. A short note, none of the prior art actually implemented a ByteArray as Binary/B proposes. Most implementations implemented a Blob type, and any that implemented something called "ByteArray" actually implemented something more like a stream API based buffer rather than anything remotely resembling an array.

One of the important points thought about in this proposal was interoperability between strings and blobs. ie: The ability to write code that can abstractly extract, combine, buffer, and whatnot strings and blobs with a api ignorant of whether the data is binary or text. As well things which seamed counter-intuitive (putting .charAt on Blob) were avoided.

Most of this was based off of APIs drafted for MonkeyScript (Blob Buffer).

Terms

To avoid confusion and ambiguity these are the basic definitions of terms used within this document.

List: A type which groups a series of items in a specific order.
Sequence: A type of list which manages a list of fixed-unit pieces of data.; These units of data are normally either bytes or characters. "Sequence" is basically a term which refers generically to both Strings, Blobs/ByteStrings, and mutable counterparts like Buffer and whatnot.
Array: A type of list which manages a list of items. These items are not related to one another in any way other than their inclusion in the list and do not need to be of the same type.; A key importance is an Array is a loose collection of items, these items do not have any sort of fixed unit to them.

Differences between a Sequence and an Array

While this may not be the case in lower level languages, JavaScript's API does make a clear distinction between strings and arrays.

Units

A Sequence is built up of a list of single unit items. Whilst an Array is built up of unitless items, the array does nothing but point to objects, it contains nothing itself. The sequence "abc" is made up of 3 units { a, b, c } whilst [1,"asdf",3,{}] is made up of 4 items { 1, "asdf", 3, {} } with no relation to each other and no fixed units as we see two separate numbers in there, a 4 unit sequence inside of it, and an object which could have an indefinite hierarchy.

Spillover

Depending on whether the type is a Sequence or an Array type functions such as .indexOf may "spill" or "overflow" over multiple items. Sequences spill, while Arrays do not spill. There is a subtle difference in the api between the two.

sequence.indexOf(sequence, [offset]);
array.indexOf(item, [offset]);

When using .indexOf on a sequence you give it another sequence. indexOf does not look for just a single item, but a sequence of items within that sequence. Contrasted to this, when using .indexOf on an array it ONLY looks for a single item and the search is unaffected by adjacent items.

This is apparent from how "foobarbaz".index("bar"); returns the index of "bar" despite the fact that 'b', 'a', and 'r' are 3 units within this 9 unit long sequence (in this 1 unit being 1 character). While contrasted to this [1,2,3,4,5].indexOf([2,3,4]); does NOT return the location of the 2, 3, and 4 inside this array. The reason for this being that indexOf on an array is a single item operation, it does not spill lookup over into the following items.

Pushing and Popping

Another point which does not get emphasised because strings are immutable in JavaScript and thus don't need methods to mutate them as Arrays do, is the semantics of .push, .pop, etc...

.pop() and .shift() remove ONE item from an array and return it.

As well given one argument .push() and .unshift() add ONE item to an array.

The key point here is [1,2,3].push([4,5,6]); does NOT turn the array into [1,2,3,4,5,6] it just adds the [4,5,6] as a sub array as so [123,[456]].

You can give multiple arguments to these methods, but then you are no longer working with your lists in the same way.

There is another name which does fit this kind of operation, "Append" (Side note, Wrench.js does .append to the array type). Using [1,2,3].append([4,5,6]); DOES push 4, 5, and 6 onto the array creating the array [1,2,3,4,5,6].

ByteArray?

From the discussions in the ServerJS mailing list I do not recall anyone saying "[I] want to mutate a [sequence] with the Array api". I only recall the statements "[I] want a way to mutate a [sequence]" and "mutable, like an Array". The decision to use the name "ByteArray" had nothing to do with wanting an Array type sequence. It had to do with wanting to mutate a sequence, and Arrays being the other type of list.

Looking into prior art the Array api does not fit. There is a more relevant term used in computing, the buffer.

Prior art

Java's java.lang.StringBuffer is a very good reference for prior art. It is made for Strings rather than bytes, but nonetheless it's a api designed solely for the purpose of mutation of a string, not one designed for one purpose and hacked to suit another.

The StringBuffer works using by append[ing](), insert[ing](), strings to grow the buffer. .delete() removes portions of the buffer, .indexOf() and .lastIndexOf() can search, .replace() and .reverse() are available, .length() shows the length of the data itself, .capacity() shows the current amount of memory allocated, and .substring can grab a substring from the StringBuffer.

The API

The api for this spec defines two new classes. Blob (Fluspferd, Google, jslibs have all used this name, it's a fairly long-standing name and normally works similarly) and Buffer.

It is up to an implementation whether they wish to make Blob and Buffer native global objects, or seclude them inside of a binary module. Whether they are made global or not if the implementation implements require() then require('binary'); must return an object containing Blob and Buffer as keys, even if the binary module is simply a module containing exports.Blob = Blob; exports.Buffer = Buffer;.

Blob

Blob is the binary counterpart to String, it has a slightly different API but has many similarities. A Blob is an immutable representation of a sequence of 8bit bytes.

Most of the blob methods work on blobish data, rather than flat blobs. This means that the argument is treated as if it were passed through Blob(), thus .indexOf(255); is the same as if you had done .indexOf(Blob(255)), so you do not need to explicitly convert everything into a blob.

[new] Blob();: Construct an empty blob
[new] Blob(number);: Construct a single unit blob, converting the number a byte. If the item is outside that range, not a number, or not an integer (has a decimal point) a TypeError should be thrown.
[new] Blob(arrayOfNumbers);: Construct an blob the same length as the array, converting numbers 0..255 into bytes. If any item is outside that range, not a number, or not an integer (has a decimal point) a TypeError should be thrown.
[new] Blob(blob);: Passes the blob through.
[new] Blob(string, toCharset);: Construct a new blob with the binary contents of a string. The string will be encoded from the native UTF-16 charset into the charset specified by the toCharset argument and represented in the new blob in 8bit bytes.

blob.length;: Returns the length of the blob. This is immutable.

blob[index]; // Optional

blob.byteAt(index);

blob.valueAt(index);

@showofhands (ashb suggests .byteAt could return Number (byte) instead of a single unit blob; .valueAt would still return blob so that abstract code still works)

Extracts a single byte from the blob and returns a new blob object containing only it. Note that the blob[i] form is optional, implementations may chose to exclude support for it. This should be ideally be relevant to support for string[i], thus if the interpreter being used supports string[i] it is expected that an implementation should attempt to support it as well.

blob.indexOf(blob, offset=0);
blob.lastIndexOf(blob, offset=0);: Returns the index within the calling blob object of the first or last (depending on which method is used) occurrence of the specified value, or -1 if not found.

blob.concat(otherBlob, ...);: Combines the content of multiple blobs together and returns a new blob.

blob.slice(offset, length);: Extracts a section of the blob and returns a new blob containing it as the contents. (This should behave the same as string.slice and array.slice)

blob.split();
blob.split(separator);
blob.split(separator, limit);: Splits the blob based on a sequence of bytes ({0 0 0 255 0 0} split by 255 would become [{0 0 0}, {0 0}]) and returns an array of blobs. This is the same as string.split except it does not support regular expressions. Like string.split this supports sequences of more than one unit (ie: You may split {0 0 255 0 0 255 3 0} by the blob {255 0} and get [{0 0}, {0 255 3 0}])

blob.toBlob([fromCharset, toCharset]);: If passed with no argument returns the same blob.; If passed with two charaset arguments transcodes the data from one charset to the other and returns the data as a new blob.; Note that if a single argument is passed to this method it should throw a TypeError to prevent gotchas where someone runs .toBlob(charset) on a blob instead of a string where it is relevant.

blob.toString();: Returns a debug representation like "[Blob length=2]", where 2 is the length of the blob. Alternative debug representations are valid too, as long as (A) this method will never fail, (B) the length is included, (C) It is not only the representation of an implicitly converted string.

blob.toString(fromCharset);: Converts the binary data in the blob from the charset specified by fromCharset to the native UTF-16 charset and returns a new string with that content.

blob.toArray();: Returns an array containing the bytes as numbers.
blob.toArray(charset);: Returns an array containing the decoded Unicode code points.

blob.integerAt(offset, size=1, signed=false, networkEndian=false);: Extracts an integer out of a blob. Arguments may control the byte size extracted whether the number is signed or unsigned, and whether or not the byte is in networkEndian or not. The default is to return a single unsigned byte in the form of an integer in the range 0..255.

blob.floatAt(offset, size);: Extracts a float out of a blob. This method requires a size argument of either 4 (floats) or 8 (doubles) to extract a number and should generate a TypeError if passed invalid params.

blob.stringAt(offset, size, fromCharset=UTF-16);: Extracts a string out from a blob. This method is similar to if you had done blob.slice(offset, size).toString(fromCharset||"UTF-16");.

blob.toSource();: This method is optional, it should be included if the interpreter being used supports .toSource() on it's various objects and types.; Returns a representation of the blob in the format "(Blob([]))" or "(new Blob([]))". If the blob has content in it the string should contain integers 0..255 representing the blob such that if evaluated (calling the correct Blob function) would return a blob with the same content.

Buffer

Buffer is a generic buffer which may act on either binary or textual data using the same api (Two purposes in one class is derived from the MonkeyScript Buffer; the .text = bool; idiom came from the .text property in MonkeyScript's Stream class, the idea is to make it as easy as possible to create a buffer of the same type without needing to use a ternary, this point is up for discussion on the list).

(Note: Whether to allow type conversion of a buffer with no content or not could be discussed) A Buffer may initially be created as untyped without any length, or may be given a type and/or length when created. A Buffer which has a type assigned to it (binary or text) and has content within it (length > 0) may not be converted to another type and should throw an error if attempted.

A buffer gains a type when .text is assigned a true or false value, or it inherits the data type of the first piece of data added to the buffer. After it is typed a buffer should throw an error if someone attempts to insert a different type of data into the buffer (If you want to insert a string into a binary buffer, you should convert it to a blob first yourself; automatic type conversion in these cases would cause gotchas).

Buffers may implement smart resizing in the background (ie: padding arrays or whatnot to sizes to avoid reallocating on each insert) but information on this is not available to the JavaScript API.

new Buffer();: Construct an empty buffer with no set type
new Buffer([String or Blob], [len]);: Construct a new buffer.; The constructor may accept the String function or the Blob function as a hint of what type to set on the buffer.; The constructor also may accept a length argument to set the default length of the buffer.
new Buffer(sequence);: Construct a buffer based on a string or a blob. The Buffer will be the same length as the sequence, inherit the same type, and start off with the same content as the sequence.

buf.length;
buf.length = len;: Get or set the length of the buffer (For binary buffers this is number of bytes, for text buffers this is number of characters).; When length is set the buffer is dynamically resized. If shrunk it is truncated to size discarding items from the end. If grown the buffer is padded with 0 bytes for binary, and '\0' (null characters) for text.

buf.text;
buf.text = bool;: Get or set the type of the buffer. Binary buffers return false, text buffers return true, buffers which have not been assigned a type yet return undefined.; This will throw a TypeError if you try to set a type on a buffer that already has a type and has a length greater than 0.

buf[index];
buf.valueAt(index);: Returns a string or blob representing the unit at a specified index.

buf.append(data);: Append a chunk of data to the end of the buffer growing it by data.length.

buf.insert(data, index);: Insert a chunk of data into a buffer growing it by data.length and shifting the data to the right of the specified index towards the end of the buffer.

buf.clear(offset, length);: Zero out a section of the buffer. Binary buffers have bytes replaced with 0 bytes and text buffers have characters replaced with '\0' (null characters).

buf.remove(offset, length);: Remove a section of the buffer starting at offset and continuing for length units, shrinking it by length.

buf.splice(offset, length, data, ...);: Remove a section of the buffer and insert chunks of data starting from the place it was removed from.

buf.slice();
buf.slice(start);
buf.slice(start, end);: Extract a subsection of the buffer and return it as a new sequence. (Behaves the same as the string and blob counterparts)

buf.split();
buf.split(separator);
buf.split(separator, limit);: Splits the buffer based on a sequence and returns an array of strings or blobs. (When used on text buffers this may or may not chose to support regular expressions)

buf.indexOf(sequence, offset=0);
buf.lastIndexOf(sequence, offset=0);: Returns the index within the calling buffer object of the first or last (depending on which method is used) occurrence of the specified value, or -1 if not found.

buf.reverse();: Causes the sequence to be replaced by the reverse of the sequence.

buf.valueOf();

Return the non-mutable sequence for the buffer.

In binary mode this returns a Blob which matches the contents of the buffer.
In text mode this returns a String which matches the contents of the buffer.

String extensions

These extensions may be optional, however it would be ideal if implementations added these prototypes to the standard objects. Implementations may chose how to implement these (load binary themselves beforehand, prototype methods that use require('binary') within them, etc...)

string.toBlob(toCharset);: Converts a UTF-16 string into the specified charset and returns a blob containing that binary data.

string.valueAt(index);: An alias for string.charAt(index);; The point of this prototype is so that (string or blob).valueAt(index); may be used independently of whether the sequence is a string or a blob. This will allow strings to maintain .charAt and blobs to maintain .byteAt without returning unintuitive results while still allowing a method of working abstractly without relying on things like (str or blob)[index] which may not be implemented in some engines.

General requirements

Any operation that requires encoding, decoding, or transcoding among charsets may throw an error if that charset is not supported by the implementation. All implementations MUST support "us-ascii" and "utf-8".

Charset strings are as defined by IANA http://www.iana.org/assignments/character-sets.

Charsets are case insensitive.

Notes

A high priority in this proposal was String/Blob interoperability. While implicit string conversion was avoided it was important to make sure there was a api which could abstractly work with a sequence of data ignorant of whether the data was a string or a blob.
- .valueAt was added to string so that there was a common method for both blobs and strings without implementing a counterintuitive .charAt on blob. Note that as a result you can actually check .charAt vs .byteAt and string will only have .charAt, while blob will only have .byteAt.
- Buffer was made independent of whether the data is binary or text. To avoid implicit string conversion TypeErrors are thrown when giving data of the incorrect type to a buffer. But you are still able to write code using buffer that works on either strings or blobs and doesn't care which mode it is in.
  - Note how Buffer accepts String or Blob to determine it's data type. You could actually write code like var buf = new Buffer(sequence.constructor); and create a buffer based on the type of a sequence without checking what it is.
- While same-type rules apply .slice can be used abstractly on both strings and blobs (arrays to actually), and the same goes for .concat, .length, .split (without regex), and .indexOf/lastIndexOf.
Some experimentation with .valueOf needs to be done. .valueOf has type hinting (the first argument is a string hint of what type may be converted to, operators like > and < make use of it as well as a few other cases). It would be nice to see if it's possible to use the native < and > operators to compare blobs on their binary order.
For now I've ignored things like .eq/equals, .lt, gt, etc... do note that Rhino actually implements .equals on String already. Also if we do add these things to blob we should probably implement the same on string.
Aristid Breitkreuz notes Buffer could be moved to IO.