E4X

From MozillaWiki
Jump to: navigation, search

This page is for material helpful to implementors of the E4X specification. As such, it is not intended to be a tutorial on E4X. For more information intended for users of E4X, see the Mozilla Developer Center E4X page.

ECMA-357 2nd edition, the current version of the standard, can be downloaded from ECMA's site.

Whitespace in literals

I had a question about the handling of whitespace in E4X code, when XML.ignoreWhitespace is false. This came up when attempting to pass the Mozilla conformance test relating to the normalize() method (LXR version of test).

The testcase is buggy -- see bug 368459.

The conformance test constructs an XML object (which is specified by ECMA357 11.1.4). According to 11.1.4, we're using the production XMLElement, which contains XMLElementContent, which contains { Expression }, XMLMarkup, XMLText, and/or XMLElement. Now, XMLText contains (back to 8.3) SourceCharacters (minus { and < characters). SourceCharacters is defined to be a 1+ string of SourceCharacter, which refers back to ECMA262 5.1, which defines it as "any Unicode character."

According to 13.4.3.4, "If ignoreWhitespace is true, insignificant whitespace characters are ignored when processing constructing new XML objects." Nothing is said about false, but presumably those characters are not ignored in that case.

Do note that even if it's not explicitly stated (might be, but I don't feel like rereading the E4X spec intro), the prose here is more intended to be informative than normative. (In other words, the numbered steps are the better guide here.) The only place XML.ignoreWhitespace is mentioned is in MapInfoItemToXML, which is pretty clear about its effects.

Since XMLText can contain whitespace (any Unicode character), I'd say this means that an XML literal initialized when XML.ignoreWhitespace = false can contain whitespace and therefore will contain text nodes consisting of only whitespace characters.

Correct.

If that's right, the conformance test is wrong as I see it, because it deletes any nodes consisting solely of whitespace (which, as far as I can tell, means that they did not exist after the initializer executed). And as I see it, when ignoreWhitespace is false, those text nodes exist in the resulting document.

One thing I don't like about my interpretation is that it means that platform effects come into play. Because, as I understand it, ECMAScript allows theoretically any combination of CR/LF to signal the end of a line (along with some more obscure terminators with which I am not familiar; see ECMA262 7.3), it means that our XML document would contain CR/LF in JavaScript composed on a DOS platform, CR if composed on a Mac OS 9, and LF if composed on *nix.

Normal linebreak processing doesn't occur in E4X literals -- a CR is a CR, an LF is an LF, etc. Yes, tho, the platform on which a script is written thus does affect the script's semantics when using E4X literals containing line breaks -- those who care often might not need to adapt (you might want CRLF endings precisely because you're on Windows), and since white space isn't a typical task to be attacked by XML-using programmers, it probably wasn't worth the effort to invent some hack to make whitespace not affect semantics.

As LF is the most XML-ish, my implementation currently converts the underlying terminators to LF (I think; I should check). I think most XML parsers do this as well, but I should check that as well.

I seem to recall something about LF never appearing in an XML document unless inserted with an entity, but I could be wrong.

So what's the right answer here? --Inonit 11:58, 23 October 2006 (PDT)

Anyway, the problem you describe is with the testcase and is being handled in bug 368459. --Waldo 22:02, 27 January 2007 (PST)