From MozillaWiki
Jump to: navigation, search

Feature HTML 5 Parser

  • Development Status: - In progress (date)
  • Feature Testing: - In progress (date)
  • Team: Developer Henri Sivonen (hsivonen), Matt Evans (mevans)
  • Tracking Bugs: Bug bug 373864 - (html5-parsing) Replace HTML parser with an HTML5 parser

Feature Description

Update existing parser with the new HTML5 parser

Feature Release Readiness Assessment

The table below provides a top level go/no go assessment of whether the feature is release ready for the given milestone. Each milestone link references a section below that discusses the criteria and evaluation that went into the QA go/no-go decision.

Milestone Assessment
#Beta1 N/A
#Beta2 N/A
#Beta3 N/A


Feature Documentation

Item Description Status
#Project_Wiki Wiki Links to all feature related entries
#Developer_Links (blogs) Developer links to feature related sites
#Other_Docs Web links to feature related sites
#Developer_QA_Review Details from developer and qa discussions regarding feature test strategies and issues.


Feature Bug Management

Item Description Status
#Bug_Tracking Top level bugs tracking feature
#Bug_Verification Feature bugs that need verification
#Bug_Triage Links triage bug tasks


Feature Test Items

The table below provides a breakdown of all feature items that should be covered and how they will be tested. Not all items will be covered by internal QA team members. It is important to list what should be covered. If it is not covered, list it as not covered.

Note: not all items listed below will apply for a given feature

Test Item Description Covered By Status
Item 1 Item 1 Description Developer Tests
Item 2 Item 2 Description Beta tester exposure
#Localization Feature localization
#Accessibility Feature accessibility
#Plugins Plugins compatibility
#Addons Addons compatibility
#Topsites Top internet sites compatibilities


Feature Tests

Automated Tests

Item Description Status
#Developer_Tests Links to automated developer tests
#Mozmill_Tests Links to automated mozmill feature test cases

Manual Tests

Item Description Status
#Smoke_Tests link to smoke tests
#Regression_Tests link to BFT and/or regression tests
#Functional_Tests link to FFT and/or complete functional tests


Community Test Events

Item Description Status
#Testdays Links to test day event results for feature
#Bugdays Links to bug day event results for feature
#Meetups Links to Meetup events for feature


Feature Documentation Details

Project Wiki

  • Provide link to all project related wikis


Developer Links

  • Provide links to all feature related developer links to blogs and other internet sites


Other Docs

  • Provide links to all feature related developer links to blogs and other internet sites


Developer QA Review

The QA person responsible for the feature should hold a formal interview with the lead developer or feature champion. Below are questions that should be asked in the interview:

Do we have automated tests for the feature?


The Java version of the HTML5 parser is tested using the html5lib test suite ( At present, all tokenizer test failures are cases where the test suite assumes implementation details of html5lib itself and the two tree builder failures are deliberately unfixed, because it's not clear that the spec is optimal on the point tested.

The encoding tests from html5lib aren't being run, and I'm not sure if those tests are up-to-date.

The Java to C++ translation is believed to preserve the properties of the parser that the html5lib tree builder and tokenizer tests test, but it is, of course, possible for this belief to be incorrect.

In Mochitest, there exists a test harness that makes it possible to run html5lib tree builder tests on the in-Gecko C++ version of the parser ( However, currently only the first 3 or the 15 tree builder test data files have been imported from upstream into mozilla-central. Fixing this is .

The C++ version of the parser Gecko doesn't currently expose the tokenizer independently of the tree builder. For this reason, it's not possible to run the html5lib tokenizer tests in the browser directly. In most cases, however, it's possible to transform the tokenizer tests into tree builder tests and run the same test using the browser's built-in parser and with a GWT-compiled JavaScript version of the parser. If a tokenizer test passes as Java and the result from C++ and JavaScript matches, the only way to have a false pass would be having the same bug in the Java to C++ translator and in the Java to JavaScript GWT compiler.

Jonathan Griffin implemented a test harness for the tokenizer tests. The patch is attached to bug but it hasn't gotten landed. I suggest unrotting the patch. (Note that the JavaScript version of the parser needs refreshing, I can recompile the parser using GWT so that QA doesn't need to set up the GWT build.)

There are also other HTML parser-relevant test scattered around and

What do they cover?

The html5lib tokenizer tests cover most of the different tokenizer state transitions.

The html5lib tree builder tests covers interesting cases in the HTML5 tree building algorithm.

The mochitests under content/base/test/ have "smoketest" level coverage for script execution, document.write() and doctype sniffing. These test have been written as part of older Gecko bug fixing and haven't been written systematically for the HTML5 spec.

What do they not cover?

Coverage for encoding sniffing is in theory (I believe bitrotted, but I'm not sure) in the html5lib repository. Those tests aren't run with Gecko, though. Thus, encoding sniffing is an area that lacks proper test coverage. (There's one notable known bug in this area: )

The html5lib tests don't test script execution or document.write() at all.

There are no tests making sure that SVG and MathML features work on the higher level when the DOM has been built by the HTML5 parser. For example, there's no coverage for checking that SMIL animations actually start.

How well do they cover the feature?

The html5lib tokenizer tests are incomplete in their coverage of U+0000 and line breaks in various tokenizer states. Also, they only spot-test named character references. Otherwise, I believe the tokenizer tests are very complete.

The html5lib tree builder tests have very superficial coverage for the SVG and MathML features. The coverage for the HTML-only parts isn't fully systematic. It has happened that real-world testing has required a change to the parsing algorithm and the test suite hasn't had a test for the old behavior of the algorithm. That is, a test has only been added after the case has been found to be of interest as opposed to coverage having been built systematically for "everything".

If QA writes more tree construction or tokenization test cases, I would like to encourage you to use the html5lib test formats and to contribute the tests under the MIT license to the html5lib project and then pull them to mozilla-central from there. Having everyone contribute to the same test suite has been a huge productivity and compliance boost so far, so it would be good to continue that.

I believe the script execution and document.write() tests cover the relevant area sufficiently, but they do not cover the whole spec systematically. For example, I discovered by reading WebKit bugs instead of by running our tests.

What are the important areas we should focus on?

I think we should focus on real-world site compatibility on one hand (going through "top site" lists navigating deeper than the front page and seeing if stuff breaks) and on SVG features working above the parser layer (that is, checking that selectors match camelCase names right, DOM getters do the right thing, SMIL animations work). It may also be worthwhile to stress nested document.write() order some more.

About real-world site compat, there are two things I think the QA and triagers of incoming bugs should be particularly aware of:

  1. document.write() only writes to the stream if it is called from a parser-inserted script that is being executed by the parser synchronously with the parse. In other cases, document.write() implies a call to, which blows away the document. These "other cases" include calls from: 'defer' scripts, 'async' scripts, scripts created with createElement() and inserted to the DOM, timeouts, intervals and event handlers. Previously, Gecko only blew away the document if the parser was done and allowed document.write() to insert content into a timing-dependent point in the stream if the parser wasn't done. The HTML5 behavior is like IE's behavior but, it turns out, not exactly. So far, whenever I've seen problems related to document.write(), they have been ad or analytics scripts that do browser sniffing and serve different code to Firefox and IE. The problem manifests as the page going blank and not finishing loading. There is a pending spec bug about mitigating this problem: (See also the b.m.o evangelism bugs linked from that bug.)
  1. The HTML5 parser never reparses due to hitting end of file inside a comment, <title>, <script>, <style>, <xmp> or <textarea>. Previously, browsers have reparsed in that case. Old browsers, when hitting the end of file inside a comment, rewind to the start of the comment and reparse so that '>' ends the comment instead of '-->' ending the comment. Old browsers, when hitting the end of file after <script> ... <!-- ... </script> ... -->, reparse looking for </script> ignoring the <!-- escape. Reparsing is a potential XSS problem and involves complexity, so if at all feasible, it's desirable to get rid of reparsing.

I'm aware of two failure involving the not reparsing and major sites:

Since the HTML5 parsing algorithm doesn't reparse, it currently closes comments a bit more eagerly than old parsers. This causes I'm planning on landing a fix for that bug. Afterwards, it's very important to find out if this causes more breakage elsewhere than it solves on CNN.

In order to avoid reparsing scripts, the HTML5 parsing algorithm does some carefully researched (by Opera QA) trickery ( to guess if </script> after <!-- means to close the script or not. This is probably the riskiest change in the HTML5 parsing algorithm compared to traditional browser behavior. Currently, I am aware of this breaking one banking site: . I *really* hope we don't need to change the parser here. However, in case solution isn't good enough for real-world compatibility, I'd like to know sooner than later. That's why it would be good for QA to be aware of this issue so that it can be recognized if it shows up. (I can't say how much exactly would have to break to warrant redesigning this part of the parsing algorithm.)

What are the dependencies?

I'm not sure I understand the question, but the main dependency is round-tripping findings as spec feedback and implementing the spec changes, which in practice means comparing notes with Hixie, Opera QA and Chrome implementors.

What is our comfort level with this feature in its current state?

Very comfortable, except I'm not entirely comfortable with the level of evidence about the real-world compatibility of new non-reparsing script and comment closing behavior.

What feedback would you like from QA? =

I'm most interested in data showing if late-breaking changes to the parsing algorithm cause more problems than they fix. Unfortunately, this isn't as much a thing QA can answer directly, but it's a problem QA can hopefully recognize from incoming beta tester reports that haven't made it to the right bugzilla component. For example, caused


Feature Release Readiness Assessment Details




Feature Bug Management Details

Bug Tracking

  • Top level bugs tracking feature. Include any relevant bug queries that are helpful for tracking feature status.
Query Name Description
bugzilla query url link query description


Bug Verification

  • Feature bugs that need verification


Bug Triage

  • Bug triage information


Feature Test Items Details


  • Details of feature localization test requirements



  • Details of feature accessibility test requirements



  • Details of plugins compatibility test requirements



  • Details of addons compatibility



  • Details of top internet sites test requirements


Feature Tests Details

Automated Tests Details

Developer Tests

  • Links to automated developer tests


Mozmill Tests

If a particular feature needs manual tests which should also be covered by Mozmill tests please add the "[mozmill-test-needed]" whiteboard entry to the feature implementation or regression bug.

List of Mozmill Tests:

  • Links to automated mozmill feature test cases


Manual Tests Details


  • links to litmus smoke tests or description



  • links to litmus BFT and/or regression tests description



  • links to litmus FFT and/or complete functional tests description


Community Test Events Details


  • Links to test day event results for feature


  • Links to bug day event results for feature


  • Links to Meetup events for feature