Security/Reviews/Firefox4/HTML5 Parser Security Review

Overview

Describe the goals and objectives of the feature here.

  1. Make HTML parsing well-defined commodity functionality that everyone does the same way instead of having product-specific magic.
  2. Enable the use of SVG and MathML in text/html.
  3. Replace Gecko's HTML parser with something better understood and more maintainable.
  4. Move HTML parsing off the main thread in the hope of improving responsiveness.
Background links

Security and Privacy

  • Is this feature a security feature? If it is, what security issues is it intended to resolve?

The HTML5 parser is not a security feature. However, the HTML5 parsing algorithm attempts to have these defense-in-depth security features:

  1. U+0000 is not ignored where script or style sheet data can occur. (It is turned into U+FFFD instead.) This way, if a security gatekeeper is blacklist-based (which they shouldn't be; everyone should use whitelist-based gatekeepers), if the attacker tries to fool the gatekeeper by injecting U+0000 into blacklisted identifiers, the browser doesn't treat the parsed identifiers as those dangerous identifiers, because U+0000 has been turned into U+FFFD instead of getting dropped.
  2. Forcing a premature end of file doesn't change the executability of a given piece of the page compared to the situation where a premature end of file hasn't been forced. This is achieved by not retokenizing in a different mode if the EOF is seen inside [R]CDATA text or inside a comment.
  3. If the EOF occurs within a token, the incomplete token is discarded. This way, a premature EOF can't truncate the code in event handler attributes.
  • What potential security issues in your feature have you already considered and addressed?

Gecko's layout system runs algorithms that are recursive along the depth of the DOM tree. This means that deep trees lead to an overflow of the runtime stack, especially on Windows. The HTML5 parser limits the depth of the tree it creates to 200. This works against DoS-by-incompetence but not against DoS-by-malice (since deep trees can be created by other means).

The Adoption Agency Agency algorithm has two loops one inside another, which means that the work done by the parser can grow more than linearly as a function of the length of the input. A patch for limiting the number of iterations is in queue. See bug 596180.

  • Is system or subsystem security compromised in any way if your project's configuration files / prefs are corrupt or missing?

No.

  • Include a thorough description of the security assumptions, capabilities and any potential risks (possible attack points) being introduced by your project.

When the tag <isindex> is parsed, a string that depends on the UI localization of the browser is inserted into the resulting DOM. An untrusted JavaScript program can use this string to obtain configuration-dependent entropy for fingerprinting or can infer the UI locale of the user. However, Gecko already leaks this data elsewhere.

  • How are transitions in/out of Private Browsing mode handled?

They are not handled.

Exported APIs

  • Please provide a table of exported interfaces (APIs, ABIs, protocols, UI, etc.)
  • Does it interoperate with a web service? How will it do so?
  • Explain the significant file formats, names, syntax, and semantics.
  • Are the externally visible interfaces documented clearly enough for a non-Mozilla developer to use them successfully?
  • Does it change any existing interfaces?

Module interactions

  • What other modules are used (REQUIRES in the makefile, interfaces)?

Data

  • What data is read or parsed by this feature?
  • What is the output of this feature?
  • What storage formats are used?

Reliability

  • What failure modes or decision points are presented to the user?
  • Can its files be corrupted by failures? Does it clean up any locks/files after crashes?

Configuration

  • Can the end user configure settings, via a UI or about:config? Hidden prefs? Environment variables?
  • Are there build options for developers? [#ifdefs, ac_add_options, etc.]
  • What ranges for the tunable are appropriate? How are they determined?
  • What are its on-going maintenance requirements (e.g. Web links, perishable data files)?

Relationships to other projects

Are there related projects in the community?

  • If so, what is the proposal's relationship to their work? Do you depend on others' work, or vice-versa?
  • Are you updating, copying or changing functional areas maintained by other groups? How are you coordinating and communicating with them? Do they "approve" of what you propose?

Review comments