Accessibility/Video Text Format

A HTML-based media markup language for HTML5

This page introduces a HTML-based time-aligned text markup for audio and video. It is particularly targeted for use with HTML5 audio and video elements, but can be used in stand-alone applications.

The new markup is called "Web Media Markup Language" (WMML) and has a mime type of text/wmml.

The main motivation for creating this markup is to create a text format for specifying captions, subtitles, karaoke, and similar time-aligned text which work by reusing existing Web technologies such as CSS and HTML. It does so by creating a new file format, but re-using existing HTML5 elements that are appropriate. Only a small number of elements are introduced that do not currently exist in HTML5.

The new elements are not an extension to HTML5 and are not planned to be. There are hooks into HTML through the TimedTracks API in HTML5 by which the WMML elements are exposed to the Web page that bears the media resource and the link to the WMML document. Some of these HTML5 APIs are only objects in HTML5, but are actual elements in WMML.

The aim behind this way of defining WMML is to create a format that can reuse existing HTML5 snippet parsing, rather than having to invent a completely new parsing model. A WMML parser will only consist of a small amount of new parsing code and rely on an existing HTML5 snippet parser to provide for the bulk of its parsing needs. Also, the reuse of CSS will allow reuse of existing implementations for styling and positioning. This should vastly help Web browsers to implement support for WMML, even and particularly including the richer features.

Note: A WMML document is a non-HTML document that contains HTML elements but is not an XML-with-namespaces document. This is on purpose so as to allow reuse of HTML snippet parsing.

Also note that as it stands WMML is not XML conformant, since it has some relaxed parsing rules (e.g. <cue> doesn't have to be explicitly closed - it will be implicitly closed by the next <cue> element). Stricter parsing rules could however be introduced.

Introductory Examples

1. A simple example

This is an example with two 10 sec long text cues provided in the default language "en-US". We use the word "cue" as a general abstraction of the time-aligned text (or "event") that is being provided. It is more general than "caption" or "subtitle" etc.

<!DOCTYPE wmml>
<wmml lang="en-US">
  <cuelist>
    <cue start="00:00:00.00" end="00:00:10.00">This should render from time=0s to time=10s.</cue>
    <cue start="00:00:10.00" end="00:00:20.00">This should render from time=10s to time=20s.</cue>
  </cuelist>
</wmml>

HTML5 defines a timed track API for cues and the list of cues inside a WMML document maps neatly onto the TimedTrackCueList interface.

If not given otherwise, the default rendering region for a WMML resource that is related to a video is the video viewport, and inside that the bottom part. Alternatively, the top, right, and left viewport regions are possible rendering regions, too. Further, the cues could be rendered by a Web page outside the video element, but such information is decided by the rendering Web page and not the WMML file itself. The Web page's setting will also always overrule any settings provided in the WMML file.

In this example, the cues are rendered at 10s and 20s as an overlay onto the bottom area of the video viewport.

2. A formatted and positioned example

This is an example with two 10 sec long text cues provided in the default language "en-US" which are placed at the top third of the video.

<!DOCTYPE wmml>
<wmml lang="en-US">
  <head>
    <style type="text/css">
    wmml {
      min-width: 320px;
      min-height: 240px;
    }
    cue {
      text-align: center;
      margin: 2% 2% 64% 2%;
    }
    cue.c1 {
      color: red;
      font-family: sans-serif;
      background: rgba(0,0,0,0.5);
    }
    cue.c2 span {
      font-style: italic;
    }
    </style>
  </head>
  <cuelist>
    <cue id="c1" start="00:00:00.00" end="00:00:10.00">This should render from time=0s to time=10s.</cue>
    <cue id="c2" start="00:00:10.00" end="00:00:20.00">This should render from <span>time=10s</span> to <span>time=20s</span>.</cue>
  </cuelist>
</wmml>

It is possible to define the viewport minimum width and height through styling the <wmml> element. This helps to communicate what CSS box it is expecting to be reserved for. All the formatting specifications in the cues are done relative to that viewport. It is preferred that cues be placed relative to the viewport, such that it will be easy to scale with the video, e.g. for fullscreen viewing.

The cue elements c1 and c2 are formatted - the first one with red color, a different font, and a background transparency of 50%. The second one has spans that are italicised.

The elements of WMML

the <wmml> element

is the root element of WMML
analogous to <html> element
contains a <head> element followed by a <cuelist> element.
supports global attributes that <html> supports, too
additionally supports the following attribute:
- kind: the kind of track that this document provides, see HTML5 kind attribute

the <head> element

same definition as HTML <head> element
<link> elements inside <head> can be used to link to external style sheets
<script> should be avoided (the need for JavaScript support in WMML is not clear yet)
<style> can be used to put styling information directly in the document

the <cuelist> element

only contains a sequence of <cue> elements
is just a grouping element for the <cue> elements and doesn't support any of the attributes of the HTML body element

the <cue> element

is analogous to the HTML <div> element and supports all of the attributes and content elements of <div>, in particular all flow content (which includes <ruby>).
<cue> elements cannot appear inside <cue> elements; it is possible to introduce a parsing rule for <cue> that is similar to <dl>/<dd> where the opening of a new <cue> element implicitly closes the previous one. In this way, it is impossible to write a nested <cue> element - the parser always turns it into something non-nested.
it has the following additional attributes:
- start (float, optional): the start time of the cue (in relation to a media resource that is externally specified in a HTML media element); if missing, start=0 is assumed
- end (float, optional): the end time of the cue; if missing, it implicitly ends with the start of the next cue or at the end of the resource; thus, if time-overlapping cues are needed, specification of the end attribute is required
- voice (optional): a string identifying the voice with which the cue is associated (as defined in the HTML5 specification
- width/height: per cue width/height in %

the <t> element

a flow content element that is used inside the <cue> element for further specification of starting times of smaller elements
by default, the content inside the <t> element inherits its style from the parent <cue> element; its own styling is only activated when its time stamp is reached
it has the following attributes:
- at (float): a time stamp specifying at what time the style for the element becomes active
- style: the styles to be activated

Note: this could also be achieved with a span element and some CSS3, in particular the transition-delay property, but the markup would be a lot more verbose and make it unreadable.

Matching WMML elements using selectors

For all formatting purposes in WMML, CSS is used.

The pseudo-classes as introduced for HTML5 are applicable here, e.g. <a> attribute pseudo-classes.

The HTML element selectors as introduced for HTML5 are also applicable here.

With the use of attributes, CSS selectors can be applied e.g. to all cues that belong to a certain speaker, like this: cue[voice="speaker1"] { ... } .

Rendering

The WMML file's <cue> elements are not rendered into an existing HTML page, but rather a WMML file creates its own iframe-like new nested browsing context. It is linked to the parent HTML page through a track element that is inserted as a child of the video element. Creation of a nested browsing context is important because a WMML file can come from a different URI than the Web page and thus for security reasons and for general base URI computations a nested browsing context is the better approach with the DOM nodes of the hosting page and the DOM nodes of the WMML document in different owner documents. That way, the hosting document has the security origin of its own URL and the WMML document has the security origin of its URL.

As the browser plays the video, it must render the WMML <cue> tags in sync. As the start time of a <cue> tag is reached, the <cue> tag is made activate, and it is made inactive as the <cue> tag's end time is reached. If no start time is given, the start is assumed to be 0, and if no end time is given, the cue ends with the start of the next one or at the end of the resource.

The content of WMML cue elements is made available to the HTML page that includes the WMML file and the media resource through the timed track API in HTML. In particular, the getCueAsHTML and getCueAsSource API calls will provide a copy of the DOM subtree for the <cue>. You lose style information that was being applied by <style> elements in the WMML document, but since the main reason for the JavaScript API is to run your own styles, this is acceptable. The returned content needs to be sanitized in case a malicious cue contains a <script> element.

Concrete Examples

1. Subtitles

Simple, unformatted subtitles:

<!DOCTYPE wmml>
<wmml lang="en-US" kind="subtitles">
  <cuelist>
    <cue start="00:00:15.00" end="00:00:17.95">At the left we can see...</cue>
    <cue start="00:00:18.16" end="00:00:20.08">At the right we can see the...</cue>
    <cue start="00:00:20.11" end="00:00:21.96">...the head-snarlers</cue>
    <cue start="00:00:21.99" end="00:00:24.36">Everything is safe.<br/>Perfectly safe.</cue>
  </cuelist>
</wmml>

Formatted subtitles with hyperlink:

<!DOCTYPE wmml>
<wmml lang="de_DE" kind="subtitles">
  <head>
    <style type="text/css">
    wmml {
      background: rgba(0,0,0,0.5);
    }
    cue.c1 {
      text-align: left;
    }
    cue.c2 span {
      text-align: right;
    }
    </style>
  </head>
  <cuelist>
    <cue id="c1" start="00:00:15.00" end="00:00:17.95">Auf der <i>linken</i> Seite sehen wir...</cue>
    <cue id="c2" start="00:00:18.16" end="00:00:20.08">Auf der <b>rechten</b> Seite sehen wir die....</cue>
    <cue id="c3" start="00:00:20.11" end="00:00:21.96" style="color: red;">...die <a href="http://orange.blender.org/blog/creative-commons-license-2/">Enthaupter</a>.</cue>
    <cue id="c4" start="00:00:21.99" end="00:00:24.36">Alles ist <mark>sicher</mark>.<br/>Vollkommen sicher.</cue>
  </cuelist>
</wmml>

Top-to-bottom rendered subtitles:

<!DOCTYPE wmml>
<wmml lang="zh" kind="subtitles">
  <head>
    cue {
      writing-mode: tb-rl;
      margin: 2% 2% 2% 64%;
      text-align: right;
    }
  </head>
  <cuelist>
    <cue start="00:00:15.00" end="00:00:17.95">在左边我们可以看到...</cue>
    <cue start="00:00:18.16" end="00:00:20.08">在右边我们可以看到...</cue>
    <cue start="00:00:20.11" end="00:00:21.96">...捕蝇草械.</cue>
    <cue start="00:00:21.99" end="00:00:24.36">一切都安全.<br/>非常地安全.</cue>
  </cuelist>
</wmml>

2. Captions

<!DOCTYPE wmml>
<wmml lang="en-US" kind="captions">
  <cuelist>
    <cue start="00:00:15.00" end="00:00:17.95" voice="Proog">At the left we can see...</cue>
    <cue start="00:00:18.16" end="00:00:20.08" voice="Proog">At the right we can see the...</cue>
    <cue start="00:00:20.11" end="00:00:21.96" voice="Proog">...the head-snarlers<br/>[Whizzing noises]</cue>
    <cue start="00:00:21.99" end="00:00:24.36" voice="Proog">Everything is safe.<br/>Perfectly safe.</cue>
  </cuelist>
</wmml>

3. Lyrics

<!DOCTYPE wmml>
<wmml lang="en-US" kind="lyrics">
  <head>
    <title>Can't buy me Love</title>
    <meta name="m:title" content="Can't buy me love"/>
    <meta name="m:artist" content="Beatles, The"/>
    <meta name="m:author" cotent="Lennon & McCartney"/>
    <meta name="m:album" content="Beatles 1 - 27 #1 Singles"/>
    <meta name="m:by" content="Wooden Ghost"/>
  </head>
  <cuelist>
    <cue start="00:00:00.45" end="00:00:05.60">
      Can't <t at="00:00.75">buy</t> <t at="00:00.95">me</t> <t="00:01.40">love,</t> <t at="00:02.60">love,</t> <t at="00:03.95">love, </t> <t at="00:05.30">love</t>
    </cue>
    <cue start="00:00:05.90" end="00:00:08.90">
       Can't <t at="00:06.20">buy</t> <t at="00:06.40">me</t> <t at="00:06.70">love,</t> <t at="00:08.00">love</t>
    </cue>
    <cue start="00:00:09.35" end="00:00:11.55">
      I'll <t at="00:09.50">buy</t> <t at="00:09.75">you</t> <t at="00:10.15">a</t> <t at="00:10.25">dia</t><t at="00:10.50">mond</t> <t at="00:10.75">ring</t> <t at="00:11.10">my</t> <t at="00:11.40">friend</t>
    </cue>
  </cuelist>
</wmml>

4. Karaoke

<!DOCTYPE wmml>
<wmml lang="en-US" kind="karaoke">
  <head>
    <title>Can't buy me Love</title>
    <meta name="m:title" content="Can't buy me love"/>
    <meta name="m:artist" content="Beatles, The"/>
    <meta name="m:author" cotent="Lennon & McCartney"/>
    <meta name="m:album" content="Beatles 1 - 27 #1 Singles"/>
    <meta name="m:by" content="Wooden Ghost"/>
    <style>
      t {
        font-weight: bold;
        color: yellow;
      }
    </style>
  </head>
  <cuelist>
    <cue start="00:00:00.45" end="00:00:05.60">
      Can't <t at="00:00.75">buy</t> <t at="00:00.95">me</t> <t="00:01.40">love,</t> <t at="00:02.60">love,</t> <t at="00:03.95">love, </t> <t at="00:05.30">love</t>
    </cue>
    <cue start="00:00:05.90" end="00:00:08.90">
       Can't <t at="00:06.20">buy</t> <t at="00:06.40">me</t> <t at="00:06.70">love,</t> <t at="00:08.00">love</t>
    </cue>
    <cue start="00:00:09.35" end="00:00:11.55">
      I'll <t at="00:09.50">buy</t> <t at="00:09.75">you</t> <t at="00:10.15">a</t> <t at="00:10.25">dia</t> <t at="00:10.50">mond</t> <t at="00:10.75">ring</t> <t at="00:11.10">my</t> <t at="00:11.40">friend</t>
    </cue>
  </cuelist>
</wmml>

5. Chapter markers

Simple:

<!DOCTYPE wmml>
<wmml lang="en-US" kind="chapters">
  <cuelist>
    <cue start="00:00:00.00" end="00:00:18.00">Introductory Titles</cue>
    <cue start="00:00:18.01" end="00:01:10.00">The Jack Plugs</cue>
    <cue start="00:01:10.01" end="00:02:30.00">Robotic Birds</cue>
  </cuelist>
</wmml>

With image:

<!DOCTYPE wmml>
<wmml lang="en-US" kind="chapters">
  <cuelist>
    <cue start="00:00:00.00" end="00:00:18.00"><img src="intro.png"/> Introductory Titles</cue>
    <cue start="00:00:18.01" end="00:01:10.00"><img src="plugs.png"/> The Jack Plugs</cue>
    <cue start="00:01:10.01" end="00:02:30.00"><img src="birds.png"/> Robotic Birds</cue>
  </cuelist>
</wmml>

6. Texted audio descriptions

<!DOCTYPE wmml>
<wmml lang="en-US" kind="descriptions">
  <cuelist>
    <cue start="00:00:00.00" end="00:00:05.00">The orange open movie project presents</cue>
    <cue start="00:00:05.01" end="00:00:12.00">Introductory titles are showing on the background of a water pool with fishes swimming and mechanical objects lying on a stone floor.</cue>
    <cue start="00:00:11.01" end="00:00:14.80">elephants dream</cue>
  </cuelist>
</wmml>

7. Timed Metadata

Timed slides for a presentations (slide transcripts given, too):

<!DOCTYPE wmml>
<wmml lang="en-US" kind="metadata">
  <head>
    <title>Really Achieving Your Childhood Dreams</title>
    <meta name="dc:title" content="Really Achieving Your Childhood Dreams"/>
    <meta name="dc:author" cotent="Randy Pausch"/>
    <meta name="dc:location" content="Carnegie Mellon University"/>
    <meta name="dc:date" content="18th Sept 2007"/>
    <meta name="dc:source" content="http://www.labnol.org/home/randy-pausch-last-lecture-video-with-transcript/4211/"/>
  </head>
  <cuelist>
    <cue start="00:00:00.00" end="00:00:44.00"><img src="intro.png"/>"Really Achieving Your Childhood Dreams" by Randy Pausch, Carnegie Mellon University, Sept 18, 2007</cue>
    <cue start="00:00:44.00" end="00:01:18.00"><img src="elephant.png"/>The elephant in the room...</cue>
    <cue start="00:01:18.00" end="00:02:05.00"><img src="denial.png"/>I'm not in denial...</cue>
  </cuelist>
</wmml>

Differences to other proposed formats for use in HTML5

Other formats have been proposed to be used as out-of-the-box supported markup for external time-aligned text documents for HTML5 media elements. The most popular examples are SRT, WebSRT, and TTML (former DFXP).

The main difference between SRT and WMML is that WMML is HTML-like and thus requires more markup. But that is offset by the ability to easily extend WMML with existing HTML and CSS features.

WebSRT tries to extend SRT with features that have been deemed required for a collection of use cases around captions, subtitles, and karaoke. While this results in a fairly dense document definition, it also has the drawback that it is not easily extensible to slightly new applications, such as overlays on videos with ads, or captions with images, icons, or hyperlinks in them. Further, WebSRT is not a XML/HTML-based markup and thus requires implementation of a new parsing unit into Web browsers. Such new parsing code should be kept to a minimum, while continuing to provide flexibility of what can be displayed in time-synchronisation with videos.

TTML has tried to be such a format. It is XML-based and has CSS-like formatting instructions. However, it has diverged too much from HTML/CSS that it is not easily possible to reuse existing HTML & CSS parsing code to interprete a TTML document. At the time of its definition, it seemed like a sensible thing to do in order to stay in sync with XHTML, with XML namespaces and with XSL-FO, but in the modern HTML5 space, these have proven to be a hinderance to implementation in modern Web browsers.

WMML provides a solution to this situation. It is very similar to HTML and reuses CSS for formatting and styling. It tries to be as simple as possible with what it introduces newly. It references HTML and CSS for the bulk of its functionality, which makes it easily extensible, since any new functionality introduced into HTML and CSS is available to WMML, too.

Uptake concerns

Uptake has to concern itself with support by several user groups:

web browsers,
manual authors,
authoring applications,
stand-alone players.

Generally, it is expected that applications ignore CSS and HTML elements that they do not understand rather than failing to parse a WMML document with such elements.

Web browsers should be able to implement support for WMML fairly easily, since they already have support for most of the required CSS and HTML functionalities.

For (manual) authoring of WMML document it is expected that authors exert constraint in the actual elements they use. The reason is that the more elements of HTML are being used in WMML documents, the less usable the WMML document becomes to players that do not support Web technologies. Over time, increasing amounts of HTML elements may be supported by authoring tools and stand-alone players, so can be used in typical WMML documents.

Since many new players are already capable of parsing HTML pages, implementation of support for WMML in stand-alone players may not be much of an issue.

As for the authoring side of WMML documents: for hand-coding, WMML is a bit more verbose than e.g. SRT. It is frequently pointed out that the XML-based caption format USF (Universal Subtitle Format) as it was defined by Matroska developers never achieved any uptake. Reasoning is that the fansubbing community refused to author documents in such a verbose format. However, there was never any support implemented for USF for more than the basic features in any media player, thus the verbose overhead had a big impact and the features were never visible.

The situation with WMML is different though, since it's not built completely new from scratch. If all Web browsers support WMML and its advanced features, then authors understand the usefulness of the verbosity. Also, because WMML would reuse HTML parsers, all features would be available immediately in a Web browser without having to wait for player developers to catch up. Exporting to WMML from a subtitling or captioning creation application also wouldn't be hard, at least for the most fundamental needs - and it would provide for all the features of advanced formats, too. Finally, stand-alone players the consider implementation of support for WMML will look at it in the context of also implementing support for HTML documents - something increasingly useful to media players (as exemplified in iTunes etc). Thus, there is no additional overhead (or only minimal overhead) in implementing WMML.