Accessibility/Video Text Format
A HTML-based media markup language for HTML5
This page introduces a HTML-based time-aligned text markup for audio and video. It is particularly targeted for use with HTML5 audio and video elements, but can be used in stand-alone applications.
The main motivation for creating this markup is to create a text format for specifying captions, subtitles, karaoke, and similar time-aligned text which work by reusing existing Web technologies such as CSS and HTML. It does so by creating the absolute minimum of new markup, reusing several tags from HTML5, and reusing existing CSS constructs. The aim is to be able to simply parse the temporal cues inside this format with existing Web browser code paths without requiring the introduction of new parsing code for a new format.
This markup is called "Web Media Markup Language" (WMML) and has a mime type of text/wmml.
Introductory Examples
1. A simple example
This is an example with two 10 sec long text cues provided in the default language "en-US". We use the word "cue" as a general abstraction of the time-aligned text (or "event") that is being provided. It is more general than "caption" or "subtitle" etc.
<!DOCTYPE wmml>
<wmml lang="en-US">
<cuelist>
<cue start="00:00:00.00" end="00:00:10.00">This should render from time=0s to time=10s.</cue>
<cue start="00:00:10.00" end="00:00:20.00">This should render from time=10s to time=20s.</cue>
</cuelist>
</wmml>
HTML5 defines a timed track API for cues and the list of cues inside a WMML document maps neatly onto the TimedTrackCueList interface.
If not given otherwise, the default rendering region for a WMML resource that is related to a video is the video viewport, and inside that the bottom part. Alternatively, the top, right, and left viewport regions are possible rendering regions, too. Further, the cues could be rendered by a Web page outside the video element, but such information is decided by the rendering Web page and not the WMML file itself. The Web page's setting will also always overrule any settings provided in the WMML file.
In this example, the cues are rendered at 10s and 20s as an overlay onto the bottom area of the video viewport.
2. A formatted and positioned example
This is an example with two 10 sec long text cues provided in the default language "en-US" which are placed at the top third of the video.
<!DOCTYPE wmml>
<wmml lang="en-US">
<head>
<style type="text/css">
wmml {
min-width: 320px;
min-height: 240px;
}
cue {
text-align: center;
margin: 2% 2% 64% 2%;
}
cue.c1 {
color: red;
font-family: sans-serif;
background: rgba(0,0,0,0.5);
}
cue.c2 span {
font-style: italic;
}
</style>
</head>
<cuelist>
<cue id="c1" start="00:00:00.00" end="00:00:10.00">This should render from time=0s to time=10s.</cue>
<cue id="c2" start="00:00:10.00" end="00:00:20.00">This should render from <span>time=10s</span> to <span>time=20s</span>.</cue>
</cuelist>
</wmml>
It is possible to define the viewport minimum width and height through styling the <wmml> element. This helps to communicate what CSS box it is expecting to be reserved for. All the formatting specifications in the cues are done relative to that viewport. It is preferred that cues be placed relative to the viewport, such that it will be easy to scale with the video, e.g. for fullscreen viewing.
The cue elements c1 and c2 are formatted - the first one with red color, a different font, and a background transparency of 50%. The second one has spans that are italicised.
The elements of WMML
the <wmml> element
- is the root element of WMML
- analogous to <html> element
- contains a <head> element followed by a <body> element.
- supports global attributes that <html> supports, too
- additionally supports the following attribute:
- kind: the kind of track that this document provides, see HTML kind attribute
DOM interface:
interface WMMLElement {
attribute DOMString kind;
};
the <head> element
- same definition as HTML <head> element
- <link> elements inside <head> can be used to link to external style sheets
- <script> should be avoided
- <style> can be used to put styling information directly in the document
the <cuelist> element
- only contains a sequence of <cue> elements
- is just a grouping element for the <cue> elements and doesn't support any of the attributes of the HTML body element
DOM interface:
interface TimedTrackCueList {
readonly attribute unsigned long length;
getter TimedTrackCue (in unsigned long index);
TimedTrackCue getCueById(in DOMString id);
};
the <cue> element
- is analogous to the HTML <div> element and supports all of the attributes and content elements of <div>, in particular all flow content
- <cue> elements cannot appear inside <cue> elements
- it has the following additional attributes:
- start: the start time of the cue (in relation to a media resource that is externally specified in a HTML media element)
- end: the end time of the cue
- voice (optional): a string identifying the voice with which the cue is associated (as defined in the HTML5 specification
- width/height: per cue width/height in %
DOM interface:
interface TimedTrackCue {
readonly attribute float startTime;
readonly attribute float endTime;
readonly attribute DOMString voice;
DocumentFragment getCueAsHTML();
};
the <t> element
- a flow content element that is used inside the <cue> element for further specification of starting times of smaller elements
- by default, the content inside the <t> element inherits its style from the parent <cue> element; its own styling is only activated when its time stamp is reached
- it has the following attributes:
- at: a time stamp specifying at what time the style for the element becomes active
- style: the styles to be activated
DOM interface:
interface TimedTrackOffset {
readonly attribute float at;
};
Matching WMML elements using selectors
For all formatting purposes in WMML, CSS is used.
The pseudo-classes as introduced for HTML5 and applicable here apply, e.g. <a> attribute pseudo-classes.
The HTML element selectors as introduced for HTML5 and applicable here apply.
With the use of attributes, CSS selectors can be applied e.g. to all cues that belong to a certain speaker, like this: cue[voice="speaker1"] { ... } .
Rendering
The WMML file's <cue> elements are not rendered into an existing HTML page, but rather a WMML file creates its own iframe-like new nested browsing context. It is linked to the parent HTML page through a track element that is inserted as a child of the video element. Creation of a nested browsing context is important because a WMML file can come from a different URI than the Web page and thus for security reasons and for general base URI computations a nested browsing context is the better approach with the DOM nodes of the hosting page and the DOM nodes of the WMML document in different owner documents. That way, the hosting document has the security origin of its own URL and the WMML document has the security origin of its URL.
As the browser plays the video, it must render the WMML <cue> tags in sync. As the start time of a <cue> tag is reached, the <cue> tag is made activate, and it is made inactive as the <cue> tag's end time is reached. If no start time is given, the start is assumed to be 0, and if no end time is given, end is assumed to be infinity.
Concrete Examples
1. Subtitles
<!DOCTYPE wmml>
<wmml lang="en-US" kind="subtitles">
<cuelist>
<cue start="00:00:15.00" end="00:00:17.95">At the left we can see...</cue>
<cue start="00:00:18.16" end="00:00:20.08">At the right we can see the...</cue>
<cue start="00:00:20.11" end="00:00:21.96">...the head-snarlers</cue>
<cue start="00:00:21.99" end="00:00:24.36">Everything is safe.<br/>Perfectly safe.</cue>
</cuelist>
</wmml>
<!DOCTYPE wmml>
<wmml lang="de_DE" kind="subtitles">
<head>
<style type="text/css">
wmml {
background: rgba(0,0,0,0.5);
}
cue.c1 {
text-align: left;
}
cue.c2 span {
text-align: right;
}
</style>
</head>
<cuelist>
<cue id="c1" start="00:00:15.00" end="00:00:17.95">Auf der <i>linken</i> Seite sehen wir...</cue>
<cue id="c2" start="00:00:18.16" end="00:00:20.08">Auf der <b>rechten</b> Seite sehen wir die....</cue>
<cue id="c3" start="00:00:20.11" end="00:00:21.96" style="color: red;">...die Enthaupter.</cue>
<cue id="c4" start="00:00:21.99" end="00:00:24.36">Alles ist <mark>sicher</mark>.<br/>Vollkommen sicher.</cue>
</cuelist>
</wmml>
<!DOCTYPE wmml>
<wmml lang="zh" kind="subtitles">
<head>
cue {
writing-mode: tb-rl;
margin: 2% 2% 2% 64%;
text-align: right;
}
</head>
<cuelist>
<cue start="00:00:15.00" end="00:00:17.95">在左边我们可以看到...</cue>
<cue start="00:00:18.16" end="00:00:20.08">在右边我们可以看到...</cue>
<cue start="00:00:20.11" end="00:00:21.96">...捕蝇草械.</cue>
<cue start="00:00:21.99" end="00:00:24.36">一切都安全.<br/>非常地安全.</cue>
</cuelist>
</wmml>
2. Captions
<!DOCTYPE wmml>
<wmml lang="en-US" kind="captions">
<cuelist>
<cue start="00:00:15.00" end="00:00:17.95">At the left we can see...</cue>
<cue start="00:00:18.16" end="00:00:20.08">At the right we can see the...</cue>
<cue start="00:00:20.11" end="00:00:21.96">...the head-snarlers<br/>[Whizzing noises]</cue>
<cue start="00:00:21.99" end="00:00:24.36">Everything is safe.<br/>Perfectly safe.</cue>
</cuelist>
</wmml>
3. Lyrics
<!DOCTYPE wmml>
<wmml lang="en-US" kind="lyrics">
<head>
<title>Can't buy me Love</title>
<meta name="m:title" content="Can't buy me love"/>
<meta name="m:artist" content="Beatles, The"/>
<meta name="m:author" cotent="Lennon & McCartney"/>
<meta name="m:album" content="Beatles 1 - 27 #1 Singles"/>
<meta name="m:by" content="Wooden Ghost"/>
</head>
<cuelist>
<cue start="00:00:00.45" end="00:00:05.60">
Can't <t at="00:00.75">buy</t> <t at="00:00.95">me</t> <t="00:01.40">love,</t> <t at="00:02.60">love,</t> <t at="00:03.95">love, </t> <t at="00:05.30">love</t>
</cue>
<cue start="00:00:05.90" end="00:00:08.90">
Can't <t at="00:06.20">buy</t> <t at="00:06.40">me</t> <t at="00:06.70">love,</t> <t at="00:08.00">love</t>
</cue>
<cue start="00:00:09.35" end="00:00:11.55">
I'll <t at="00:09.50">buy</t> <t at="00:09.75">you</t> <t at="00:10.15">a</t> <t at="00:10.25">dia</t> <t at="00:10.50">mond</t> <t at="00:10.75">ring</t> <t at="00:11.10">my</t> <t at="00:11.40">friend</t>
</cue>
</cuelist>
</wmml>
4. Karaoke
<!DOCTYPE wmml>
<wmml lang="en-US" kind="karaokes">
<head>
<title>Can't buy me Love</title>
<meta name="m:title" content="Can't buy me love"/>
<meta name="m:artist" content="Beatles, The"/>
<meta name="m:author" cotent="Lennon & McCartney"/>
<meta name="m:album" content="Beatles 1 - 27 #1 Singles"/>
<meta name="m:by" content="Wooden Ghost"/>
<style>
t {
font-weight: bold;
color: yellow;
}
</style>
</head>
<cuelist>
<cue start="00:00:00.45" end="00:00:05.60">
Can't <t at="00:00.75">buy</t> <t at="00:00.95">me</t> <t="00:01.40">love,</t> <t at="00:02.60">love,</t> <t at="00:03.95">love, </t> <t at="00:05.30">love</t>
</cue>
<cue start="00:00:05.90" end="00:00:08.90">
Can't <t at="00:06.20">buy</t> <t at="00:06.40">me</t> <t at="00:06.70">love,</t> <t at="00:08.00">love</t>
</cue>
<cue start="00:00:09.35" end="00:00:11.55">
I'll <t at="00:09.50">buy</t> <t at="00:09.75">you</t> <t at="00:10.15">a</t> <t at="00:10.25">dia</t> <t at="00:10.50">mond</t> <t at="00:10.75">ring</t> <t at="00:11.10">my</t> <t at="00:11.40">friend</t>
</cue>
</cuelist>
</wmml>
5. Chapter markers
<!DOCTYPE wmml>
<wmml lang="en-US" kind="chapters">
<cuelist>
<cue start="00:00:00.00" end="00:00:18.00">Introductory Titles</cue>
<cue start="00:00:18.01" end="00:01:10.00">The Jack Plugs</cue>
<cue start="00:01:10.01" end="00:02:30.00">Robotic Birds</cue>
</cuelist>
</wmml>
6. Texted audio descriptions
<!DOCTYPE wmml>
<wmml lang="en-US" kind="descriptions">
<cuelist>
<cue start="00:00:00.00" end="00:00:05.00">The orange open movie project presents</cue>
<cue start="00:00:05.01" end="00:00:12.00">Introductory titles are showing on the background of a water pool with fishes swimming and mechanical objects lying on a stone floor.</cue>
<cue start="00:00:11.01" end="00:00:14.80">elephants dream</cue>
</cuelist>
</wmml>
7. Timed Metadata
8. Linguistic Markup
Differences to other proposed formats for use in HTML5
Other formats have been proposed to be used as out-of-the-box supported markup for external time-aligned text documents for HTML5 media elements. The most popular examples are SRT, WebSRT, and TTML (former DFXP).
The main difference between SRT and WMML is that WMML is HTML-like and thus requires more markup. But that is offset by the ability to easily extend WMML with existing HTML and CSS features.
WebSRT tries to extend SRT with features that have been deemed required for a collection of use cases around captions, subtitles, and karaoke. While this results in a fairly dense document definition, it also has the drawback that it is not easily extensible to slightly new applications, such as overlays on videos with ads, or captions with images, icons, or hyperlinks in them. Further, WebSRT is not a XML-based markup and thus requires implementation of a new parsing unit into Web browsers. Such new parsing code should be kept to a minimum, while continuing to provide flexibility of what can be displayed in time-synchronisation with videos.
TTML has tried to be such a format. It is XML-based and has CSS-like formatting instructions. However, it has diverged too much from HTML/CSS that it is not easily possible to reuse existing HTML & CSS parsing code to interprete a TTML document. At the time of its definition, it seemed like a sensible thing to do in order to stay in sync with XHTML, with XML namespaces and with XSL-FO, but in the modern HTML5 space, these have proven to be a hinderance to implementation in modern Web browsers.
WMML provides a solution to this situation. It is very similar to HTML and reuses CSS for formatting and styling. It tries to be as simple as possible with what it introduces newly. It references HTML and CSS for the bulk of its functionality, which makes it easily extensible, since any new functionality introduced into HTML and CSS is available to WMML, too.
Uptake concerns
Uptake has to concern itself with support by several user groups:
- web browsers,
- manual authors,
- authoring applications,
- stand-alone players.
Generally, it is expected that applications ignore CSS and HTML elements that they do not understand rather than failing to parse a WMML document with such elements.
Web browsers should be able to implement support for WMML fairly easily, since they already have support for most of the required CSS and HTML functionalities.
For (manual) authoring of WMML document it is expected that authors exert constraint in the actual elements they use. The reason is that the more elements of HTML are being used in WMML documents, the less usable the WMML document becomes to players that do not support Web technologies. Over time, increasing amounts of HTML elements may be supported by authoring tools and stand-alone players, so can be used in typical WMML documents.
Since many new players are already capable of parsing HTML pages, implementation of support for WMML in stand-alone players may not be much of an issue.
As for the authoring side of WMML documents: for hand-coding, WMML is a bit more verbose than e.g. SRT. It is frequently pointed out that the XML-based caption format USF (Universal Subtitle Format) as it was defined by Matroska developers never achieved any uptake. Reasoning is that the fansubbing community refused to author documents in such a verbose format. However, there was never any support implemented for USF for more than the basic features in any media player, thus the verbose overhead had a big impact and the features were never visible.
The situation with WMML is different though, since it's not built completely new from scratch. If all Web browsers support WMML and its advanced features, then authors understand the usefulness of the verbosity. Also, because WMML would reuse HTML parsers, all features would be available immediately in a Web browser without having to wait for player developers to catch up. Exporting to WMML from a subtitling or captioning creation application also wouldn't be hard, at least for the most fundamental needs - and it would provide for all the features of advanced formats, too. Finally, stand-alone players the consider implementation of support for WMML will look at it in the context of also implementing support for HTML documents - something increasingly useful to media players (as exemplified in iTunes etc). Thus, there is no additional overhead (or only minimal overhead) in implementing WMML.