Accessibility/Video Text Format
A HTML-based media markup language for HTML5
This page introduces a HTML-based time-aligned text markup for audio and video. It is particularly targeted for use with HTML5 audio and video elements, but can be used in stand-alone applications.
The main motivation for creating this markup is to create a text format for specifying captions, subtitles, karaoke, and similar time-aligned text which work by reusing existing Web technologies such as CSS and HTML. It does so by creating the absolute minimum of new markup, reusing several tags from HTML5, and reusing existing CSS constructs. The aim is to be able to simply parse the temporal cues inside this format with existing Web browser code paths without requiring the introduction of new parsing code for a new format.
This markup is called "Web Media Markup Language" (WMML) and has a mime type of text/wmml.
Differences to other proposed formats for use in HTML5
Other formats have been proposed to be used as out-of-the-box supported markup for external time-aligned text documents for HTML5 media elements. The most popular examples are SRT, WebSRT, and TTML (former DFXP).
The main difference between SRT and WMML is that WMML is HTML-like and thus requires more markup. But that is offset by the ability to easily extend WMML with existing HTML and CSS features.
WebSRT tries to extend SRT with features that have been deemed required for a collection of use cases around captions, subtitles, and karaoke. While this results in a fairly dense document definition, it also has the drawback that it is not easily extensible to slightly new applications, such as overlays on videos with ads, or captions with images, icons, or hyperlinks in them. Further, WebSRT is not a XML-based markup and thus requires implementation of a new parsing unit into Web browsers. Such new parsing code should be kept to a minimum, while continuing to provide flexibility of what can be displayed in time-synchronisation with videos.
TTML has tried to be such a format. It is XML-based and has CSS-like formatting instructions. However, it has diverged too much from HTML/CSS that it is not easily possible to reuse existing HTML & CSS parsing code to interprete a TTML document. At the time of its definition, it seemed like a sensible thing to do in order to stay in sync with XHTML, with XML namespaces and with XSL-FO, but in the modern HTML5 space, these have proven to be a hinderance to implementation in modern Web browsers.
WMML provides a solution to this situation. It is very similar to HTML and reuses CSS for formatting and styling. It tries to be as simple as possible with what it introduces newly. It references HTML and CSS for the bulk of its functionality, which makes it easily extensible, since any new functionality introduced into HTML and CSS is available to WMML, too.
Uptake concerns
For the actual authoring of WMML document it is expected that authors exert constraint in the actual elements they use. The reason is that the more elements of HTML are being used in WMML documents, the less usable the WMML document becomes to players that do not support Web technologies. Over time, increasing amounts of HTML elements may be used in typical WMML documents.
Since many new players are already capable of parsing HTML pages, implementation of support for WMML in stand-alone players may not be much of an issue.
As for the authoring side of WMML documents: for hand-coding, WMML is a bit more verbose than e.g. SRT. It is frequently pointed out that the XML-based caption format USF (Universal Subtitle Format) as it was defined by Matroska developers never achieved any uptake. Reasoning is that the fansubbing community refused to author documents in such a verbose format. However, there was never any support implemented for USF for more than the basic features in any media player, thus the verbose overhead had a big impact and the features were never visible.
The situation with WMML is different though, since it's not built completely new from scratch. If all Web browsers support WMML and its advanced features, then authors understand the usefulness of the verbosity. Also, because WMML would reuse HTML parsers, all features would be available immediately in a Web browser without having to wait for player developers to catch up. Exporting to WMML from a subtitling or captioning creation application also wouldn't be hard, at least for the most fundamental needs - and it would provide for all the features of advanced formats, too. Finally, stand-alone players the consider implementation of support for WMML will look at it in the context of also implementing support for HTML documents - something increasingly useful to media players (as exemplified in iTunes etc). Thus, there is no additional overhead (or only minimal overhead) in implementing WMML.