Accessibility/Video a11y Architecture

< Back to main video accessibility page

This describes a general architecture for doing media accessibility on the Web.

Side Conditions:

Our focus for video (and audio) accessibility here stems from a need for the Web; accessibility will also be required in other applications than the Web Browser, so some thought should also be spent on how to enable offline video accessibility without much loss of functionality (e.g. less styling capabilities)

Our focus is on getting accessibility to work with HTML5's video and audio elements; this means that the way in which accessibility features are exposed to Web Browsers should be similar to the way they currently work with HTML4 pages (http://www.w3.org/WAI/References/HTML4-access and http://www.w3.org/TR/wai-aria-primer/)

Also, Web Browsers already implement a lot of functionality for styling text; we should try to re-use that functionality without enforcing a complex styling model on offline video applications

Audio and video files may be found anywhere on the Web and may contain annotation tracks (such as captions, subtitles or audio descriptions - let's call them "text codecs") - alternatively these annotation tracks may also be stored in companion resources (files) that may reside on the same server or on a different server

There will be Web services that multiplex audio/video files and their annotation tracks as text codecs together to make more self-contained media resources to store to disk, view offline, and share with friends

Web Browsers may receive annotation tracks for media resources either within the resource or from a different resource possibly even from a different server; they have to cope with both situations and be able to play back multiplexed and companion resources. This could however be hidden from the Web Browser through the media framework. [I am not 100% sure about this decision - it would be easier, but less flexible to just focus on multiplexed resources.]

There will always be a multitude of media formats and codecs to deal with - whatever scheme we develop for dealing with annotations has to work across different encapsulation formats and services; Ogg is the main target for our solution here though

There will also be a need to keep innovation potential open on subtitles, captions, audio annotations and similar schemes so we cannot outright decide for one particular format but may rather deal with an abstract model of time-aligned annotations.

The Web Browser needs to know what type of tracks it is dealing with and be able to select them for display or route them to the right device dependent on Browser settings provided by the user.

So, given these conditions, here are some thoughts on architecture:

The User generally uses online video and audio for one of two purposes: playback and download

Download requires the Browser to provide one file of multiplexed video or audio with text codecs, since it's easier to share one file than many.

Playback can work via one connection to a multiplexed media resource, or via multiple connections to the media data and the text data separately. Assuming the user can only provide one URL to the media resource, the latter case would need to have a URL to a resource description such as SMIL or ROE through which the player can then re-issue a request to several separate resources. I am not sure this is desirable and would like to discuss this.

For the multiplexed media resource case, one can imagine having partial resources available from one or more Web servers and a Web service that can combine them together into a multiplexed media resource based on the demands of a Browser's request, which in turn is based on the user preferences.

In either case the media decoding subsystem of the Browser will need to hand over audio, video and text data to the Browser. If the text data was already in some XML format, the Browser's internal XML parser could be re-used to create a DOM in a nested browsing context for the Web page that the video or audio tag are part of. It would in general be easier and more flexible to provide a nested browsing context DOM for each text codec of a media resource rather than defining a standard javascript API.

A text codec is then an XML format that would follow a standard structure that is able to be temporally multiplexed into a media resource. An example would be:

 <head> ... </head>
 <body>
  <div start="t1" end="t2"> ... </div>
  <div start="tx" end="ty"> ... </div>
 ...
  <div start="tz1" end="tz2"> ... </div>
 </body>

Further tags in the head and in the divs could then be defined freely, but the file would still generically map into a media resource by taking the <head> as header data and each

as a codec packet.

Since it's XML (or maybe better even: HTML-like), the HTML means of attaching style information to elements can also be applied to these elements and provide styling commands to the Browser.

We would further define a media mapping of such a resource into Ogg and implement this in a little library such that it can be used to create multiplexed media resources for the download case.

Having defined this generic interface for text codecs, it will be easy afterwards to write an xslt (XML transformation) that can take e.g. a 3GPP TimedText file or a CMML file or a srt file and convert into a text codec format, map it into Ogg, and allow the Browser to create a nested browsing context DOM for it that can be accessed by accessibility devices and by the user.

I would further offer to myself getting involved in the newly re-opened TimedText Working Group at the W3C to work out the best XML format template for text codecs.

< Back to main video accessibility page

Accessibility/Video a11y Architecture

Navigation menu

Search