Accessibility/Experiment1 feedback

From MozillaWiki
Jump to: navigation, search

Experiment 1: Video Accessibility

The Specification

A first specification for how to extend the HTML5 video to support out-of-band subtitles (and other time-aligned text) using the itext approach was developed in July 2009. It is based on a previous proposal and on several other proposals at WHATWG.

The Implementation

An implementation of this specification was developed.

It includes a specification of the <itext> elements that reference out-of-band time-aligned text files. It also includes a javascript implementation of the proposed javascript API for the <itext> elements.

It also includes use of a skin for the video player because of the need to extend the video controls - this can be ignored for the purposes of discussion of the specification.

The video used is "Elephants Dream", for which a large number of subtitles in different language are available in srt format and in different character sets.

The demo is here.

The demo works in Safari (with XiphQT installed), in Opera (experimental build), in Chrome (experimental build) and in Firefox 3.5.

To make use of the textual audio annotations, you will typically need to install a screen reader such as JAWS, NVDA, or firevox. JAWS and NVDA seems to expose some bugs with the javascript-updated aria-live attributes, but otherwise read the textual audio annotations nicely.

The software is available from here.

(Please note that there are bugs in the demo, but that the idea is to discuss the concepts.)

Features of the Implementation

The demo:

  • contains four different types of time-aligned text: subtitles, captions, chapters, and textual audio annotations
  • extends the video controls with a menu button for the time-aligned text tracks which enables the user to switch between different languages for the different tracks
  • the textual audio annotations are mapped into an aria-live activated div element, such that they are indeed read out by screen-readers; this div sits behind the video, invisible to everyone else
  • the chapters are displayed as text on top of the video
  • the subtitles and captions are displayed as overlays at the bottom of the video
  • these three display mechanisms are supposed to be default display mechanisms for these kinds of tracks, that could be overwritten by the stylesheet of a Web developer, who intends to place the text elsewhere on screen

Bugs / missing features / limitations of the demo:

  • the "delay" functionality of the specification has not been implemented yet
  • only srt files have been used to implement time-aligned text functionality
  • subtitles and captions currently overlap each other in the display space - they are likely to also overlap lyrics and maybe even transcripts - there needs to be some thought about how to solve this issue
  • several time-aligned text categories (KTV, TIK, NB, META, TRX and LRC) have not been implemented / demonstrated yet
  • currently selecting a different track through the menu doesn't work very well
  • currently, switching off tracks that have been activated is not possible yet
  • callbacks e.g. for CUE ranges, are missing and need to be dealt with

Thoughts / Feedback

SP = Silvia Pfeiffer/Xiph, Mozilla

PJ = Philip Jagenstedt/Opera

GF = Geoff Freed/WGBH

FS = Felipe Sanches/collaborative subtitling

MG = User:Ms2ger

RE = Roberto Ellero/a11y expert

OG = OggK

  • SP: the distinction between captions and subtitles may not be necessary
GF: The distinction between captions and subtitles is definitely necessary, especially if you're planning to follow the North American nomenclature (which it appears you are going to do). Subtitles are for hearing people; they're on-screen text that reflect a translation of the original audio into another language. Captions are for people who are deaf or hard-of-hearing; they are on-screen text that reflect the same language as the original audio. Captions also contain additional information (speaker cues, music indicators, placement of text) not normally found in subtitles.
SP: yes, that was the reason there is a distinction. While there is definitely a difference between the audience of captions and subtitles and their needs, I wonder if they need a technical distinction: they are both displayed on-screen and typically in the same location. Translations can exist for both, subtitles and captions. The only reason to keep them separate is that there may be both, a subtitle and a caption file available in the same language. However, they should be alternatives and not additionals. So, it might make sense to somehow group them together.
GF: Still, captions are not subtitles and subtitles are not captions. Even if you've got Spanish subtitles and Spanish captions, they're different because the captions will contain information that the subtitles won't. Really, the only thing they have in common is that they are text. The technical distinction can be made by identifying captions with one type of metadata and subtitles with another. Place them in different GUI menus, as well.
  • SP: the HTML specification could be improved by including an extra hierarchical element, such as itextlist. This allows all time-aligned text categories to be handled in the same way with itext, but provides a selection mechanism for the alternative tracks. category is a required attribute.
 <video ...>

 <itextlist category="CC" activelang="de">
  <itext src="" lang="de" type="text/srt" />
  <itext src="" lang="en" type="text/srt" />
  <itext src="" lang="it" type="text/srt" />

 <itextlist category="TAD" activelang="en">
  <itext src="" lang="de" type="text/srt" charset="ISO-8859" />
  <itext src="" lang="en" type="text/srt" charset="ISO-8859" />

  • PJ: can we make this fit in with or replace the addCueRange/removeCueRanges interface? Basically, I believe it should be possible to, using a DOM interface, add the same timed text ranges that would result from letting the browser parse SRT. The only difference between itext and the cue ranges interface is that one is associated with text while the other uses callbacks. The allText property would need to be replaced with another representation where both the times and the text (or callbacks) can be created/modified/deleted. Something like an array of
interface MediaTimeRange {
 attribute double start;
 attribute double end;
 attribute DOMString text;
 attribute Function onenter;
 attribute Function onleave;
SP reply: I suppose, similar to Ian's proposal for extending srt to support karaoke and lyrics, it could also be extended with functions for onenter and onleave.
PJ: Then, the delay method would become somewhat redundant, better to handle this by rewriting the times via DOM (also allows fixing drift, not just constant delay) currentText also wouldn't be needed.
SP reply: that forces everyone who has a problem with the start time (and the drift) to write the function to fix it themselves
FS: I'd suggest adding another synchronization parameter on top of the "delay". It could be called "stretch". Sometimes different video encodings lead to slightly different playback speed (maybe due to issues related to different framerates). So, one would not only delay the subtitles but maybe also time-stretch it a bit in order to synchronize it. Default value for this parameter would be "100%". If one needs the subtitles to be displayed 3% faster, then he/she would explicitly set stretch="103%". The itext tag (specially if it provides both delay and stretch attributes) will make it easier to implement a clean version of the collaborative subtitling system.
SP reply: yes, I like this idea.

SP reply: we could remove the fetch() function. All I wanted to avoid was to have all files downloaded even if they're not ever required.
  • PJ: Making enabled writable would remove the need for enable()/disable().
SP reply: I am considering moving it to the itextlist.

  • PJ: Depending on the user agents preferred language setting has failed miserable so far - most users just leave it as the default. Sites are forced to use explicit language selection or guess the language based on IP, I expect it would be no different for this feature. I honestly don't know what a good solution for this is.
SP reply: By providing a selection mechanism through the "display" attribute, it is possible for the Web developer to override the preferred language setting. Further, the user can do explicit selection through the menu.
  • PJ: For scripts, the charset attribute is ignored for cross-origin because interpreting something under a different charset than intended can give different results. The cross-origin problem is probably more relevant when it comes to allText/currentText/MediaTimeRange. I'm not sure if verifying that the resource is in fact a supported type is enough, as that would still allow web sites to read subtitle files from the intranet of the client if they can guess the URL. Imagine if the full text of http://internal/ was available for all to see through this API.
SP reply: srt does not have a specification for charset, so the server can only guess the correct charset to provide with the srt file. Thus, IMHO the only means in which a Web developer can provide the correct charset to use for a srt file is by providing it in such an attribute. If that could be avoided, I'd be all for it.
  • FS: I suggest that we provide a standard way of adding to the text category menu the subtitles that come embedded in videos. ( What would be the proposed procedure for adding these to the list? The ogg-decoder would need a way to provide this info to the subtitles list code. It will be good if this info is accessible through DOM so that client code can be aware of embedded subs.
SP reply: I haven't started working on in-band time-aligned text yet, but the plan is to experiment with Ogg and Kate to see how far we can get to expose an identical interface to the browser as through the out-of-band time-aligned text. So, while parsing the binary file, the text should be added to the DOM as is happening with the out-of-band text.
  • MG: It looks interesting to me, but from a quick read I'd say that in the current proposal (at least in the demo) there's too much boilerplate code. At least one of the arguments pro HTML5 is that it doesn't require more typing than necessary (see the doctype, meta charset, etc.).
So, personally, I'd drop @type and shorten @category. I'd also put @charset in the SRT file itself, or at least in the HTTP headers, where it is less likely to be out of sync with the content. (I guess that's not possible with the format as it stands today, but it needs a better spec anyway. I guess a charset declaration could be added at the same time.) Finally, I'd make @display default to "auto", and make the behavior dependent on the language and the category. (E.g. always display the cue points.)
I.e., instead of <itext lang="es" type="text/srt" charset="ISO-8859" display="auto" src="subtitles/" category="CC"></itext> I'd prefer something closer to <itext lang="es" src="subtitles/" cat="CC">. I'd drop the end-tag.
Also, shouldn't that use "ISO-8859-1" rather than "ISO-8859"?
SP reply: the charset is required since SRT does not provide it. Also, I had to use IS-8859-1 to display the correct characters, but I'd be happy to take advice on the charset to use. As for shortening: yes, I was thinking that, too - see the thoughts on itextlist above.

  • RE: It seems to be impossible to listen to the audiodescription and use the other controls using JAWS.
SP reply: I hope that will be just a temporal issue while ARIA live regions are still new.

SP reply: with the javascript API to the itext elements it will be possible to build such interfaces easily using HTML5 video.

(see SMIL file here:

SP reply: with the HTML5 video element and its javascript API this is already possible.

SP reply: this is a Firefox bug that I think is fixed now (Oct 09).

  • OG: I see a problem for embedded streams about 'readonly attribute HTMLCollection allText'. '4. Itext text extraction' says it is the entire contents of time+text, so you can do processing on the whole text, which won't be possible on streamed resources. Even if it was to preload the entire thing, it would break for 'live' sources that are realtime captioned (though that's probably very rare, but it's done on TV I think). Might want to add a boolean 'partial content' if it's acceptable to get filled as streaming goes.
SP reply: It will always contain what is available - in real-time resources that would be what has been available so far. I don't think that's "partial" content either.

  • OG: On errors: there's an language mismatch, there probably should be a category mismatch for a category that's not in the list ? Or a warning. (or that's parse error ? Could be too).
SP reply: yes, the error interface is still very raw. Thanks for noticing.

  • OG: langName: you say it's the full text, suitable for display. I was wondering where to do that conversion from language code to user visible name. Having the JS see the language code allows for doing something with it (eg, fetching a dictionary, whatever). User visible name means for en you could have English, Ingles, etc. Less useful for that.
SP reply: no, it's not the full text, but the code, as in the spec. The browser has to map that to the visible name as it does with other interfaces.

  • more feedback encouraged!