Accessibility/Video a11y requirements

< Back to main video accessibility page

Accessibility for Web audio & video is an interesting beast. It is mainly referred as a need of extending content with functionality to allow people of varying disabilities (in particular deaf / blind) access to Web content.

However, such functionalities are not only usable by disabled people - they become quickly very useful to people that are not generally regarded as "disabled", e.g. subtitles for internationalisation or in loud environments, or the "mis"-use of captions for deep search and direct access to subparts of a video. (See for example the use cases at WHATWG wiki).

Therefore, the analysis of requirements for accessibility cannot stop at the requirements for disabled people, but will need to go further.

The aim here is to analyse as many requirements for attaching additional information to audio & video files for diverse purposes. Then we can generalise a framework for attaching such additional information, in particular to Ogg files/streams. And as a specialisation of this, we can eventually give solutions for captions and audio annotations in particular.

Mojito is an example of a video player with such functionality.

As an alternative to embedding some of this information into Ogg files, it is possible to keep them in a separate file and just refer to them as related to the a/v file. This is a problem to solve outside the Ogg format.

The description of how to include a certain data type into Ogg is generally straight-forward from its description and is therefore included below as a suggestion.

The advantages of including timed text data inside Ogg are:

The captions travel with the file whenever the soundtrack travels, so they don't get dropped when the file is moved around.
Using a design accepted by non-Web player apps is more likely to lead to positive network effects through applicability of the same code or authoring effort in different contexts.

The disadvantages are:

To replace or add a timed text track to an existing video file, one needs to re-encode (re-encapsulate) the file.
A user agent will only be able to access the text data after decoding, which may take the full duration of the media file - instead with a text file, it would get all the data already in readable form and up-front.

Additional Data tracks required for audio & video

Requirements for hearing-impaired people

Hearing-impaired people need a visual or haptic replacement for the auditive content of an audio or video file.

The following channels are available to replace the auditive content of an audio or video file:

textual representation: text (or braille, which is a text display)
visual representation: sign-language

The following are actual technologies that can provide such channels for audio & video files:

Burned-in ("open") captions

caption text comprises the spoken content of an audio/video file plus a transcription of non-speech audio (speaker names, music, sound effects)
open captions are caption text that is irreversibly merged in the original video frames in such a way that all viewers have to watch them
open captions are in use today mostly for providing open subtitles on TV

Discussion:

(-) open captions are restrictive because they are just pixels in the video and the actual caption text is not directly accessible other than through OCR
(-) open captions can only be viewed and not (easily) be used for other purposes such as braille or search indexing
(-) in a digital world where the distribution of additional information can easily be achieved without destroying the format of the original data (both: video and text in this case), such a transfer of captions is not preferred
(-) no flexibility to turn off the captions
(-) you can have only one of them per video track (eg, one video per language)
(-) doesn't usually play nice with video
(+) no special software needs to be implemented to enable display of open captions in a video player

MULTIPLEX INTO OGG: not necessary, since it is part of the video track

Textual (closed) captions

closed captions are caption text that can be activated/deactivated to be displayed on top of the video
while there is a multitude of methods to encode closed captions into TV signals, we will here refer to closed captions only as a textual representation of captions relating to a digital video file

Layout:

textual captions are rendered over either the top or the bottom of the video screen
they are not more than 1-2 lines of text normally
textual captions do not overlap in time, but are normally sequential
sometimes captions are split between the left and the right half of the screen to associate better who is speaking what
sometimes captions are coloured differently for different people
sometimes a negative colour is used to the default because the background colours of the video conflict

Discussion:

(+) textual captions can be accessed as a textual representation of the video and used for purposes that go beyond the on-screen display, e.g. for display as braille or search purposes
(+) textual captions can be internationalised and given in multiple languages
(+) textual captions can be compressed easily
(-) textual captions contain essentially the represented spoken content together with music description, sound effect description, and metadata like the speaker name together in one representation means, which makes them semantically difficult to distinguish
(-) being extra data that accompanies a video or audio file, a player needs to be specifically adapted to support display and handling of this kind of data

MULTIPLEX INTO OGG: as extra timed text track(s)

Bitmap (closed) captions

sometimes closed captions are represented as a a sequence of transparent bitmaps which are overlayed onto the video at given time offsets to display the closed captions (sometimes used on DVDs also called SPU subtitles, or DVB)

Discussion:

(-) like open captions, bitmap captions are restrictive because they are just pixels in the video and the actual caption text is not directly accessible other than through OCR
(-) bitmap captions can only be viewed and not (easily) be used for other purposes such as braille or search indexing
(+) bitmap captions can be internationalised and given in multiple languages
(+) you can have total control over the appearance of the text (eg,

move it to where it doesn't interfere with viewing, add font style with no need for player support, etc)

(-) in a digital world where the distribution of additional information can easily be achieved without destroying the format of the original data (both: video and text in this case), such a transfer of captions is not preferred
(+) compared to open captions, bitmap captions can actually be selectively turned on and off
(-) being extra data that accompanies a video or audio file, a player needs to be specifically adapted to support display and handling of this kind of data

MULTIPLEX INTO OGG: probably as extra timed bitmap track(s) (or extra video track) - the OggMNG and OggKate codecs will provide for this need

Sign-language

a sign language is a set of gestures, body movements, face expressions and mouth movements to communicate between people
a sign-language is defined for a specific localized community, e.g. American Sign Language (ASL), English Sign Language (ESL)
there are a large number of different sign languages "spoken" around the world
sign-language does not typically relate in geographic area to any spoken language (ASL and ESL are totally unrelated languages)
there is not typically a simple means to automatically translate a spoken language word to a sign-language representation, because sign-language works typically more on ideas and concepts rather than letters and words; in addition, multiple things can be communication in one sign language sign that may take multiple sentences to communicate in a spoken language
for most sign languages there is no written representation available - people rather use as a written communication means the written representation of their local spoken language
multiple sign-language tracks are necessary to cover internationalisation of sign-language

Discussion:

(-) sign languages can at this point in time only be represented through a visual recording of a person performing the language, because there are no standards for specifying sign language in avatars
(-) automated analysis, transcription, or translation between text and sign-language is generally not possible (proprietary approaches: | IBM ASR-to-ASL or | VCom3D Text-to-ASL)
(-) being extra data that accompanies a video or audio file, a player needs to be specifically adapted to support display and handling of this kind of data

MULTIPLEX INTO OGG: as extra video tracks, one per sign-language - a specification on how these tracks should be presented (e.g. PIP) and what language they represent should go into the message header fields of the skeleton track

Requirements for visually impaired people

Visually impaired people need a replacement of the image content of a video to allow them to follow the display of a video or any other image-based content (e.g. an image track with slides).

The following channels are available to replace the visual content of an audio or video file:

haptic representation: braille text
aural representation: auditive descriptions, text-to-speech (TTS) screen-reader

The following are actual technologies that can provide such channels for audio & video files:

Auditive audio descriptions

auditive descriptions are descriptions of all that is happening only visually in a video; this includes e.g. scenery, people & objects appearing/disappearing/movement, face expressions, weather

Example: http://www.narrativetv.com/

Discussion:

(-) an auditive description of the visual scene is restrictive because the description is spoken and not directly accessible other than through speech recognition
(-) auditive descriptions can only be listened to and not (easily) be used for other purposes such as braille or search indexing
(-) it can be problematic to synchronise the auditive description with the actual timeline of the video, because the auditive description may sometimes take longer to play than the original video provides a break for
(-) in a digital world where the distribution of additional information can easily be achieved without destroying the format of the original data (both: text description and the video sync in this case), such a transfer of scene descriptions is not preferred
(+) it is possible to play back the aditive description on a different audio channel to the rest of the video's audio and be separately sped up in playback time
(-) being extra data that accompanies a video file, a player needs to be specifically adapted to support playback and handling of this kind of data

MULTIPLEX INTO OGG: as extra audio track with editing of breaks into original a/v file (preferrably Speex encoded, since it's a pure speech track)

Textual audio descriptions

textual representation of an audio description

Layout:

audio descriptions are not displayed as text, since they aim at blind users
a textual scene description can be made accessible in a multitude of ways, e.g. through a TTS & screenreader, through braille

Discussion:

(+) it is possible to play back the aditive description on a different audio channel to the rest of the video's audio and be separately sped up in playback time
(+) the textual scene description can be used for other purposes, such as search indexing
(+) the textual scene description together with captions may coincide in many ways with the annotation needs of archives
(-) the textual scene description may be too unstructured to be usable as a semantic representation of video
(-) being extra data that accompanies a video file, a player needs to be specifically adapted to support playback and handling of this kind of data

MULTIPLEX INTO OGG: as extra timed text track

Requirements for internationalisation

subtitles

- transcription of what is being said in different languages
- same options as open / closed text / closed bitmap captions above

Recommendation: closed text as timed text track

dubbed audio tracks

- translation of what is being said into different language in audio

MULTIPLEX INTO OGG: as extra audio tracks (could be speex, if the sound background is being kept in a separate file and both are played back at the same time) - there needs to be a specification in the skeleton track about which language it represents and that it replaces the main audio track if selected

Requirements for entertainment

karaoke

- remove spoken words from audio track and add timed text track
- similar to subtitles with style information

Layout:

karaoke text is rendered anywhere over the video screen
karaoke texts often overlap in time, e.g. two lines of text are used to alternatively display the next line of lyrics
karaoke text sometimes scrolls
karaoke text is consecutively coloured as the words are being sung
karaoke text is sometimes coloured differently for different people

MULTIPLEX INTO OGG: as timed text track

ticker text

news often have a ticker text at the bottom of the screen
ticker text always stays at the same position and scrolls through from left to right
the ticker text often has small ads in it, too
there are often graphics included

MULTIPLEX INTO OGG: as timed text track

image maps

advertisers in particular want to mark regions on online video with hyperlinks and some scroll-over text, which is linked to the region
there is the active region and there is the region for the scroll-over text
there can be text & graphics in the scroll-over text box
can be placed anywhere on top of the video screen
the active region is typically empty and lights up upon scroll-over

MULTIPLEX INTO OGG: as timed text track

Requirements for archiving

metadata & semantic annotations

(structured) annotations that describe chapters, scenes and simialr semantics
similar to audio description, but tends to be structured along different fields
may also be a random note left by somebody to leave a comment

Discussion:

there are different types of annotations that are typically handled differently
RDF type annotations, e.g. is not rendered as text, but available for machine processing only
chapter marker and similar structural subdivisions are mostly used for naviation
visible annotations are short pieces of text that are attached through speech/thought bubbles, or as notes
visible annotations require a time range, on-screen position, text styling, and annotation box styling
visible annotations could also contain graphics, but mostly simple formatting such as for notes is required

MULTIPLEX INTO OGG: as timed text track

story board

sequence of images that represent the original way in which the video was scripted
if it is possible to align it with the resulting video, provides insights into the video planning and production
if it cannot be time-aligned with the video, it should be referenced separately (text with url on webpage)

MULTIPLEX INTO OGG: as timed image track

transcript

textual description of everything that happens in the video
basically similar to what captions and audio annotations provide together
it is probably better to keep as much of the different types of annotations separate as possible and call the complete textual annotations available to a audio/video file its "transcript"
if it cannot be time-aligned with the video, it should be referenced separately (text with url on webpage)

Discussion:

transcripts are mostly not displayed on top of the video, but in parallel to it with the current text block selected in sync to the video
rich formatting is possible
wiki-style editing is often encouraged
often there are hyperlinks in transcripts

MULTIPLEX INTO OGG: as timed text track

script

original description of the video for production purposes
if it is possible to align it with the resulting video, provides insights into the video planning and production
if it cannot be time-aligned with the video, it should be referenced separately (text with url on webpage)

Discussion:

same as transcript

MULTIPLEX INTO OGG: as timed text track

lyrics

for music files, a special type of transcript / subtitle format has evolved - the lyrics .lrc file format
this is also a text codec and can be time-aligned with audio / video

Discussion:

same as transcript

MULTIPLEX INTO OGG: as timed text track

titles / credits / on-screen text

most videos have titles and/or credits at the start/end respectively
sometimes there are statistics or scoreboards or other on-screen texts in the video, which could also be provided as normal text
where it is possible to make these available as separate text rather than burnt in (i.e. open), they should be made part of a transcript

Discussion:

same as transcript

MULTIPLEX INTO OGG: as timed text track

linguistic research markup

linguists like to mark up audio (and video) by single words or even by syllables, linking them back to the original audio
some markups have different hierarchical segmentations - by syllable, by word, by sentence, by paragraph

Discussion:

this is a fairly fine-grained temporal markup, but otherwise no different to a subtitle or caption markup

MULTIPLEX INTO OGG: as timed text track

Functionalities required around additional data tracks & audio / video

Access to text from Web page

Any text multiplexed inside an audio / video file should be exposed to the Web page, so we can use live region semantics where assistive technologies are notified when there is new text.

Solutions:

One way to get the text into the Web page is through inclusion into the DOM as subelements of the <video> or <audio> element.

Another way would be to provide a javascript API for the text elements.

Discussion:

Exposing timed text to scripts as DOM nodes is an interesting idea.
However, exposing them as direct subelements is probably not the way to go for two reasons:

1) The general expectation is that the DOM tree is the parse tree of one HTTP resource--that is, there's one Document object per HTTP resource.

2) For security reasons, allowing a video file cause node insertion into an HTML DOM tree seems like a potentially dangerous thing.

If a DOM tree representation is natural for the captioning format, it should probably be available under a separate Document object like iframe content.
OTOH, the DOM is rather heavyweight, so if the caption renderer isn't DOM-based, it would probably be better to expose a lighter specialized JS API.

Specification needs

Timed text tracks need to:
- specify the category of timed text (e.g. caption, metadata etc)
- specify their primary language at the start (e.g. lang="en")
- specify their text encoding at the start (e.g. UTF-8)
- if there are sections given in a different language they need to be specified as such
- text segments should be allowed to overlap in time
- it must be allowed to have time segments without text (discontinuous codec)
- there should be a way to specify adaptable layout of overlay text, in particular relative positioning
- there should be a way to specify mathematical formulae

Audio tracks should:
- specify their primary language in the header

Sign language video tracks should:
- specify their primary language in the header

Access to structure of video / audio from Web page

Similarly to the access of text, it is important to be able to access the structure of the multiplexed text through the Web page.
This is of particular importance for blind users that cannot use the mouse to access random offset points in videos but need to rely on tabbing.
Video (in particular video that has audio descriptions) needs to have a structure that can be accessed via tabbing through it, to allow (blind) people to directly jump to areas of interest

Discussion:

(+) providing video with a structure for direct access to fragments is useful not just to blind people
(+) audio should also be able to be structured in this way

Solutions:

Both options: DOM and javascript API are possible.

Text Display

Text available in association to a video should not be hidden from the user but made available.

Text that is displayed in a Web page needs to be stylable.

Sometimes there is more than one text region to be defined at one time.

Solutions:

Define default display mechanisms for the different types of text.

Define for the different text types which are open by default, and which can be toggled on.

For text that can be toggled on, default icons need to be specified.

Use style sheet mechanisms and style tags for displayable text - maybe select a simple subpart of CSS to provide styling.

Here is an example of two services that have solved the display challenge.

Dynamic content selection

The most suitable content for a video / audio file should be made available to the user in accordance with user's abilities and preferences.

This includes the ability to have the user agent ask the Web server about the available content tracks for a video.

This also includes the ability to have the user agent tell a Web server the specific preference settings of a user, e.g. the supression of video content because of a blind user, so the Web server can adapt the content and provide only the tracks that the user prefers.

Another example where content may need to be dynamically extended with a11y text is where a caption author may provide the captions for a video that he does not host, but his server would be able to multiplex the two together and deliver them to a user agent.

Solution:

The Web server needs to have the ability to dynamically compose a video together from its constituent tracks according to a user agent's request. The different tracks may possibly be reused from different Web servers.

There needs to be a protocol (possibly a HTTP extension) that allows user agent and Web server to communicate about the composition.

Direct access to fragments

In search results, in particular search results on large audio / video files, the results need to point to the fragments that actually contain the content that relates to the query.

This makes content more accessible to any user, including disabled users.

When the fragment is selected for playback, it should only include the fragment and not the full video.

Solution:

A URI mechanism to address the fragment is required (see Media Fragments WG at W3C).

A Web server needs to be enabled to deliver just the subpart of a video / audio that is requested (see Metavid for an example).

Hyperlinking out of audio / video

The power of the Web comes from hyperlinks between resources; data that has no outgoing hyperlinks is essentially dead data on the Web.

Hyperlinks from within a/v to other Web resources allow users of varying abilities to link to related information and return back for further viewing.

Imagine as a blind user using a Web of audio files - hyperlinks will highly improve the usability of the Web for blind users, who are still relatively underrepresented on the Web because of the dominantly visual nature of Web browsing.

Solution:

One option is to enable the inclusion of <a href ..> into all timed text. The problem with that approach is that if the text is displayed as e.g. a caption inside a video, it appears and disappears too quickly for people to select and activate the link (even if they use 'tab' to get there).

An alternative is to enable the inclusion of a hyperlink for a fragment of time and display in some special way that a hyperlink is present. (see e.g. CMML for an example)

Specification language

A text codec as is required for the specification of captions, subtitles, textual audio description, annotations requires some textual markup language for authoring. A few opinions on the choice of markup language are floating around which are captured here.

Typically for Web purposes a XML based language - preferrably even a HTML-like language - is preferred.

If a subset of HTML is selected as a choice for the markup language, parsing and rendering engines of Web browsers can be re-used.

XML based markup of text codecs however often renders the text specification unreadable. Examples: Kate chose a C-like tree that was easy to parse with lex & yacc; RDF has a N3 notation or a RELAX NG notation as a RDF-XML alternative; srt and other simple subtitle/caption formats avoid XML

When encapsulating text codecs into a binary audio/video file such as Ogg, where compression is everything that counts, a "talkative" XML codec may not be the best choice.

Solution:

For specification languages, it may be possible to do both, a XML and a non-XML specification (similar to RDF).

For encapsulation (e.g. into Ogg), it may be best to provide the encapsulation framework in a denser way than XML, but to allow XML fragments to appear as codec pages.

For decoding into a Web application / browser, there must be a simple way to provide the text codec in a DOM structure.

< Back to main video accessibility page

Accessibility/Video a11y requirements

Contents

Additional Data tracks required for audio & video

Requirements for hearing-impaired people

Burned-in ("open") captions

Textual (closed) captions

Bitmap (closed) captions

Sign-language

Requirements for visually impaired people

Auditive audio descriptions

Textual audio descriptions

Requirements for internationalisation

subtitles

dubbed audio tracks

Requirements for entertainment

karaoke

ticker text

image maps

Requirements for archiving

metadata & semantic annotations

story board

transcript

script

lyrics

titles / credits / on-screen text

linguistic research markup

Functionalities required around additional data tracks & audio / video

Access to text from Web page

Specification needs

Access to structure of video / audio from Web page

Text Display

Dynamic content selection

Direct access to fragments

Hyperlinking out of audio / video

Specification language

Navigation menu

Search