Accessibility/Video a11y requirements
< Back to main video accessibility page
Accessibility for Web audio & video is an interesting beast. It is mainly referred as a need of extending content with functionality to allow people of varying disabilities (in particular deaf / blind) access to Web content.
However, such functionalities are not only usable by disabled people - they become quickly very useful to people that are not generally regarded as "disabled", e.g. subtitles for internationalisation or in loud environments, or the "mis"-use of captions for deep search and direct access to subparts of a video. (See for example the use cases at WHATWG wiki).
Therefore, the analysis of requirements for accessibility cannot stop at the requirements for disabled people, but will need to go further.
The aim here is to analyse as many requirements for attaching additional information to audio & video files for diverse purposes. Then we can generalise a framework for attaching such additional information, in particular to Ogg files/streams. And as a specialisation of this, we can eventually give solutions for captions and audio annotations in particular.
Mojito is an example of a video player with such functionality.
As an alternative to embedding some of this information into Ogg files, it is possible to keep them in a separate file and just refer to them as related to the a/v file. This is a problem to solve outside the Ogg format.
The description of how to include a certain data type into Ogg is generally straight-forward from its description and is therefore included below as a suggestion.
The advantages of including timed text data inside Ogg are:
- The captions travel with the file whenever the soundtrack travels, so they don't get dropped when the file is moved around.
- Using a design accepted by non-Web player apps is more likely to lead to positive network effects through applicability of the same code or authoring effort in different contexts.
The disadvantages are:
- To replace or add a timed text track to an existing video file, one needs to re-encode (re-encapsulate) the file.
- A user agent will only be able to access the text data after decoding, which may take the full duration of the media file - instead with a text file, it would get all the data already in readable form and up-front.
Contents
- 1 Additional Data tracks required for audio & video
- 2 Functionalities required around additional data tracks & audio / video
Additional Data tracks required for audio & video
Requirements for hearing-impaired people
Hearing-impaired people need a visual or haptic replacement for the auditive content of an audio or video file.
The following channels are available to replace the auditive content of an audio or video file:
- textual representation: text (or braille, which is a text display)
- visual representation: sign-language
The following are actual technologies that can provide such channels for audio & video files:
- caption text comprises the spoken content of an audio/video file plus a transcription of non-speech audio (speaker names, music, sound effects)
- open captions are caption text that is irreversibly merged in the original video frames in such a way that all viewers have to watch them
- open captions are in use today mostly for providing open subtitles on TV
Discussion:
- (-) open captions are restrictive because they are just pixels in the video and the actual caption text is not directly accessible other than through OCR
- (-) open captions can only be viewed and not (easily) be used for other purposes such as braille or search indexing
- (-) in a digital world where the distribution of additional information can easily be achieved without destroying the format of the original data (both: video and text in this case), such a transfer of captions is not preferred
- (-) no flexibility to turn off the captions
- (-) you can have only one of them per video track (eg, one video per language)
- (-) doesn't usually play nice with video
- (+) no special software needs to be implemented to enable display of open captions in a video player
MULTIPLEX INTO OGG: not necessary, since it is part of the video track
- closed captions are caption text that can be activated/deactivated to be displayed on top of the video
- while there is a multitude of methods to encode closed captions into TV signals, we will here refer to closed captions only as a textual representation of captions relating to a digital video file
Layout:
- textual captions are rendered over either the top or the bottom of the video screen
- they are not more than 1-2 lines of text normally
- textual captions do not overlap in time, but are normally sequential
- sometimes captions are split between the left and the right half of the screen to associate better who is speaking what
- sometimes captions are coloured differently for different people
- sometimes a negative colour is used to the default because the background colours of the video conflict
Discussion:
- (+) textual captions can be accessed as a textual representation of the video and used for purposes that go beyond the on-screen display, e.g. for display as braille or search purposes
- (+) textual captions can be internationalised and given in multiple languages
- (+) textual captions can be compressed easily
- (-) textual captions contain essentially the represented spoken content together with music description, sound effect description, and metadata like the speaker name together in one representation means, which makes them semantically difficult to distinguish
- (-) being extra data that accompanies a video or audio file, a player needs to be specifically adapted to support display and handling of this kind of data
MULTIPLEX INTO OGG: as extra timed text track(s)
- sometimes closed captions are represented as a a sequence of transparent bitmaps which are overlayed onto the video at given time offsets to display the closed captions (sometimes used on DVDs also called SPU subtitles, or DVB)
Discussion:
- (-) like open captions, bitmap captions are restrictive because they are just pixels in the video and the actual caption text is not directly accessible other than through OCR
- (-) bitmap captions can only be viewed and not (easily) be used for other purposes such as braille or search indexing
- (+) bitmap captions can be internationalised and given in multiple languages
- (+) you can have total control over the appearance of the text (eg,
move it to where it doesn't interfere with viewing, add font style with no need for player support, etc)
- (-) in a digital world where the distribution of additional information can easily be achieved without destroying the format of the original data (both: video and text in this case), such a transfer of captions is not preferred
- (+) compared to open captions, bitmap captions can actually be selectively turned on and off
- (-) being extra data that accompanies a video or audio file, a player needs to be specifically adapted to support display and handling of this kind of data
MULTIPLEX INTO OGG: probably as extra timed bitmap track(s) (or extra video track) - the OggMNG and OggKate codecs will provide for this need
Sign-language
- a sign language is a set of gestures, body movements, face expressions and mouth movements to communicate between people
- a sign-language is defined for a specific localized community, e.g. American Sign Language (ASL), English Sign Language (ESL)
- there are a large number of different sign languages "spoken" around the world
- sign-language does not typically relate in geographic area to any spoken language (ASL and ESL are totally unrelated languages)
- there is not typically a simple means to automatically translate a spoken language word to a sign-language representation, because sign-language works typically more on ideas and concepts rather than letters and words; in addition, multiple things can be communication in one sign language sign that may take multiple sentences to communicate in a spoken language
- for most sign languages there is no written representation available - people rather use as a written communication means the written representation of their local spoken language
- multiple sign-language tracks are necessary to cover internationalisation of sign-language
Discussion:
- (-) sign languages can at this point in time only be represented through a visual recording of a person performing the language, because there are no standards for specifying sign language in avatars
- (-) automated analysis, transcription, or translation between text and sign-language is generally not possible (proprietary approaches: | IBM ASR-to-ASL or | VCom3D Text-to-ASL)
- (-) being extra data that accompanies a video or audio file, a player needs to be specifically adapted to support display and handling of this kind of data
MULTIPLEX INTO OGG: as extra video tracks, one per sign-language - a specification on how these tracks should be presented (e.g. PIP) and what language they represent should go into the message header fields of the skeleton track
Requirements for visually impaired people
Visually impaired people need a replacement of the image content of a video to allow them to follow the display of a video or any other image-based content (e.g. an image track with slides).
The following channels are available to replace the visual content of an audio or video file:
- haptic representation: braille text
- aural representation: auditive descriptions, text-to-speech (TTS) screen-reader
The following are actual technologies that can provide such channels for audio & video files:
Auditive audio descriptions
- auditive descriptions are descriptions of all that is happening only visually in a video; this includes e.g. scenery, people & objects appearing/disappearing/movement, face expressions, weather
Example: http://www.narrativetv.com/
Discussion:
- (-) an auditive description of the visual scene is restrictive because the description is spoken and not directly accessible other than through speech recognition
- (-) auditive descriptions can only be listened to and not (easily) be used for other purposes such as braille or search indexing
- (-) it can be problematic to synchronise the auditive description with the actual timeline of the video, because the auditive description may sometimes take longer to play than the original video provides a break for
- (-) in a digital world where the distribution of additional information can easily be achieved without destroying the format of the original data (both: text description and the video sync in this case), such a transfer of scene descriptions is not preferred
- (+) it is possible to play back the aditive description on a different audio channel to the rest of the video's audio and be separately sped up in playback time
- (-) being extra data that accompanies a video file, a player needs to be specifically adapted to support playback and handling of this kind of data
MULTIPLEX INTO OGG: as extra audio track with editing of breaks into original a/v file (preferrably Speex encoded, since it's a pure speech track)
Textual audio descriptions
- textual representation of an audio description
Layout:
- audio descriptions are not displayed as text, since they aim at blind users
- a textual scene description can be made accessible in a multitude of ways, e.g. through a TTS & screenreader, through braille
Discussion:
- (+) it is possible to play back the aditive description on a different audio channel to the rest of the video's audio and be separately sped up in playback time
- (+) the textual scene description can be used for other purposes, such as search indexing
- (+) the textual scene description together with captions may coincide in many ways with the annotation needs of archives
- (-) the textual scene description may be too unstructured to be usable as a semantic representation of video
- (-) being extra data that accompanies a video file, a player needs to be specifically adapted to support playback and handling of this kind of data
MULTIPLEX INTO OGG: as extra timed text track
Requirements for internationalisation
subtitles
- transcription of what is being said in different languages
- same options as open / closed text / closed bitmap captions above
Recommendation: closed text as timed text track
dubbed audio tracks
- translation of what is being said into different language in audio
MULTIPLEX INTO OGG: as extra audio tracks (could be speex, if the sound background is being kept in a separate file and both are played back at the same time) - there needs to be a specification in the skeleton track about which language it represents and that it replaces the main audio track if selected
Requirements for entertainment
karaoke
- remove spoken words from audio track and add timed text track
- similar to subtitles with style information
Layout:
- karaoke text is rendered anywhere over the video screen
- karaoke texts often overlap in time, e.g. two lines of text are used to alternatively display the next line of lyrics
- karaoke text sometimes scrolls
- karaoke text is consecutively coloured as the words are being sung
- karaoke text is sometimes coloured differently for different people
MULTIPLEX INTO OGG: as timed text track
ticker text
- news often have a ticker text at the bottom of the screen
- ticker text always stays at the same position and scrolls through from left to right
- the ticker text often has small ads in it, too
- there are often graphics included
MULTIPLEX INTO OGG: as timed text track
image maps
- advertisers in particular want to mark regions on online video with hyperlinks and some scroll-over text, which is linked to the region
- there is the active region and there is the region for the scroll-over text
- there can be text & graphics in the scroll-over text box
- can be placed anywhere on top of the video screen
- the active region is typically empty and lights up upon scroll-over
MULTIPLEX INTO OGG: as timed text track
Requirements for archiving
metadata & semantic annotations
- (structured) annotations that describe chapters, scenes and simialr semantics
- similar to audio description, but tends to be structured along different fields
- may also be a random note left by somebody to leave a comment
Discussion:
- there are different types of annotations that are typically handled differently
- RDF type annotations, e.g. is not rendered as text, but available for machine processing only
- chapter marker and similar structural subdivisions are mostly used for naviation
- visible annotations are short pieces of text that are attached through speech/thought bubbles, or as notes
- visible annotations require a time range, on-screen position, text styling, and annotation box styling
- visible annotations could also contain graphics, but mostly simple formatting such as for notes is required
MULTIPLEX INTO OGG: as timed text track
story board
- sequence of images that represent the original way in which the video was scripted
- if it is possible to align it with the resulting video, provides insights into the video planning and production
- if it cannot be time-aligned with the video, it should be referenced separately (text with url on webpage)
MULTIPLEX INTO OGG: as timed image track
transcript
- textual description of everything that happens in the video
- basically similar to what captions and audio annotations provide together
- it is probably better to keep as much of the different types of annotations separate as possible and call the complete textual annotations available to a audio/video file its "transcript"
- if it cannot be time-aligned with the video, it should be referenced separately (text with url on webpage)
Discussion:
- transcripts are mostly not displayed on top of the video, but in parallel to it with the current text block selected in sync to the video
- rich formatting is possible
- wiki-style editing is often encouraged
- often there are hyperlinks in transcripts
MULTIPLEX INTO OGG: as timed text track
script
- original description of the video for production purposes
- if it is possible to align it with the resulting video, provides insights into the video planning and production
- if it cannot be time-aligned with the video, it should be referenced separately (text with url on webpage)
Discussion:
- same as transcript
MULTIPLEX INTO OGG: as timed text track
lyrics
- for music files, a special type of transcript / subtitle format has evolved - the lyrics .lrc file format
- this is also a text codec and can be time-aligned with audio / video
Discussion:
- same as transcript
MULTIPLEX INTO OGG: as timed text track
titles / credits / on-screen text
- most videos have titles and/or credits at the start/end respectively
- sometimes there are statistics or scoreboards or other on-screen texts in the video, which could also be provided as normal text
- where it is possible to make these available as separate text rather than burnt in (i.e. open), they should be made part of a transcript
Discussion:
- same as transcript
MULTIPLEX INTO OGG: as timed text track
linguistic research markup
- linguists like to mark up audio (and video) by single words or even by syllables, linking them back to the original audio
- some markups have different hierarchical segmentations - by syllable, by word, by sentence, by paragraph
Discussion:
- this is a fairly fine-grained temporal markup, but otherwise no different to a subtitle or caption markup
MULTIPLEX INTO OGG: as timed text track
Functionalities required around additional data tracks & audio / video
Access to text from Web page
- Any text multiplexed inside an audio / video file should be exposed to the Web page, so we can use live region semantics where assistive technologies are notified when there is new text.
Solutions:
- One way to get the text into the Web page is through inclusion into the DOM as subelements of the <video> or <audio> element.
- Another way would be to provide a javascript API for the text elements.
Discussion:
- Exposing timed text to scripts as DOM nodes is an interesting idea.
- However, exposing them as direct subelements is probably not the way to go for two reasons:
1) The general expectation is that the DOM tree is the parse tree of one HTTP resource--that is, there's one Document object per HTTP resource.
2) For security reasons, allowing a video file cause node insertion into an HTML DOM tree seems like a potentially dangerous thing.
- If a DOM tree representation is natural for the captioning format, it should probably be available under a separate Document object like iframe content.
- OTOH, the DOM is rather heavyweight, so if the caption renderer isn't DOM-based, it would probably be better to expose a lighter specialized JS API.
Specification needs
- Timed text tracks need to:
- specify the category of timed text (e.g. caption, metadata etc)
- specify their primary language at the start (e.g. lang="en")
- specify their text encoding at the start (e.g. UTF-8)
- if there are sections given in a different language they need to be specified as such
- text segments should be allowed to overlap in time
- it must be allowed to have time segments without text (discontinuous codec)
- there should be a way to specify adaptable layout of overlay text, in particular relative positioning
- there should be a way to specify mathematical formulae
- Audio tracks should:
- specify their primary language in the header
- Sign language video tracks should:
- specify their primary language in the header
Access to structure of video / audio from Web page
- Similarly to the access of text, it is important to be able to access the structure of the multiplexed text through the Web page.
- This is of particular importance for blind users that cannot use the mouse to access random offset points in videos but need to rely on tabbing.
- Video (in particular video that has audio descriptions) needs to have a structure that can be accessed via tabbing through it, to allow (blind) people to directly jump to areas of interest
Discussion:
- (+) providing video with a structure for direct access to fragments is useful not just to blind people
- (+) audio should also be able to be structured in this way
Solutions:
- Both options: DOM and javascript API are possible.
Text Display
- Text available in association to a video should not be hidden from the user but made available.
- Text that is displayed in a Web page needs to be stylable.
- Sometimes there is more than one text region to be defined at one time.
Solutions:
- Define default display mechanisms for the different types of text.
- Define for the different text types which are open by default, and which can be toggled on.
- For text that can be toggled on, default icons need to be specified.
- Use style sheet mechanisms and style tags for displayable text - maybe select a simple subpart of CSS to provide styling.
Here is an example of two services that have solved the display challenge.
Dynamic content selection
- The most suitable content for a video / audio file should be made available to the user in accordance with user's abilities and preferences.
- This includes the ability to have the user agent ask the Web server about the available content tracks for a video.
- This also includes the ability to have the user agent tell a Web server the specific preference settings of a user, e.g. the supression of video content because of a blind user, so the Web server can adapt the content and provide only the tracks that the user prefers.
- Another example where content may need to be dynamically extended with a11y text is where a caption author may provide the captions for a video that he does not host, but his server would be able to multiplex the two together and deliver them to a user agent.
Solution:
- The Web server needs to have the ability to dynamically compose a video together from its constituent tracks according to a user agent's request. The different tracks may possibly be reused from different Web servers.
- There needs to be a protocol (possibly a HTTP extension) that allows user agent and Web server to communicate about the composition.
Direct access to fragments
- In search results, in particular search results on large audio / video files, the results need to point to the fragments that actually contain the content that relates to the query.
- This makes content more accessible to any user, including disabled users.
- When the fragment is selected for playback, it should only include the fragment and not the full video.
Solution:
- A URI mechanism to address the fragment is required (see Media Fragments WG at W3C).
- A Web server needs to be enabled to deliver just the subpart of a video / audio that is requested (see Metavid for an example).
Hyperlinking out of audio / video
- The power of the Web comes from hyperlinks between resources; data that has no outgoing hyperlinks is essentially dead data on the Web.
- Hyperlinks from within a/v to other Web resources allow users of varying abilities to link to related information and return back for further viewing.
- Imagine as a blind user using a Web of audio files - hyperlinks will highly improve the usability of the Web for blind users, who are still relatively underrepresented on the Web because of the dominantly visual nature of Web browsing.
Solution:
- One option is to enable the inclusion of <a href ..> into all timed text. The problem with that approach is that if the text is displayed as e.g. a caption inside a video, it appears and disappears too quickly for people to select and activate the link (even if they use 'tab' to get there).
- An alternative is to enable the inclusion of a hyperlink for a fragment of time and display in some special way that a hyperlink is present. (see e.g. CMML for an example)
Specification language
A text codec as is required for the specification of captions, subtitles, textual audio description, annotations requires some textual markup language for authoring. A few opinions on the choice of markup language are floating around which are captured here.
- Typically for Web purposes a XML based language - preferrably even a HTML-like language - is preferred.
- If a subset of HTML is selected as a choice for the markup language, parsing and rendering engines of Web browsers can be re-used.
- XML based markup of text codecs however often renders the text specification unreadable. Examples: Kate chose a C-like tree that was easy to parse with lex & yacc; RDF has a N3 notation or a RELAX NG notation as a RDF-XML alternative; srt and other simple subtitle/caption formats avoid XML
- When encapsulating text codecs into a binary audio/video file such as Ogg, where compression is everything that counts, a "talkative" XML codec may not be the best choice.
Solution:
- For specification languages, it may be possible to do both, a XML and a non-XML specification (similar to RDF).
- For encapsulation (e.g. into Ogg), it may be best to provide the encapsulation framework in a denser way than XML, but to allow XML fragments to appear as codec pages.
- For decoding into a Web application / browser, there must be a simple way to provide the text codec in a DOM structure.