SGML for captioning, audio description, subtitling, and dubbing: Who needs it? And who cares?

I've identified what I consider a grave need for Standard Generalized Markup Language document-type definitions (SGML DTDs) to handle four types of accessible media-- captioning, audio description, subtitling, and dubbing.

I assume here that you know what SGML is but are not up to speed on what those four media are. (One useful link for SGML information is sil.org. Also try SoftQuad and the newsgroup comp.text.sgml.)

First, definitions:

Captioning

Rendering dialogue and other sounds in written words. Sign language has nothing to do with captioning.

Closed-captioning

Captions transmitted in the form of a code. You need a decoder (or, more likely, just a decoder chip) to turn the captions into visible words. Nearly all North American TVs carry decoder chips as standard equipment now.

Open-captioning

Captions that are an indelible part of the picture and are always visible. (Open-captioning effectively does not exist. It is assumed, on the basis of no scientific studies whatsoever, that hearing people simply will not tolerate open-captioning.)

Note: Captioning and subtitling have as little in common as bicycles and motorcycles. Three big differences are:

Captions are in the same language as the audio (with relatively rare exceptions).
Captions denote sound effects meaningful to the narrative or to understanding the program.
Captions move to indicate the position of the speaker.

Subtitles are a translation, ignore sound effects, and are always located in the same spot on-screen.

Audio description

Rendering visual details in a spoken narrative. In audio description, a special narrator succinctly describes action, settings, facial expressions, onscreen graphics, clothing, and other visual details. The narrator speaks out loud; A.D. is an auditory medium, not a visual one. Narrators typically speak during pauses in dialogue or at other appropriate moments, but sometimes they narrate over dialogue, over music, and so on.

How does this relate to information technology and SGML? Some facts to consider:

TV closed-captioning of prerecorded programs in North America is done using any of several rather primitive DOS programs. Real-time captioning of live programs typically uses the same software and hardware with the addition of a very skilled court reporter who enters dialogue into a stenotype machine (along with other annotations necessary to captioning). (That's stenotypy, not typing on a QWERTY keyboard. Stenotypists use compact keyboards that require depressing up to a dozen or more keys simultaneously to produce a word. For further information, look at Gary Robson's captioning FAQ.) Those entries are in shorthand and are then translated into actual words via lookup tables. (This means that homonyms like "four," "for," "fore," "IV," and "4" require distinct keystrokes. It's not exactly easy keeping track of all those keystrokes, which number in the hundreds, any of which could come up at any time in dialogue being captioned.) The words are then sent out for display on a decoder-equipped TV.
Closed-captioning in North America is encoded on Line 21 of the vertical blanking interval. The VBI is a narrow band of picture lines, all of which are normally invisible, positioned between the bottom and the top of the TV picture. (That's not a totally accurate description, but if you have a TV with a vertical-hold control, you can set the picture rolling slowly and see the VBI as a mostly-black bar between the top and bottom of the picture.) North American TV signals are made up of 525 lines (again, not totally accurate); the top 21.5 lines are in the VBI and are ordinarily invisible. (They're not magic. They're perfectly visible if you look for them. It's just that new TV sets are adjusted to keep the VBI out of sight.) Captions are encoded on line number 21 of those 21.5 lines. The caption codes are relatively wide rectangles of light that flit back and forth. Home VCRs have no trouble recording and playing those signals.
CC in PAL-standard countries like most of Europe and Australia comes about as an offshoot of the World System Teletext technology. You just tune to a certain page of teletext (888, usually) and you suddenly see captions on any captioned show. This technology uses several lines of the VBI; all the encoding takes the form of tiny dots in the VBI which are too small for anything but Super-VHS VCRs to record. This is a severe limitation, but there are some provisos to it.
Typography in both the Line 21 and WST systems is crap. Megacrap, even. Fonts are not under the control of anyone with typographic knowledge or training. Characters are generated only in the decoder or decoder chip, and decoder or chip manufacturers decide what the font will look like. There is no known case in which TV or chip manufacturers have contracted with qualified type designers to create caption fonts.
In all cases, we are talking about fonts reminiscent of dot-matrix printers circa 1982. Most fonts in Line 21 systems do not offer descenders in the lowercase gypqj, making the lowercase so poorly readable that, since Day 1 of closed-captioning, captioners have used uppercase for nearly all text even though uppercase is also hard to read. In Line 21, activating or deactivating italics, underlining, or the like inserts a space. Italics simply are not available in PAL World System Teletext captioning. Alignment in Line 21 systems is poor but, by industry agreement, by the year 2002 captioners will have available to them new codes that will permit niceties like true centering and right justification. (For further information on this topic, which I should really write a full treatise about, check my article "Typography and TV Captioning," Print, January/February 1989. Also look at the bibliography of captioning articles I've written.)
Captioning is a huge industry. Effectively all prime-time shows on all U.S. and Canadian networks, virtually everything remotely resembling a newscast, many daytime shows, thousands of home videos, most national commercials, lots of music videos, training tapes, and more are captioned. This is a source of money and a source of intellectual property. But the tools being used for captioning are primitive. (Also, caption quality is generally poor. Don't let anyone tell you otherwise.)
Audio description on TV is relatively rare. PBS is the biggest source of A.D.; described programs carry a mix of descriptions + main audio in the Second Audio Program subchannel of stereo TV. (If you have a stereo TV-- most midrange to high-end models are stereo-- you can set your TV to SAP. Won't do you much good, though, for everyday TV in programs not using audio description-- only a few stations broadcast in stereo and virtually none use SAP.) The descriptions, then, are "closed": You needn't be bothered with them unless you want to be. Unfortunately, while all TV signals have a VBI, not all have SAP, so audio description is not a ubiquitous medium the way CC is.
WGBH, the Boston PBS Überstation, is a dynamo in access technology. It is home to the Caption Center (oldest captioner on earth, and the best, though their standards are slipping), the Descriptive Video Service (does A.D. for PBS and other clients, and also sells a small home-video line of movies with always-audible descriptions), and the National Center for Accessible Media, which researches new technologies, like Web captioning and captioning in movie houses. Even these people aren't really thinking all that broadly about the potential of access technologies, though again that has many provisos.
To caption a prerecorded program, you transcribe it. Usually the captions are an edited version of that transcript-- reading is slower than speaking, and there are speed limits to caption transmission-- but if you retained a verbatim transcript with all proper annotations of sound effects (phone ringing, thunder, etc.) and speaker identification, among other structural issues clearly amenable to SGML encoding, suddenly you have a viable text-only analogue of an audiovisual program.
It gets better: Audio description typically happens during pauses in dialogue. A.D. scripts, then, are quite short-- up to 100 or 200 bursts of narration. However, it's possible to describe a whole program nonstop, and in fact one project I'm working on will do just that. If you unite either or both of these A.D. scripts (i.e., conventional and continuous description scripts) with the CC script, suddenly you have a rich and complete text-only approximation of an audiovisual program.
What can you do with that information? Archive it, either on the Web or your own computer or elsewhere. Monitor it continuously for keywords. (It is believed that the NSA has done exactly that for years.) Use it for people who don't want to wait 20 minutes to download a choppy videoclip from a Web site. And, of course, use it for its intended purpose, access.

Where research is needed:

SGML. Markups for everything from what takes the overt form of italics (which have reserved functions in captioning along with all the regular uses of italics in print) to speaker IDs to caption-on and -off times to various annotations for A.D. tracks are all needed. How is this useful? Really sophisticated captioning/A.D. software could be developed. More relevantly, existing nonlinear video-editing systems (à la Avid, Scitex [alternate link], and Media 100) and programs like Premiere and Acrobat could be extended to understand SGMLified access codes. This same development process would have to encompass subtitling and dubbing, too, which I am not talking a whole lot about here.
Also, if captions were stored as part of an SGML structure, they could be automatically reformatted in real time for different display devices, like an LED screen (with a character set different from TV and/or inverted for viewing in a mirrorized display), TV pop-up captions, TV scroll-up captions, a continuous text-only stream without paragraph and caption breaks destined for computers, or an offscreen large-print display for visually-impaired viewers. Or captions created with one software package could be read and understood by another-- or another country's system. Right now it is quite tedious to reformat Line 21 CC for PAL CC, and there are various typographic issues that come up here.
Web access. Trying to educate Webmasters that the WWW is not an excuse to post pretty pictures is a battle we've already lost. But making those graphics accessible is possible; the WGBH site shows some preliminary techniques, particularly the use of offboard descriptions (look for the D links). It's also possilbe, though currently difficult, to make Web-based audioclips and videoclips accessible. However, in the asbence of standards like SGML, there is no way to define the data types necessary for access and no way to make such data interoperable and readily translatable. Worse, most software used in the creation and playback or display of graphics, audioclips, and videoclips offers no provisions at all for access technologies.
Subtitling and dubbing are the norm outside English-speaking countries and are not unheard-of within those countries. Both subtitling and dubbing can be found in the same movie; it is then possible to caption subtitled and/or dubbed movies, also to audio-describe to them. But subtitling and dubbing rely on analogue techniques, like title cameras, typescript, and recording-studio sessions. Apart from the fact that both techniques are badly in need of automation, if SGML DTDs existed for subtitling and dubbing it would be easier to create derivative versions and to archive and otherwise make use of the resulting data.

So: I am interested in setting up a working group to create DTDs for only the four access technologies I mentioned. SoftQuad isn't interested. Is anyone else? Let me know. With sufficient interest, I may set up a mailing list to work on these topics; in the interim, consider subscribing to the Media Access mailing list, where we discuss all manner of topics related to captioning, audio description, and other means of making media of information accessible.

Back to the Joe Clark main page (or directly to the section on media access).