Go back to home page of Unsolicited Advice from Tiffany B. Brown

Subtitles and captions with WebVTT

One drawback of HTML5 multimedia is accessibility. For hearing impaired users, audio and video content is nearly-useless without an alternative. This is where the track element and WebVTT come in handy.

Excerpt from 'Sita Sings the

What WebVTT looks like when combined with `@font-face`.

WebVTT, short for “Web Video Text Tracks”, can be used to provide timed subtitles and/or captions for multimedia content. WebVTT files are plain text, but must be served with a text/vtt header.

Even though it's plain text, WebVTT does adhere to a special format. The first line must be WEBVTT, and separated from a series of cues by a blank line. Each cue is made of a start time, an end time, and some descriptive text — either subtitles, translated dialogue, or a description of background audio. Below is an example of dialogue from an excerpted clip of Nina Paley's Sita Sings the Blues.


0:00:05.000 --> 0:00:11.000 
When? I don't remember what year. There's no year. 
How do you know there's a year for that?

0:00:11.000 --> 0:00:12.800 
I think they say the 14th century.

That first line marks the boundaries of our first cue. It starts at roughly 5 seconds into the clip and ends at 11 seconds. During that time, the text below will appear on screen. Cues must be separated by a blank line. Our next cue begins at 11 seconds and ends at 12.8 seconds.

Cues and CSS

In Chrome, Safari iOS7, and Opera 16+, we can style our cues using CSS and the ::cue pseudo-element.

    font:18px / 1.5 verdana, sans-serif;

Firefox and Internet Explorer don't support this just yet. Firefox' support for ::cue is in progress. I assume the same is true of Internet Explorer.

Simple WebVTT Markup

WebVTT supports a subset of HTML tags and a few of its own elements. We can bold or italicize text using <b> and <i> elements. It's also possible to specify the language of a particular snippet of cue text using the <lang> element.

Perhaps most useful is the ability to mark up different speakers using voice elements or the <v> tag.

0:00:16.500 --> 0:00:20.499 
<v Man1>That's when the Moguls were ruling. Babur was in India.</v>
<v Woman>The 11th then...</v>

Then we can style them using the ::cue psuedo-element as a function.



Browser support for this is still a bit scattershot, though. Chromium and its derivatives (Chrome and Opera) have the most robust support for WebVTT features. Those browsers support most of WebVTT's tags, and allow the most control over the appearance of captions and subtitles with CSS. Chromium-based browsers even support using @font-face with WebVTT cues.

Using with <track>

To use WebVTT with the track element, you need to set your path to the WebVTT file as the src attribute. By default, track elements are subtitles. If you like them to be treated as captions by the browser, set the kind attribute to captions. Though a label isn't required, some browsers — notably, Internet Explorer — will display less-than-helpful defaults. Make your label a descriptive name.

<track kind="captions" srclang="en-US" src="dialogue.vtt" label="English">

The srclang attribute is only required when kind="subtitles". Without it, subtitles won't work. The value of srclang should be a BCP 47 language code. We've included it here even though our track is a captions track. Safari will prioritize the srclang as a label when both are present.

Most browsers support the track element for the video element only. For audio, there are two options. Either:

  • include a text transcript in the reference document along with your audio media; or
  • serve audio files with the video tag

You can see how it all comes together in the related demo.

Want more?

I cover HTML5 audio and video, as well as the ins-and-outs of WebVTT in Jump Start HTML5 Multimedia from Learnable and SitePoint.