The First Complete Guide to YouTube Captions (Old SRV3 Format)

9 min readJan 10, 2020

UPDATE NOTES (22 Jan): The Sunday after this story was published, YouTube quietly rolled out a new caption format, dubbed JSON3. It’s not user-editable, and their implementation as of editing has broken text color completely, in an attempt to restrict them to an 8-bit teletype palette. It’s a bit more user-readable, though.

Closed captions are an often overlooked part of the YouTube experience.

On one hand, it’s used by communities to share in-jokes and comment on the action of a video. On the other hand, it’s treated by professional studios as an obligation, a leftover of FCC and international regulations. And for deaf and hard-of-hearing audiences, it can be frustrating to see some captions being used for trolling or commentary.

As a frequent contributor and occasional evangelizer of captions on YouTube, I’ve done my part to keep a decent level of standards, but today I’m happy to announce and share a breakthrough.

Prelude

It started when I watched the music video to No Doubt’s “Just a Girl”.

It wasn’t the first video I watched with a different caption layout on YouTube, but it was the one where the interest finally clicked with me. By that point, I was familiar with checking in the Web Inspector to see what the extra files being called upon looked like.

When I dug up the caption file, it was like finding the entrance to a treasure trove of secrets, of which only a few gold nuggets shone forth here.

From there, I brought up an already-uploaded test video I had on hand, and looked at YouTube’s canonical list of supported caption file formats. They recommend the Scenarist (SCC) format, which uses a horrifyingly dense implementation of the original analog TV captions and formatting.

But with it, I got a way in… and a nice reward for the effort.

After a month of experimenting and reverse engineering from that point on, I now present to you the first complete guide to YouTube captions.

Why Do This Anyway?

Closed captions are an immensely vital part of the audio-visual experience. There’s a reason we’ve had decades of regulatory oversight to make sure captions are a guaranteed part of broadcast television.

For one, people with hearing loss make up 20% of the United States population alone. More stats are available here. They also get very vocal when a video’s captions are subpar or misleading. Suffice it to say, captions are a public good for your fellow viewers.

On top of that, the normal set of captions YouTube provides are generally below government standards, which include repositioning captions to dodge on-screen text, and in the case of the United Kingdom, assigning unique colours to individual on-screen voices.

Finally, there is an argument to be made that within reason, a caption expert might use their own style guide and techniques to enhance the experience.

What Does YouTube Actually Use?

So first off, let’s talk about how captions get made in the first place, and no better place to start than the caption editor.

I use the caption editor religiously to tweak and begin work on captions, or use the one at Amara when the channel in question hasn’t opened up yet. Both editors are pretty basic and straightforward, and they don’t give much leeway to formatting by themselves. On the editor itself, you can upload a file to circumvent those limits, but then you have to consult the aforementioned format list, and it’s… not all that helpful.

As a sampling, you have SubRip (.srt) files, which are incredibly basic, then you have WebVTT (.vtt), which is basically just SubRip with extra formatting, and at the extreme end, you have SCC. Seriously though, don’t bother with SCC unless you’re in television already or willing to deal with converters every time you make an edit.

Behind the scenes, YouTube converts any supported input you give it (with some holes in certain format parsers), and it turns out that it spits up a format only marginally recognizable to anyone in the caption world.

There’s a format called Timed-Text Markup Language (TTML), and it’s a subset of XML with a standardized tag set for captions.

Well… I say standardized.

There’s a series of W3C recommendations for the TTML format, and YouTube also suggests the closed-spec iTunes Timed Text (iTT) version. But none of these are what YouTube actually ends up with.

Internally, their adapted TTML format is called SRV3, and once I share it with you, you’ll wonder how we ever got along without it. Here goes.

Reference Guide

Here’s a bare-bones sample file written with SRV3:

<timedtext version=”3”>
<head>
<pen id=”1" b=”1” fc=”#FF0055” />
</head>
<body>
<p t=”4050” d=”1020”><s p=”1”>Love it or leave it!</s></p>
</body>
</timedtext>

Let’s start with the framework. Stripped down, it almost looks like an HTML file, with head and body tags, and these are where the formatting and caption lines go.

Notice that it’s wrapped with <timedtext version=”3”>. YouTube used to have an even more basic version for synching transcripts called SRV1, but it’s deprecated and irrelevant now.

Tags

Let’s begin with the tags that go inside <body>:

<!-- Paragraph/Cue Tag -->
<p
   t="0000"       <!-- Timestamp in ms (required) -->
   d="0000"       <!-- Duration in ms (required)  -->
   ws="#"         <!-- <ws> ID   -->
   wp="#"         <!-- <wp> ID   -->
   w="#"          <!-- <w> parent ID -->
   a="0|1"        <!-- used by auto captions for padding  -->
  /><!-- Span Tag -->
<s
   p="#"          <!-- <pen> ID  -->
   t="#"          <!-- Timestamp (relative to <p> parent) -->
   ac="#"         <!-- Unused (found in auto captions)    -->
  /><!-- Window/Region Tag (for scrolling text) -->
<w
   id="#"         <!-- Tag ID -->
   t="0000"       <!-- Timestamp (child <p> timestamps relative) -->
   ws="#"         <!-- <ws> ID   -->
   wp="#"         <!-- <wp> ID   -->
  />

The <p> tag represents every caption cue, and styles are controlled by individual <s> tags within. Normally, each cue is given a window all to itself, but having a <w> tag at the start is useful for linking them together.

That’s it for the content side of the format, so now let’s get down to the formatting tags in <head>:

<!-- Cue Pen/Style Tag -->
<pen
   id="#"         <!-- Tag ID    -->
   b="0|1"        <!-- Bold      -->
   i="0|1"        <!-- Italic    -->
   u="0|1"        <!-- Underline -->
   fs="#"         <!-- Font Style/Family:
                          0|4 - YouTube Noto/Roboto/sans-serif
                          1   - Courier/monospace
                          2   - Times New Roman/serif
                          3   - Deja Vu Sans Mono/monospace
                          5   - Comic Sans MS/Impact
                          6   - Monotype Corsiva/cursive
                          7   - Carrois Gothic SC/small-caps -->
   sz="#"         <!-- Font Size (in % default) -->
   fc="#FFFFFF"   <!-- Font Color -->
   bc="#000000"   <!-- Background Color -->
   ec="#000000"   <!-- Edge Color -->
   fo="#"         <!-- Font Opacity -->
   bo="#"         <!-- Background Opacity-->
   et="#"         <!-- Edge Style/Text Shadow:
                          0   - None
                          1   - Drop Shadow
                          2   - Raised
                          3   - Uniform -->
   of="#"         <!-- Offset -->
   rb="#"         <!-- Ruby Text (see tutorial below) -->
   hg="0|1"       <!-- Horizontal Guide
                          (used for horizontal text
                           within vertical text) -->
   te="#"         <!-- Text Emphasis -->
  /><!-- Window Style Tag -->
<ws
   ju="#"         <!-- Justify/Text Align:
                          0   - Start (Left in LTR)
                          1   - End (Right in LTR)
                          2   - Center
                          3   - Justify -->
   pd="#"         <!-- Print Direction:
                          0   - LTR Horizontal
                          1   - RTL Horizontal
                          2   - Vertical RTL, Upright Text
                          3   - Vertical LTR, Sideways Text -->
   sd="#"         <!-- Scroll Direction:
                          0   - LTR
                          1   - RTL (requires vertical pd) -->
   mh="#"         <!-- Mode Hint:
                          0|1 - Default
                          2  - Scroll -->
   wfc="#000000"  <!-- Window Fill Color   -->
   wfo="#"        <!-- Window Fill Opacity -->
  /><!-- Window/Region Tag (necessary for scrolling text) -->
<wp
   id="#"         <!-- Tag ID -->
   ap="#"         <!-- Anchor Point:
      0 - Top Left    | 1 - Top Center    | 2 - Top Right 
      3 - Center Left | 4 - True Center   | 5 - Center Right
      6 - Bottom Left | 7 - Bottom Center | 8 - Bottom Right -->
   cc="#"         <!-- Column Count (en dash width) -->
   rc="#"         <!-- Row Count -->
   ah="#"         <!-- Align Horizontal (X from left) -->
   av="#"         <!-- Align Vertical (Y from top) -->
  />

These are basically like meta tags standing in for CSS classes. <pen> tags can apply to <p> or <s> tags to style the text itself. The <wp> and <ws> tags are meant for the uppermost point of the caption hierarchy, applying to the window the text is contained in.

Quirks and Caveats

Of course, since YouTube made this format for themselves to be used by themselves, there isn’t a lot of forgiveness in screwing up what it expects.

As such, don’t expect to get the right results the first time if you’re just getting in casually. Expect to be tweaking your files constantly and re-uploading them if you want to be particular about it. To save you headaches, here are a few major stumbling blocks I found while trying to reverse engineer this thing:

Anchor Points Move

You might’ve seen the <wp ap=”#”> attribute pass by, but that’s one you have to master in some respects and give up some control in others. The simple fact is that they can move.

The control bar at the bottom pushes up against caption windows that are aligned to the bottom. For embedded videos, the title bar on top pushes down on the top-aligned others.

If you’re trying to use positioning of your captions with accuracy, then you should prepare for that.

Sometimes You Need Empty <s> Tags

I’ve found early on that if I start a <p> cue with a styled <s>, then sometimes it won’t actually apply. It seems to happen with multi-line captions in particular, but a quick fix that works for me is to put an empty <s> tag at the start of the cue.

<!-- BUG -->
<p t="1000" d="500"><s p="2">BOB:</s><s> Hey Alice?
What are you doing?</s><!-- Correct -->
<p t="1000" d="500"><s></s><s p="2">BOB:</s><s> Hey Alice?
What are you doing?</s>

Mobile Doesn’t Accept Vertical Text or Rubies

My testing here was entirely reliant on my Samsung Galaxy S8, which did not register either vertical text or ruby text. In fact, you’ll see in the tutorial below that instead of supporting it, their system accounts for it by giving you a compulsory procedure that handles both.

From what it seems, they compromised on small mobile screens by just turning off distractions and making things more streamlined there.

Some Necessary Tutorials

While we’re here, let’s get to some specific tutorials that a general reference guide won’t be much help with.

Rubies

If you’re not familiar with East Asian languages, rubies are text put on top of Chinese characters to show how they’re pronounced in a certain language.
Japanese rubies are called furigana, and you may find them in anime subtitles.

For YouTube captions, there is a specific way to make rubies work.

<pen id="1" rb="1" />
<pen id="2" rb="2" />
<pen id="3" rb="3" />
<pen id="4" rb="4" /><p t="1000" d="500">
         <s p="1">太孫</s>     <!-- <ruby> Main Text -->
         <s p="2">(</s>       <!-- <rp> Open Parenthesis -->
         <s p="3">たいそん</s>  <!-- <rt> Ruby Text -->
         <s p="4">)</s>       <!-- <rp> Close Parenthesis -->
</p><!-- Newlines and tabs used for illustrative clarity.
Don't actually use them. They trigger new caption lines. -->

The rb attribute in this case is used by the player’s code to lay out a sequence of ruby text when it finds a sequence of <s> tags with different rb numbers.

On web browsers, the ruby text works perfectly, even with vertical text. However, the YouTube mobile app will instead render all four as a string of text. That’s why the parentheses are there, to pick up the slack in a historically acceptable way.

Scrolling Text

If for some reason you want to mimic live TV captions, you can use this guideline. The snippet below is adapted from a video with Mark Hamill:

<ws id="1" mh="2" ju="0" sd="0"/>
<wp id="1" ap="6" ah="20" av="100" rc="2" cc="40"/><w id="1" t="0" wp="1" ws="1"/>
<p t="79" d="3000" w="1"><s>Wow</s></p>
<p t="9780" w="1" a="1">
</p>
<p t="9790" d="8300" w="1"><s t="1790">We</s><s t="2790"> better</s><s t="3090"> be</s><s t="3360"> good</s></p>

The key elements to look for are the mh=“2” attribute in <ws> and parenting all the <p> cues to a single <w> above them.

YouTube’s auto captions like to break up a single cue line into a number of spans that are synchronized to their utterance in the video.

Evangelizing

Finally, some tips on how to spread the word.

If you see any captions in need of fixing, consider downloading the originals in SRV3 format and work with the code in your favorite text editor.

If the channel doesn’t accept new captions, try out Amara to get a head start, and get in touch with the showrunner to consider what you provide.

Happy captioning!