Subtitling on zero budget

Accessible subtitling is critical for disabled users, and increasingly essential for all users, due to the popularity muting sound while consuming social video. Producing subtitles is not difficult, and can be done for free. This guide helps explain how subtitling works and takes you through the process of creating a subtitle file at zero cost. Video is not finished until it has been subtitled or captioned.

What does a subtitle file look like?

A subtitle file connects speech in the video with text. Depending on the intent of the Caption Writer, other elements of the video may also be conveyed. Subtitles can be stored in many different digital files. The simplest is SubRip (.srt) which is accepted by Facebook and Youtube as well as semi – professional applications like Adobe Premiere Pro CC

What you need:

  • a transcript of your video
  • a video player that will display ‘Audio time units’ – i.e. time in HH:MM:SS;mS and which you are familiar with
  • a text editor (notepad will do if you are desperate)

SRT formatting is exceptionally simple. This makes it suitable for hand coding into a plain text file. The format is simply this:

  • Subtitle index (starts at zero increments +1 per subtitle)
  • Start time –> end time (h:mm:ss;ms)
  • Subtitle content (as plain text, hard line breaks are honoured on screen)
  • Blank line to terminate this subtitle
00:20:41,150 --> 00:20:45,109
In the name of the Father, and of the Son and of the Holy Spirit, 
Lets open up this tricky passage of scripture. 

00:20:45,110 -->  00:20:50,001
The wise ones, Magi, brought gifts to the infant Christ of 
Gold Frankinsense and myrh

There is a very readable guide at Wikipedia about the format and also some formatting permitted by convention in many SRT clients

Step 1: Obtaining a transcript of your video

  • The simplest strategy is just to type one. I recommend not trying to type your transcript in SRT format to start with. Instead transcribe it into natural chunks. If you are very familiar with what was said, you may find that transcribing chunks from the end of the video and moving ‘backwards’ towards the beginning gets you a more accurate transcript.
    • This technique is great if you have poor internet upload speed, or are constrained to keep the content of the video on your local machine – for example in the developing world
    • Hand coding also works well if you have poor quality sound in the video
  • An alternative method is to use an AI speech recognition service. There are several with adequate free plans for a lot of work:
    • At the time of writing works well providing you give it clean audio. Helpfully Otter can handle multiple speakers competently.

Step 2: Splitting your transcript for best effect

Subtitling for effective communication is an art. You can’t expect to just split every n words. That creates very poor intelligibility. You want to help the viewer through the video. Your role is important and should not be rushed. Here’s an example

Christmas has come and gone and also the new year.
And if we are honest,

we would probably have to admit that most of us,
if not all of us have been feeling rather exhausted.

We do it every year, the mad rush to buy the presents before
the shop closeson Christmas Eve;.

Step 3: Adding timings

Now there is nothing to do but some hard work. In the absence of any other tools you need VLC the excellent free video player to play the video out. You need an VLC extension ‘Time’ which allows you to put the time in an useful format on screen in ms.

Simply go down the file adding in the time stamps and index numbers

Step 4: Finalise

Now name the file *.srt perhaps and you are good to go.

Data Stories

Sony PXW-Z190, ffmpeg batch transcode to ProRes Quicktime: Explained

Specific guide on using ffmpeg to transcode Quicktime (ProRes) from the native MXF as recorded by Sony PXW-Z190 cameras. Also of interest to users of the PXW-Z280. For those of you desperate to cut and paste an incantation into Terminal – here you go:

 for i in *.MXF; do ffmpeg -i "$i"   -map 0:0 -map 0:1 -map 0:2 -map 0:3 -map 0:4 -c:v prores_ks -profile:v 1 -quant_mat:v 3 -qscale:v 13   "./output/${i%.*}.mov";done

Read on to understand what it does and why it works. If you’re not technical or used to Terminal – don’t worry we do it step by step.

So for the transcode we want to operate on the video stream particularly which is in h264, and transcode that to ProRes. We could do things to the audio stream but here we are just going to copy the four tracks of audio straight over.

Video transcode

Using ffprobe find the information about the zeroth stream which in this camera is the video…

Stream #0:0: Video: h264 (High), yuv420p(tv, bt709, progressive), 3840x2160 [SAR 1:1 DAR 16:9], 29.97 fps, 29.97 tbr, 29.97 tbn, 59.94 tbc    

So lets write a ffmpeg command to just make a QuickTime video file with no audio to illustrate what we are about.

ffmpeg -i INPUT.MXF -c:v prores_ks -profile:v 1 -quant_mat:v 3 -qscale 20  OUTPUT.MOV    

Exploding the command term by term

  • -c:v Selects the zeroth stream. If your video is somewhere else or there are n video streams then you need to map it / them formally
  • prores_ks selects the codec – This is one of number of ProRes options but is the one that I’ve had most luck with. Here’s a slightly despairing blog by one of the original authors: ProRes KS
  • -profile:v a preconfigured profile in prores_ks options are specified by an integer thus: -profile:v 3
    • 0 ‘proxy’
    • 1 ‘lt’ (suprisingly useful especially when output to web)
    • 2 ‘standard’
    • 3 ‘hq’
    • 4 ‘4444’
    • 5 ‘4444xq’
  • -quant_mat a preconfigured matrixes in prores_ks options are specified by an integer thus: -profile:v 3
    • 0‘auto’
    • 1‘default’
    • 2‘proxy’
    • 3‘lt’
    • 4 ‘standard’
    • 5 ‘hq’
  • -qscale Quantiser. Very broadly this is about the way the encoder picks the amount of compression per frame. Setting a fixed qscale speeds up the encode because a whole chunk of processing to get best quality is short-circuited.
  • bits_per_mb higher values will improve speed. I don’t recommend specifying both this and -qscaleat the same time because predicting the outcome gets tricky if you aren’t intimate with the maths

Audio copy

ffmpeg -i INPUT.MXF -map 0:1 -map 0:2 -map 0:3 -map 0:4 -c:a copy  OUTPUT.MOV    

Creating a straight copy over of the audio as PCM – i.e. without transcoding needs the use of the map command. There’s a better explanation of what’s going on here than I can give ffmpeg wiki some systems may struggle with the raw audio.

 Synthesising the command

ffmpeg -i INPUT.MXF -map 0:1 -map 0:2 -map 0:3 -map 0:4 -c:v prores_ks -profile:v 1  -quant_mat:v 3 -qscale:v 12  -c:a:0  pcm_alaw  -c:a:1  pcm_alaw -c:a:2 pcm_alaw -c:a:3 pcm_alaw

Note that we now need to map the video channel explicitly.

Now some magic to make I process a directory of files.

This works in OSX and probably works in most Linux. I’m not going to explain because why ‘$I’ works isn’t straightforward…

The only requirement here is that you probably should pre-create a directory called /output in the active directory. So if you’re working dir is
$usr/video then $usr/video/output

 for i in *.MXF; do ffmpeg -i "$i"   -map 0:0 -map 0:1 -map 0:2 -map 0:3 -map 0:4 -c:v prores_ks -profile:v 1 -quant_mat:v 3 -qscale:v 13  -c:a:0  pcm_alaw  -c:a:1  pcm_alaw -c:a:2 pcm_alaw -c:a:3 pcm_alaw "./output/${i%.*}.mov";done

Use your friend ffprobe

use ffmpeg without a solid understanding of the file you are feeding it as an input is fairly futile for all but the most straightforward cases. So first steps is always to use ffprobe to examine the input.

The bits we need from the screed of the Metadata report are the stream definitions – since they help with specifying the right options for the transcode.

 Video definition

 Stream #0:0: Video: h264 (High), yuv420p(tv, bt709, progressive), 3840x2160 [SAR 1:1 DAR 16:9], 29.97 fps, 29.97 tbr, 29.97 tbn, 59.94 tbc    

The first stream is the video – h264 encoded, colour depth 4:2:0,

N audio streams

 Stream #0:1: Audio: pcm_s24le, 48000 Hz, 1 channels, s32 (24 bit), 1152 kb/s

Then we get to the audio streams of which there are at least four – two external microphones, and two internal mics. I haven’t been able to test the Sony special interface

Mysterious fifth stream

There is a data stream which it isn’t quite clear what it does – It’s timecode related but I don’t do enough TC specific stuff to have worked out what.