Adding subtitles to videos

In my ever-growing collection of FFmpeg-related blog posts, I will today show how to add subtitles to videos. These tricks are based on the need to create a captioned version of a video I made to introduce the Workshop on NIME Archiving for the 2022 edition of the International Conference on New Interfaces for Musical Expression (NIME).

The video I discuss in this blog post. YouTube supports turning on and off the subtitles (CC button).

Why add subtitles to videos?

I didn’t think much about subtitles previously but have become increasingly aware of their importance. Firstly, adding subtitles is essential from an accessibility point of view. In fact, at UiO, it is now mandatory to add subtitles to all videos we upload to the university web pages. The main reason is that people that have problems hearing the content can read.

Also, for people that can hear sound in the video, it is helpful to have subtitles available. On Twitter, for example, videos will play automatically when you hover over them. However, the sound will usually be off, so without subtitles, it is impossible to hear what is said. There are also times when you may want to only watch some content without listening, for example, if you don’t have headphones available in a public setting.

I guess that having subtitles also helps search engines find your content more efficiently, which may lead to better dissemination of the content.

Creating subtitles

There are numerous ways of doing this, but I usually rely on some machine listening service as the first step these days. At UiO, we have a service called Autotekst that will create a subtitle text file from an audio recording. The nice thing about that service is that it supports both English and Norwegian and two people talking. It is pretty ok but does require some manual cleanup. I typically do that in a text editor while checking the video.

YouTube offers a more streamlined approach for videos uploaded to their service. It has machine listening built in that works quite well for one person talking in English. It also has a friendly GUI where you can go through and check the text and align it with the audio.

YouTube has a nice window for adding/editing subtitles.

I typically use YouTube for all my public videos in English and UiO’s service for material in Norwegian and with multiple people.

Closed Caption formats

Various services and platforms support at least 25 different subtitle formats. I have found that SRT (SubRip) and VTT (Web Video Text Tracks) are the two most common formats. Both are supported by YouTube, while Twitter prefers SRT (although I still haven’t been able to upload one such file successfully). The video player on UiO prefers VTT, so I often have to convert between these two formats.

Fortunately, FFmpeg comes to the rescue (again). Converting from one subtitle format to another is as simple as writing this one-liner:

ffmpeg -i caption.vtt

This will convert an SRT file to VTT in a split second. As the screenshot below shows, there is not much difference between the two formats.

A few lines of the same subtitle file in VTT format (left) and SRT (right).

Playing back videos with subtitles

Web players will play subtitles associated with them, but what about on a local machine. If the subtitle file is named the same as the video file, most video players will use the subtitles when playing the file. Here is how it looks in VLC on Ubuntu:

VLC will show the subtitles automatically if it finds an SRT file with the same name as a video file.

It is also possible to go into the settings to turn the subtitles on and off. I guess it is also possible to have multiple files available to add additional language support, and that would be interesting to explore another time.

The benefit of having subtitles as a separate text file is that they can be turned on and off.

Embedding subtitles in video files

We are exploring using PubPub for the NIME conference, a modern publication platform developed by The MIT Press. There are many good things to say about PubPub, but some features are still missing. Adding a subtitle file to uploaded videos is one missing feature. I, therefore, started exploring whether it is possible to embed the subtitles inside the video file.

A video file is built as a “container” that holds different content, of which video and audio are two (or more) “tracks” within the file. The nice thing about working with FFmpeg is that one quickly understands how such containers are constructed. And, just as I expected, it is possible to embed an SRT file (and probably others too) inside of a video container.

As discussed in this thread, many things can go wrong when you try to do this. I ended up with this one-liner:

ffmpeg -i video.mp4 -i -c copy -c:s mov_text video_cc.mp4

The trick is to think of the subtitle file as just another input file that should be added to the container. The result is a video file with subtitles built in, as shown below:

It may have been a long shot, but the PubPub player didn’t support such embedded subtitles either. Then I started exploring a more old-school approach, “burning” the text into the video file. I feared that I had to do this within a video editor, but, again, it turned out that FFmpeg could do the trick:

ffmpeg -i video.mp4 -vf video_cc.mp4

This “burns” the text into the video, which is not the best way of doing it, I think. After all, the nice thing about having the subtitles in text form is that they can be turned on and off and adjusted in size. Still, having some subtitles may be better than nothing.

The video with the subtitle text “burned” into the video content.
The video with subtitles is in a separate text layer.
The video is embedded on the workshop page, with subtitles.

Posting on Twitter

After having gone through all that trouble, I wanted to post the video on Twitter. This turned out to be more difficult than expected. Three problems arose.

First, Twitter does not support 4K videos, so I had to downsample to Full HD. Fair enough, that is easily done with FFmpeg. Second, Twitter only supports videos shorter than 2:20 minutes; mine is 2:34. Fortunately, I could easily cut out the video’s first and last sentence, and it still made sense. However, this also leads to trouble with the subtitles. The subtitles are based on the timing of the original video. So if I were to trim the video, I would also need to edit the subtitle file to adjust all the timings (Happy to get input on tools for doing that!).

After spending too much time on this project, I reverted to the “burned” text approach. Writing the text into the video and trimming it would ensure some text together with the video. While preparing the one-liner, I wondered whether FFmpeg would be smart enough to also “trim” the subtitles when trimming the video file:

ffmpeg -i video.mp4 -vf,scale=1920:1080,fps=30 -ss 00:00:11 -t 00:02:17  video_hd_cc.mp4

The command above does it all one go: downscale from 4K to HD, add the subtitles, and trim the video to the desired duration. Unfortunately, this command kept text from the sentence that was trimmed out in the beginning:

When trimming a captioned video, you get some text that is not part of the video.

The extra words at the beginning of the video are perhaps not the biggest problem. I would still be interested to hear thoughts on how to avoid this in the future. After all, subtitles are here to stay.

Running a disputation on YouTube

Last week, Ulf Holbrook defended his dissertation at RITMO. I was in charge of streaming the disputation, and here are some reflections on the technical setup and streaming.

Zoom Webinars vs YouTube Streaming

I have previously written about running a hybrid disputation using a Zoom webinar. We have used variations of that setup also for other events. For example, last year, we ran RPPW as a hybrid conference. There are some benefits of using Zoom, particularly when having many presenters. Zoom rooms are the best for small groups where everyone should be able to participate. For larger groups, and particularly (semi-)public events, Zoom Webinars are the only viable solution. I had only experienced Zoom bombing once (when someone else organised a public event with more than 100 people present), which was an unpleasant experience. That is why we have run all our public events using Zoom Webinars, where we have more fine-grained control of who is allowed to talk and share their video and screen.

I find that a streaming solution (such as on YouTube) is the best for public events where there is no need for much interaction. A public PhD defence is one such type of event. This is a one-way delivery, where the audience is passive; hence streaming is perfectly fine. So for Ulf’s defence, we opted for using Youtube as our streaming service. We could have used UiO’s streaming service, but then we would have missed out on the social media parts of YouTube as a channel, particularly the chat functionality that people could use to ask questions. In the end, nobody asked any questions, but it still felt good to be able to communicate with the audience.


As described previously, we have a relatively complex audiovisual installation in the hall. There are two pairs of PTZ cameras: one pair connected to the lecture desk’s PC in the front and another pair connected to a video mixer in the control room. It is possible to run two 2-camera “productions” at once, one from the back and one from the front. In the Zoom Webinars, we have used the two cameras connected to the front PC. However, we used the cameras connected to the streaming station in the back of the hall for the streaming setup.

The view from the control room in the back of Forsamlingssalen at RITMO. Two PTZ cameras can be seen on the shelf in the lower left corner of the picture.

During the defence, I sat in the control room, switching between cameras and checking that everything went as it should. As can be seen in the image below, the setup consists of a hardware video controller and video mixer, and the mixed signal is then passed on to a streaming PC.

A video from the control room, with a screen of the video mixer’s view to the left, the streaming PC on the right, and my laptop for “monitoring” in the middle.

The nice thing about working with a hardware video mixer is that you can turn it on, and it works. Computers are powerful and versatile, but hardware solutions are more reliable. As seen in the image below, the mixing consisted of choosing between the two PTZ cameras, one that gave an overview shot and another a close-up. In addition, I mixed in the slides shown on the screen. There is a picture-in-picture mode on the video mixer, but one challenge is that you don’t know what is coming on the slides. So you may end up covering some of the slides, as shown in the image below.

The video mixer was used to switch between two camera views and the image from the laptop.

I still struggle to understand the logic of setting up a live stream on YouTube. The video grabber on the streaming PC (a Blackmagic Web Presenter HD) shows up as a “webcam” and can be used directly to stream to YouTube. I had made a test run some days before the defence, and everything worked well. However, when we were about to start the scheduled stream, I realised that the source was changed to “broadcast software” instead of “webcam”. I have no idea why that happened, but fortunately, I had OBS installed on the PC and could start the stream from there. The nice thing about OBS is that you also get access to a software-based audio mixer, which came in handy for tuning the sound slightly during the defence.

The streaming PC was running OBS, getting the signal from the Web Presenter HD.

Being able to monitor the output is key to any production. Unfortunately, for the first part of Ulf’s presentation,n I was monitoring on the video mixer. There everything sounded fine. It was only after a while that I connected to the YouTube stream with my laptop and realised that there was a slight “echo” on the sound.

There was an echo on the sound in the introduction’s opening, caused by two audio streams with a short offset.

It turned out that this was because we were streaming sound twice through OBS, both through the incoming audio on the PC and through the audio channel passed alongside the HDMI signal. I had briefly checked that things were fine when starting up the YouTube channel, but I think what happened was that the two audio streams were slightly getting out of sync, leading to first an echo-like sound and later a delay. After discovering the problem, I quickly managed to turn off one of the audio streams.

The sound was better after turning off one of the audio streams.

For the rest of the disputation, I was careful to monitor the YouTube stream at regular intervals to check that things worked well. But since the Youtube stream was 10+ seconds delayed, I had to do the main monitoring on the video mixer to change camera positions in time.

I have found it important to monitor the final stream on YouTube, in case there are any problems.

Summing up

Apart from the audio problems initially, I think the streaming went well. Of course, we are not aiming at TV production quality. Still, we try to create productions that are ok to watch and listen to from a distance. We have had 3-4 people involved in the productions for some of our previous events. This time, we were two people involved. I ran the online production, and Eirik Slinning Karlsen from the RITMO administration was in control of the microphones in the hall. This setup works well and is more realistic for future events.

Programmatically resizing a folder of images

This is a note to self about how to programmatically resize and crop many images using ImageMagick.

It all started with a folder full of photos with different pixel sizes and ratios. That is because they had been captured with various cameras and had also been manually cropped. This could be verified by running this command to print their pixel sizes:

identify -format "%wx%h\n" *.JPG

Fortunately, all the images had a reasonably large pixel count, so I decided to go for a 5MP pixel count (2560×1920 in 4:3 ratio). That was achieved with this one-liner:

for i in *.JPG; do convert "$i" -resize 3000x1920 -crop 2560x1920+0+0 "$i"; done

The little script looks for all the image files in the folder and starts by resizing them to the preferred height (1920 pixels) and then cropping them to the correct width (2560 pixels). The result is a folder full of equally sized images.

Ps: the script above overwrites the original files in the folder.

A new figure of the disciplinarities: intra, cross, multi, inter, trans

Back in 2012, I published what has become my (by far) most-read blog post: Disciplinarities: intra, cross, multi, inter, trans. There I introduced a figure that I regularly receive permission requests to republish (which I always give, in the spirit of open research).

The challenge with the previous blog post has been that I based my figure on a combination of a textual description by Stember and a more limited figure by Zeigler. This led to inconsistency when it comes to two of the disciplinarities: cross and multi. That is because the figure and the textual description do not match up.

I have received many comments about this mixup over the years, and have also thought a great deal about the differences between the two. In my new book I write about interdisciplinarity in the introduction and decided to remake the figure and fix the inconsistency in my argument. I now think about multidisciplinarity as “the step” before interdisciplinarity, while crossdisciplinarity is closer to interdisciplinarity.

Anyways, here is the new figure:

An illustration of different types of disciplinarities. License: CC-BY.

I hope it can be useful to people interested in the differences between the terms. Feel free to use it if you like it. This one comes with a CC-BY license to allow for reuse.

Merge multiple MP4 files

I have been doing several long recordings with GoPro cameras recently. The cameras automatically split the recordings into 4GB files, which leaves me with a myriad of files to work with. I have therefore made a script to help with the pre-processing of the files.

This is somewhat similar to the script I made to convert MXF files to MP4, but with better handling of the temp file for storing information about the files to merge:

Save the script above as, put it in the folder of your files, make it executable, with a command like:

chmod u+x

run the file:


and watch the magic.

The script above can be remixed in various ways. For example, if you want a smaller output file (the original GoPro files are quite large), you can use FFmpeg’s default MP4 compression settings by removing the “-c copy” part in the last line above. That will also make the script take much longer, since it will recompress the output file.