Frame differencing with FFmpeg

I often want to create motion videos, that is, videos that only show what changed between frames. Such videos are nice to look at, and so-called “frame differencing” is also the start point for many computer vision algorithms.

We have made several tools for creating motion videos (and more) at the University of Oslo: the standalone VideoAnalysis app (Win/Mac) and the different versions of the Musical Gestures Toolbox. These are all great tools, but sometimes it would be nice also to create motion videos in the terminal using FFmpeg.

I have previously written about the tblend function in FFMPEG, which I thought would be a good starting point. However, it turned out to be slightly more challenging to do than I had expected. Hence, this blog post is to help others looking to do the same.

Here is the source video I used for testing:

Source video of dance improvisation.

First, I tried with this oneliner:

ffmpeg -i input.mp4 -filter_complex "tblend=all_mode=difference" output.mp4

It does the frame differencing, but I end up with a green image:

The result of the tblend filter.

I spent quite some time looking for a solution. Several people report a similar problem, but there are few answers. Finally, I found this explanation suggesting that the source video is in YUV while the filter expects RGB. To get the correct result, we need to add a format=gbrp to the filter chain:

ffmpeg -i input.mp4 -filter_complex "format=gbrp,tblend=all_mode=difference" output.mp4
The final motion video after RGB conversion.

I have now also added this to the function mgmotion in the Musical Gestures Toolbox for Terminal.

Kayaking motion analysis

Like many others, I bought a kayak during the pandemic, and I have had many nice trips in the Oslo fiord over the last year. Working at RITMO, I think a lot about rhythm these days, and the rhythmic nature of kayaking made me curious to investigate the pattern a little more.

Capturing kayaking motion

My spontaneous investigations into kayak motion began with simply recording a short video of myself kayaking. This was done by placing an action camera (a GoPro Hero 8, to be precise) on my life vest. The result looks like this:

In the future, it would be interesting to also test with a proper motion capture system (see this article for an overview of different approaches). However, as they say, the best motion capture system is the one you have at hand, and cameras are by far the easiest one to bring around.

Analysing kayaking motion

For the analysis, I reached for the Musical Gestures Toolbox for Python. It has matured nicely over the last year and is also where we are putting in most new development efforts these days.

The first step of motion analysis is to generate a motion video:

From the motion video, MGT will also create a motiongram:

Motiongram of a kayaking video.

From the motiongram, it is pretty easy to see the regularity of the kayaking strokes. This may be even easier from the videogram:

Videogram of a kayaking video.

We also get information about the centroid and quantity of motion:

Centroid and quantity of motion of the kayaking video.

The quantity of motion can be used for further statistical analysis. But for now, I am more interested in exploring how it is possible to better visualise the rhythmic properties of the video itself. It was already on the list to implement directograms in MGT, and this is even higher on the list now.

The motion average image (generated from the motion video) does not reveal much about the motion.

Motion average image of the kayaking video.

It is generated by calculating the average of all the frames. What is puzzling is the colour artefacts. I wonder whether that is coming from some compression error in the video or a bug somewhere in MGT for Python. I cannot see the same artefacts in the average image:

Average image of the kayaking video.

Analysing the sound of kayaking

The video recording also has sound, so I was curious to see if this could be used for anything. True, kayaking is a quiet activity, so I didn’t have very high hopes. Also, GoPros don’t have particularly good microphones, and they compress the sound a lot. Still, there could be something in the signal. To begin with, the waveform display of the sound does not tell that much:

A waveform of the sound of kayaking.

The spectrogram does not reveal that much either, although it is interesting to see the effects of the sound compression done by the GoPro (the horizontal lines from 5k and upward).

A spectrogram of the sound of kayaking.

Then the tempogram is more interesting.

A tempogram of the sound of kayaking.

It is exciting to see that it estimates the tempo to be 122 BPM, and this resonates with theories about 120 BPM being the average tempo of moderate human activity.

This little investigation into the sound and video of kayaking made me curious about what else can be found from such recordings. In particular, I will continue to explore approaches to analysing the rhythm of audiovisual recordings. It also made me look forward to a new kayaking season!

Preparing video for Matlab analysis

Typical video files, such as MP4 files with H.264 compression, are usually small in size and with high visual quality. Such files are suitable for visual inspection but do not work well for video analysis. In most cases, computer vision software prefers to work with raw data or other compression formats.

The Musical Gestures Toolbox for Matlab works best with these file types:

  • Video: use MJPEG (Motion JPEG) as the compression format. This compresses each frame individually. Use .AVI as the container, since this is the one that works best on all platforms.
  • Audio: use uncompressed audio (16-bit PCM), saved as .WAV files (.AIFF usually also works fine). If you need to use compression, MP3 compression (MPEG-1, Layer 3) is still more versatile than AAC (used in .MP4 files). If you use a bitrate of 192 Kbs or higher, you should not get too many artefacts.

Many people ask me how to convert from typical MP4 files (with H.264 video compression and AAC audio compression). The easiest solution (I think) is to use FFMPEG, the versatile command-line utility. Here is a oneliner that will convert from an .MP4 file into a .AVI file with MJPEG and PCM audio:

FFmpeg -i input.mp4 -c:a pcm_s16le -c:v mjpeg -q:v 3 -huffman optimal output.avi

The resultant file should work well in Matlab and other video analysis tools. We have included this conversion by default in the new Musical Gestures Toolbox for Python. So there, you can directly load an MP4 file, which will be converted to an AVI file using a script similar to the one above.

Running a successful Zoom Webinar

I have been involved in running some Zoom Webinars over the last year, culminating with the Rhythm Production and Perception Workshop 2021 this week. I have written a general blog post about the production. Here I will write a little more about some lessons learned on running large Zoom Webinars.

In previous Webinars, such as the RITMO Seminars by Rebecca Fiebrink and Sean Gallagher, I ran everything from my office. These were completely online events, based on each person sitting with their own laptop. This is where the Zoom Webinar solution shines.

Things become more complex once you try to run a hybrid event, where some people are online and others are on-site. Then you need to combine methods for video production and streaming with those of a streamlined video conferencing solution. It is doable but quite tricky.

RPPW 2021 chair Anne Danielsen introduces one of the panels from RITMO.

The production of the RPPW Webinar involved four people: myself as “director”, RITMO admin Marit Furunes as “Zoom captain”, and two MCT students, Thomas Anda and Wenbo Yi as “Zoom station managers”. We could probably have done it with three people, but given that other things were going on simultaneously, I am happy we decided to have four people involved. In the following, I will describe each person’s role.

Zoom station 1

The first Zoom station, RITMO 1, was operated by Wenbo Yi. He was sitting behind the lecture desk in the hall, equipped with one desktop PC with two screens. The right screen was split to be shown with a projector on the wall. This was the screen that showed displays when there were breaks, and so on.

MCT student Wenbo Yi in control of RITMO 1 during RPPW.

There are three cameras connected to the PC: one front-facing that we used for the moderator of the keynote presentations, one next to the screen on the wall that showed a close-up of conference chair Anne Danielsen during the introductions, and one in the back of the hall that showed the whole space.

There are four microphones connected to the PC and the PA system in the hall. One on the desk that was used for the keynote moderation. We only used one of the wireless microphones, a handheld one that Anne used during her introductions.

The nice thing about Zoom is that it is easy for a person to turn on and off cameras and microphones. However, this is designed around the concept that you are sitting in front of your computer. When you are standing in the middle of the room, someone else will need to click. That was the job of Wenbo. He switched between cameras and microphones and turned on and off slides.

Zoom station 2

The second Zoom station, RITMO 2, was operated by Thomas Anda, sitting in the control room behind the hall. He was controlling a station that was originally designed as a regular video streaming setup. This includes two remote-controlled PTZ cameras connected to a video mixer. For regular streams, we would have tapped the audio and video from the auditorium and made a mix to be streamed to, for example, YouTube. Now, we mainly used one of the PTZ cameras to display the general picture from the hall.

MCT student Thomas Anda sitting in the control room with the RITMO 2 station during RPPW.

The main job of Thomas was to playback all the 100 pre-recorded videos. We had a separate, ethernet-cabled PC for this task connected to the Zoom Webinar and shared the screen. The videos were cued in VLC, with poster images inserted between the videos. During our testing of this setup, we discovered the uneven sound level of the video files, which led to a normalization procedure for all of them.

In theory, we could have played the video files from RITMO station 1. However, both Wenbo and Thomas had plenty of things to think about, so it would have been hard to do it all alone. Also, having two stations allowed for having two camera views and also added redundancy for the stream.

Zoom captain station

The third station was controlled by our “Zoom captain”, Marit Furunes. She served as the main host of the Webinar most of the time and was responsible for promoting and de-promoting people to panellists during the conference.

Marit Furunes was the “Zoom captain”. Here, with conference chair Anne Danielsen in the background.

It is possible to set up panels in advance, but that requires separate Zoom Webinars and individualized invitation e-mails. We have experienced in the past that people often forget about these e-mails, so we decided to just have one Zoom Webinar for the entire conference and rather move people in and out of the panels manually. That required some manual work by Mari, but it also meant that she was in full control of who could talk in each session.

She was also in charge of turning on and off people’s video and sound, and ensuring that the final streamed looked fine.

Director station

I was sitting next to Marit, controlling the “director station”. I was mainly checking that things were running as they should, but I also served as a backup for Marit when she took breaks. In between, I also tweeted some highlights, replied to e-mails that came in, and commented on things in Slack.

From the “director station” I controlled one laptop as host and had another laptop for watching the output stream.

Monitoring and control

Together, the four of us involved in the production managed to create a nice-looking result. There were some glitches, but in general, things went as planned. The most challenging part of working with a Webinar-based setup is the lack of control and monitoring. What we have learned is that “what you see is not what you get”. We never really understood what to click on to get the final result we wanted. For example, we often had to click back and forth between the “gallery” and “speaker” view to get the desired result.

Also, as a host, you can turn off other cameras, but you cannot turn them on, only ask for the person to turn them on. That makes sense in many ways. After all, you should not be allowed to turn on the camera of another person remotely. However, as a production tool in an auditorium, this was cumbersome. It happened that Marit and I wanted to turn on the main video camera in the hall (from RITMO 2) or the front-facing camera (from RITMO 1). But we were not allowed to do this. Instead, we had to request that Thomas or Wenbo could turn on the cameras.

Summing up

The Zoom Webinar function was clearly made for a traditional video-conferencing-like setup. For that it also works very well. As described above, we managed to make it work quite well also in a hybrid setup. However, this required a 4-person strong team and 5 computers. The challenge was that we never really felt that we were completely in control of things. Also, we could not properly monitor the different signals.

The alternative would be a regular video streaming solution, based on video and audio mixers. That would have given us a much higher control of the final stream, including good monitoring. It would have required more equipment (that we have) but not necessarily more people. We would have lost out on some of the Zoom functionality, though, like the Q&A functionality that works very well.

Next time I am doing something like this, I would try to run a stream-based setup instead. Panellists could then come in through a Zoom Room, which could be mixed into a stream using either our hardware video mixer or a software mixer like OBS. Time will tell if that ends up being better or worse.

Normalize audio in video files

We are organizing the Rhythm Production and Perception Workshop at RITMO next week. As mentioned in another blog post, we have asked presenters to send us pre-recorded videos. They are all available on the workshop page.

During the workshop, we will play sets of videos in sequence. When doing a test run today, we discovered that the sound levels differed wildly between files. There is clearly the need for normalizing the sound levels to create a good listener experience.

Batch normalization

How does one normalize around 100 video files without too much pain and effort? As always, I turn to my go-to video companion, FFmpeg. Here is a small script I made to do the job:


shopt -s nullglob
for i in *.mp4 *.MP4 *.mov *.MOV *.flv *.webm *.m4v; do 
   name=`echo $i | cut -d'.' -f1`; 
   ffmpeg -i "$i" -c:v copy -af loudnorm=I=-16:LRA=11:TP=-1.5 "${name}_norm.mp4"; 

This was the result of some searching around for a smart solution (in Qwant, btw, my new preferred search engine). For example, I use the “nullglob” trick to list multiple file types in the for loop.

The most important part of the script is the normalization, which I found in this blog post. The settings are described as:

  • loudnorm: the name of the normalization filter
  • I: the integrated loudness (from -70 to -5.0)
  • LRA: the loudness range (from 1.0 to 20.0)
  • TP: Indicates the max true peak (from -9.0 to 0.0)

The settings in the script normalize to a high but not maximum signal, which leaves some headroom.

To compress or not

To save processing time and avoid recompressing the video, I have included “-c:v copy” in the script above. Then FFmpeg copies over the video content directly. This is fine for videos with “normal” H.264 compression, which is the case for most .MP4 files. However, when getting 100 files made on all sorts of platforms, there are surely some oddities. There were a couple of cases with weird compression formats, that for some reason failed with the above script. One also had interlacing issues. For them, I modified the script to recompress the files.


shopt -s nullglob
for i in *.mp4 *.MP4 *.mov *.MOV *.flv *.webm *.m4v; do 
    name=`echo $i | cut -d'.' -f1`; 
    ffmpeg -i "$i" -vf yadif -af loudnorm=I=-16:LRA=11:TP=-1.5 "${name}_norm.mp4"; 

In this script, the copy part is removed. I have also added “-vf yadif”, which is a de-interlacing video filter.

Summing up

With the first script, I managed to normalize all 100 files in only a few minutes. Some of the files turned up with 0 bytes due to issues with copying the video data. So I ran through these with the second script. That took longer, of course, due to the need for compressing the video.

All in all, the processing took around half an hour. I cannot even imagine how long it would have taken to do this manually in a video editor. I haven’t really thought about the need for normalizing the audio in videos like this before. Next time I will do it right away!