Frame differencing with FFmpeg

I often want to create motion videos, that is, videos that only show what changed between frames. Such videos are nice to look at, and so-called “frame differencing” is also the start point for many computer vision algorithms.

We have made several tools for creating motion videos (and more) at the University of Oslo: the standalone VideoAnalysis app (Win/Mac) and the different versions of the Musical Gestures Toolbox. These are all great tools, but sometimes it would be nice also to create motion videos in the terminal using FFmpeg.

I have previously written about the tblend function in FFMPEG, which I thought would be a good starting point. However, it turned out to be slightly more challenging to do than I had expected. Hence, this blog post is to help others looking to do the same.

Here is the source video I used for testing:

Source video of dance improvisation.

First, I tried with this oneliner:

ffmpeg -i input.mp4 -filter_complex "tblend=all_mode=difference" output.mp4

It does the frame differencing, but I end up with a green image:

The result of the tblend filter.

I spent quite some time looking for a solution. Several people report a similar problem, but there are few answers. Finally, I found this explanation suggesting that the source video is in YUV while the filter expects RGB. To get the correct result, we need to add a format=gbrp to the filter chain:

ffmpeg -i input.mp4 -filter_complex "format=gbrp,tblend=all_mode=difference" output.mp4
The final motion video after RGB conversion.

I have now also added this to the function mgmotion in the Musical Gestures Toolbox for Terminal.

New online course: Motion Capture

After two years in the making, I am happy to finally introduce our new online course: Motion Capture: The art of studying human activity.

The course will run on the FutureLearn platform and is for everyone interested in the art of studying human movement. It has been developed by a team of RITMO researchers in close collaboration with the pedagogical team and production staff at LINK – Centre for Learning, Innovation & Academic Development.


In the past, we had so few users in the fourMs lab that they could be trained individually. With all the new exciting projects at RITMO and an increasing amount of external users, we realized that it was necessary to have a more structured approach to teaching motion capture to new users.

The idea was to develop an online course that would teach incoming RITMO students, staff, and guests about motion capture basics. After completing the online course, they would move on to hands-on training in the lab. However, once the team started sketching the content of the course, it quickly grew in scope. The result is a six-week online course, a so-called massive open online course (MOOC) that will run on the FutureLearn platform.

People talking in lab
From one of the early workshops with LINK, in which I explain the basics of a motion capture system (Photo: Nina Krogh).

MOOC experience

Developing a MOOC is a major undertaking, but we learned a lot when we developed Music Moves back in 2015-2016. Thousands of people have been introduced to embodied music cognition through that course. In fact, we will run it for the seventh time on 24 January 2022.

Motion capture is only mentioned in passing in Music Moves. Many learners ask for more. Looking around, we haven’t really found any general courses on motion capture. There are many system-specific tutorials and courses, but not any that introduce the basics of motion capture more broadly. As I have written about in the Springer Handbook of Systematic Musicology (open access version), there are many types of motion capture systems. Most people think about the ones where users wear a suit with reflective markers, but this is only one type of motion capture.

From biomechanics to data management

In the new Motion Capture course, we start with teaching the basics of human anatomy and biomechanics. I started using motion capture without that knowledge myself and have later realized that it is better to understand a bit about how the body moves before playing with the technology.

People talking in front of a whiteboard
RITMO lab engineer Kayla Burnim discusses the course structure with Audun Bjerknes and Mirjana Coh from LINK (Photo: Nina Krogh).

The following weeks in the course contain all the information necessary to conduct a motion capture experiment: setting up cameras, calibrating the system, post-processing, and analysis. The focus is on infrared motion capture, but some other sensing technologies are also presented, including accelerometers, muscle sensors, and video analysis. The idea is not to show everything but to give people a good foundation when walking into a motion capture lab.

The last week is dedicated to data management, including documentation, privacy, and legal issues. These are not the most exciting topics if you want to motion capture. But they are necessary if you’re going to research according to today’s regulations.

From idea to course

Making a complete online course is a major undertaking. Having done it twice, I would compare it to writing a textbook. It helps with prior experience and a good team, but it is still a significant team effort.

We worked with UiO’s Centre for Learning, Innovation and Academic Development, LINK, when developing Music Moves, and I also wanted to get them on board for this new project. They helped structure the development into different stages: ideation, development of learning outcomes, production planning, and production. It is tempting to start filming right away, but the result is much better if you plan properly. The last time we made the quizzes and tests last, and this time, I pushed to make them first to know the direction we were heading.

People talking in front of a table
Mikkel Kornberg Skjeflo from LINK explains how the learning experience becomes more engaging by using different learning activities in the course (Photo: Nina Krogh).

Video production

In Music Moves, we did a lot of “talking head” studio recordings, like this one:

It works in bringing over content, but I look uncomfortable and don’t get through the content very well. I find the “dialogue videos” much more engaging:

Looking at the feedback from learners (we have had around 10 000 people in Music Moves over the years!), they also seem to engage more with less polished video material. So for Motion Capture, we decided to avoid “lecture videos”. Instead, we created situations where pairs would talk about a particular topic. We wrote scripts first, but the recordings were spontaneous, making for a much more lively interaction.

The course production coincided with MusicTestLab, an event for testing motion capture in a real-world venue. The team agreed to use this event as a backdrop for the whole course, making for a quote chaotic recording session. Filming an online course in parallel to running an actual experiment that was also streamed live was challenging, but it also gives the learners an authentic look into how we work.

Musicians on stage with motion capture equipment.
Audun Bjerknes and Thea Dahlborg filming a motion capture experiment in the foyer of the Science Library.

Ready for Kick-off

The course will run on FutureLearn from 24 January 2022. In the last months, we have done the final tweaking of the content. Much effort has also been put into ensuring accessibility. All videos have been captioned, images have been labelled, and copyrights have been checked. That is why I compare it to writing a textbook. Writing the content is only part of the process. Similarly, developing a MOOC is not only about writing texts and recording videos. The whole package needs to be in place.

Music Moves has been running since 2016 and is still going strong. I am excited to see how Motion Capture will be received!

Try not to headbang challenge

I recently came across a video of the so-called Try not to headbang challenge, where the idea is to, well, not to headbang while listening to music. This immediately caught my attention. After all, I have been researching music-related micromotion over the last years and have run the Norwegian Championship of Standstill since 2012.

Here is an example of Nath & Johnny trying the challenge:

As seen in the video, they are doing ok, although they are far from sitting still. Running the video through the Musical Gestures Toolbox for Python, it is possible to see when and how much they moved clearly.

Below is a quick visualization of the 11-minute long sequence. The videogram (similar to a motiongram but of the original video) shows quite a lot of motion throughout. There is no headbanging, but they do not sit still.

A videogram of the complete video recording (top) with a waveform of the audio track. Two selected frames from the sequence and “zoomed-in” videograms show the motion of specific passages.

There are many good musical examples listed here. We should consider some of them for our next standstill championship. If corona allows, we plan to run a European Championship of Standstill in May 2022. More information soon!

2022, a Year of Sound Actions

I have over the last few years worked on a book project with the working title Sound Actions. The manuscript has been through peer reviewing and several rounds of editing. Now it is an editorial process and will be published by The MIT Press later in 2022.

Action-Sound Couplings and Mappings

The book is based on the action-sound theory I developed as part of my dissertation. My main point is that we experience the world through action-sound couplings and mappings. An action-sound coupling is based on the interaction between physical objects and its sound is based on the mechanical and acoustical properties of the objects involved.

An action-sound coupling is based on the interaction with physical objects.

On the other hand, an action-sound mapping is designed and constructed using either analogue or digital technologies.

An action-sound mapping is designed and constructed using electronic technologies

My main argument in the book is that action-sound couplings and mappings are different. This is not to say that one type is better than the other, they are just different. I also suggest that we experience them differently. This is based on ecological psychology, focusing on how we experience the world through our interaction with the environment. The environment in my context can be anything that leads to sound production, which, in fact, is a lot of the things we do.

Sound-producing actions can be thought of as “chunks” of continuous motion. These actions are related to “chunks” of the continuous sound. Such sound actions form the basis for our experience of musical gestures.

Sound Action Types

We all produce and experience a number of sound actions every day. Most of them we don’t think about. Sometimes we do, for example, when we enter a quiet place, such as a library. Then we try to be quiet, that is, adjust our actions to produce less sound. We may also be aware of others’ sound actions, particularly if someone is loud. There is clearly a social aspect of sound actions.

There are many different types of sound actions. Based on the taxonomy of Pierre Schaeffer, we may talk about three main types of sound actions: impulsive, sustained, and iterative.

An illustration of the three main sound types proposed by Pierre Schaeffer, and their relationship to sound-producing actions.

Daily Sound Actions

Throughout 2022, I will publish one sound action daily. These will be short video recordings of particular sound actions. I have three goals with this project:

  • Allow people to reflect on various types of everyday sound actions. Hopefully, I can inspire others to contribute with some more sound actions as well.
  • Serve as a “countdown” to the publication of my book later this year.
  • Explore sound actions as I am about to start up my new research project AMBIENT: Bodily Entrainment to Audiovisual Rhythms.

With that, here we go:

Pre-processing Garmin VIRB 360 recordings with FFmpeg

I have previously written about how it is possible to “flatten” a Ricoh Theta+ recording using FFmpeg. Now, I have spent some time exploring how to process some recordings from a Garmin VIRB camera.

Some hours of recordings

The starting point was a bunch of recordings from our recent MusicLab Copenhagen featuring the amazing Danish String Quartet. A team of RITMO researchers went to Copenhagen and captured the quartet in both rehearsal and performance. We have data and media from motion capture, eye tracking, physiological sensing, audio, video, and more. The plan is to make it all available on OSF.

When it comes to video, we have many different recordings, ranging from small GoPro cameras hanging around the space to professional streaming cameras operated by a camera crew. In addition, we have one recording from a Garmin VIRB 360 camera hanging in the chandelier close to the musicians. Those recordings are what I will explore in this post.

An upside 360 recording

The Garmin VIRB camera records a 360-degree video using two 180-degree lenses. Unlike Ricoh Theta’s stereo-spherical videos, the Garmin stores the recording with an equirectangular projection. Here is a screenshot from the original recording:

An image from the original video recorded from a Garmin VIRB camera.

There are some obvious problems with this recording. First, the recording is upside down since the camera was hanging upside down from a chandelier above the musicians. The panning and tilting of the camera are also slightly off concerning the placement of the musicians. So it is necessary to do some pre-processing before analysing the files.

Most 360-degree cameras come with software for adjusting the image. The Garmin app can do it, but I already have all the files on a computer. It could also be done in video editing software, although I haven’t explored that. In any case, I look for an option that allows me to batch process a bunch of videos (yes, we have hours of recordings, and they are split up into different files).

Since working on the Ricoh files last year, I have learned that FFmpeg’s new 360 filter is part of the regular release. So I wanted to give it a spin. Along the way, I learned more about different image projections types that I will outline in the following.

Equirectangular projection

The starting point was the equirectangular projection coming out of the Garmin VIRB. The first thing to make it more useful is to flip the video around and place the musicians in the centre of the image.

Rotating, flipping, panning, and tilting the image to place the musicians in the centre.

The different functions of the v360 filter in FFmpeg are documented but not explained very well. So it took me quite some time to figure out how to make the adjustments. This is the one-liner I ended up with to create the image above:

ffmpeg -i input.mp4 -vf "v360=input=e:output=e:yaw=100:pitch=-50:v_flip=1:h_flip=1" output.mp4

There are some tricks I had to figure out to make this work. First, I use the v360 filter with equirectangular (shortened to e) as both the input and output of the filter. The rotation was done using both the v_flip and h_flip commands, which rotate around both the horizontal and vertical axes. In the original image, the cellist was on the edge. So I also had to turn the whole image horizontally using yaw and move the entire image down a bit using pitch. It took me some manual testing to figure out the correct numbers here.

Since the analysis will be focused on the musicians, I have also cropped the image using the general crop filter (note that you can add multiple filters with a comma in FFmpeg if you try to add another filter, only the last one will be used):

ffmpeg -i input.mp4 -vf "v360=input=e:output=e:yaw=100:pitch=-50:v_flip=1:h_flip=1, crop=1700:1000:1000:550" output_crop.mp4

This gives us a nicely cropped video of the musicians:

Cropped equirectangular image.

This video already looks quite good and could be used for analysis (for example, in one of the versions of Musical Gestures Toolbox). But I wanted to explore if other projections may work better.

Gnomonic projection

An alternative projection is called gnomonic in fancy terminology and “flat” in more plain language. It looks like this:

A gnomonic projection of the video.
ffmpeg -i input.mp4 -vf "v360=input=e:output=flat:v_flip=1:h_flip=1:yaw=90:pitch=-30:h_fov=150:v_fov=150" output_flat.mp4

Here I used the flat output type in FFmpeg and did the same flipping, panning and tilting as above. I had to use slightly different numbers for yaw and pitch to make it work, though. Also, here I added some cropping to focus on the musicians:

ffmpeg -i input.mp4 -vf "v360=input=e:output=flat:v_flip=1:h_flip=1:yaw=90:pitch=-30:h_fov=150:v_fov=150, crop=3800:1100:0:800" output_flat_crop.mp4

This left me with the final video:

Cropped gnomonic projection.

There are many problems with this projection, and the most obvious is the vast size difference between the musicians. So I won’t use this version for anything, but it was still interesting to explore.

Cube map

A different projection is the cube map. Here is an illustration of how it relates to the equirectangular projection:

Overview of different projection types (from Sizima).

The v360 filter also allows for creating such projections. It has multiple versions of this idea. I found a nice blog post by Anders Jirås that helped me understand how this function works.

First, I tested the c6x1 output function:

ffmpeg -i input.mp4 -vf "v360=input=e:output=c6x1:out_forder=frblud:yaw=50:pitch=-30:roll=50:v_flip=1:h_flip=1" output_c6x1.mp4

I changed the order of images using out_forder (as documented here) and (again) played around with the yaw, pitch, and roll to make something that worked well. This resulted in an image like this:

A cube map projection of the video. Here with a 6×1 cube layout.

There is also a function called c3x2, which will generate an image like this:

A 3×2 cube projection.

Adding some cropping to the 3×2 projection:

ffmpeg -i input.mp4 -vf "v360=input=e:output=c3x2:out_forder=frblud:yaw=50:pitch=-30:roll=50:v_flip=1:h_flip=1, crop=1500:1080:150:0" output_c3x2_crop.mp4

Then we end up with an image like this:

A cropped 3×2 projection.

This looks quite weird, mainly because the cellist wraps into a different cube than the others.

Equi-angular cubemap

Finally, I wanted to test a new projection invented by Google a couple of years ago: the Equi-Angular Cubemap. The idea has been to create a projection with fewer artefacts on the edges:

Equirectangular Projection (left), Standard Cubemap (middle), Equi-Angular Cubemap (right) (from a Google blog post).

In FFmpeg, this can be achieved with the eac function:

ffmpeg -i input.mp4 -vf "v360=input=e:output=eac:yaw=100:pitch=-50:roll=0:v_flip=1:h_flip=1" output_eac.mp4

The resultant image looks like this:

An equi-angular cubemap projection.

Only the top part of the image is useful for my analysis, which can be cropped out like this:

ffmpeg -i input.mp4 -vf "v360=input=e:output=eac:yaw=100:pitch=-50:roll=0:v_flip=1:h_flip=1, crop=2200:1200:750:0" output_eac_crop.mp4

The final image looks like this:

A cropped equi-angular projection.

The equi-angular cubemap should have better projection overall because it avoids too much distortion on the edges. However, that comes at the cost of some more artefacts in the central parts of the image. So when cropping into the image as I did above, the equirectangular may actually work best.

Summing up

After quite some time fiddling around with FFmpeg and trying to understand the various parts of the new v360 function, I can conclude that the original equidistant projection is probably the best one to use for my analysis. The other projections probably work better for various types of 3D projections. Still, it was useful to learn how to run these processes using FFmpeg. This will surely come in handy when I am going to process a bunch of these files in the near future.