Add fade-in and fade-out programmatically with FFmpeg

There is always a need to add fade-in and fade-out to audio tracks. Here is a way of doing it for a bunch of video files. It may come in handy with the audio normalization script I have shown previously. That script is based on continuously normalizing the audio, which may result in some noise in the beginning and end (because there is little/no sound in those parts, hence they are normalized more).

It is easy to add a fade-in to the beginning of a file using FFmpeg’s afade function. From the documentation, you can do a 15-second fade-in like this:

afade=t=in:ss=0:d=15

And a 25-second fade-out like this:

afade=t=out:st=875:d=25

Unfortunately, the latter requires that you specify when to start the fade-out. That doesn’t work well in general, and particularly not for batch processing.

A neat trick

Searching for solutions, I found a neat trick that solved the problem. First, you create the normal fade-in. Then you make the fade-out by reversing the audio stream, applying a fade-in, and then reversing again. The whole thing looks like this:

ffmpeg -i input.mp4 -c:v copy -af "afade=d=5, areverse, afade=d=5, areverse" output.mp4

A hack, but it works like a charm! And you don’t need to re-encode the video (hence the -c:v copy message above).

Putting it together

If you want to run this on a folder of files and run a normalization in the same go (so you avoid recompressing more than once), then you can use this bash script:

#!/bin/bash

shopt -s nullglob
for i in *.mp4 *.MP4 *.mov *.MOV *.flv *.webm *.m4v; do 
   name=`echo $i | cut -d'.' -f1`; 
   ffmpeg -i "$i" -c:v copy -af "loudnorm=I=-16:LRA=11:TP=-1.5,afade=d=5, areverse, afade=d=5, areverse" "${name}_norm.mp4"; 
done

Save, run, and watch the magic!

Video visualizations of mountain walking

After exploring some visualizations of kayaking, I was eager to see how a similar approach could work for walking. On a trip to the Norwegian mountains, specifically at Haugastøl, situated halfway between Oslo and Bergen, I strapped a GoPro Hero Black 10 on my chest and walked up and down a nearby hill called Storevarden. The walk was approximately 25 minutes up and down, and a fast-forward version of the video can be seen here:

What can one get from the audio and video of such a trip? Here are some results generated with various functions from the Musical Gestures Toolbox for Python.

Static visualizations

The first trial was to create some static visualizations from the video recording.

A keyframe image display shows nine sampled images from the video. The first ones mainly show the path since I was leaning forward while walking upward, and the last show the scenery.
An average image of the whole video does not tell much in this case, and I guess it shows that (on average) I looked up most of the time. Hence the horizon can be seen toward the bottom of the image.

The average image is not particularly interesting in this case. Then it may be better to create a history video that averages images over a shorter period, such as in this video:

A history video is averaging over several seconds of video footage.

Still quite shaky, but it creates an interesting soft-focus rendition of the video. This may resemble how I perceived the scenery as I walked up and down.

Videograms

A better visualization, then, are the videograms, which give more information about the spatiotemporal features of the video recording.

A horizontal videogram of the 25-minute walking sequence reveals the spatiotemporal differences in the recording: first walking upward facing the ground, then having a short break on the top, and then walking downward facing the scenery.
A vertical videogram is less interesting in this case.

Motiongrams

The videograms are based on collapsing the original images in the video sequence. Motiongrams, on the other hand, collapse the motion image sequence, clearly showing what changed between frames.

A horizontal motiongram reveals the same information as the videogram and clearly shows the break I took in the middle. (the black part in the middle).
A vertical motiongram is not particularly relevant.

Audio analysis

What can one get out of the audio recording of walking? The waveform does not tell much, except that the average levels look higher in the second half (where I was walking down).

A waveform of the audio that I recorded during the 25-minute walking.
The sonogram shows a lot of energy throughout the energy spectrum, and my break at the top can be seen a little over halfway through. A peculiar black line at 8.7 kHz has to come from the GoPro, and the camera also cuts all sound above approximately 13 kHz.
The tempogram also reveals the break in the middle and estimates a tempo of my walking of almost 120 BPM.

It is fascinating how the estimated tempo of my walking was almost 120 BPM, which happens to be similar to the 2 Hz frequency found in many studies of walking and everyday activities. It will be interesting to try a similar approach for other walking videos.

Removing audio hum using a highpass filter in FFmpeg

Today, I recorded Sound Action 194 – Rolling Dice as part of my year-long sound action project.

The idea has been to do as little processing as possible to the recordings. That is because I want to capture sounds and actions as naturally as possible. The recorded files will also serve as source material for both scientific and artistic explorations later. For that reason, I only trim the recordings non-destructively using FFmpeg.

Recording the dice example, however, I noticed an unfortunate low-frequency hum in the original recording:

The original recording has an unfortunate low-frequency hum.

I like the rest of the recording, so I thought it would be a pity to skip publishing this sound action only because of the hum. So I decided to break my rule of not processing the sound and apply a simple highpass filter to remove the noise.

Fortunately, FFmpeg, as always, comes to the rescue. It has myriad audio filters that can be combined in various ways. I only needed to add a highpass filter, which can be accomplished using this one-liner:

ffmpeg -i input.mp4 -c:v copy -af highpass=400 output.mp4

Here I use the -c:v copy to copy the video stream directly. This avoids re-compressing the file and saves time. Then I use the -af highpass=400 function to add the highpass filter to the audio stream with a frequency of 400 Hz. This is relatively high but works well for this example.

The recording with highpass-filtered audio.

Adding a filter means that the audio stream needs to be re-compressed. So it breaks with the original (conceptual and technical) idea. However, the result sounds more like how I experienced it. I didn’t notice the hum while recording, and this project is focused on foreground sounds, not the background. However, this example is relevant for my upcoming project, AMBIENT, in which I will focus on the background sound of various in-door environments.

Kayak motion analysis with video-based horizon leveling

Last year, I wrote about video-based motion analysis of kayaking. Those videos were recorded with a GoPro Hero 8 and I tested some of the video visualization methods of the Musical Gestures Toolbox for Python. This summer I am testing out some 360 cameras for my upcoming AMBIENT project. I thought I should take one of these, a GoPro Max, out for some kayaking in the Oslo fjord. Here are some impressions of the trip (and recording).

Horizon leveling

I stumbled upon the feature “horizon leveling” by accident when going through the settings on the GoPro Max. The point is that it will stabilize the recorded image so that the horizon will always be leveled. I haven’t found any tech details about the feature, but assume that it uses a built-in gyroscope for the leveling. As it turns out, this feature also appears to be included on the GoPro Hero 9 and 10.

This feature works amazingly well, as can be seen from an excerpt of my kayaking adventure below:

Video visualizations

The Musical Gestures Toolbox for Python is in active development and I, therefore, thought it could be interesting to test some video visualization methods on the kayaking video. The whole video recording is 1.5 hours long, which is a good starting point for exploring some of the video visualization techniques in the toolbox. After all, one of the points of the toolbox was to develop solutions for visualizing long video recordings without scene changes.

My recordings of music or dance performances are similar to my kayaking videos in that they are based on continuous single-camera recordings. So I tested some of the basic visualization techniques.

A “keyframe” display based on sampling 9 images from the recording. These give “snapshots” of the scenery but don’t tell much about the motion.
A horizontal videogram of the whole recording (time running from left to right) shows more about what happened. Here you can really see that the horizon leveling worked flawlessly throughout. It is interesting to see the “ascending” lines at various intervals. These are due to the fact that I kayaked around an island, and kept turning right.
The vertical videogram shows the sideways motion. It is, perhaps, less informative than the horizontal videogram, but more beautiful. The yellow line in the middle is the kayak.
An average image of the whole recording blurs out all the details but leaves the essential information: the kayak, the fjord, and the horizon.

Audio analysis

Kayaking is a rhythmic activity, so I was interested in also looking at whether I could find any patterns in the audio signal. For now, I have only calculated a tempogram, which estimates the tempo of my kayaking strokes to 114 BPM. I am not sure if that is good or bad (I am only a recreational kayaker) but will try to make some new recordings and compare.

Tempogram of the audio from the kayaking video. It is made from a resampled video file, hence the short duration (the original video is 1.5 hours long).

Solid toolbox

After testing the Musical Gestures Toolbox for Python more actively over the last few weeks, I see that we have now managed to get it to a state where it really works well. Most of the core functions are stable and they allow for combinations in various ways, just like a toolbox should. There are still some optimization issues to sort out, particularly when it comes to improving the creation of motion videos. But overall it is a highly usatile package.

Running a disputation on YouTube

Last week, Ulf Holbrook defended his dissertation at RITMO. I was in charge of streaming the disputation, and here are some reflections on the technical setup and streaming.

Zoom Webinars vs YouTube Streaming

I have previously written about running a hybrid disputation using a Zoom webinar. We have used variations of that setup also for other events. For example, last year, we ran RPPW as a hybrid conference. There are some benefits of using Zoom, particularly when having many presenters. Zoom rooms are the best for small groups where everyone should be able to participate. For larger groups, and particularly (semi-)public events, Zoom Webinars are the only viable solution. I had only experienced Zoom bombing once (when someone else organised a public event with more than 100 people present), which was an unpleasant experience. That is why we have run all our public events using Zoom Webinars, where we have more fine-grained control of who is allowed to talk and share their video and screen.

I find that a streaming solution (such as on YouTube) is the best for public events where there is no need for much interaction. A public PhD defence is one such type of event. This is a one-way delivery, where the audience is passive; hence streaming is perfectly fine. So for Ulf’s defence, we opted for using Youtube as our streaming service. We could have used UiO’s streaming service, but then we would have missed out on the social media parts of YouTube as a channel, particularly the chat functionality that people could use to ask questions. In the end, nobody asked any questions, but it still felt good to be able to communicate with the audience.

Setup

As described previously, we have a relatively complex audiovisual installation in the hall. There are two pairs of PTZ cameras: one pair connected to the lecture desk’s PC in the front and another pair connected to a video mixer in the control room. It is possible to run two 2-camera “productions” at once, one from the back and one from the front. In the Zoom Webinars, we have used the two cameras connected to the front PC. However, we used the cameras connected to the streaming station in the back of the hall for the streaming setup.

The view from the control room in the back of Forsamlingssalen at RITMO. Two PTZ cameras can be seen on the shelf in the lower left corner of the picture.

During the defence, I sat in the control room, switching between cameras and checking that everything went as it should. As can be seen in the image below, the setup consists of a hardware video controller and video mixer, and the mixed signal is then passed on to a streaming PC.

A video from the control room, with a screen of the video mixer’s view to the left, the streaming PC on the right, and my laptop for “monitoring” in the middle.

The nice thing about working with a hardware video mixer is that you can turn it on, and it works. Computers are powerful and versatile, but hardware solutions are more reliable. As seen in the image below, the mixing consisted of choosing between the two PTZ cameras, one that gave an overview shot and another a close-up. In addition, I mixed in the slides shown on the screen. There is a picture-in-picture mode on the video mixer, but one challenge is that you don’t know what is coming on the slides. So you may end up covering some of the slides, as shown in the image below.

The video mixer was used to switch between two camera views and the image from the laptop.

I still struggle to understand the logic of setting up a live stream on YouTube. The video grabber on the streaming PC (a Blackmagic Web Presenter HD) shows up as a “webcam” and can be used directly to stream to YouTube. I had made a test run some days before the defence, and everything worked well. However, when we were about to start the scheduled stream, I realised that the source was changed to “broadcast software” instead of “webcam”. I have no idea why that happened, but fortunately, I had OBS installed on the PC and could start the stream from there. The nice thing about OBS is that you also get access to a software-based audio mixer, which came in handy for tuning the sound slightly during the defence.

The streaming PC was running OBS, getting the signal from the Web Presenter HD.

Being able to monitor the output is key to any production. Unfortunately, for the first part of Ulf’s presentation,n I was monitoring on the video mixer. There everything sounded fine. It was only after a while that I connected to the YouTube stream with my laptop and realised that there was a slight “echo” on the sound.

There was an echo on the sound in the introduction’s opening, caused by two audio streams with a short offset.

It turned out that this was because we were streaming sound twice through OBS, both through the incoming audio on the PC and through the audio channel passed alongside the HDMI signal. I had briefly checked that things were fine when starting up the YouTube channel, but I think what happened was that the two audio streams were slightly getting out of sync, leading to first an echo-like sound and later a delay. After discovering the problem, I quickly managed to turn off one of the audio streams.

The sound was better after turning off one of the audio streams.

For the rest of the disputation, I was careful to monitor the YouTube stream at regular intervals to check that things worked well. But since the Youtube stream was 10+ seconds delayed, I had to do the main monitoring on the video mixer to change camera positions in time.

I have found it important to monitor the final stream on YouTube, in case there are any problems.

Summing up

Apart from the audio problems initially, I think the streaming went well. Of course, we are not aiming at TV production quality. Still, we try to create productions that are ok to watch and listen to from a distance. We have had 3-4 people involved in the productions for some of our previous events. This time, we were two people involved. I ran the online production, and Eirik Slinning Karlsen from the RITMO administration was in control of the microphones in the hall. This setup works well and is more realistic for future events.