Rigorous Empirical Evaluation of Sound and Music Computing Research

At the NordicSMC conference last week, I was part of a panel discussing the topic Rigorous Empirical Evaluation of SMC Research. This was the original description of the session:

The goal of this session is to share, discuss, and appraise the topic of evaluation in the context of SMC research and development. Evaluation is a cornerstone of every scientific research domain, but is a complex subject in our context due to the interdisciplinary nature of SMC coupled with the subjectivity involved in assessing creative endeavours. As SMC research proliferates across the world, the relevance of robust, rigorous empirical evaluation is ever-increasing in the academic and industrial realms. The session will begin with presentations from representatives of NordicSMC member universities, followed by a more free-flowing discussion among these panel members, followed by audience involvement.

The discussion was moderated by Sofia Dahl (Aalborg University) and consisted of Nashmin Yeganeh (University of Iceland), Razvan Paisa (Aalborg University), and Roberto Bresin (KTH).

The challenge of interdisciplinarity

Everyone in the panel agreed that rigorous evaluation is important. The challenge is to figure out what type(s) of evaluation is useful and plausible within sound and music computing research. This was efficiently illustrated in a list of the different methods that are employed by the researchers at KTH.

A list of methods in use by the sound and music computing researchers at KTH.

Roberto Bresin had divided the KTH list into methods that they have been working with for decades (in red) and newer methods that they are currently exploring. The challenge is that each of these methods requires different knowledge and skills, and they all have different types of evaluation.

Although we have a slightly different research profile at UiO than at KTH, we also have a breadth of methodological approaches in SMC-related research. I pointed to a model I often use to explain what we are doing:

A simplified model of explaining my research approach.

The model has two axes. One shows a continuum between artistic and scientific research methods and outputs. Another is a continuum between performing research on natural and cultural phenomena. In addition, we develop and use various types of technologies for all of these.

The reason I like to bring up this model is to explain that things are connected. I often hear that artistic and scientific research are completely different things. Sure, they are different, but there are also commonalities. Similarly, there is an often unnecessary divide between the humanities and the social and natural sciences. True, they have different foci but when studying music we need to take all of these into account. Music involves everything from “low-level” sensory phenomena to “high-level” emotional responses. One can focus on one or the other, but if we really want to understand musical experiences – or make new ones for that matter – we need to see the whole picture. Thus, evaluations of whatever we do also need to have a holistic approach.

Open Research as a Tool for Rigorous Evaluation

My entry into the panel discussion was that we should use the ongoing transition to Open Research practices as an opportunity to also perform more rigorous evaluations. I have previously argued why I believe open research is better research. The main argument is that sharing things (methods, code, data, publications, etc.) openly forces researchers to document everything better. Nobody wants to make sloppy things publicly available. So the process of making all the different parts of the open research puzzle openly available is a critical component of a rigorous evaluation.

In the process of making everything open, we realize, for example, that we need better tools and systems. We also experience that we need to think more carefully about privacy and copyright. That is also part of the evaluation process and lays the ground for other researchers to scrutinize what we are doing.

Summing up

One of the challenges of discussing rigorous evaluation in the “field” of sound and music computing is that we are not talking about one discipline with one method. Instead, we are talking about a set of approaches to developing and using computational methods for sound and music signals and experiences. If you need to read that sentence a couple of times, it is understandable. Yes, we are combining a lot of different things. And, yes, we are coming from different backgrounds: the arts and humanities, the social and natural sciences, and engineering. That is exactly what is cool about this community. But it is also why it is challenging to agree on what a rigorous evaluation should be!

Preparing video for Matlab analysis

Typical video files, such as MP4 files with H.264 compression, are usually small in size and with high visual quality. Such files are suitable for visual inspection but do not work well for video analysis. In most cases, computer vision software prefers to work with raw data or other compression formats.

The Musical Gestures Toolbox for Matlab works best with these file types:

  • Video: use MJPEG (Motion JPEG) as the compression format. This compresses each frame individually. Use .AVI as the container, since this is the one that works best on all platforms.
  • Audio: use uncompressed audio (16-bit PCM), saved as .WAV files (.AIFF usually also works fine). If you need to use compression, MP3 compression (MPEG-1, Layer 3) is still more versatile than AAC (used in .MP4 files). If you use a bitrate of 192 Kbs or higher, you should not get too many artefacts.

Many people ask me how to convert from typical MP4 files (with H.264 video compression and AAC audio compression). The easiest solution (I think) is to use FFMPEG, the versatile command-line utility. Here is a oneliner that will convert from an .MP4 file into a .AVI file with MJPEG and PCM audio:

FFmpeg -i input.mp4 -c:a pcm_s16le -c:v mjpeg -q:v 3 -huffman optimal output.avi

The resultant file should work well in Matlab and other video analysis tools. We have included this conversion by default in the new Musical Gestures Toolbox for Python. So there, you can directly load an MP4 file, which will be converted to an AVI file using a script similar to the one above.

Releasing the Musical Gestures Toolbox for Python

After several years in the making, we finally “released” the Musical Gestures Toolbox for Python at the NordicSMC Conference this week. The toolbox is a collection of modules targeted at researchers working with video recordings.

Below is a short video in which Bálint Laczkó and I briefly describe the toolbox:

About MGT for Python

The Musical Gestures Toolbox for Python includes video visualization techniques such as creating motion videos, motion history images, and motiongrams. These visualizations allow for studying video recordings from different temporal and spatial perspectives. The toolbox also includes basic computer vision methods, and it is designed to integrate well with audio analysis toolboxes.

It is possible to run the toolbox from the terminal:

ipython example
Example of running MGT for Python in a terminal.

Many people would probably prefer to run it in a Jupyter notebook:

Screenshots from the example Jupyter Notebook.

The MGT was initially developed to analyze music-related body motion (of musicians, dancers, and perceivers) but is equally helpful for other disciplines working with video recordings of humans, such as linguistics, pedagogy, psychology, and medicine.


This toolbox builds on the Musical Gestures Toolbox for Matlab, which again builds on the Musical Gestures Toolbox for Max. The latest version was primarily developed by Bálint Laczkó, Frida Furmyr, and Marcus Widmer.

Read more

To learn more about Musical Gestures Toolbox for Python, take a look at our paper presented at NordicSMC:

Experiences with running a hub-based conference

For the last couple of days, I have participated in the NordicSMC conference. It was organized by a team of Ph.D. fellows from Aalborg University Copenhagen, supported by the Nordic Sound and Music Computing network. UiO is happy to be a partner in this network, together with colleagues in Copenhagen (AAU), Stockholm (KTH), Helsinki (Aalto), and Reykjavik (UoI).

Choosing a conference format

When we began discussing the conference earlier this year, it quickly became apparent that it was unrealistic to meet in person. Restrictions have been lifted in most Nordic countries, but the pandemic is still ongoing. One option could have been to run it as an online-only conference. But since we are an adventurous group of researchers, we wanted to explore running it as a hub-based hybrid conference. Hybrid means that it was run both in-person and online. Hub-based means that there were multiple physical locations. This makes sense, given that there are five partners in the network.

Over the last years, we have gained much experience running hybrid events, such as a hybrid disputation and a hybrid conference. Also, most RITMO events have been hybrid over the last years. We also have much experience with hub-based teaching in the MCT master’s programme, with daily teaching between Oslo and Trondheim for three years. All of these experiences have made us aware of all the technical details that need to be in place for successfully mastering the different “formats.” But also the social aspects of handling various constellations of people in local and remote locations.

NordicSMC2021 was our first attempt at running a multi-hub conference. The idea is old, and several attempts have been running hub-based conferences over the last few years. For example, Richard Parncutt has made interesting reflections on running ESCOM/ICMPC conferences in hubs on different continents. Then timezone issues may be the biggest challenge for a successful event.

In comparison, NordicSMC is a small conference without timezone problems. That also makes it easier to experiment with the setup. Here are some thoughts on what we did and how it worked.

Benefits of hub-based conferences

The easiest would, of course, have been to run an online-only conference using the Zoom Webinar functionality. Most people know this format well, and it is technically easy to set up and run smoothly. I say easy, but you still need to pay attention to many details to run such events well.

However, many of us are tired of sitting alone in our offices, particularly when we can meet on campus. The nice thing about a hub-based conference is that people can meet in various interconnected hubs. This means that you get some of the social aspects of being together in your local hub while at the same time taking part in something larger.

Setting up for a hub-based conference is not particularly difficult. Most places have decent audiovisual equipment these days. Still, there are many ways of making things work well.


The camera choice and placement is an essential element of such a setup. We used a seminar room with a Crestron UC-SB1-CAM system containing a Huddly camera placed below the TV. That is a very wide-angle camera, which has the benefit of capturing the whole space. The downside is that people are tiny in the image. I tested setting up a Logitech web camera on top of the screen instead. I think the image quality was better there, but the bird’s eye view didn’t work too well. So we decided to use the Huddly camera.

One nice option in the Huddly camera is that it can automatically “zoom” and “pan” in on people in the room. This is a software feature based on some computer vision algorithm running in the background. That only partly worked, though. So we ended up having to move the image around manually. Not ideal; we should try to figure out why the auto-tracking didn’t work correctly.

Another option could have been to use one wide-angle camera (like the Huddly) to capture the whole room and a second camera for close-ups of people talking. We have taken this approach for large-scale hybrid events, but it requires more equipment and a small production team. We wanted to explore what can quickly be done in a typical seminar room with video conferencing equipment.

The video setup also requires attention to the physical organization in the room. From the MCT programme we have experienced that sitting in a V-shape work well from a communicative point of view. Such a setup allows the local participants to see each other while at the same time seeing the screen. The aim is to create a sense of both being together locally and remotely at the same time.


Being a conference on sound and music, we care about the sound quality. We know from other activities that the Crestron sound system below the TV is ok for short meetings. However, that system sounds like, eh, a video conferencing sound system. Given the size of the speaker elements, it sounds “thin” and does not project music very well. I struggle sitting through hours of meetings with the semi-poor audio quality found in regular video conferencing systems. Therefore, we decided to use the B&W speaker setup in the room instead. Of course, a good sound system doesn’t save people’s bad microphone quality. But it makes it a joy to listen to those that care about their sound, such as the excellent keynote lecture by Ludvig Elblaus. He also used the stereo sound functionality in Zoom (usually the sound is mono only) for his sound examples, which came through beautifully.

We chose to play sound over the B&W speaker system in the room instead of the Crestron panel. A Catchbox provided good sound from our side.

One of the benefits of using a combined video conferencing solution is that it typically has sound feedback cancellation built-in. This saves a lot of trouble when it comes to handling unwanted feedback issues. So when we decided to move to the Hi-Fi sound system in the room, we had to figure out another microphone solution.

We have for some time used a Catchbox microphone in meetings at RITMO. It is a wireless microphone embedded in a softbox that can be thrown around. And the best thing is that it has some very smart anti-feedback and anti-noise circuitry. While testing the setup, we found that the microphone array located in the Crestron panel could pick up talking in the room. However, when you sit some meters from a microphone, it will sound relatively muffled—having the Catchbox close by results in much better sound quality.

Another nice thing about the Catchbox is that it helps clarify who is talking. Others could see the box, so remote participants could understand who had the microphone even when we didn’t zoom in on people. Having to wait for the microphone is also disciplining. The downside to using such a semi-directional microphone is that it does not capture the sonic ambiance of the room. That is something to explore more later.

The need for 2 screens

We started with only having one connected TV screen in the room. That worked well for presentations, but we quickly realized it was challenging to keep track of the chat and Q&A windows simultaneously. This is also problematic when sitting on your own computer, and I don’t understand why Zoom didn’t solve these multi-sub-window problems a long time ago.

The image quickly became cluttered with multiple windows.

Since we had a second TV on another wall in the room, we connected that one to move the chat and Q&A windows there. Not ideal to have them on separate walls, though. This made me realize that “old-style” video conferencing setups with two TVs may be a better solution for such events.

As for running presentations, our default option nowadays is always to connect and run these from a separate laptop. It is much easier to have a dedicated Zoom machine to handle the communication part. It is less risky from a technical point of view and makes it easier to arrange images in the way you want.

We ended up connecting two TVs to get enough screen-estate in the room.

In-person or online chairing

During the conference, we explored different solutions for chairing the sessions. Most chairs were doing it online, which allowed for more easily using their local multi-view setups. However, my colleague Stefano Fasciani decided to chair his session from our hub.

Stefano Fasciani chairing a session from the Oslo hub.

This worked well, I think. The Catchbox provided good sound, and the zooming from the Huddly camera made it possible to see him in the image when speaking. This was before we connected the second TV, so he monitored the chat and Q&A through a Zoom session running on his own laptop.

Also, from a conceptual point of view, I think it was nice to have the session chair in the room. It felt more like a “normal” conference. For that reason, I also decided to be present in the hub for the final panel discussion.

Handling interactions

As always, juggling multiple platforms is a challenge for such events. There was audiovisual communication happening in Zoom, together with the chat and Q&A windows. Then we used Discord as a social channel in between. I didn’t want to keep a separate Zoom window run on my laptop for the entire two days, which meant that I couldn’t easily follow or contribute in the chat and Q&A windows. For such events, I think there should be an option to turn off incoming video in Zoom. It feels like a waste of bandwidth that everyone in a room should run separate Zoom instances to communicate in text.

Many conferences use Slack for in-between discussions. This time we used Discord, which had the same functionality. Still, it is a challenge to figure out where and how to interact. In online-only events, written communication in various channels has worked well. But when we were trying to create a hub-based setup, it was more challenging. When you sit next to people, you naturally talk to them instead of sending a message.

When it came to the Q&A sessions, I realized that I ended up asking questions aurally rather than in writing. This was because I didn’t have Zoom running on my laptop, and I had access to a microphone. Of course, the ability to ask questions live was limited to those of us in the hubs. The online participants had to interact through the written channels.

For such a small conference, we could have used a regular Zoom meeting instead of a Webinar. That would have allowed everyone to show their face and talk. However, there are risks involved in having relatively large Zoom meetings, so it is much safer on the organizing side to run a Webinar. I generally think that running a Webinar with always-on hub rooms worked well. That gave the presenters the feel of someone being present. And the local hub hosts were in control of the technical and social communication.

It would be interesting to hear how such an asymmetric communication form was experienced by those attending remotely. They, obviously, got a very different experience than those of us present in the hubs. I would imagine that they didn’t feel as connected as those in the hubs, but this may also be what they preferred. One could say that the ability to interact more directly could be a motivating factor to join a hub (for those who can and want to).

Hub-based conferences are the future

All in all, I think this year’s NordicSMC conference was a great success. Many engaging presentations showed the breadth of activities in sound and music computing in our region.

Technically speaking, I also think things went well. The AAU hosts ran things smoothly. The conference was built on what has become the “normal” setup: a Zoom webinar with pre-recorded presentations, panel-based Q&A, and written communication through both Zoom and Discord.

What was new was that we tried out a hub-based approach. We ended up only having hubs in Oslo, Stockholm, and Copenhagen, and we all set them up slightly differently. Still, we managed to create a sense of being “together”. We weren’t that many people in the Oslo hub, but people came in and out during the conference. As such, it felt like being at a regular conference.

Aleksander Tidemann got into a lively discussion after his presentation.

The experience was different than if I had been sitting alone in my office. I noticed that I followed the presentations more carefully than I would have otherwise done. And it was great chatting with colleagues and students during the breaks.

There are lots of small details that can be improved, both technically and conceptually. On the technical side, I think paying attention to the camera and microphone placements are critical factors. Often the equipment is ok, but it is set up and used sub-optimally. One challenge is that you are often not in complete control of the systems in university seminar rooms. For example, we did not have admin privileges on the Zoom computer, making it difficult to reach some settings.

I also think it is essential to pay attention to social interaction. One often says that running a hybrid event is more complex than running in-person or online-only. That is because you need to think about two different social groups at once. In that respect, running a hub-based and hybrid conference is triple difficult. You need to cater to the well-being of the people in all the hubs and online at the same time. The best solution here is to assign “hub hosts” responsible for the social interaction in their own hubs. It is also vital that these hub hosts interact with each other. If done well, I think this can make such hub-based events successful. They would not be the same as in-person events, but they can capture the feeling of being together.

Rotate video using FFmpeg

Here is another FFmpeg-related blog post, this time to explain how to rotate a video using the command-line tool FFmpeg. There are two ways of doing this, and I will explain both in the following.

Rotation in metadata

The best first try could be to make the rotation by only modifying the metadata in the file. This does not work for all file types, but should work for some (including .mp4) files.

ffmpeg -i input.mp4 -metadata:s:v rotate="-90" -codec copy output.mp4

The nice thing here is that it is superfast and also non-destructive.

Rotation with compression

If the above does not work, you will need to recompress the file. That is not ideal, but will do the trick. To rotate a movie by 90 degrees, you can do:

ffmpeg -i input.mp4 -vf "transpose=1" -c:a copy output.mp4

The trick here is to use the rotation value:

  • 0 – Rotate by 90 degrees counter-clockwise and flip vertically. This is the default.
  • 1 – Rotate by 90 degrees clockwise.
  • 2 – Rotate by 90 degrees counter-clockwise.
  • 3 – Rotate by 90 degrees clockwise and flip vertically.

I often record with cameras hanging upside down. Then I want to rotate 180 degrees, which can be done like this:

ffmpeg -i input.mp4 -vf "transpose=2,transpose=2" output.mp4

Even though the video is re-compressed using this method, you can force the audio to be copied without compression by adding the -c:a copy tag:

ffmpeg -i input.mp4 -vf "transpose=1" -c:a copy output.mp4

Of course, these commands can also be included in a chain of other FFmpeg-commands.