The status of FAIR in higher education

I participated in the closing event of the FAIRsFAIR project last week. For that, I was asked to share thoughts on the status of FAIR in higher education. This is a summary of the notes that I wrote for the event.

What is FAIR?

First of all, The FAIR principles state that data should be:

  • Findable: The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.
  • Accessible: Once the user finds the required data, she/he/they need to know how they can be accessed, possibly including authentication and authorisation.
  • Interoperable: The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.
  • Reusable: The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.

This sounds all good, but the reality is that FAIRifying data is a non-trivial task.

FAIR is not (necessarily) Open

Readers of this blog know that I am an open research advocate, and that also means that I embrace the FAIR principles. It is impossible to make data genuinely open if they are not FAIR. It is essential to say that FAIR data is not the same as Open Data. It is perfectly possible to “FAIRify” closed data. That would entail making metadata about the data openly available according to the FAIR principles while still closing the actual data.

There are many cases where it is impossible to make data openly available. In the fourMs Lab, we often have to keep the data closed. There are usually two reasons for this:

  • Privacy: we perform research on and with humans and, therefore, almost always also record audio and video files from which it is possible to identify the people involved. Some participants in our studies consent to us sharing their data, but others do not. So we try to anonymize our material, such as blurring faces, creating motion videos, or keeping only motion capture data, but sometimes that is not possible. Anonymized data make sense in some cases, such as when we capture hundreds of people for the Championships of Standstill. But if we study expert musicians, they are not so easy to anonymize. The sound of an expert hardanger fiddler is enough for anyone in the community to recognize the person in question. And, their facial expressions may be an important part in understanding the effects of a performance. So if we cannot share their face and their sound, the data is not useful.
  • Copyright: even more challenging than the privacy matters, are questions related to copyright. When working with real music (as opposed to the “synthetic” music used in many music psychology studies) we need to consider the rights of composers, musicians, producers, and so on. This is a tricky matter. On one hand, we would like to work with new music by living musicians. On the other hand, the legal territory is challenging. There are a lot of actors, national legislations, copyright unions, and so on. Unfortunately, we do not have the legal competency and administrative capacity to tackle all the practicalities of ensuring that we are allowed to share a lot of the music we use openly.

These challenges make sharing all our data openly difficult, but we can still make them FAIR.

How far have we come?

Three things need to be in place to FAIRify data properly:

  1. Good and structured data. This is the main responsibility of the researcher. However, it is much easier said than done. Take the example of MusicLab Copenhagen, an event we ran in October. It was a huge undertaking with lots of data and media being collected and recorded by around 20 people. We are still working on organizing the data in meaningful ways. The plan is to release as much as possible as fast as possible, but it takes an astonishing amount of time to pre-process the data and structure it in a meaningful way. After all, if the data does not make sense to the person that collected it, nobody else will be able to use it either.
  2. Data repository to store the data. Once the data is structured, it needs to be stored somewhere that provides the necessary tools, and, in particular, persistent identifiers (such as DOIs). We don’t have our own data repository at UiO, so here we need to rely on other solutions. There are two main types of repositories: (a) “bucket-based” repositories that researchers can use themselves, (b) data archives run by institutions with data curators. That brings me to the third point:
  3. Data wranglers and curators. With some training and the right infrastructures, researchers may take on this role themselves. Tools such as Zenodo, Figshare, and OSF, allow researchers to drop in their files and get DOIs, version control, and timetagging. However, in my experience, even though these technical preservation parts are in place, the data may not be (a) good and structured enough in itself, and/or (b) have sufficient metadata and “padding” to be meaningful to others. That is why the institutional archives have professional data wranglers and curators employed to help with these large parts.

More people and institutions have started to realize that data handling is a skill and profession of its own. Many universities have begun to employ data curators in their libraries to help with the FAIRification process. However, the challenge is that they are too few and are too far removed from where the research is happening. In my experience, much trouble can be avoided at the later part of the data archiving “ladder” if the data is handled better from the start.

At RITMO, we have two lab engineers that double as data managers. When we hired them back in 2018, they were among the first local data managers at UiO, and that has proven to be essential for moving things forward. As lab engineers, they are involved in data collection from the start, and they can therefore help with data curation long before we get to the archival stage. They also help train new researchers in thinking about data management and follow data from beginning to end.

Incentives and rewards

There are many challenges to solve before we have universal FAIRification in place. Fortunately, many things move in the right direction. Policies and recommendations are made at international, national, and institutional levels, and Infrastructures are established, and personnel trained.

The biggest challenge now, I think, is to get incentives and rewards in place. Sharing data openly, or at least making the data FAIR, is still seen as costly, cumbersome, and or/unnecessary. That is because of the lack of incentives and rewards for doing so. We are still at a stage where publications “count” the most in the system. Committees primarily look at publication lists and h-indexes when making decisions about hiring, promotion, or funding allocations.

Fortunately, there is a lot of focus on research assessment these days. I have been involved in developing the Norwegian Career Assessment Matrix (NOR-CAM) as a model for broadening the focus. I am also pleased to see that research evaluation and assessment is at the forefront at the Paris Open Science European Conference (OSEC) starting today. When researchers get proper recognition for FAIRifying their data, we will see a radical change.

New article: Best versus Good Enough Practices for Open Music Research

After a fairly long publication process, I am happy to finally announce a new paper: Best versus Good Enough Practices for Open Music Research in Empirical Musicology Review.

Summary

The abstract reads:

Music researchers work with increasingly large and complex data sets. There are few established data handling practices in the field and several conceptual, technological, and practical challenges. Furthermore, many music researchers are not equipped for (or interested in) the craft of data storage, curation, and archiving. This paper discusses some of the particular challenges that empirical music researchers face when working towards Open Research practices: handling (1) (multi)media files, (2) privacy, and (3) copyright issues. These are exemplified through MusicLab, an event series focused on fostering openness in music research. It is argued that the “best practice” suggested by the FAIR principles is too demanding in many cases, but “good enough practice” may be within reach for many. A four-layer data handling “recipe” is suggested as concrete advice for achieving “good enough practice” in empirical music research.

The article is written based on challenges we have faced with adhering to Open Research principles within music research. I mention our experiences with MusicLab in particular.

The perhaps most important take-home message from the article is my set of recommendations in the end.

1. DATA COLLECTION (“RAW”)

1a. Create analysis-friendly data. Planning what to record will save time afterward, and will probably lead to better results in the long run. Write a data management plan (DMP).

1b. Plan for mistakes. Things will always happen. Ensure redundancy in critical parts of the data collection chain.

1c. Save the raw data. In most cases, the raw data will be processed in different ways, and it may be necessary to go back to the start.

1d. Agree on a naming convention before recording. Cleaning up the names of files and folders after recording can be tedious. Get it right from the start instead. Use unique identifiers for all equipment (camera1, etc.), procedures (pre-questionnaire1, etc.) and participants (a001, etc.).

1e. Make backups of everything as quickly as possible. Losing data is never fun, and particularly not the raw data.

2. DATA PRE-PROCESSING (“PROCESSED”)

2a. Separate raw from processed data. Nothing is as problematic as over-writing the original data in the pre-processing phase. Make the raw data folder read-only once it is organized.

2b. Use open and interoperable file formats. Often the raw data will be based on closed or proprietary formats. The data should be converted to interoperable file formats as early as possible.

2c. Give everything meaningful names. Nothing is as cryptic as 8-character abbreviations that nobody will understand. Document your naming convention.

3. DATA STORAGE (“COOKED”)

3a. Organize files into folders. Creating a nested and hierarchical folder structure with meaningful names is a basic, but system-independent and future-proof solution. Even though search engines and machine learning improve, it helps to have a structured organizational approach in the first place.

3b. Make incremental changes. It may be tempting to save the last processed version of your data, but it may be impossible to go back to make corrections or verify the process.

3c. Record all the steps used to process data. This can be in a text file describing the steps taken. If working with GUI-based software, be careful to note down details about the software version, and possibly include screenshots of settings. If working with scripts, document the scripts carefully, so that others can understand them several years from now. If using a code repository (recommended), store current snapshots of the scripts with the data. This makes it possible to validate the analysis.

4. DATA ARCHIVE (“PRESERVED”)

4a. Always submit data with manuscripts. Publications based on data should be considered incomplete if the data is not accessible in such a way that it is possible to evaluate the analysis and claims in the paper.

4b. Submit the data to a repository. To ensure the long-term preservation of your data, also independently of publications, it should be uploaded to a reputable DOI-issuing repository so that others can access and cite it.

4c. Let people know about the data. Data collection is time-consuming, and in general, most data is under-analyzed. More data should be analyzed more than once.

4d. Put a license on the data. This should ideally be an open and permissive license (such as those suggested by Creative Commons). However, even if using a closed license, it is important to clearly label the data in a way so that others can understand how to use them.

MusicLab Copenhagen

After nearly three years of planning, we can finally welcome people to MusicLab Copenhagen. This is a unique “science concert” involving the Danish String Quartet, one of the world’s leading classical ensembles. Tonight, they will perform pieces by Bach, Beethoven, Schnittke and folk music in a normal concert setting at Musikhuset in Copenhagen. However, the concert is nothing but normal.

Live music research

During the concert, about twenty researchers from RITMO and partner institutions will conduct investigations and experiments informed by phenomenology, music psychology, complex systems analysis, and music technology. The aim is to answer some big research questions, like:

  • What is musical complexity?
  • What is the relation between musical absorption and empathy?
  • Is there such a thing as a shared zone of absorption, and is it measurable?
  • How can musical texture be rendered visually?

The concert will be live-streamed (on YouTube and Facebook) and it will also be aired on Danish radio. There will also be a short film documenting the whole process.

Researchers and staff from RITMO (and friends) in front of the concert venue.

Real-world Open Research

This concert will be the biggest and most complex MusicLab event to date. Still, all the normal “ingredients” of a MusicLab will be in place. The core is a spectacular performance. We will capture a lot of data using state-of-the-art technologies, but in a way that is as little obtrusive as possible for performers and the audience. After the concert, both performers and researchers will talk about the experience.

Of course, being a flagship Open Research project, all the collected data will be shared openly. The researchers will show glimpses of data processing procedures as part of the “data jockeying” at the end of the event. However, it is first when all data is properly uploaded and pre-processed that data processing can start. All the involved researchers will dig into their respective data. But since everything is openly available, anyone can go in and work on the data as they wish.

Proper preparation

Due to the corona situation, the event has been postponed several times. That has been unfortunate and stressful for everyone involved. On the positive side, it has also meant that we have been able to rehearse and prepare very well. Already a year ago we ran a full rehearsal of the technical setup of the concert. We even live-streamed the whole preparation event, in the spirit of “slow TV”:

I am quite confident that things will run smooth during the concert. Of course, there are always obstacles. For example, one of our eye-trackers broke in one of the last tests. And it is always exciting to wait for Apple and Google to approve updates of our MusicLab app in their respective app stores.

Want to see how it went. Have a look here.

From Open Research to Science 2.0

Earlier today, I presented at the national open research conference Hvordan endres forskningshverdagen når åpen forskning blir den nye normalen? The conference is organized by the Norwegian Forum for Open Research and is coordinated by Universities Norway. It has been great to follow the various discussions at the conference. One observation is that very few questions the transition to Open Research. We have, finally, come to a point where openness is the new normal. Instead, the discussions have focused on how we can move forwards. Having many active researchers in the panels also led to focus on solutions instead of policy.

Openness leads to better research

In my presentation, I began by explaining why I believe opening the research process leads to better research:

  • Opening the process makes the researcher more carefully document everything. For example, nobody wants to make messy data or code available. Adding metadata and descriptions also help improve the quality of what is made available. It also helps in removing irrelevant content.
  • Making the different parts openly available is important for ensuring transparency in the research process. This allows reviewers (and others) to check claims in published papers. It also allows for others to replicate results or use data and methods in other research.
  • This openness and accessibility will ultimately lead to better quality control. Some people complain that we make available lots of irrelevant information. True, not everything that is made available will be checked or used. The same is the case for most other things on the web. That does not mean that nobody will never be interested. We also need to remember that research is a slow activity. It may take years for research results to be used.

Of course, we face many challenges when trying to work openly. As I have described previously, we particularly struggle with privacy and copyright issues. We also don’t have the technical solutions we need. That led me to my main point in the talk.

Connecting the blocks

The main argument in my presentation was that we need to think about connecting the various blocks in the Open Research puzzle. There has, over the last few years, been a lot of focus on individual blocks. First, making publications openly available (Open Access). Nowadays, there is a lot of discussion about Open Data and how to make data FAIR (Findable, Accessible, Interoperable, Reusable). There is also some development in the other building blocks. What is lacking today is a focus on how the different blocks are connected.

There is now a need to connect the different blocks. Dark blue blocks are part of the research process, while the light blue blocks focus on applications and assessment.

By developing individual blocks without thinking sufficiently about their interconnectedness, I fear that we lose out on some of the main points of opening everything. Moving towards Open Research is not only about making things open; it is about rethinking the way we research. That is the idea of the concept of Science 2.0 (or Research 2.0, as I would prefer to call it).

There is much to do before we can properly connect the blocks. But some elements are essential:

  • Persistent identifiers (PID): Having unique and permanent digital references that makes it possible to find and reuse digital material is essential for finding this. This could be DOIs for data, ORCID for researchers, and so on.
  • Timestamping: Many researchers are concerned about who did something first. For example, many people wait with releasing their data because they want to publish an article first. That is because the data (currently) does not have any “value” in itself. In my thinking, if data had PIDs and timestamping they would also be citable. This should also be combined with proper recognition of such contributions.
  • Version control: It has been common to archive various research results when the research is done. This is based on pre-digital workflows. Today, it is much better to provide solutions for proper version control of everything we are doing.

Fortunately, things move in the right direction. It is great to see more researchers try to work openly. That also exposes the current “holes” in infrastructures and policies.

More research should be solid instead of novel

Novelty is often highlighted as the most important criterion for getting research funding. That a manuscript is novel is also a major concern for many conference/journal reviewers. While novelty may be good in some contexts, I find it more important that research is solid.

I started thinking about novelty versus solidity when I read through the (excellent) blog posts about the ISMIR 2021 Reviewing Experience. These blog posts deal with many topics, but the question about novelty caught my attention. Even though the numbers are small, it turned out that the majority of the survey respondents listed novelty as the most important selection criterion for the conference. This is not unique to ISMIR; I think many journals and conferences ask about novelty.

Defining novelty

Given that novelty is a criterion “everyone” considers all the time, few people discuss what it actually means. What does it actually mean that something is novel? Merriam-Webster suggests that it is “something new or unusual.” But what should be new or unusual? The questions? The answers? The methods?

Research is about contributing new knowledge to humankind. After all, it is not really any point in reinventing the wheel. Still, most research is incremental. We all stand on the shoulders of giants. New research questions spring out of the “future work” sections of our colleagues’ articles. Our methods are based on the refinement of disciplinary developments. Even so-called “groundbreaking” projects are incremental in nature if you scrutinize the details. Still, we have an idea that “something unheard of before” is ideal.

Research needs to be solid

My research is creative in both form and content. As such, many people think that my projects are novel in the sense of being new. I also work both multi- and interdisciplinary, which means that I don’t really fit well anywhere. That could also be considered novel in the sense of being unusual. Still, what I am doing is not particularly new or unusual. From my perspective, I am working incrementally—everything I am doing builds on other people’s work. True, I combine theories and methods from different fields. This makes it look novel.

I can illustrate this with a research project I just finished: MICRO. Over the last years, we have studied human music-related micromotion, the smallest actions it is possible to produce and perceive. This is new because no one has studied such motion in a musical context before. It is also unusual because the team comprised researchers from musicology, psychology, human movement science, and computer science.

The MICRO project can be considered novel. However, does that mean that everything we did in the project was novel? Some parts were, I guess. For example, we collected data by running the Norwegian Championship of Standstill annually. This was new and unusual the first time we did it. We even got quite a lot of media interest (it is not so often that music research is featured in the sports news on national TV).

However, collecting data once does not make for outstanding science. Research is about asking questions, finding answers, and verifying those answers. Repeating experiments, making slight modifications to the research design, improving the methods, refining the analyses. That is what solid research is about.

I have researched human music-related micromotion for nearly ten years now. We have some answers, but there are many open questions. Many of these questions are neither new nor unusual any longer. But if we want to understand more about what is actually going on inside our bodies when we experience music, we need to continue researching what is not any longer new and unusual. That is about doing solid research, not novel.

Open Research is better research

I believe that open research is better research. Opening the research process makes researchers think more carefully about what they do and how they document it. This takes (some) more time than working closed. But it also makes it easier for others to understand what has been done. This is important from a peer review perspective. It also facilitates incremental research.

The MICRO project has been an open research flagship project. I began by sharing the funding application openly. Throughout the project, we have continuously described how we have worked. The data has been released in the Oslo Standstill Database, and source code has been shared on GitHub. All of this has taken time “away” from publishing journal articles. However, it is time for researchers to publish fewer articles and focus more on making more data, code, etc., available.

Opening the research process is part of solidifying the research. As researchers, we cannot hide behind a “black box” any longer. Everyone can scrutinize what we have done. In fact, I hope that more people will analyze our data and develop our code. That is part of the incremental nature of science.

Summing up

I am not against novel research. However, I think we have gotten to a point where there is too much focus on novelty. If you are applying for a large research grant, it may make sense to doing something new. But it must be possible to submit a presentation to a conference or a manuscript to a journal based on plain, solid research. That may, in fact, be novel in itself! Hopefully, the transition to open research may actually help to focus more on solidity instead of novelty.