The status of FAIR in higher education

I participated in the closing event of the FAIRsFAIR project last week. For that, I was asked to share thoughts on the status of FAIR in higher education. This is a summary of the notes that I wrote for the event.

What is FAIR?

First of all, The FAIR principles state that data should be:

  • Findable: The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.
  • Accessible: Once the user finds the required data, she/he/they need to know how they can be accessed, possibly including authentication and authorisation.
  • Interoperable: The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.
  • Reusable: The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.

This sounds all good, but the reality is that FAIRifying data is a non-trivial task.

FAIR is not (necessarily) Open

Readers of this blog know that I am an open research advocate, and that also means that I embrace the FAIR principles. It is impossible to make data genuinely open if they are not FAIR. It is essential to say that FAIR data is not the same as Open Data. It is perfectly possible to “FAIRify” closed data. That would entail making metadata about the data openly available according to the FAIR principles while still closing the actual data.

There are many cases where it is impossible to make data openly available. In the fourMs Lab, we often have to keep the data closed. There are usually two reasons for this:

  • Privacy: we perform research on and with humans and, therefore, almost always also record audio and video files from which it is possible to identify the people involved. Some participants in our studies consent to us sharing their data, but others do not. So we try to anonymize our material, such as blurring faces, creating motion videos, or keeping only motion capture data, but sometimes that is not possible. Anonymized data make sense in some cases, such as when we capture hundreds of people for the Championships of Standstill. But if we study expert musicians, they are not so easy to anonymize. The sound of an expert hardanger fiddler is enough for anyone in the community to recognize the person in question. And, their facial expressions may be an important part in understanding the effects of a performance. So if we cannot share their face and their sound, the data is not useful.
  • Copyright: even more challenging than the privacy matters, are questions related to copyright. When working with real music (as opposed to the “synthetic” music used in many music psychology studies) we need to consider the rights of composers, musicians, producers, and so on. This is a tricky matter. On one hand, we would like to work with new music by living musicians. On the other hand, the legal territory is challenging. There are a lot of actors, national legislations, copyright unions, and so on. Unfortunately, we do not have the legal competency and administrative capacity to tackle all the practicalities of ensuring that we are allowed to share a lot of the music we use openly.

These challenges make sharing all our data openly difficult, but we can still make them FAIR.

How far have we come?

Three things need to be in place to FAIRify data properly:

  1. Good and structured data. This is the main responsibility of the researcher. However, it is much easier said than done. Take the example of MusicLab Copenhagen, an event we ran in October. It was a huge undertaking with lots of data and media being collected and recorded by around 20 people. We are still working on organizing the data in meaningful ways. The plan is to release as much as possible as fast as possible, but it takes an astonishing amount of time to pre-process the data and structure it in a meaningful way. After all, if the data does not make sense to the person that collected it, nobody else will be able to use it either.
  2. Data repository to store the data. Once the data is structured, it needs to be stored somewhere that provides the necessary tools, and, in particular, persistent identifiers (such as DOIs). We don’t have our own data repository at UiO, so here we need to rely on other solutions. There are two main types of repositories: (a) “bucket-based” repositories that researchers can use themselves, (b) data archives run by institutions with data curators. That brings me to the third point:
  3. Data wranglers and curators. With some training and the right infrastructures, researchers may take on this role themselves. Tools such as Zenodo, Figshare, and OSF, allow researchers to drop in their files and get DOIs, version control, and timetagging. However, in my experience, even though these technical preservation parts are in place, the data may not be (a) good and structured enough in itself, and/or (b) have sufficient metadata and “padding” to be meaningful to others. That is why the institutional archives have professional data wranglers and curators employed to help with these large parts.

More people and institutions have started to realize that data handling is a skill and profession of its own. Many universities have begun to employ data curators in their libraries to help with the FAIRification process. However, the challenge is that they are too few and are too far removed from where the research is happening. In my experience, much trouble can be avoided at the later part of the data archiving “ladder” if the data is handled better from the start.

At RITMO, we have two lab engineers that double as data managers. When we hired them back in 2018, they were among the first local data managers at UiO, and that has proven to be essential for moving things forward. As lab engineers, they are involved in data collection from the start, and they can therefore help with data curation long before we get to the archival stage. They also help train new researchers in thinking about data management and follow data from beginning to end.

Incentives and rewards

There are many challenges to solve before we have universal FAIRification in place. Fortunately, many things move in the right direction. Policies and recommendations are made at international, national, and institutional levels, and Infrastructures are established, and personnel trained.

The biggest challenge now, I think, is to get incentives and rewards in place. Sharing data openly, or at least making the data FAIR, is still seen as costly, cumbersome, and or/unnecessary. That is because of the lack of incentives and rewards for doing so. We are still at a stage where publications “count” the most in the system. Committees primarily look at publication lists and h-indexes when making decisions about hiring, promotion, or funding allocations.

Fortunately, there is a lot of focus on research assessment these days. I have been involved in developing the Norwegian Career Assessment Matrix (NOR-CAM) as a model for broadening the focus. I am also pleased to see that research evaluation and assessment is at the forefront at the Paris Open Science European Conference (OSEC) starting today. When researchers get proper recognition for FAIRifying their data, we will see a radical change.