The status of FAIR in higher education

I participated in the closing event of the FAIRsFAIR project last week. For that, I was asked to share thoughts on the status of FAIR in higher education. This is a summary of the notes that I wrote for the event.

What is FAIR?

First of all, The FAIR principles state that data should be:

  • Findable: The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.
  • Accessible: Once the user finds the required data, she/he/they need to know how they can be accessed, possibly including authentication and authorisation.
  • Interoperable: The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.
  • Reusable: The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.

This sounds all good, but the reality is that FAIRifying data is a non-trivial task.

FAIR is not (necessarily) Open

Readers of this blog know that I am an open research advocate, and that also means that I embrace the FAIR principles. It is impossible to make data genuinely open if they are not FAIR. It is essential to say that FAIR data is not the same as Open Data. It is perfectly possible to “FAIRify” closed data. That would entail making metadata about the data openly available according to the FAIR principles while still closing the actual data.

There are many cases where it is impossible to make data openly available. In the fourMs Lab, we often have to keep the data closed. There are usually two reasons for this:

  • Privacy: we perform research on and with humans and, therefore, almost always also record audio and video files from which it is possible to identify the people involved. Some participants in our studies consent to us sharing their data, but others do not. So we try to anonymize our material, such as blurring faces, creating motion videos, or keeping only motion capture data, but sometimes that is not possible. Anonymized data make sense in some cases, such as when we capture hundreds of people for the Championships of Standstill. But if we study expert musicians, they are not so easy to anonymize. The sound of an expert hardanger fiddler is enough for anyone in the community to recognize the person in question. And, their facial expressions may be an important part in understanding the effects of a performance. So if we cannot share their face and their sound, the data is not useful.
  • Copyright: even more challenging than the privacy matters, are questions related to copyright. When working with real music (as opposed to the “synthetic” music used in many music psychology studies) we need to consider the rights of composers, musicians, producers, and so on. This is a tricky matter. On one hand, we would like to work with new music by living musicians. On the other hand, the legal territory is challenging. There are a lot of actors, national legislations, copyright unions, and so on. Unfortunately, we do not have the legal competency and administrative capacity to tackle all the practicalities of ensuring that we are allowed to share a lot of the music we use openly.

These challenges make sharing all our data openly difficult, but we can still make them FAIR.

How far have we come?

Three things need to be in place to FAIRify data properly:

  1. Good and structured data. This is the main responsibility of the researcher. However, it is much easier said than done. Take the example of MusicLab Copenhagen, an event we ran in October. It was a huge undertaking with lots of data and media being collected and recorded by around 20 people. We are still working on organizing the data in meaningful ways. The plan is to release as much as possible as fast as possible, but it takes an astonishing amount of time to pre-process the data and structure it in a meaningful way. After all, if the data does not make sense to the person that collected it, nobody else will be able to use it either.
  2. Data repository to store the data. Once the data is structured, it needs to be stored somewhere that provides the necessary tools, and, in particular, persistent identifiers (such as DOIs). We don’t have our own data repository at UiO, so here we need to rely on other solutions. There are two main types of repositories: (a) “bucket-based” repositories that researchers can use themselves, (b) data archives run by institutions with data curators. That brings me to the third point:
  3. Data wranglers and curators. With some training and the right infrastructures, researchers may take on this role themselves. Tools such as Zenodo, Figshare, and OSF, allow researchers to drop in their files and get DOIs, version control, and timetagging. However, in my experience, even though these technical preservation parts are in place, the data may not be (a) good and structured enough in itself, and/or (b) have sufficient metadata and “padding” to be meaningful to others. That is why the institutional archives have professional data wranglers and curators employed to help with these large parts.

More people and institutions have started to realize that data handling is a skill and profession of its own. Many universities have begun to employ data curators in their libraries to help with the FAIRification process. However, the challenge is that they are too few and are too far removed from where the research is happening. In my experience, much trouble can be avoided at the later part of the data archiving “ladder” if the data is handled better from the start.

At RITMO, we have two lab engineers that double as data managers. When we hired them back in 2018, they were among the first local data managers at UiO, and that has proven to be essential for moving things forward. As lab engineers, they are involved in data collection from the start, and they can therefore help with data curation long before we get to the archival stage. They also help train new researchers in thinking about data management and follow data from beginning to end.

Incentives and rewards

There are many challenges to solve before we have universal FAIRification in place. Fortunately, many things move in the right direction. Policies and recommendations are made at international, national, and institutional levels, and Infrastructures are established, and personnel trained.

The biggest challenge now, I think, is to get incentives and rewards in place. Sharing data openly, or at least making the data FAIR, is still seen as costly, cumbersome, and or/unnecessary. That is because of the lack of incentives and rewards for doing so. We are still at a stage where publications “count” the most in the system. Committees primarily look at publication lists and h-indexes when making decisions about hiring, promotion, or funding allocations.

Fortunately, there is a lot of focus on research assessment these days. I have been involved in developing the Norwegian Career Assessment Matrix (NOR-CAM) as a model for broadening the focus. I am also pleased to see that research evaluation and assessment is at the forefront at the Paris Open Science European Conference (OSEC) starting today. When researchers get proper recognition for FAIRifying their data, we will see a radical change.

From Open Research to Science 2.0

Earlier today, I presented at the national open research conference Hvordan endres forskningshverdagen når åpen forskning blir den nye normalen? The conference is organized by the Norwegian Forum for Open Research and is coordinated by Universities Norway. It has been great to follow the various discussions at the conference. One observation is that very few questions the transition to Open Research. We have, finally, come to a point where openness is the new normal. Instead, the discussions have focused on how we can move forwards. Having many active researchers in the panels also led to focus on solutions instead of policy.

Openness leads to better research

In my presentation, I began by explaining why I believe opening the research process leads to better research:

  • Opening the process makes the researcher more carefully document everything. For example, nobody wants to make messy data or code available. Adding metadata and descriptions also help improve the quality of what is made available. It also helps in removing irrelevant content.
  • Making the different parts openly available is important for ensuring transparency in the research process. This allows reviewers (and others) to check claims in published papers. It also allows for others to replicate results or use data and methods in other research.
  • This openness and accessibility will ultimately lead to better quality control. Some people complain that we make available lots of irrelevant information. True, not everything that is made available will be checked or used. The same is the case for most other things on the web. That does not mean that nobody will never be interested. We also need to remember that research is a slow activity. It may take years for research results to be used.

Of course, we face many challenges when trying to work openly. As I have described previously, we particularly struggle with privacy and copyright issues. We also don’t have the technical solutions we need. That led me to my main point in the talk.

Connecting the blocks

The main argument in my presentation was that we need to think about connecting the various blocks in the Open Research puzzle. There has, over the last few years, been a lot of focus on individual blocks. First, making publications openly available (Open Access). Nowadays, there is a lot of discussion about Open Data and how to make data FAIR (Findable, Accessible, Interoperable, Reusable). There is also some development in the other building blocks. What is lacking today is a focus on how the different blocks are connected.

There is now a need to connect the different blocks. Dark blue blocks are part of the research process, while the light blue blocks focus on applications and assessment.

By developing individual blocks without thinking sufficiently about their interconnectedness, I fear that we lose out on some of the main points of opening everything. Moving towards Open Research is not only about making things open; it is about rethinking the way we research. That is the idea of the concept of Science 2.0 (or Research 2.0, as I would prefer to call it).

There is much to do before we can properly connect the blocks. But some elements are essential:

  • Persistent identifiers (PID): Having unique and permanent digital references that makes it possible to find and reuse digital material is essential for finding this. This could be DOIs for data, ORCID for researchers, and so on.
  • Timestamping: Many researchers are concerned about who did something first. For example, many people wait with releasing their data because they want to publish an article first. That is because the data (currently) does not have any “value” in itself. In my thinking, if data had PIDs and timestamping they would also be citable. This should also be combined with proper recognition of such contributions.
  • Version control: It has been common to archive various research results when the research is done. This is based on pre-digital workflows. Today, it is much better to provide solutions for proper version control of everything we are doing.

Fortunately, things move in the right direction. It is great to see more researchers try to work openly. That also exposes the current “holes” in infrastructures and policies.