The FAIR data principles: the key to unlocking the value of your R&D data

James Hinchliffe

7 Jul 2022

Applying the FAIR data principles as part of a robust data management program can solve many long-standing data-related problems in R&D organizations and enable new kinds of innovation – but their power is often overlooked.

In the last few years many large R&D organizations have realized that there is huge potential locked up in their scientific data archives, and that applying modern data science and machine learning techniques can release unexpected new insights. But many efforts aimed at making secondary use of data are frustrated, not by the amount of data available, but because it isn’t well-organized, correct, unambiguous and machine-readable. Addressing this means getting to grips with data management – which can be a lot more complex than it first appears.

There have been many attempts to codify the discipline of data management over the years, but for scientific research data the FAIR principles [1] – findability, accessibility, interoperability, reusability – have become the most popular and well-known.

There are three main reasons for this:

FAIR focuses on the unique challenges of scientific data.
FAIR starts from the scientist’s point of view, not the IT department’s.
FAIR is simple to explain and is easy for people – whatever their level of IT knowledge – to understand and get behind.

But although FAIR is simple on the surface, implementation of it in the real world can be complex, subtle and time-consuming, and organizations can’t wait forever for data management initiatives to finish before they can unleash the potential of their data assets.

In this short blog series from Capgemini Hybrid Intelligence, we’ll draw on our years of experience with FAIR to both demystify it and demonstrate its benefits. We’ll start in this first post with a reminder of the problems that led to the creation of the FAIR principles in the first place.

Findability issues

Obviously, you can’t do anything with data that you can’t find. Sometimes the most urgent problem in a large R&D organization is helping people to easily discover data that already exists. Despite (or maybe because of) all the modern document storage and collaboration tools that today’s workplaces have, the classic ‘data silos’ problem is still very much with us – data is fragmented across document stores, databases, lab systems and even individual researchers’ local hard drives.

Non-findable data leads to a lot of obvious problems such as lost research time as scientists hunt for data sets that may or may not exist, or, worse, repeat experimental work that’s already been done. But there are subtler problems too – if data is locked up in silos, then the serendipity of discoveries that cut across functional and geographic boundaries is impeded. That serendipity plays a bigger part in scientific discovery than many people realise [2].

Accessibility issues

Knowing that an interesting data set exists in your organization is one thing, but actually getting your hands on it can still be difficult. Without transparent access models and methods, scientists from one team simply don’t have access to other teams’ data stores and have no way to find out how to address that. That means that the ability of a researcher to access data is, too often, favour-based and proportional to the size of their personal network and their ability to influence others to share.

Interoperability issues

Having found and accessed some interesting data, the next challenge is to put it to productive use. But isolated data sets often don’t yield their full meaning and value until they’re connected to and merged with other data sets, and it’s common for data scientists to spend over half their time ‘data wrangling’ – reformatting and recombining data sets – before they can start the real work of developing new insights from it.

Why is data wrangling so time-consuming? Firstly, because of ambiguity of meaning – does X in data set 1 mean the same as X in data set 2? Secondly, because of physical data formats; when the structure of data record is left to individual experiment owners to define, variability is the result, and restricted data processing automation follows.

Reusability issues

Reusability is the central aim of FAIR and usually the factor that triggers organizations’ adoption of data management and FAIR in the first place. In principle, achieving F, A and I should give you most of R – but there is another important aspect of the reusability principle that needs to be resolved. Data capture processes must be designed with reusability beyond the initial purpose in mind. It’s usually extremely difficult to retrospectively make non-FAIR data reusable. Data must be born reusable, with context and implicit knowledge built in from the start. If not, the risk is that data sets will be found and analyzed under mistaken assumptions, leading to disrupted projects and sometimes a reluctance of researchers to share ‘their’ data with others.

Conclusion

The FAIR principles provide a powerful solution framework that effectively overcomes several long-standing obstacles to working with R&D data. In our upcoming blogs, we’ll move from talking about problems to discussing how FAIR delivers solutions. But if you recognize any of these problems, rest assured that Capgemini can help – take a look at the FAIR data and data management services we offer as part of our vision for Data-Driven R&D.

[1] https://www.nature.com/articles/sdata201618
[2] https://qz.com/1070732/viagras-famously-surprising-origin-story-is-actually-a-pretty-common-way-to-find-new-drugs/