Sharing Data Responsibly: The FAIR Principles

So you’ve submitted your paper, made your code publicly available, and maybe even provided documentation to ensure somebody can reproduce your work. But what about the data your work is based on? Is that readily available to your readers, too?

Maybe it’s too large to put on GitHub alongside your code. Maybe it’s sensitive, or subject to GDPR restrictions, so you can’t just stick a download link on your website. Maybe it’s in a proprietary format that needs non-open software to read. There are many reasons sharing data can be less straightforward than sharing code, and often it’s not entirely clear what ‘best practices’ are for a given situation. Data management is a complicated topic, and to do it justice would require far more than a quick blog post. Instead, I’d like to focus on a single source of guidance that serves as a useful starting point for thinking about responsible data management: the FAIR principles.

What is FAIR?

Originally introduced in a paper in Nature Scientific Data, FAIR stands for “Findable, Accessible, Interoperable, and Reusable”, with these four concepts underpinning a set of guiding principles for data management. Today, the GO FAIR website maintains an up-to-date description of these principles and a host of practical resources for implementing them with your own data. Here, we’ll briefly outline the core ideas behind FAIR.

Findability

“Findability” refers to whether it is easy for both humans and machines to locate data and accompanying metadata. Most importantly, the data should have a unique, static identifier. One example of a ‘findable’ identifier is a unique database label, such as that of a protein in UniProt. Human (pro)thrombin is uniquely identified by the UniProt ID P00734 and can be accessed manually or programmatically via the UniProt website. Another example is the DOI (Digital Object Identifier) assigned to a digital resource, such as a paper. The original FAIR paper, for example, can be uniquely identified and located by its DOI: https://doi.org/10.1038/sdata.2016.18.

An important, yet often overlooked, aspect of findability is ensuring that any links used are static and persistent. When I linked to the UniProt entry above, I made the assumption that the link would remain valid in the future. In practice, however, this is far from guaranteed, and it’s easy for resources to be lost to the sands of time if their URL changes – the data might still be somewhere on the website, but if the URL in your paper is no longer valid, readers might struggle find your data. For example, the PDB entry 6LU7 released in early 2020 features a structure of SARS-CoV-2 main protease. Currently, this may be accessed at https://www.rcsb.org/structure/6LU7, but at the time of publication the URL provided in the paper made use of the now-retired API and is no longer accessible: https://www.rcsb.org/pdb/search/structidSearch.do?structureId=6LU7. For a resource as well-known as the PDB, we can simply look up the entry 6LU7 (which is a findable identifier), but for a less well-known resource, a dead link might lead to data being difficult or even impossible to find.

Accessibility

Once data has been found, it should be easily accessible via a clearly-communicated protocol. In principle, anybody with a computer with internet access should be able to use the protocol to access your data. This does not necessarily mean you need to host your data somewhere that anybody can download it directly, as there are many reasons this may not always be practical or even legal. However, the process by which your data can be accessed should be clearly described and should not require any proprietary or platform-specific tools.

Common reasons for not being able to directly share your data include not having sufficient storage on a public-facing server or not being able to distribute data subject to GDPR restrictions. In such cases, FAIR suggests that we can still make data more accessible by providing a single point of contact for users to request access (e.g. a group email address or request form) and a clear description of the procedure to request access to data.

Another concern is that hosting data is expensive and, while you might be able to host data for your new paper on your website, you might not be able to keep it available for download forever. In this case, FAIR tells us that it is important to ensure the URL for the data is preserved even when the data itself is no longer directly available. Instead, by using this URL to host the associated metadata and provide the point of contact to requrest access, you can provide users with the best possible chance of locating and accessing your data, even when you can’t afford to maintain hosting forever.

Interoperability

While making your data easy to access might tick all of the boxes required by the journal’s data availability policy, it doesn’t guarantee that the data is easy to use. ‘Interoperability’ refers to ensuring that data is easy to integrate with other data sources and to use with different software and workflows. This is simple in concept but often hard to execute in practice. However, FAIR provides some guiding principles to maximise the interoperability of your data.

Perhaps the most obvious aspect of interoperability is ensuring that the data and metadata is provided in a static format that is readable using open, maintained, tools. For example, CSV files are good, an Excel sheet full of macros and graphs or a dump for a long-deprecated relational database management system are clearly not so good. A less obvious example might be the use of Python’s pickle module to serialise an object representing your data. Although pickle itself is backward-compatible (files created using older versions can be read using newer versions of the module), the object itself might not be. A user who tries to open your data using different versions of Python packages than those used to generate the data might therefore be unable to use your data.

In addition to data format, it’s also important to ensure that the structure of the data is clearly documented, and that this documentation is easily findable and accessible alongside the data itself. This might include, for example, schema documentation for a SQL database or a description of row and column labels for a CSV file. Further, any relationships between your data and other data sources should be clearly described (e.g. this data was derived from X data source by applying Y method). If you’ve ever accidentally included some data points from your test set in the training data when working with machine learning, you’ll know just how important this last point can be!

Reusability

Finally, it’s important to ensure that data can actually be reused by different users for different purposes, and to make this clear to users. The context and limitations of the data should be clearly described (e.g. “this data set features only kinase protein structures present in the PDB as of 01/01/2022”), and the data should conform to any established community standards (e.g. the AIRR community provides a detailed set of standards for AIRR sequencing datasets to ensure maximum reusability by differnt users).

It’s also important to think about the licence under which data is distributed. As scientists, we often want to say “here’s the data, you can do whatever you like with it”. However, just like with software, this often isn’t sufficient, particularly when it comes to permitting the use of data in commercial applications. It’s therefore important to ensure that data is distributed with an open licence that gives explicit permissions to users for different applications (e.g. including/excluding commercial use). Creative Commons licences are commonly used for this purpose; however, if you’re distributing a database, you may find the Open Data Commons licences more appropriate.

Summary

Best practices for open data management are difficult to define as the world of data, like data itself, is often noisy and heterogeneous, and what works for one field might be completely useless for another. By its very nature, science requires that we work with data in novel ways, making it difficult to anticipate exactly what users might like to do with data in the future. However, by keeping in mind a few core principles, such as those outlined by FAIR, it’s possible to ensure that the data we release is as open and useful as possible to the scientific community.

Author