Reproducibility of data analyses is a big topic in scientific communities. On March 24th, 2023, we had a chance to talk about this subject with Dr. Markus Konkol, research software engineer at 52°North in Münster.
Geodata tend to be vast and analyzing them is no simple task. It includes multiple steps across various tools and requires a consistent toolchain from collection to evaluation. Collecting the same kind of data again and again for similar purposes, although practiced frequently in the industry, does not add any value. The goal, instead, should be: collect once, evaluate multiple times (and do the same with updates of data collections).
This is exactly what initiated the talk with Markus Konkol. How do you make sure that multiple evaluations lead to similar or, at best, the same results? This is a question that affects not only geodata, but scientific research in general. Researchers publish results of their scientific work and it may be hard for reviewers or other peers to reproduce them. Markus worked on this topic for his PhD and that’s where it caught our attention: “Publishing Reproducible Geoscientific Papers: Status quo, benefits, and opportunities”.
The way of providing research results is about to change. Whereas in the past it was enough to describe one’s own analyzing efforts and their outcome, now it is necessary to provide the original data, the analyzing method (and software) and the results. Full reproducibility is the means by which the integrity of the results can be verified by others. In recent years some innovative findings were just misinterpretations of research results. To avoid this, more eyes shall be applied. Additionally, new findings shall inspire other researchers to build upon them. It is much easier to do so if one can understand methods by interpreting code rather than just condensing text.
An article by Monya Baker about the “Reproducibility Crisis” (published in Nature on May 19th, 2015) highlighted this subject. It pointed out that most publications lack data and code. In the aftermath, the scientific community demanded that results presented in publications be reproducible.
How can it be achieved? Simply by providing all data and all evaluation software, one might think. An executable research compendium (others call it Knowledge Package) could be the solution. But the task is a bit more complex: Firstly, the sheer size of geodata can be a challenge. Secondly, not each evaluation is fully automated, but is carried out manually in parts, e.g., with QGIS software. Making these manual steps fully reproducible would require a full instruction set of where to click and when. This is not typical for a scientific paper and definitely not within the scope of a reviewer to verify, especially if we take into account that reviewers are already rare and can hardly manage additional work.
So, what can be done? Let’s start by clarifying the goal: it should be possible to reproduce scientific results in a way that neither the peer reviewer, nor the publishing journal, nor the consumer of a paper are overwhelmed.
Part of this can be achieved by bundling data as well as software and providing both in an executable way. Virtual machines, Docker containers and the like are viable technologies for this.
However, this does not resolve the problem of large data repositories. Packing them in containers doesn’t quite make them any smaller. And journals, or reviewers, don’t really want to become repositories for large data sets.
For this, another approach is required: keep data and software separate and have both linked in reliable manner from within a publication. This requires that data be kept for an extensive time period (e.g., ten years) in a certified way so as to make sure they don’t get changed. In addition, data sets and the source code to analyze the data must be citable as individual instances to be clearly and uniquely identifiable over their lifetime. Platforms like Zenodo (funded by the European Commission and CERN) and Pangeo provide good entry points for digital object identifiers for data sets and software versions, thus making them citable.
An often-mentioned issue in reproducibility is the use of big data. One possibility to overcome these issues is to provide a subset of the data making it possible for others to check the analysis workflow. In the case of sensitive data, dummy data can be a way to demonstrate reproducibility.
As for the software, GitHub repositories seem the way to go. They are an established technology with clear versioning and have proven to outlast a minimum of ten years.
What about changes in data formats, though? Reproducibility may not only mean that you can reproduce the results with the original software, but maybe with different software even after quite some time. The support of the original data formats may not be guaranteed in this case. Therefore, it is highly recommended that data be provided in open formats (e.g. GeoJSON) and that proprietary formats be avoided wherever possible. This increases the chances of still being able to read the data.
Now for the tricky questions: What about licensing?
The data and software in one’s research efforts may be owned by third parties and may not be sublicensed to others. Also, some data may be restricted by privacy regulations, concerns about national security, potential dual use, etc. In all these cases, it is impossible to provide the package that is required for reproducibility. The outcome can only be a “classic” paper.
For this reason, reproducible research concentrates on openly licensed data from which sensitive content is excluded. Public funding requires to an increasing degree that data be made available to the public and that researchers provide a data management plan. This clearly strengthens the aspect of reproducibility. The same methodology could be applied for software, but it is harder to enforce because often proprietary software solutions, toolboxes or simulation frameworks are used. Nevertheless, researchers are using the increasingly available, powerful open-source solutions.
Classic tools are often outdated. On the other hand, providing interactive user interfaces with elements to modify the presentation of results, but also the parameters of the represented algorithm, makes the presentation even more insightful. However, this is not working for every kind of complex data.
Next question: When do you publish the data?
Scientists have a vital interest in making sure they can work on “their” data and benefit from their results before others get access to the same data pool. The “fear of being scooped” must be taken seriously. One solution is to provide access to the data after a specific amount of time (e.g., three years), but this seems to be just a workaround and not the general answer. In the interest of proper peer review and not hindering scientific progress, it is also common understanding that they should make their datasets available. upon publication of their results.
Datasets themselves can even be (an additional) subject of publication. Special journals exist (e.g., ESSD Copernicus) and tools like Binder or The Whole Tale provide the necessary infrastructure.
But as everywhere in business (and science is a business – funding is often closely related to peer-reviewed publications), you can achieve the best results with incentives. These might come in different forms:
first, reviewers of journal papers should prefer papers that come along with their data (and may refuse the ones that don’t)
second, open positions at universities should require that the successful candidate supports open science practice
third, research funding (esp. government-funded) should be prioritized for projects that intend to publish their data afterwards.
last but certainly not least: reproducibility practices should be taught in courses addressing different career stages
Does all of that render the classic journal paper irrelevant? The clear answer is no. Having data and evaluation software does not replace a description of methodology, underlying techniques, key findings, conclusions, etc. The reasoning will still need to be explained and references to related works will still need to be made.
But, sticking to the old methods will not contribute to the advancement of science. Today’s decision makers may have spent most of their careers with the established peer-review process based on papers and built on trust in the results that have been provided.
Today’s technology, though, provides many more means to verify findings and share not only results, but also the underlying data and software. The mindset in the (research) society has changed, and openness is a clear expectation these days. It will be hard for young researchers to resist the gravity of this setting. And we deem this a good thing.
One final aspect we addressed in our talk with Markus was the difference between reproducibility and replicability. We only focused on reproducibility, because it is “easy” to define the boundary conditions for it – provide the data and the software and tell us what steps to apply. But it does not prevent you from being provided with “manipulated” data or, to put it mildly, data that guarantees the reproducibility of the findings. Another limitation of reproducible research is the possibility of reproducing errors if code and data are reused without critical inspection.
Therefore, replicability is the next step. And far harder to achieve. You will not only need to reproduce results with the original data set, but also with different data sets and/or with different evaluation methods.
For now, let’s concentrate on the first step: full reproducibility of scientific results. Markus gave us a great insight into his work in this subject and we learned quite a lot in our one-hour talk.
Big thanks to Markus for fitting us into his tight schedule.
This article is non-reproducible since it was hand-written by human beings who definitely come up with different results each time, even if you feed them the same raw data (i.e. the notes taken during our talk) again. And we love it that way – at least for this kind of publication.