CARMEN, reproducible research and push-button papers

- 1 Comment

Researchers release a treasure trove of data on the developing retina, pushing the boundaries of neuroscience publishing by presenting it dynamically and reproducibly.
A new paper in GigaScience today demonstrates a major step forward for reproducible research and public data-sharing in the neurosciences with the publication and release of a huge cache of electrophysiology data resources. Important for studying visual development, many groups have been using multielectrode array recordings to look at developmental changes and the effects of various genetic defects on the spontaneous activity of the retina. We’ve written previously about the difficulties in sharing neuroscience data, and to date there are not many publicly available electrophysiology datasets. In this new study, researchers made available 366 recordings from 12 electrophysiology studies collected between 1993 and 2014, spending several months to read the data from different formats and convert and annotate them in a standardized and interoperable manner as HDF5 files. The authors hope the data will serve both as an example for developing future standards, as well as being used to address new scientific questions.

A tribute to CARMEN
275px-Henri-Lucien_Doucet_-_CarmenWe’ve written extensively on the importance of workflow management systems promoting reproducible research, and following a similar example to our series of papers utilizing our Galaxy platform, this study takes advantage of the CARMEN virtual laboratory for neurophysiology. CARMEN is an eScience project put together by a consortium of British universities to enable sharing and collaborative exploitation of data, analysis code and expertise along the complete lifecycle of neurophysiology data. Tragically, Prof Colin Ingram, the lead investigator for the CARMEN project passed away while this paper was undergoing review, and the authors have included a dedication in this paper to his memory, as the project was initiated with his discussion and support.

On top of the 1GB of supporting data and code being available from our GigaDB database and CARMEN portal, the authors also produced the paper in a dynamic matter, creating it using R and the Knitr application. Following the reproducible research paradigm, this allows readers to see and use the code that generated each figure and table and know exactly how the results were calculated, adding confidence in the research output and allowing others to easily build upon previous work.

In a similar manner to our previous Q&A articles, we asked first author Dr Stephen Eglen from the University of Cambridge (with input from the other authors), to explains how this kind of work can encourage practices that enhance openness and reproducibility. We will also follow up in a later posting on dynamic documents, and taking advantage of our open peer-review process (see the pre-publication history here) we discuss with the referees how easy it was to review and recreate the work in the paper.

Q&A with Dr Stephen Eglen

With PLOS tightening their data availability policies (making them closer to ours) there has been some debate and push-back recently on the feasibility of enforcing this, so do you have anything to say on this issue? You (and the authors of the papers providing data) have gone to great effort to make data and code available, and the methods transparent. Do you think it is worth it, and what can we do to encourage other authors to follow?

SEmugshotOur view is that wherever possible, data and materials related to a publication should be shared at the same time the article is made available. For some fields, such as genomics, data sharing is commonplace and straightforward. In Neuroscience, however, public sharing of data is not routine. One way of encouraging neuroscientists to share their data is to provide some form of academic credit. Data papers in journals such as Gigascience are therefore one way forward.

As funding agencies now require data to be shared, we hope that we will see an increase in data being made available. It will be interesting to see how the funders assess compliance with data sharing.

Why is the reproducible research paradigm important to you, and how does this paper address that? What would you hope future users of these resources will be able to do with them?

Many computer programs currently used in research, e.g. for analyzing data, are quite complicated and insufficiently described in academic papers. The reproducible research (RR) paradigm encourages people to encapsulate all the materials needed to write a paper in a format that can be easily shared. When reading articles written using reproducible research techniques, it is comforting to know that you can go, see and use the code that generates each figure and table to know exactly how something was calculated. This adds confidence in the research output, and allows others to easily build upon previous work.

The RR paradigm also helps keeping documents consistent and up-to-date. As a simple example, the legend of Figure 1 includes the phrase “We currently have 366 recordings in the repository, occupying 298 MB on disc”. Neither of those numbers are hard-coded in the article, but are dynamically computed by examining the files in our database. While we were writing our paper, we kept adding new datasets and these numbers always reflected the current size of the database.

We cannot say for sure, but we hope that our repository will continue to grow over the years. Furthermore, we hope the data will serve both as an example for developing future standards, as well as being used to address new scientific questions. We are currently working on an article developing new methods using some of these data as test cases.

Can you give a little insight about how you put the paper together? How long did it take you to collect and curate the data? As you used Knitr to make it a dynamic literate programming document, how different was this to putting together a regular static document? Was it more effort and did it take you any longer than usual?

Converting and curating the data took a few months, especially as the data from colleagues came in varying different formats, mostly text-based. However, it seems there are only so many ways of describing spike-time data and the process got easier with each new dataset we converted. Checking the data was more difficult and in some cases required brief discussions with our colleagues.

Writing the paper with KnitR was actually fairly straightforward. The KnitR framework is a pleasure to use, and highly recommended to anyone wishing to learn about reproducible research, especially, but not exclusively, for the R programming language. One of its key features is that it can cache computations, so that it only performs some task if it thinks the corresponding code in the article has changed. Furthermore, in the longer term we think it saves time — as all of the details about what you computed need to be specified in the document, you have a natural record of what you did. You therefore don’t need to keep a record elsewhere, where it would probably not get updated, or you may forget you even had. Following our example above, by writing a dynamic document, you can also avoid worrying about what to change (e.g. a figure legend) when you add or remove new datasets.

It was very sad to hear about the passing of Prof Colin Ingram at the end of last year, and you have added a dedication in the paper about this. Do you have anything you would like to say on top of this, and about the legacy of his CARMEN project?

Colin was passionate about the CARMEN project from the start. As Principal Investigator of the project and experimentalist himself, he was a strong advocate of seeing the Carmen platform used by neuroscientists for data analysis. It was him who initially suggested we undertake this study on retinal waves and he gave us strong and generous support all along. We hope to be able to carry his dream on and keep developing the platform, making it attractive to a broader community of basic and clinical neuroscientists.

UPDATE 26/4/14: Updated to include link to subsequent related posting on dynamic documents. This work has also just been highlighted in the April edition of the Nature Neuroscience NeuroPod podcast where you can hear an interview with Stephen talking about the feelgood factor of sharing your data.

Further Reading:
1. Eglen, SJ; Weeks, M; Jessop, M; Simonotto, J; Jackson, T; Sernagor, E. A data repository and analysis framework for spontaneous neural activity recordings in developing retina. GigaScience 2014, 3:3 http://dx.doi.org/10.1186/2047-217X-3-3
2. Eglen, SJ; Weeks, M; Jessop, M; Simonotto, J; Jackson, T; Sernagor, E. (2014): Supporting material for “A data repository and analysis framework for spontaneous neural activity recordings in developing retina”. GigaScience Database. http://dx.doi.org/10.5524/100089