Fermenting a Reproducible Research Revolution

July 30, 2015

An anaerobic digester at BGI.

Biofuels Research Shows the Utility of Container Publishing
One of the most promising areas in biofuel development is biogas, which has huge potential as a renewable and clean source of energy. Biogas is the production of methane gas through the anaerobic digestion (fermentation) of organic matter such as agricultural or food waste (see the picture of a typical digester used to collect methane from our colleagues at BGI’s experimental farm). Detailed knowledge on the functioning of the fermentation process is key for optimizing this process; however, the vast majority of the microbes involved remain unknown and cannot be cultivated in laboratories.

In a Data Note out today researchers from Bielefeld University in Germany have now characterized the complex communities of micro-organisms in a biogas plant that generates heat and power from maize silage and pig manure. Furthermore, the authors demonstrated the state-of-art in making their research more reproducible by disseminating everything in a virtual “Dockerised” container of their data and tools.

For their study, the researchers carried out metagenomic and meta-transcriptomic analyses, which resulted in the generation of DNA and RNA sequences from the thousands of microbial species present. From this they were able to create a catalogue of 250,000 genes that enabled them to begin defining the underlying biology of methane production. While this data production only scratches the surface of the vast amount of information gathered, the authors furthered the usefulness of this resource by releasing all of the data and computational methods as a shareable docker container to enable others to execute the same analyses in the cloud. This not only makes the research reproducible, but also allows researchers around the world to build on these resources to more rapidly delineate the important processes involved in biogas generation and to better explore its use for biofuel.

Elucidating the pathways involved in methane production.

Andreas Schlüter, microbiologist at the Center for Biotechnology of Bielefeld University highlights the utility of this data, saying: “Metagenomics in connection with advanced bioinformatics approaches greatly promoted our understanding of complex microbial communities involved in biofuel production.” Microbial conversion of biomass and organic wastes under anaerobic conditions is of importance for the generation of valuable products and biofuels. Detailed knowledge on the functioning of fermentation community members is considered to be the key for process shaping and optimization strategies. Hence, metagenome and metatranscriptome sequence data obtained from biomass degrading communities is a valuable resource regarding elucidation of their structural composition and functional potential. Moreover, identification of dominant key microorganisms and functions will help to improve and advance production of valuable intermediate metabolites and biofuels such as biogas from renewable resources.

As experiments become more data-intensive, reviewing and publishing the methods and results of scientific studies become increasingly challenging. To get around this the authors used the Docker platform, which effectively wraps software in a system that includes everything needed to rerun it. This removes the need for other researchers to install and maintain the many complex bioinformatics tools and software libraries: something that can be very technically challenging for researchers without the computational resources and skills.

Bioboxes for Biogas
Reviewing and publishing the methods and results of scientific studies become increasingly challenging, especially as experiments become more data-intensive. To ensure reproducibility scientific journals are increasingly asking authors to make their code and data publicly available. Nevertheless, complex analysis workflows with their dependency on certain versions of bioinformatics tools and software libraries are not trivial to install and maintain. “We decided to use virtualisation techniques to encapsulate our analysis workflow and make it basically independent from the host it is executed on” says Andreas Bremges, first author of the study. “We containerized our analysis workflow in Docker which can be executed virtually anywhere”.

The reproducibility of published research is an important aspect of science, and one that GigaScience is trying hard to tackle highlights Peter Li, our Lead Data Manager, who undertook the step of trying to exactly recreate the results in the paper. “Andreas and his colleagues provided a Docker container that encapsulated the method used to process the data from their biogas study. This made my job of checking the reproducibility of their results much easier as their Docker container took care of installing the bioinformatics tools and their dependencies on my cloud server”. Being an open review embracing journal you can see his and the other reviewers reports in publons here, as well as a blog post by reviewer Titus Brown that conveniently crowdsourced advice while he was reviewing the paper. You can even try to recreate the results yourself with the Docker accessible version of the study, and snapshots of the container and supporting data hosted in our GigaDB database.

Docker is a rapidly emerging framework to keep the virtualization environments small using the Linux OS container concept. “We like the idea of Docker to ensure reproducibility of analysis workflows so much that we adopted the approach in our CAMI challenge” says Alexander Sczyrba, senior author of the study and one of the organisers of CAMI, the Critical Assessment of Metagenome Interpretation. In collaboration with nucleotid.es we started the bioboxes project to standardize interchangable bioinformatics software containers. Peter Belmann, core team member of bioboxes, helped in building the Docker container for the biogas study. “The container for this study is not yet bioboxes-conforming, but the next step will be to define a bioboxes standard for this kind of workflow”. The bioboxes community is currently gathering feedback on the standards they are developing, and contributions are welcomed via their GitHub RFC page. For more on container publishing you can see Mike Barton’s talk on nucleotide.es at this years Dockercon (much of this also presented at the Balti & Bioinformatics – Open data and reproducible bioinformatics google hangout we also presented and participated in).

Fermenting a Reproducible Research Revolution

Scott Edmunds

Blog post tags

Recent comment