Notes from an E. coli “tweenome” – lessons learned from our first data DOI.


Last week marked two important milestones in the deadly 2011 European E. coli 0104:H4 outbreak: the Robert Koch institute announcing the end of the outbreak, and the publication of several papers from the many groups sequencing the pathogen. This included a publication in the New England Journal of Medicine by groups from the BGI, UMC Hamburg-Eppendorf, and Birmingham University acknowledging members of the crowdsourcing community and the work achieved using the genome sequence our colleagues at the BGI made available via our GigaScience database. This was our first dataset released with a DOI and under the freest CC0 public domain license, so now is a great opportunity to look back to see the consequences of this novel form of data release.

Due to the unusual severity of the outbreak – thousands severely ill and 50 deaths to date, it was clear that the usual scientific procedure of producing data, analyzing it slowly and then releasing it to the public after a potentially long peer-review procedure would have been unhelpful in this case. By releasing the first genomic data before it had even finished uploading to NCBI via twitter, and promoting its use and releasing subsequent improved assemblies this way, a huge community of microbial genomicists around the world took up the challenge to study the organism collaboratively (a process that was dubbed by some to have made E. coli the first “Tweenome”). Once a github repository had been created (thanks to the efforts of the Era7 team in Spain) to provide a home to these analyses and data, groups around the world started producing their own annotations and assemblies within 24 hours, and within a couple of a days a potential ancestral strain had been identified (further clearing Spanish farmers of the blame), and the many antibiotic resistance genes and pathogenic features were much more clearly understood. By releasing the data under a CC0 license, this allowed truly open-source analysis, and the UK HPA and github members followed suit in releasing their work in this way.

Huge progress was achieved in record time, and from this incredibly speedy work a free diagnostic protocol and free primers were distributed by the BGI to immediately help tracking the source of the outbreak. On top of the good feeling and positive coverage obtained by this (despite some inevitable disagreement over credit and what exactly was achieved), these novel forms of pre-publication data release did not prevent the acquisition of more traditional forms of scientific credit – publication in prestigious scientific and medical journals.

On top of all of the scientific and public health lessons to be learned, coming from a journal perspective this makes it a very important example and test case of how new and faster methods of scientific communication and data dissemination can still complement and work alongside the traditional systems. This is particularly clear as the open-source analysis was published in the New England of Medicine, a prestigious organ with a nearly 200 year history, and founder of the Ingelfinger rule causing issues in some (mainly medical) journals regarding certain pre-publication forms of data release. Maximizing the use of the data by putting it into the public domain still did not trump scientific etiquette and convention that allowed those producing the data to be attributed and take credit. This is a great argument in favour of open-data, and an important lesson to all scientists worrying about setting their data free.

As (we think) the first ever citable data DOI released to an unpublished genome, this new form of intermediate credit (similar to microattribution) did not hinder the eventual publication of the genome analysis paper. We’d like to thank our collaborators in the Datacite and the British Library for their help issuing the DOIs, and hope it provides a good example for similar data producers and projects to follow. We have followed this example with the release of additional unpublished genomes, and large supplementary datasets associated with articles in GigaScience will be given DOIs to make them more trackable and findable, further showing their interoperability with traditional scientific articles and forms of data release. This particular disease outbreak was unusually pathogenic, and the sterling efforts of the medical community and suffering of those affected should not be forgotten. Whilst there are still many unanswered questions and huge amounts of work still to be done, many lessons have hopefully been learned, and (as highlighted here) this project provides an excellent example for the future on how a more collaborative and open-form of science can carried out. As GigaScience would like to be a forum for the discussion of these issues, as well as promote and work with the open-science movement, we strongly hope that this can continue and grow.

View the latest posts on the GigaBlog homepage


By commenting, you’re agreeing to follow our community guidelines.

Your email address will not be published. Required fields are marked *