GigaScience – a repository for large datasets


The recent explosion of genomics technology has revolutionized biology, but it is only really of use if people are able to analyze and use the resulting sequences. Storage of such vast quantities of data is problematic, as the ongoing uncertainty over the future of NCBI’s arm of the Sequence Read Archive shows (SRA). The BGI, in conjunction with BioMed Central, recently launched GigaScience, a journal aimed specifically at projects generating a lot of data, which can accommodate such large datasets alongside the articles describing them. GigaScience also anticipates becoming a repository for stand-alone datasets such as those resulting from genome sequencing projects. One such dataset has just been released, and it contains the assembled and annotated sequences of genomes from three strains of sorghum, a plant of huge economic importance in the developing world as a source of food, fodder, fuel and fiber. The article describing these data has been published in Genome Biology; the raw reads are available from the SRA, and the assembled reads from GigaScience. This is the first time that a genome dataset has been cited as a DoI in an article’s reference list, so is the first step in the process leading to researchers getting citation credits for the data they generate.

Andrew Cosgrove

Andrew obtained his PhD in molecular biology from the University of Dundee in 2005. He joined Genome Biology in 2009 after a post doctoral research position at the University of Sheffield investigating chromosome positioning during meiosis in yeast.
Andrew Cosgrove

View the latest posts on the On Biology homepage


beatriz fernandes

Are there examples of good use of sequencing data for teaching material/motivation?

todd vision

BMC and BGI deserve credit for seeing that datasets such as this are accessible with global, resolvable identifiers. It should be recognized, however, that this is not the first time a DOI has appeared in a reference list (it is, in fact, relatively common in the earth sciences). Nor does the appearance of the dataset citation in the reference list, with or without DOI, automatically bring with it the kinds of discoverability and bibliometric prestige that authors may expect when their articles get cited. So I couldn’t agree more that it is just “the first step in the process leading to researchers getting citation credits for the data they generate.”

hans pfeiffenberger

This is really an impressive development and clearly shows how data can be expected to serve the public good!

However, to improve on one detail:
There have been full references to datasets (with DOIs), previously, e.g.:

Katharina Pahnke and Rainer Zahn, Southern Hemisphere Water Mass Conversion Linked with North Atlantic Climate Variability, Science 18 March 2005: 307 (5716), 1741-1746. [DOI:10.1126/science.1102163]

Reference 37 points to this dataset:

Shackleton, NJ et al. (2000): Mean stable carbon isotope ratios of Cibicidoides wuellerstorfi from sediment core MD95-2042 on the Iberian margin, North Atlantic. doi:10.1594/PANGAEA.58229

andrew cosgrove

@Pfeiffenberger, @Vision

Thank you for drawing our attention to this. Obviously we’re not as familiar with practice in the earth sciences as we should be. It’s always good to learn from other fields. I did mean to say the first genomic dataset to be cited in this way. Other genomes have been assigned DoIs but, as far as we are aware, this is the first one that has been formally cited in an article’s reference list.

I have amended the blog post accordingly.

Comments are closed.