By far one of the biggest concerns around Open Data is not whether we have the technology to enable researchers to make their data open but whether the cultural incentives are in place to make researchers freely share their data. Several publishers have recently started publishing ‘data journals’ or ‘data notes’. Is this latest publishing buzzword the answer to incentivising Open Data?
I try not to write in the first person (partly to avoid flashbacks of big red X’s from my high school essays) but this post—about something I myself have debated quite a bit—seems to demand it. As head of open data initiatives and policy here at BioMed Central, I’ve spent the last year questioning the need for ‘data notes’. Do we need them? Or are they salami slicing?
In fact, as I write, I recall discussions with well-known open science advocates pushing them to explain their dislike for such publications. I’m sure they were annoyed. They would have been surprised to know my real position at the time: I was completely on the fence. Below I sum up what has pushed me to one side or the other over this last year.
In October last year Pedro Beltrao of the EMBL-EBI argued that data note journals like Scientific Data were offering no added value that repository infrastructure did not accommodate, and thus the money spent on Article Processing Charges for such journals would be better spent supporting the current data repository infrastructure.
There were two arguments that continually made me think twice about data notes: one was the reference to repository infrastructure; the other, treating code and data as equal research objects (I’ll get back to the latter).
Most repositories are publicly funded and generally struggle with resources. Repositories like GenBank are a huge community resource. I would never want to drain further resource from them.
So, let’s start with that first concern: my not wanting a publication type I am promoting as a publisher to purloin resources from invaluable community initiatives like subject repositories. But are these publications truly not adding any value that repositories themselves are not adding?
The benefits of narrative
As Gavin Simpson says in the comments to the above-mentioned blog, Beltrao “overestimate[s] the skills possessed by many researchers; they just aren’t going to wade through XML or learn other metadata standards in order to fully understand data placed in many subject-specific repositories.”
The narrative format has its purpose. People don’t have time to struggle. The narrative format of a data note is more digestible; then the researcher can quickly decide if they wish to invest more time in it.
Simpson goes on, however: “Furthermore, there are many aspects of a data set’s metadata that may not be amenable to an existing repository structure . . . [for example,] photomicrographs explaining details of taxonomy used in identifying taxa in a community ecology data set; these are not the data but they are essential metadata for understanding what the researchers called what species.”
Since May I have been working on bringing over to the BioMed Central platform what one could essentially point to as a ‘data journal’—Standards in Genomic Sciences (SIGS).
SIGS has been around for several years (see archive here) and mostly publishes highly standardised ’Genome Reports’. This journal has shown the value of such a publication without having to use the frills of the buzzword ‘data notes’.
‘Highly standardised’ refers to the fact that all genome reports comply with MIGS/MIMS standards, a set of metadata standards for sharing information about the data and project. A few years ago, Jonathan Eisen was so thrilled about SIGS he awarded the journal’s Editor in Chief George Garrity the Open Access Pioneer Award. Eisen explains well the usefulness of the genomic sequencing data reports (again, without having to use the buzzword!):
I confess, when I first heard about these standards developments, I was bored almost to tears. But now I realize that this is a very important aspect of getting the most out of genome data. If people who sequence a genome not only release the sequence data, but also a table of information about the project, such as information about the organism (e.g., aerobic vs anaerobic, location of isolation) and about the data production (e.g., sequencing methods used) then people will be able to do high throughput analyses of these features. Then we will not just be looking at sequence but also connecting these sequences to organismal features. Right now that is very hard to do since genome data is rarely accompanied by machine usable information about the organism that has been sequenced.
I have to tell you, reading Eisen’s blog on this (as well as becoming very familiar with exactly what SIGS is doing, as we bring it onto our platform) has very much persuaded me of the need for these data notes.
The carrot approach to open data
Short data publications could also serve to incentivize researchers to make their data open. As Titus Brown notes of Beltrao’s blog, “Pedro is missing the idea that this is publication of data, in a peer-reviewed journal. For those of us trying to push open data, this is incredibly important; there are few such journals out there that let me argue to my evaluators that I am doing something more significant than posting unreviewed tarballs on figshare.”
What I like about Titus’s blog is that 1. immediately a picture of Titus arguing with his evaluators comes to mind and makes me smile, but more relevant, 2. that he goes on to point out the fact that detailed data analysis can take up to five years, which means we won’t see data until five years after generation. That’s bad for science. But it’s also bad for scientific credit.
Titus sees these data publications as good for science. They help get data out there earlier. I think people underestimate how important this can be for scientific innovation, as data traditionally has not been a shared resource.
For example, the 3,000 Genome Rice Project recently released a data note of rice data in the journal GigaScience that quadrupled the amount of rice sequencing data available in the world. During the review process, the reviewers of the paper all commented on how crucial it was to release the data as soon as possible, as this was a massive resource for scientists working in food security and other areas.
Indeed, with climate change and the total production of rice worldwide needing to increase by 25% by 2030 in order to keep up with global population growth, we can’t sit on such data resources until we’re ready to publish a research article. Time is of the essence.
Data validation and peer review
As large collaborative datasets like the 3,000 Rice Genome Dataset become more common, peer review and validation of data will become increasingly important. This has to happen before data can be used; however, it’s not consistently done by repositories. The danger is when people begin treating data coming out of repositories like gospel.
In April I wrote about an article published in the journal BioData Mining on mycoplasma contamination in the 1,000 Genome Project. Although contamination is well-known in genomics, people were surprised to see contamination in a project like the 1,000 Genome Project. The article by William Langdon served as a good reminder of the peer-review and validation needed for public datasets.
A data note, though not perfect, does offer more validation than merely depositing a dataset in a repository. There are, of course, obstacles to peer reviewing datasets, which Carly Strasser nicely sums up in her blog on it. Of course, as Strasser points out, validation of data can come from the community and users of the data as well. Indeed, reuse is the perhaps the true test.
To return to credit, Titus also sees data notes as a way to motivate his collaborators to open up their data, dangling the citation carrot in front of them. That is the argument I have always given as well. But all the while my unspoken fear was that we were actually hurting ourselves by doing this. That is, if we are fighting to make datasets equal to research articles, are data notes not taking citations away from datasets and reinforcing the idea that only published articles are worth any credit?
When to cite the dataset versus the data note?
Yet, there is added value in these publications, as they’re both validating and contextualising data in a way to make the data itself more useful—for example, helping to link genotype to phenotype. Something I think we can all agree we’ve learned from the genomic data deluge is that data in itself does not deliver meaning. But when does one clearly cite the dataset versus the data note?
Sarah Callaghan uses the metaphor of a Work of Art (the dataset) versus a Fine Arts PhD thesis on the Work of Art (the data note), describing the dataset and data note as “two separate though related things”.
Thus, if I write a critique of the Work of Art, there’s no reason for me to then cite the thesis. But if I use information from the thesis to support my critique of the Work of Art, then I should cite the thesis. Put simply, she says “cite what you use”:
- If you use a data article to understand and make use of a dataset, cite them both.
- If you use a dataset, but don’t use any of the extra information given in the data article, cite the dataset.
- If you use a data article, but don’t do anything with the dataset, cite the article.
In reality it sounds to me like citation will mostly fall on the dataset or both, but rarely the data note alone. This seems fair enough. Indeed, in all cases in GigaScience, and probably likewise for many other data journals, reading the data note might remind someone to cite the dataset in the correct manner.
Skip to the end….
I realise this was a long post. But I’ve been thinking about this for a while! For those of you still curious as to what point I’m making, it is this: I do believe data notes add value that is currently not added by repositories. They also help us to recognise data as a first class research object on a par with research articles.
To sum up, I believe they add value in: peer review, contextualisation via metadata, visibility through narrative. I also believe they act as a tool to help create a cultural change toward citing data and software as well as research articles.
I’d love your thoughts, though. Please feel free to comment below.