On 2nd September 2010 BioMed Central  issued a draft position statement in support of open data, which provides a number of recommendations – and aspirations – for the role of publishers in promoting reproducible research by increasing scientific data sharing, data reuse and open data.

In the months since the BioMed Central statement, a number of high-level policy statements on data sharing have emerged, including the Oxford Data Sharing Statement, a report to the European Commission, and a joint pledge from a consortium of 17 major public health research funding agencies.

These are positive developments but there are many details and practicalities still be to worked out. So rather than ‘why share data?’, the question now is ‘how?’. A number of projects, initiatives and working groups have formed around issues of data sharing and reproducible research, but a need has been identified for shared understanding on three key issues affecting authors, editors, publishers and funders of life science research.

Licenses and waivers
BioMed Central believes that open data should mean that data are freely available on the public internet permitting any user to download, copy, analyse, re-process, pass them to software or use them for any other purpose without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. Around the same time BioMed Central’s open data statement was issued Heather Piwowar and Peter Murray-Rust of the Open Knowledge Foundation sent enquiries to a number publishers – Nature, PLoS, BioMed Central included – about the openness of the data published in their journals. The enquiry applied to raw data and meta data published as supplementary material (additional files), and to data extractable from article full text, tables and figures. They established that much data are freely available for harvesting, under Creative Commons attribution licenses, but are not yet fully openly-available according to the Panton Principles for Open Data in Science. Many members of the scientific community have signed in support of these principles; it is time to put them into practice (see proposed goal #1).

Supplementary (additional) files
Editors and publishers are acutely aware of the limited pool of peer reviewers who are increasingly called upon to help try and ensure the integrity of the published record. The online availability of research data as a supplementary (additional) files has prompted debate about the role of peer review in this non-written material, and indeed the role of journals in publishing this material. (see proposed goal #2).

Journal and funder policies
Journal and research funding agency policies are important for sharing best practices and influencing researcher (author) behavior. They are usually consensus-driven, and motivated towards serving the needs of the communities and audiences they serve. There are a number of different approaches to journal data sharing policies, and the most appropriate approach may depend on the field of research. However, with more broad scope, multi-disciplinary – and predominantly open access – journals emerging, a single data sharing policy must increasingly apply to a large number of fields (see proposed goal #3).

It seems a way forward would be to organize a meeting of editors (authors), publishers, and funding agencies to encourage debate, and to investigate the potential for developing widely-agreeable and transferable guidelines, processes and polices enabling reproducible research and open data. BioMed Central has previously led an initiative incorporating similar stakeholders in scientific research, and is willing coordinate a meeting, and form a working group, to facilitate this initiative.

We look forward to hearing from you.

Potential goals of the proposed Publishing Open Data Working Group

1. Establish a process and policy for implementing a variable publishers’/authors’ license agreement, allowing public domain dedication of data and data elements of scientific articles

The copyright-ability of data/facts varies by jurisdiction, creating potential obstacles to reuse, and explicit open licenses or waivers for data that place data in the public domain ensure maximum reproducibility and interoperability. This is necessary because providing full legal  attribution, as is required by Creative Commons attribution licenses, for all facts/data in a large collection may not be practical. Moreover, the application of appropriate licenses to different components of the products of research aims to ensure attribution, facilitate sharing of knowledge and ensures reproducibility.
A key proposal in BioMed Central’s open data statement was that from a specific date any author submitting to a BioMed Central journal would agree to dedicate the data elements of their article and supplementary material to the public domain and apply an open data conformant license or waiver, such as Creative Commons CC0.  This needs to be done in careful consultation with the scientific community to ensure that researchers still receive appropriate credit for their contributions, and that authors understand the implications of different licenses and waivers applied to their work.

2. Consensus on the role of peer reviewers in articles including supplementary (additional) data files
There has  been continuing debate in scholarly circles about the role of journals in publishing data online as supplementary material (additional files). Although concerns over space limitations are less applicable for (predominantly) online open access journals, different journals/publishers are taking different stances on the risks and benefits of publishing supplementary data. Some believe that supplementary material may have a negative impact on already-overloaded peer reviewers, whereas others –BioMed Central included– believe transparency and providing a service to scientific communities which do not have widely-supported data repositories takes precedence.  Moreover, increasing numbers of journals are publishing data papers (data notes), where a data set is the main component of the article. This suggests the role and types of peer reviewers for published data may need to be better defined.

3. Sharing of information and best practices on implementation of journal data sharing/deposition policies
There are many recognized benefits of openly sharing data, for science, the economy and public good. But
while comprehensive evidence of the benefits to the individual scientist of openly sharing data are still emerging journal and funder policies are important drivers for change in researcher behaviour. Different journals and publishers are taking different approaches in their requirements for data sharing from their authors. Data sharing can be implied, on request, as a condition of submission or publication (e.g. BioMed Central, PLoS); it can be required for editors and peer reviewers (e.g. Nature); a link to the data set via DOI can be required as a condition of acceptance and publication (Joint Data Archiving Policy e.g. American Naturalist) or a statement as to the availability of data can be required without any change in behaviour required other than the provision of the statement (e.g. Annals of Internal Medicine, BMJ). The full impact of these policies may not yet known, but it might be reasonable to assume that that by increasing transparency in scientific research we are promoting reproducibility, a core principle of the scientific method, and avoiding bias. What are the advantages and disadvantages of each approach, and what experience can be shared from journals that have implemented the most stringent policies?

egon willighagen

Dear Iain,

thanx for this interesting and important post! I absolutely agree that some standards need to be set. Scientists have been unable to do this, and publishers can distinguish themselves from competition in doing this right.

Without going into detail what ‘right’ is (I have very strong opinions on that :), what is important for BioMedCentral right now, is put the advantages so closely in front of the scientist, they can no longer ignore it, or say ‘whatever’ (which they do now).

BMC must therefore demonstrate what this reuse, reproducibility, etc, practically means. So, Goal 0 must be: do something with the ‘additional files': process them yourself and 1) associate every single additional file with facts about that file; 2) index them, and create a search engine to search additional files based on their content, *cross* all BMC journals; 3) provide alternative download formats, showing what it means to use Open Standards.

Open Data is not the goal; it’s the means to do science better.

About 1. Every additional file should have a separate web page (or page section), listing not just size, but also the exact format (MS-Excel 2000, rather than ‘Excel’… versioning matters!), metadata present in that file (author, creation data, does it have Macro’s defined, etc), and statistics about that file (number of sheets in the spreadsheet, number of filled cells, etc).

About 2. It is of utmost importance that we can discover this supplementary information, and it must be easy to search for stuff using free text (e.g. I want to find all additional files across all BMC journals that have ‘tryptamine’ somewhere in the additional file, even if that information is stored in Excel files *inside* zip files. Current technology makes that very easy, such as Strigi.

About 3. As reuse is the key here, the use of Open Standards are important. This could be stressed by showing that files with Open Standards can easily be interconverted, such as spreadsheets into CSV or HTML tables. Just alternative download formats makes the ‘Additional files’ more useful, and encourages the authors to ensure that they provide data in the right formats.