Implementing Reproducible Research, recently released by CRC Press and edited by Victoria Stodden, Friedrich Leisch, and Roger Peng, clearly describes the changes needed in science and publishing to help foster reproducible research.
With contributions from key leaders in computational science, such as Titus Brown, the book covers topics ranging from good programming practice and open source computational tools to the role of publishers in reproducible research.
Below is an interview with the authors of the chapter ‘Open Science and the Role of Publishers in Reproducible Research’, Iain Hrynaszkiewicz (Outreach Director at F1000), Peter Li (Data Organisation Manager at GigaScience) and Scott Edmunds (Executive Editor at GigaScience).
Your chapter ‘Open Science and the Role of Publishers in Reproducible Research’ points to many of the advances in open science initiatives in the last 10 years. What do you think are the biggest challenges that remain?
SE: While there are obviously challenges with the scale of the data outputs we are now having to deal with, I think the main challenges remain cultural rather than technical. If the only metric all scientific research is judged on is how many citations it can gather in a very short time, then the incentive system rewards those that can churn out lots of faddish but risky research more than those that produce and disseminate carefully collected and documented datasets and useful tools—even though the latter may have a longer term legacy and may get more use (and citations) in the long term.
A perfect example of that are the recent acid-bath stem-cell papers, where the research articles seem to have come out before detailed protocols and all the supporting materials were made available. If the incentive systems were different and the protocols and data came out first, this probably would not have happened.
IH: Despite some good examples in some communities, many scientists still aren’t sharing their data or seeing the need to. We need to change the incentive structure—in funding, promotion, research assessment—to encourage more transparency and close the skills gap there seems to be in data management and curation.
These are cultural problems, which publishers can play a role in solving. Journal policies on data access often aren’t being comprehensively enforced or aren’t being effective. Also, while we’re seeing tremendous growth in open access publishing, we’re not yet reaching half of publications even in biomedicine. For science publishing to better support reproducibility in all areas of research, we need wider adoption of gold open access—which scales much better than self archiving and provides various other benefits. It’s important to realise that open science isn’t the goal, but a means to enhance science and discovery.
PL: I think there is still a reproducibility issue when readers of scientific papers would like to replicate the results from analyses of data. Journals and publishers can play a key role in enabling reproducibility. However, it will take extra funds and effort to achieve this; for example, data and their analyses will need to be verified and curated, and then hosted in a publicly accessible location with a unique identifier which is resolvable.
The analyses will also need to be made available in a manner which is understood by users. I think the tools exist which allow us to do all this but these need to be integrated into the web sites of those journals interested in making the work that they publish reproducible.
The 2003 Ft Lauderdale guidelines in genomics set out clear protocols for data reuse and credit, and dispersed responsibility for these across various stakeholders—data generators, data users, funders. How effective do you think this has been within the genomics community in motivating researchers to share their data and creating best data sharing practices among researchers? Do you think similar field-specific initiatives are needed for other research fields to better embrace data sharing (e.g. in neuroscience or ecology)?
SE: As we discussed in the book chapter, while not perfect, and also potentially not working as well as they did, the guidelines have made genomics a success story compared to most other areas of research. The backlash against the new PLoS data deposition policies from some seems totally alien to people used to the genomics community, and is a reminder that there is still a long way to go in most of biology. As with the rise of open access, I think improvements in data sharing are inevitable and will be driven by funder mandates.
IH: I agree that scientists need to see field-specific initiatives and field specific benefits—for the individual researchers, as well as benefits at the societal and funder level. While mega journals are needed and are successful, the growth in smaller specialised journals is also strong.
The PLoS data policy is, ultimately, the right approach (F1000Research has required open data for all publications since its launch over a year ago), but we need to be careful to avoid any perception that a publisher can assume one size fits all. Good examples of ‘data sharing done well’ across different disciplines can help address that. Evidence of the benefits, such as increased visibility and citations, also needs to be continually gathered and communicated.
One significant issue that arises in data sharing, if not the main issue, is that of credit. Scientists fear being scooped and not receiving credit for the data they produced. How successful has the genomics community been in integrating (or perhaps the better word is ‘naturalising’) scientific credit for data sharing?
SE: Looking back on the legacy of the Bermuda and Ft Lauderdale policies a decade on, despite them being more etiquette than legally binding guidelines, they have generally been effective in preventing scooping, except for a few rare examples. Generally they are a lesson and role model to other fields that scooping and scientists trying to take unfair credit from each other is a lot rarer than most people fear, and the benefits far outweigh the potential risks.
IH: Credit is undoubtedly important. It helps determine who gets hired and who gets funded—and data citation and data papers are important tools. However, we would be wrong to assume that credit is everything. Scientists contribute to research not purely for personal gain but also at more altruistic level of helping to solve some of the hardest problems in the world. As well as demonstrating citation share, examples of data sharing facilitating collaboration or advancing discovery are important.
Using DataCite DOIs, GigaScience database (GigaDB) has been able to experiment with the release of BGI datasets in a citable form before publication of analyses. What have you found?
SE: In the two and half years since our first DOIs we’ve been able to track lots of interesting and unexpected examples of re-use, for example the Pigeon genome being used in studies looking at Pigeon milk. At the same time there have not been any negatives, with journals having no problems with this form of pre-publication release, no problems with the DOIs being integrated into the references (other than this unfortunate mix-up by Nature Genetics), and no issues with groups trying to scoop BGI with genome-scale analyses of this work. One of our DOIs also made it to the front cover of the influential Royal Society ‘Science as an Open Enterprise report’, so we’ve been very pleased with the result.
IH: GigaScience is showing leadership in data linking and citation, and the joint data citation principles are helping to change policies at other publishers—but we need to see them more broadly implemented.
One thing the Toronto agreement pointed to were Project Descriptions as a new scientific publication in order to combat the issue of credit and data reuse. These didn’t really take off, but you can see the motivation behind them (i.e., they provide a citation for the data generator). Recently, we have seen the arrival of Data Notes, such as in GigaScience, and the assignment of DOIs to data, allowing them to be cited and tracked. Do you think Data Notes will be the way to conquer this problem of credit?
SE: We hope so, and as we saw with our community funded Puerto Rican Parrot genome Data Note, this can really speed research up, as it can be a great way to get some early data out there and get attention and credit for doing so. In the case of the parrot this gained them huge amounts of media coverage, and more sponsorship, funding, and potential collaborators. It has helped them fund follow-up functional and comparative genomics work.
IH: Data papers certainly work in some fields and are a way to assign credit to large or complex data sets that require additional context outside of the a regular article, or where datasets have different groups of contributors/authors to the paper. But as data repositories become more sophisticated, and meet standards for discoverability, permanence, reusability, citability and licensing, we can better integrate data in repositories—such as figshare—with journal articles. Not all scientists are incentivised to publish data papers (perhaps that will change) but I see a mixed landscape evolving, where journals offer data papers where the community or the project demands it, but I also think we can dramatically improve the linking and integration of data held in repositories to articles.
PL: I think DOIs and Data Notes can help producers of data to receive credit. When data is cited by their DOIs and Data Notes, these can be used to track the re-use of data. It is important that this data usage information is officially considered by universities and government agencies when measuring the impact of data-producing scientists for tenure, research funding, etc. I think DOIs and Data Notes can help producers of data to receive credit.
You mention the better metadata and standards needed for data and point to the Genomics Standards Consortium (GSC), specifically their ‘minimum information about a genome sequence’ (MIGS) and ‘minimum information about a metagenome sequence’ (MIMS) checklists. This has largely been implemented at the journal level, as a requirement for their journal Standards in Genomic Sciences. This information is key to contextualising the data. Do you think enforcement by journals is the way to increase uptake of better standards documentation? Or do you think the place for this is elsewhere, say, at the repository level?
SE: With so many different standards out there, you do need to be careful to pick the correct ones (and avoid the classic XKCD ‘How do standards proliferate’ trap), and editors may not be best placed to judge what these are. Saying that, biosharing and MIBBI have done quite a good job cataloguing the useful ones, and journals are one of the only groups with any teeth to enforce these things, so they need to work closely with experts to determine and stay up to date with what the essential ones are.
IH: I think there needs to be shared responsibility. Journals can have strong influence on author/research behaviour, if their policies must be met for authors to publish, but arguably the point of publication of an article is too late. Good practices for data management and documentation need to be encouraged or required throughout the research process—from the lab/field/clinic, to the repository, to the journal. Finding ways to demonstrate the benefits of using standards is also important—such as increased reuse (citation, collaboration) potential, which helps justify any extra effort needed to use them.
PL: I think the responsibility of data being accompanied by metadata that needs official guidelines primarily lies with the repositories. However, journals can play a more active role in enforcing the use of metadata standards for scientific data. For example, GigaScience employs a full time data curator who is responsible for ensuring that data submitted into GigaDB and other public repositories meet the information standards required by the journal.
One standards-compliant framework you mention for data collection, management, and reuse is the ISA (Investigation–Study–Assay) Framework. This has been particularly successful for journals like GigaScience but has also been adopted by many repositories. Why do you think this initiative has been able to be so successful?
SE: Datasets from different sources often need to be harmonized to open their content to integrative analysis, and ISA is one of the main players trying to tackle this interoperability issue. I think this will be of growing importance as research gets more data driven, and we’ve found ISA particularly useful for work such as the ‘cyber centipede’ dataset, which combined morphological description, sequence data, DNA barcoding data, and detailed morphological (X-ray microtomography) data.
Some of the best examples of ‘open science’ (data, sourcecode, etc) come out of the necessity for collaboration. Specifically, you mention the ENCODE project from 2012, where data, tools, and pipelines were shared from 1600 experiments and 450 authors to produce 30 papers. How has collaboration in research changed and will it continue to?
SE: Living in a more connected world coupled with shrinking research budgets mean this is inevitable to maximise the resources we have. Those who don’t collaborate and share the results of their publicly funded research outputs will become increasingly marginalised and irrelevant, so even if things seem slow at times, I’m generally optimistic that things will continue to change for the better.
IH: The growth of open access should stimulate and facilitate more interdisciplinary collaborations as knowledge can be more easily shared. I think we will see more examples of interdisciplinary research, and the evolution of new fields of research such as synthetic biology.
Integration and reuse of large datasets brings a need for collaboration, not just with other researchers in the same field but also software, legal experts and patient groups. Open data enables independent researchers to answer questions never imagined by those who generated the data and first analysed it, and opens the door to collaborations between re-users of the data and its generators. Outside of data-intensive research, we are also seeing more engagement and involvement of patients in clinical research and peer review.
To conclude, your chapter largely points to our needing to better connect how we do science with how we communicate it. If you could point to one thing that needs to change to do this, what would it be?
SE: Death to the Impact Factor! Long live the Data Impact Factor (or next metrics).
IH: It’s hard to identify one thing. What we’re getting at here is that publishers are evolving from being stewards of, and delivery mechanisms for, content, into service providers and facilitators throughout the research process. Publishers are increasingly involved at multiple points in research through the acquisition and development of, and investment in, software tools.
From generating a hypothesis (supported by literature discovery and organisational services, like Papers) to project planning and managing data (with electronic lab notebooks and other workflow tools), as well as the more traditional areas of peer review and publication, publishers can increasingly be at the centre of the research workflow. Better connecting all of these processes and tools should make research more efficient, and hopefully more transparent. If I had to pick to one aspect, however, integrating datasets with journal articles across more journals and publishers would be a good place to start.