The author William Langdon analysed the available raw data from the project (50 billion DNA measurements) and found that some of the data did not match human genomes (around 7%). These were in fact Mycoplasma genomes.
Contamination of samples is well known in genomics, especially in certain cases such as with Nematoda and related animals. Such animals often ingest the cells of their hosts. Even with free-living animals, it is not unheard of for samples to become infected with cells from ingested food. It’s also particularly common to observe genomic contamination with eukaryotes, though species aside, contamination can often emerge from DNA in the laboratory and heavily handled fossils lying in museums for decades. Within-species contamination, though less common, also exists.
One recent and well-known case of contamination is the recent work examining the interbreeding of Neanderthals with modern humans. The Neanderthal Genome Project, coordinated by the Max Planck Institute for Evolutionary Anthropology in Germany and 454 Life Sciences in the United States, sequenced the entire genome of a 130,000-year-old Neanderthal found in a Siberian cave. They reported these findings in December 2013.
One of the biggest hurdles of the project was the contamination of the samples by the bacteria that had colonized the Neanderthal’s body and the humans who handled the bones at the excavation site and at the laboratory.
According to the authors, “A special challenge in analyzing DNA sequences from the Neandertal nuclear genome is that most DNA fragments in a Neandertal are expected to be identical to present-day humans. Thus, contamination of the experiments with DNA from present-day humans may be mistaken for endogenous DNA.”
Contamination, however, in the 1000 Genome Project is surprising. It’s not the kind of ‘messy’ hard-to-retrieve data that we associate with contamination. This paper brings to light the importance of best data reuse practice when using shared data.
Stephan Beck, professor of medical genomics at University College London, says: “When scientists download these raw data from The 1000 Genomes website, or any similar project, they should be aware of the caveat that this data is exactly that—raw. It is not surprising that contamination was found, but this should act as a warning to the community that they need to more vigilant and filtering out this contamination.”
William Langdon, the author, echoes Beck’s concerns: “As scientists use publicly available datasets rather than collecting their own samples, there is a risk of people using data and taking it as gospel. Mycoplasma contamination is a common problem, but it’s just a case of catching it and annotating data.”
In a 2011 study, Piwowar, Vision, and Whitlock estimated actual reuse of available data: they estimate that with every ten datasets submitted to GEO, 3 years later four new papers are created using that dataset. Much has been said about the better metadata and tracking of computational methods needed in order to reuse publicly archived data; however, as the sharing and reuse of data becomes more and more common, expanding beyond the field of genomics, data provenance and best practice for data reuse will become increasingly important. Indeed, this is particularly important as health care systems like the NHS offer up their own ‘data’—like medical records that are often free text and weren’t generated in a consistent manner—for use by researchers.
Contamination is a serious concern for downstream analyses of data and should not be taken for granted. Although much opportunity comes with the sharing of large data from high-throughput sequencing, it is also clear the removal of contaminants should be seen by default as a required step for any sequencing project.