Genome Informatics is an annual conference, focusing on computational approaches for understanding the biology of genomes. It alternates between the Wellcome Trust conference center in Hinxton, UK and Cold Spring Harbor Laboratories, NY, USA. Last year was the turn of Hinxton, so I went along, as I have the previous two times it was in the UK.
The two keynote presentations were from Katie Pollard (University of California San Francisco, USA) and Rafael Irizarry (Dana-Farber Cancer Institute, Boston, USA). Pollard discussed the use of machine learning in genomics research, and in particular the problems that can arise. She pointed out that you shouldn’t use balanced training data if the problem you are looking at is very unbalanced (ie few positives and many negatives such as identifying promoter sequences); and also that many machine learning models assume that data are independent and identically distributed, but this is very much not the case with genomics data – but nevertheless, even though the assumptions of the model may be violated, useful results can still be obtained.
Now there are more talks discussing the biology revealed by the informatics rather than the informatics methods themselves.
Irizarry’s talk also dealt with problems in analysis, and why you shouldn’t just blindly trust the results you get. Sometimes, you can get a good idea if your results are plausible just by eyeballing the data. This was a common theme in many talks. Irizarry gave an example of a study which reported that a quarter of genes expressed in blood were differentially expressed between two human populations. This seemed implausibly high, so he looked into it and found a batch effect from having the two populations sampled in two separate projects.
In previous editions of this conference, attendees have told me how it has changed since it first started – now there are more talks discussing the biology revealed by the informatics rather than the informatics methods themselves. This iteration was no different, with several talks about analyzing large numbers of cancer genomes to find variants, or large cohorts of personal genomes to find variants associated with developmental disorders. For going beyond trying to identify variants associated with a condition, Sri Kosuri (University of California Los Angeles, USA) talked about experiments in which he tested thousands of SNPs for their effects on splicing in a reporter gene construct.
One biology talk that I found particularly interesting was from Lucia Spangenberg (Institut Pasteur de Montevideo, Uruguay), who has been attempting to reconstruct the genome of the Charruas, the indigenous people of Uruguay who were exterminated in the 19th century. Spangenberg found that the genomes of ten modern-day Uruguayans between them contain enough Charruan DNA to be able to reconstruct 99% of the Charruan genome. In general, people’s native genetic ancestry was higher than their self-reported native identity.
Several talks discussed how modern techniques, such as long-read sequencing from Pacific Biosciences, linked reads from 10x Genomics, and genome contact information from Hi-C, can be used to improve genome assemblies. This was shown in a variety of systems: birds (Alexander Suh, Uppsala University, Sweden), donkeys (Nikka Keivanfar, 10x Genomics, USA), and moss (Sarah Carey, University of Florida, USA). Jeffrey Kidd (University of Michigan, USA) showed that PacBio can used to produce a reference genome for dog that is more complete than the original one sequenced using Sanger technology.
One trend that particularly intrigued us at Genome Biology was the increased number of methods for representing genomes in a graph format, with variants shown as alternative branches, rather than the traditional linear reference representation. This was described for both prokaryotic genomes (Rachel Colquhoun, Oxford University, UK) and eukaryotic genomes (Prithicka Sritharan, Quadram Institute Bioscience, UK). We found this interesting, as we have been discussing this for a while, and have just issued a call for papers for an article collection on graph genomes.
I am planning on attending this year’s Genome Informatics conference in Cold Spring Harbor, and it will be fascinating to see how the different location, with a different set of delegates, affects the feel and focus of the conference. However it is different, I predict it will be equally as fascinating as last year’s conference.