The main outcome of the Human Genome Project is the human reference assembly, a resource intended to be used by everyone. The human reference assembly is a linear sequence compiled from the collapsed genomes of 50 individuals that has, with the development of new sequencing technologies, been improved upon over the last decade.
It is of course rather clear that a reference assembly defined in this way does not necessarily represent the entirety of humanity – and nor does it attempt to – but in order to have a reference that is more thorough, a better representation of the many different world populations, and that accurately depicts the variation between humans, we would need to change how this reference is structured in the first place.
And this is what the Genome Reference Consortium has been working on since its early days: it suggested that alternative sequences are included in the reference to reflect the variation complexity and differences between humans. This change was already visible in the previous release of the reference – GRCh37. However, the full scope of this change only became apparent when GRCh38 was released about a year ago.
So what is the new reference GRCh38 like? Well, to start with it is not linear and it does not represent haploid or diploid versions of the genome. It is a collection of random multiple sequences – more of them for some regions of the genome, and much fewer for others. It contains much more information than any previous release of the human reference.
As we gain understanding of biological systems, we must update the models we use to represent these data.
Deanna Church and colleagues, Genome Biology 2015, 16:13
As it turns out, however, this may not necessarily be a good thing (not in the short run anyway). As Deanna Church, Richard Durbin, Paul Flicek and their many colleagues from the GRC explain now in the Comment in Genome Biology, many of the bioinformatics tools designed for the alignment of sequences to the human reference genome and for the analysis are limited in how they perform quality control of the sequenced DNA. And they simply cannot handle the possibility of having multiple reference sequences available for the same region of the genome.
What does it mean for us? Primarily, it means a lot of work for bioinformaticians, who will have to either adapt existing tools, or develop new ones that can make a proper use of the new reference assembly. In the long run this effort will definitely pay off: it is bound to drive further advances of available tools, and more importantly once we are able to fully explore the potential of the new genome reference, the sky is the limit of what we can still discover.