CGAL: a new metric for assessing genome assembly quality

- 0 Comments

GB logoSo you have just spent the last couple of years on the project: using shiny brandnew machines to sequence the most complex genomes on Earth. You dotted your ‘i’s, crossed the ‘t’s, identified all ‘g’s and ‘c’s. From the (still growing) range of the available assemblers, you picked the one you thought best. And you ended up with the assembly that might be perfect, but might just as well be a disaster waiting to happen.

Both genome assemblies and assemblers can be assessed using a number of different quality metrics. For a long time, N50 was a leading metric used for that purpose but, although N50 scaffold and contig lengths most of the time correlate with assembly quality, the measure itself can be terribly misleading: a badly assembled genome with the reads simply stitched together can give very high N50 values and yet be utterly useless.

No surprise then that in 2011, after the experience of the first Assemblathon, Ian Korf suggested that N50 must die (with a question mark loaded with meaning). During the Assemblathon, a number of different metrics were used alongside N50. Unlike N50 though, many of these require some knowledge about the genome that is being assembled. And this is the problem that Atif Rahman and Lior Pachter, from Berkeley, encountered: you could either use the imperfect N50 metric, which didn’t require a priori knowledge of your genome, or you could use metrics which were more accurate, but if the species you worked on didn’t have at least a close relative sequenced already – you’re doomed.

Source: JJ Harrison (CC BY 3.0)

In the January issue of Genome Biology, Rahman and Pachter publish a Method article describing a new metric for assessing genome assembly quality: a likelihood-based approach that doesn’t require prior knowledge of the genome. The metric’s implementation, swankily named ‘CGAL’, is available from the authors’ website. The metric was used to assess the performance of four assemblers applied to a few different datasets. The authors found that the likelihood metric accurately reflects sequence similarity, which is often missed by other metrics.

Assembler comparisons aside though, we hope that CGAL will become a useful tool for all researchers working at the forefront of the genome assembly field, sequencing away where no other researcher has sequenced before, and will help them evaluate how good their newly-found assemblies are.