David Kelley and Steven Salzberg at the University of Maryland have developed a pipeline to correct misassembles due to false duplications, in a study published today in Genome Biology. Diploid genomes harbour a significant amount of variation between homologous chromosomes. This causes problems for genome assembly algorithms which may construct two DNA sequences corresponding to one divergent region and incorporate both into an assembly as a false segmental duplication.
Their approach is to align DNA sequence fragments to the surrounding sequence, using mate pair information, to determine whether duplicated segments should be merged into one copy. Mate pairs are two sequence reads derived from the same region of DNA and this study is the first time in which mate pair reads have been used to detect duplications. Kelley and Salzberg apply their pipeline to the cow, chimpanzee, dog and chicken genomes and they identify many single copy regions that have been falsely incorporated as segmental duplications in these genome assemblies, which also allows previously undetected polymorphisms to be identified. This promises to be a valuable method for correcting existing errors in many genome assemblies and to control for misassembly errors in future genome sequencing efforts.