As next generation sequencing methods quickly become ubiquitous tools of genomics, more and more effort is directed to understand what are the limitations of these approaches. These limitations present themselves quite often in the form of coverage biases.
Last year Genome Biology published a study from David Jaffe and colleagues that looked at coverage biases in DNA sequencing. The authors used a suite of computational tools for bias assessment and applied them to a number of commonly used technologies. It turned out that, for instance, PacBio coverage is the least biased; and that high- and low-CG regions and long runs of homopolymers are very prone to coverage biases. They emphasized that the presence of such biases may lead to undersampling of certain regions of the genome in sequencing projects.
No less important is the bias issue for RNA sequencing data. A number of bioinformatic approaches for assessing – and correcting – this issue has been recently described (for example, here and here). While these computational studies identify some sources of bias, such as PCR enrichment, read errors of sequencing-by-synthesis reaction and rRNA depletion protocols, they cannot distinguish between technology-caused coverage aberrations and ones caused by the very nature of the sample.
This week Genome Biology publishes a study by John Hogenesch and colleagues that addresses this last point in an elaborate, and yet elegant way. Hogenesch et al. introduce a new method: in vitro transcribed RNA-sequencing. They create a pool of over 1000 in vitro transcribed (IVT) RNAs from a human cDNA library, and sequence these RNAs with the two most common protocols, poly-A and total RNA-seq. This method ensures that each transcript is – on a sequencing level – equally represented. Any visible differences have to come from the library preparation steps.
The authors find that as many as one in ten transcripts displays a substantial difference in within-transcript sequence coverage. Additionally, more than 5% of all transcripts contain regions of very unpredictable coverage, with huge between-sample differences. The authors show that rRNA depletion is the main culprit. Finally, they note that the coverage of RNA-seq of one species can depend on the contamination with RNA-seq from another species (an observation worth bearing in mind in the light of the recent discovery that 7% of the data from 1000 Genomes Project is contaminated with Mycoplasma sequences).
All is not doom and gloom though. The authors hope that IVT RNA-seq will enable better characterization of RNA-seq coverage biases in many mammalian studies – and similar methods should be able to do this for other species. The approach should also be helpful for benchmarking new sequencing protocols, thus making RNA-seq technology even better and more powerful than it already is.