Saturday, August 14, 2010

Identifying Copy Number Variations based on Next Generation Sequencing Data by a Mixture of Poisson Model

Next generation sequencing (NGS) technologies have profoundly impacted biological research and are becoming more and more popular due to cost effectiveness and their speed. NGS can be utilized to identify DNA structural variants, namely copy number variations (CNVs) which showed association with diseases like HIV, diabetes II, or cancer.There have been first approaches to detect CNVs in NGS data, where most of them detect a CNV by a significant difference of read counts within neighboring windows at the chromosome. However these methods suffer from systematical variations of the underlying read count distributions along the chromosome due to biological and technical noise. In contrast to these global methods, we locally model the read count distribution characteristics by a mixture of Poissons which allows to incorporate a linear dependence between copy numbers and read counts. Model selection is performed in a Bayesian framework by maximizing the posterior through an EM algorithm. We define a CNV call which indicates a deviation of the Poisson mixture parameters from the null hypothesis represented by the prior which is a model for constant copy number across the samples. A CNV call requires sufficient information in the data to push the model away from the null hypothesis given by the prior.We test our approach on the HapMap cohort where we rediscovered previously found CNVs which validates our approach. It is then tested on the tumor genome data set where we are able to considerably increase the detection while reducing the false discoveries.

(this Post content was reproduced from: http://dx.doi.org/10.1038/npre.2010.4716.1, Via Browsing Bioinformatics : Nature Precedings.)