Tuesday, January 7, 2014

Fwd: Diminishing Return for Increased Mappability with Longer Sequencing Reads: Implications of the k-mer Distributions in the Human Genome

Fwd: please follow footer link

Diminishing Return for Increased Mappability with Longer Sequencing Reads:
Implications of the k-mer Distributions in the Human Genome
: "Background:
The amount of non-unique sequence (non-singletons) in a genome directlyaffects the difficulty of read alignment to a reference assembly for highthroughput-sequencing data. Although a longer read is more likelyto be uniquely mapped to the reference genome, a quantitativeanalysis of the influence of read lengths on mappability has beenlacking. To address this question, we evaluate the k-mer distributionof the human reference genome. The k-mer frequency is determinedfor k ranging from 20 bp to 1000 bp.
Results:
We observe that the proportion of non-singletons k-mers decreasesslowly with increasing k, and can be fitted by piecewise power-lawfunctions with different exponents at different ranges of k. A slowerdecay at greater values for k indicates more limited gains inmappability for read lengths between 200 bp and 1000bp. The frequency distributionsof k-mers exhibit long tails with a power-law-like trend, and rankfrequency plots exhibit a concave Zipf's curve. The most frequent1000-mers comprise 172 regions, which include four large stretches onchromosomes 1 and X, containing genes of biomedical relevance.Comparison with other databases indicates that the 172 regionscan be broadly classified into two types: those containing LINE transposable elementsand those containing segmental duplications.
Conclusion:
Read mappability as measured by the proportion of singletons increasessteadily up to the length scale around 200 bp. When read lengthincreases above 200 bp, smaller gains in mappability are expected.Moreover, the proportion of non-singletons decreases with read lengthsmuch slower than linear. Even a read length of 1000 bp would not allowthe unique alignment of reads for many coding regions of human genes.A mix of techniques will be needed for efficiently producing"

(Via BMC Bioinformatics - Latest articles.)