Saturday, September 22, 2012

Fwd: Vinay's bioinformatics blog: How to estimate insert size for paired-end RNA-Seq data

Fwd: please follow footer link
I searched long before finally finding this great post.


How to estimate insert size for paired-end RNA-Seq data
Reference genome alignment high-throughput RNA-seq reads generated from "paired-end" library often requires "mate-pair insert length" and associated standard deviation for alignment tools such as BowTie, TopHat or BWA to work properly. These values can be obtained from the library prep that was used for the sequencing. But these values are not always accompanied with the data in public databases hence need to be estimated computationally. Here are some simple steps:

Required sowftare: BowTie or BWA, Picard tools.
Reference transcriptome: Reference transcript sequences from the same species as the RNA-Seq data is from. Instead of reference transcript, reference genome can also be used. I prefer reference transcriptome since it saves a magnitude of alignment as well as downstream processing time over the whole genome and also generates more alignments as splice junction reads are easily aligned to the transcript sequence than the reference genome sequence.

Steps:

1.) Create reference transcriptome index for BowTie (or BWA).

2.) Align paired-end RNA-Seq data to the reference transcripts using minimum insert-length as zero and maximum insert-length as 500 (or more).
BowTie: -I 0 -X 500, BWA: -a 500. Set output format as SAM(BowTie) or BAM(BWA).

3.) Sort resulting SAM/BAM file using Picard's SortSam.jar utility as follows:
java -Xmx2g -jar picard-tools/SortSam.jar INPUT= OUTPUT= SORT_ORDER=coordinate

4.) Now run Picard's CollectInsertSizeMetrics.jar utility as follows:
java -Xmx2g -jar picard-tools/CollectInsertSizeMetrics.jar INPUT= HISTOGRAM_FILE= OUTPUT=

5.) Insert-size distribution is printed in output_file while corresponding histogram image file is generated as pdf in histogram_file.

6.) Insert-matrix output_file contain estimated insert-size and corresponding standard deviation. Picard generates two types estimates: "MEDIAN_INSERT_SIZE" and "MEAN_INSERT_SIZE". Anyone of them can be used but I prefer the second one since it is estimated after trimming off the outliers in from the original insert-size distribution in order to render normal distribution.

Note: If you plan to run TopHat on the same data afterwards, set -r as (estimated insert-size-2*read_length) since TopHat requires distance between mate-pairs and not the insert-size.


(Source: Posted by vinay mittal at 12:18 PM Vinay's bioinformatics blog.)