Saturday, August 27, 2011

Blue Collar Bioinformatics

Blue Collar Bioinformatics: Distributed exome analysis pipeline with CloudBioLinux and CloudMan

(read more in the original post; link below)

A major challenge in building analysis pipelines for next-generation sequencing data is combining a large number of processing steps in a flexible, scalable manner. Current best-practice software needs to be installed and configured alongside the custom code to chain individual programs together. Scaling to handle increasing throughput requires running that custom code on a wide variety of parallel architectures, from single multicore machines to heterogeneous clusters.

Establishing community resources that meet the challenges of building these pipelines ensures that bioinformatics programmers can share the burden of building large scale systems. Two open-source efforts which aim at providing this type of architecture are:

CloudBioLinux — A community effort to create shared images filled with bioinformatics software and libraries, using an automated build environment.

CloudMan — Uses CloudBioLinux as a platform to build a full SGE cluster environment. Written by Enis Afgan and the Galaxy Team, CloudMan is used to provide a ready-to-run, dynamically scalable version of Galaxy on Amazon AWS.

Here we combine CloudBioLinux software with a CloudMan SGE cluster to build a fully automated pipeline for processing high throughput exome sequencing data:

The underlying analysis software is from CloudBioLinux.
CloudMan provides an SGE cluster managed via a web front end.
RabbitMQ is used for communication between cluster nodes.
An automated pipeline, written in Python, organizes parallel processing across the cluster.
Below are instructions for starting a cluster on Amazon EC2 resources to run an exome sequencing pipeline that processes FASTQ sequencing reads, producing fully annotated variant calls.

(Via .)