Sunday, February 27, 2011

Maybe we have to sequence everybody! Every fish! BGI Cloud

Maybe we have to sequence everybody! Every fish! BGI Cloud:
Bio-IT world ran an interesting article with this quote


“The data are growing so fast, the biologists have no idea how to handle this data,” says Li. “I think the Cloud will be the solution. We have to sequence more and more data. Maybe we have to sequence everybody! Every fish! The data keep growing and we need a lot of compute power to process.”
For Chen, there are three priorities for BGI Cloud:
  • Connectivity: With partners across China and the world, “we’ve connected all the people and resources—the sequencers, the samples, the ideas, the compute power, and the storage together to make a greater contribution.”
  • Scalability: Calling the explosion in next-gen sequencing (NGS) a “data tsunami,” Chen says BGI aims to provide the parallel computing resources to help users manage and process these datasets. “If you can’t do the analysis, it’s pointless. We use distributed computing technology in the bioinformatics area. We’re confident we can solve the scalability problem.”
  • Reproducibility: Chen says bioinformatics researchers are happy to show their data and their pet program—SOAP, BWA, and so on. “That’s fine. But analysis is very complicated. The methodology he is actually using is a homemade pipeline. It’s very difficult to reproduce that result. We built this platform not only to solve the capability and connectivity of computing, we want to resolve the problems in reproducing designs and procedures.”
With new NGS gene assembly and SNP calling programs such as Hecate and Gaea about to be released (see, “In the Name of Gods”), Li says it was essential to develop a “run-time environment, a Web-based platform for Cloud storage and reference data, with a feature-rich GUI, and effective bioinformatics analysis software.”


Kevin: It would be interesting to see how Amazon and other cloud providers together with Galaxy (usegalaxy.org) will take to BGI's offering to produce reproducible data analysis. (commercial software providers aside). Also their offering comes at a strange time when NCBI is discontinuing SRA. Might BGI cloud fill up the void where SRA left? 
Everyone is trying to come up with a 'standard' workflow that everyone will adopt but I feel that the ecology of bioinformatics is that there's always another 'better' way to tweak that analysis. Custom analysis is a pet phrase of a lot of bench biologists. 
Every bioinformatician will know and remember their treasure trove of throw away scripts that worked beautifully but only once for that set of data.