Wednesday, May 5, 2010

Guest Post: CoGe, The Suite for Comparative Genomics – Eric Lyons

This next post in our continuing semi-regular Guest Post series is from Eric Lyons, of CoGe at the University of California, Berkeley. If you are a provider of a free, publicly available genomics tool, database or resource and would like to convey something to users on our guest post feature, please feel free to contact us at wlathe AT openhelix DOT com.

Thanks both for the prior CoGe post (editors note: a tip of the week on GoGe) and the invitation to write a bit about CoGe.  Since most people are probably not familiar with CoGe, let me begin with how it is designed:

CoGe’s architecture and philosophy:  Solve a problem once

CoGe is a web-based platform for comparative genomics and consists of many interconnected web-based tools.  The entire system is hooked up to a database that can store any version of any genome in any state of assembly from any organism (currently ~9000 genomes from ~8000 organisms). Each of CoGe’s tools is designed to do one task (e.g. search and display information about a genome, compare two genomes and generate syntenic dotplots, search any number of genomes for similar sequence, manage a list of genes, etc.), and are linked to one another. This means that there is no predefined analysis workflow. Instead, people can begin exploring a genome of interest, compare it to what they want, find something interesting, explore that, finding something else, explore that, etc.) People anywhere in the world can perform computationally intense analyses by clicking a few buttons on a web-page, and letting our servers crunch away on whatever genomes we have currently loaded in our system .  Since each tool is web-based, links are used to move from tool to tool which creates an easy way to save an analysis for future work or to send to a colleague. This also has the benefit that as we develop new tools to solve a specific problem, we can generalize the solution, and plug it into CoGe’s database and connect it to its pre-existing tool set. Overall, this allows an easy way for us to expand CoGe’s functionality.

This paradigm has its roots in CoGe’s genomes database. Many genomics systems require a separate database for each genome being used, which usually means significant administrative overhead (aka headache) when a new genome is deployed.  Instead, CoGe’s database is designed to store any number of genomes simultaneously. That way, as new genomes are released, they just get loaded into the database and immediately become available for analysis. For those interested, if there is a genome you’d like to see loaded into CoGe, just send me an e-mail with some necessary information ( I have a variety of programs to load genomes into CoGe. Some places, like NCBI, have a consistent data format (genbank record) for their data, and I have programs that periodically run through their system check for new genomes or new versions of genomes and load them automatically. Other place, like JGI, use GFF3 for their annotations, but frequently change exactly what identifiers they use in the file to denote feature types (e.g. “3′” versus “three-prime”), names and annotations. For those places, it takes a bit more time to figure out what’s important, but I can usually get genomes loaded in a couple of hours to days (depending on regular work stuff).

If you are interested in learning a bit more about CoGe’s system architecture and database design please see:

Designing new tools in CoGe — zombie-style: “the need for biologists’ brains”

While I’m trained as a biologist, there is no substitute for the expertise and depth of knowledge that comes from years of working with a group of organisms.  By working closely with biologists who have specific questions about their genomes of interest, CoGe’s team develops new solutions or extends existing tools.  As these solutions are tested and validated, some become the basis for new tools in CoGe.  What is particularly fun about this process is that many times, tools developed for one problem can be used to solve problems that were not originally intended.

An example of this type of development was the recent extension of SynMap, CoGe’s tool for generating syntenic dotplots between any two genomes.  Working with a group sequencing E. coli genomes in order to determine polymorphisms, I developed a hybrid assembly pipeline that would first use de novo assembly to generate contigs, and then use syntenic comparison to a reference genome to order and orient the contigs.  The second part of this assembly process was integrated into SynMap as an option (  While useful for putting together small genomes with a relatively close reference genome, this syntenic path assembly can also be used in generating very approximate assemblies between very divergent genomes — for example grape and potato (~120MYA):  This type of analysis turned a 68,000 piece genome into a visual pattern that confirms that these two genomes have  nearly an identical genome structure, as well as evidence for ancient whole genome duplication events.

Another example using SynMap is comparing differences in genome assemblies.  While designed to identify syntenic regions within and between genomes, it works equally when comparing two version of a genome.  These two examples comparing versions 1 and 2 of the grape genome (, and versions 2 and 3 of the medicago genome (, made me appreciate how much salt to take when analyzing genome structure evolution.

Learning to use CoGe:

Using CoGe is very much like the old “choose your own adventure” books.  Which is fun for its open-ended story-lines, but does mean that there is a learning curve to the system.  To help with this learning process, there are several text and video tutorials available that are set up as “How to do [something]“.  You can view them at: .   Also, there is some necessary background information that will help you along an analytical path.  Knowing something about the types of patterns for which you are searching, or the types of information you’d like to obtain, is required.  To help with that aspect of learning about comparative genomics, and genome structure and evolution in general, we maintain a wiki to describe these terms and provide links to CoGe for analyzing and visualizing different patterns of genome structure and evolution.  Still, there are many tricks that you learn by doing, and the best way to start is finding a genome of interest, looking at it, and comparing it to other genomes.

What’s coming up for CoGe:

For the most part, CoGe is under constant development.  New genomes are loaded, tools are tweaked, bugs are squashed, new features implemented, and new tools deployed.  The latest set of tool developments are:

-A much more powerful analytical pipeline for SynMap (deployed)

-A new tool for binning genome/gene contents by percent GC and codon usage (a pet obsession of mine) called CodeOn (deployed)

-A new tool for finding all syntenic regions for a given gene through any number of other genomes called SynFind (in progress)

Much more significantly, we will be rolling out a new version of CoGe as a whole in the next 2-4 months (we are waiting for new hardware).  This will sport a new UI layout, some changes to CoGe’s genomes database, and updated system code (e.g. update code-base for new version of dependent modules.)

How to help:

If any one is interested in helping:

1. Provide feedback on what is working and what is not.

2. Make a tutorial.  Just a couple of pictures and some words can go a long ways to help someone else. Or better yet, a 5 minute video.


I have to thank several people who have been instrumental in putting CoGe together.  Their programming, algorithm, visualization, and UI skills have been and continue to be invaluable:  Josh Kane, Brent Pedersen, James Schnable, Shabari Subramaniam,  Haibao Tang

Eric Lyons

(this Post content was reproduced from:, Via The OpenHelix Blog.)