LO6: Omics and system biology

COMPARATIVE GENOME ANALYSIS

The first complete genome sequences of living organism have become available not long ago. In 1995, the genomes of the first two bacteria, Haemophilus influenzae and Mycoplasma genitalium, were reported. One year later, the first archaeal (Methanococcus jannaschii) and the first eukaryotic (yeast Saccharomyces cerevisiae) genomes were completely sequenced. Next, in 1997 the sequencing of the genomes of the two best-studied bacteria, Escherichia coli and Bacillus subtilis was done. Many more bacterial and archaeal genomes, as well as the genomes of a multicellular eukaryotes, like the nematode Caenorhabiditis elegans, have been sequenced since then.

An outstanding outcome of these first genome projects is that at least one-third of the genes encoded in each genome had no known or predictable function. The prediction of the general function for many of the remaining genes have been appeared possible. The depth of our ignorance becomes particularly obvious on examination of the genome of Escherichia coli K12, debatably the most extensively studied organism among both prokaryotes and eukaryotes. Even in this well-known model organism of molecular biologists, at least 40% of the genes have unknown function. On the other hand, it turned out that the level of evolutionary conservation of microbial proteins is rather uniform, with ∼70% of gene products from each of the sequenced genomes having orthologs in distant genomes. Thus, the functions of many of these genes can be predicted simply by comparing different genomes and by transferring functional annotation of proteins from better-studied organisms to their orthologs from lesser-studied organisms. This makes comparative genomics a powerful tool for achieving a better understanding of the genomes and, subsequently, of the biology of the respective organisms.

PROGRESS IN GENOME SEQUENCING

By the beginning of 2000, genomes of 23 different unicellular organisms (5 archaeal, 17 bacterial, and 1 eukaryotic) had been completely sequenced. Up to 2018 thousands of microbial and eukaryotic genomes were in different stages of completion with respect to sequencing. Periodically updated lists of both finished and unfinished publicly funded genome sequencing projects are available in the GenBank Entrez Genomes. A complete list of sequencing centers world-wide can be found at the NHGRI Web site. One can retrieve the actual sequence data from the NCBI FTP site or from the FTP sites of each individual sequencing center. A convenient sequence retrieval system is maintained also at the DNA Data Bank of Japan. In the framework of the Reference Sequences (RefSeq) project, NCBI has started to increase the lists of gene products with some valuable sequence analysis information, such as the lists of best hits in different taxa, predicted functions for uncharacterized gene products, frame-shifted proteins, etc. On the other hand, sequencing centers like TIGR regularly updates their sequence data, correct some of the sequencing errors and, accordingly, their sites may contain more recent data on unfinished genome sequences.

General-Purpose Databases for Comparative Genomics

Because the Web makes genome sequences available to anyone with Internet access, there exists a variety of databases that offer more or less convenient access to basically the same sequence data. However, several research groups, specializing in genome analysis, maintain databases that provide important additional information, such as operon organization, functional predictions, three-dimensional structure, and metabolic reconstructions.

PEDANT

This useful Web resource provides answers to most standard questions in genome comparison. PEDANT provides an easy way to ask simple questions, such as finding out how many proteins in H. pylori have known (or confidently predicted) three-dimensional structures or how many NAD+-dependent alcohol dehydrogenases (EC 1.1.1.1) are encoded in the C. elegans genome. The list of standard PEDANT queries includes EC numbers, PROSITE patterns, Pfam domains, BLOCKS, and SCOP domains, as well as PIR keywords and PIR superfamilies (Fig.1.). Although PEDANT does not allow the users to enter their own queries, the variety of data available at this database makes it a convenient entry point into the field of comparative genome analysis.

Fig. 1. Helicobacter pylori P12 in PENDANT database

Fig. 1. Helicobacter pylori P12 in PENDANT database

COGs

The Clusters of Orthologous Groups (COGs) database has been intended to simplify evolutionary studies of complete genomes and improve functional projects of individual proteins. It consists of more than 4,800 conserved families of proteins (COGs) from each of the completely sequenced genomes. Each COG contains orthologous sets of proteins from at least three phylogenetic lineages, which are assumed to have evolved from an individual ancestral protein. By definition, orthologs are genes that are connected by vertical evolutionary descent (the ‘‘same’’ gene in different species) as opposed to paralogs—genes related by duplication within a genome. Because orthologs typically perform the same function in all organisms, delineation of orthologous families from diverse species allows the transfer of functional annotation from better-studied organisms to the lesser-studied ones. The protein families in the COG database are separated into 25 functional groups that include a group of uncharacterized yet conserved proteins, as well as a group of proteins for which only a general function prediction only has been performed (Fig.2). This site is particularly useful for functional predictions in disputed cases, where protein similarity levels are fairly low. Due to the diversity of proteins in COGs, sequence similarity searches against the COG database can often suggest a possible function for a protein that otherwise has no clear database hits.

Fig.2. Bacteroides thetaiotaomicron VPI-5482 functional categories in GOG

Fig.2. Bacteroides thetaiotaomicron VPI-5482 functional categories in GOG

KEGG

The Kyoto Encyclopedia of Genes and Genomes (KEGG) is focused on cellular metabolism. This database presents a comprehensive set of metabolic pathway charts, both general and specific, for each of the completely-sequenced genomes, as well as for Schizosaccharomyces pombe, Arabidopsis thaliana, Drosophila melanogaster, mouse, and human. Enzymes that are already identified in a particular organism are color-coded, so that one can easily trace the pathways that are likely to be present or absent in a given organism (Fig. 3). For the metabolic pathways covered in KEGG, lists of orthologous genes that code for the enzymes participating in these pathways are also provided. It is also indicated whenever these genes are adjacent, forming likely operons. A very convenient search tool allows the user to compare two complete genomes and identify all cases in which conserved genes in both organisms are adjacent or located relatively close (within 5 genes) to each other. The KEGG site is continuously updated and serves as an ultimate source of data for the analysis of metabolism in various organisms.

Fig. 3. Metabolic pathway chart of glycerophospholipid metabolism

Fig. 3. Metabolic pathway chart of glycerophospholipid metabolism

MBGD

The Microbial Genome Database (MBGD) offers another convenient tool for comparative analysis of completely sequenced microbial genomes, the number of which is now growing rapidly (Fig. 4). Here, the homology relationships are based only on sequence similarity (BLASTP values of 10-2 or less). MBGD permits to submit several sequences at once (up to 2,000 residues) for searching against all of the completely sequenced genomes. The result is displayed as color-coded functions of the detected homologs, and shows their location on a circular genome map. The output of MBGD’s BLAST search also shows the degree of overlap between the query and target sequences. For each sequenced genome, MBGD provides convenient lists of all recognized genes that are involved in a particular function, e.g., the biosynthesis of branched-chain amino acids or the degradation of aromatic hydrocarbons.

Fig. 4. MBGD database

Fig. 4. MBGD database

Organism-Specific Databases

In addition to general genomics databases, exist a variety of databases for particular organism or a group of organisms. Although all of them are useful for specific purposes, those devoted to E. coli, B. subtilis, and yeast are probably the ones most widely used for functional assignments in other, less studied organisms.

Escherichia coli. The importance of E. coli for molecular biology is reflected in the large number of databases dedicated to this organism. One of them is maintained at the University of Wisconsin-Madison, the research groups that carried out the actual sequencing of the E. coli genome (Fig. 5). The Wisconsin group is also involved in sequencing the enteropathogenic E. coli O157:H7 and other enterobacteria, so their database is also very useful for analysis of enteric pathogens. Another useful database on E. coli, EcoCyc. It lists all experimentally studied E. coli genes and provides comprehensive coverage of the metabolic pathways identified in E. coli. The aim of another E. coli database, Bacteriome, is to provide an integrated protein interaction database for a high quality functional interaction dataset of E. coli proteins together with experimental datasets generated through tandem affinity purification screens.. Finally, Colibri and GenExpDB are the databases of choice for those interested in regulatory networks of E. coli. The E. coli Genetic Stock Center (CGSC) Web site also provides gene and function information.

Fig.5. E.coli Genome Project

Fig.5. E.coli Genome Project

Mycoplasma genitalium. Mycoplasma has the smallest genome of all known cellular life forms, which offers some hints as to what is the lower limit of genes necessary to sustain life (the ‘‘minimal genome’’). Its comparison to the second smallest known genome, that of Mycoplasma pneumoniae, is available online. Recent data from VFDB provides insight into the range of Mycoplasma genes that can be mutated without loss of viability (Fig. 6). From both computational analysis and mutagenesis studies, it appears that 250–300 genes are absolutely essential for the survival of mycoplasmas.

Fig. 6. Mycoplasma Genome Database at VFDB

Fig. 6. Mycoplasma Genome Database at VFDB

Bacillus subtilis. The B. subtilis genome also attracts considerable attention from biologists and, like that of E. coli, is being actively studied from the functional perspective. The SubtiList World-Wide Web Server, maintained at the Institute Pasteur, is constantly updated to include the most recent information on functions of new B. subtilis genes. In addition, a DBTBS contains comprehensive database of the transcriptional regulation in Bacillus subtilis and contains upstream intergenic conservation information.

Saccharomyces cerevisiae. The major databases specifically devoted to the functional analysis of yeast S. cerevisiae genome is the Saccharomyces Genome Database (SGD) (Fig. 7). It provides regurally updated lists of yeast proteins with known or predicted functions, appropriate references, and mutant phenotypes and reflect the ongoing efforts aimed at complete characterization of all yeast proteins. SGD is probably the largest and most comprehensive source of information on the current status of the yeast genome analysis and includes the Saccharomyces Gene Registry.

Other useful sites for yeast genome analysis include Saccharomyces cerevisiae Promoter Database, listing known regulatory elements and transcriptional factors in yeast; and the Saccharomyces Cell Cycle Expression Database, presenting the first results on changes in mRNA transcript levels during the yeast cell cycle.

Fig. 7. Saccharomyces Genome Database

Fig. 7. Saccharomyces Genome Database

Fig. 7. Saccharomyces Genome Database

GENOME ANALYSIS AND ANNOTATION

One of the limiting steps in the most genome projects are the sequence analysis and annotation of the complete genomes. This task is particularly discouraging given the lack of functional information for a large number of genes even in the best-understood model organisms. The standard stages involved in the structural-functional annotation of uncharacterized proteins includes:

  • sequence similarity searches using programs such as BLAST, FASTA, or the Smith-Waterman algorithm;
  • identifying functional motifs and structural domains by comparing the protein sequence against PROSITE, BLOCKS, SMART, or Pfam;
  • predicting structural features of the protein, such as likely signal peptides, transmembrane segments, coiled-coil regions, and other regions of low sequence complexity; and
  • generating a secondary (and, if possible, tertiary) structure prediction.

All these steps have been automated in several software packages, such as GeneQuiz01510-8.pdf?code=cell-site), MAGPIE, PEDANT, Imagene, and others. Of these, however, MAGPIE and PEDANT do not allow outside users to submit their own sequences for analysis and display only the authors’ own results. GeneQuiz offers a limited number of searches (up to 100 a day) to general users but is still a good entry point for comparative genome analysis. It relies on unrealistically high cutoff scores to deduce homology, which results in relatively low sensitivity. One such package that is currently available for free downloading is SEALS, developed at NCBI. It consists of a number of UNIX-based tools for retrieving sequences from GenBank, running database search programs such as BLAST, viewing and analyzing search outputs, searching for sequence motifs, and predicting protein structural features. A similar package, called Imagene, has been developed at Universite´ Paris VI.

Genome Comparison for Prediction of Protein Functions

Analysis of the first sequenced bacterial, archaeal, and eukaryotic genomes using the sequence comparison methods failed to predict protein function for at least one-third of gene products in any given genome. In these cases, other approaches can be used that take into consideration all other available data, putting them into ‘‘genome context’’. These approaches rely on the same basic principle, that the organization of the genetic information in each particular genome reflects a long history of mutations, gene duplications, gene rearrangements, gene function divergence, and gene acquisition and loss that has produced organisms uniquely adapted to their environment and capable of regulating their metabolism in accordance with the environmental conditions. In this respect the cross-genome similarities can be assumed as meaningful in the evolutionary sense and thus are potentially useful for functional analysis. The most applicable comparative methods specifically employ information derived from multiple genomes thus achieving reliability and sensitivity that are not easily attainable with standard tools. Some of these new approaches are briefly reviewed below.

Transfer of Functional Information

The simplest and the most common way to exploit the information embedded in multiple genomes is the transfer of functional information from well-characterized genomes to poorly-studied ones. Indirectly, this is done through making a prediction for a newly sequenced gene on the basis of a database hit(s). There are, however, many pitfalls that tend to hamper accurate functional prediction on the basis of such hits. The most important ones relate to the lack of sufficient sensitivity, leading to error broadcast. Main reasons for that are due to the dependence on incorrect or imprecise annotations already present in the databases, and the difficulty in distinguishing orthologs from paralogs. The issue of orthology vs. paralogy is critical because transfer of functional information could be assumed as reliable for orthologs (direct evolutionary counterparts) but may not be quite consistent for the paralogs (products of gene duplications). All these problems are, in part, avoided in the COG system, which consists of carefully annotated sets of likely orthologs and does not rely on arbitrary cutoffs for assigning new proteins to them.

The COGs can be employed for annotation of newly-sequenced genomes using the COGNITOR program. This program allocates new proteins to COGs by comparing them to protein sequences from all genomes included in the COG database and detecting genome-specific best hits (BeTs). When three or more BeTs fall into the same COG, the query protein is considered a likely new COG member. The requirement of multiple BeTs for a protein to be assigned to a COG serves, to some extent, as a safeguard against the propagation of errors that might be present in the COG database itself. Indeed, if a COG contains one or even two false-positives, this will not result in a false assignment by COGNITOR under the three-BeT cutoff rule.

Phylogenetic Patterns (Profiles)

The COG-type analysis applied to multiple genomes provides for the root of phylogenetic patterns, which are potentially useful in many aspects of genome analysis and annotation. The phylogenetic pattern for each protein family (COG) is defined as the set of genomes in which the family is represented. The COG database is accompanied by a pattern search tool that allows the user to select COGs with a particular pattern. On this basis, tit is considered that the genes that are functionally related presumably have the same phylogenetic pattern. Because of these features, phylogenetic patterns can be used to improve functional predictions in complete genomes. When a particular genome is represented in the COGs for a subset of components of a particular complex or pathway but is missing in the COGs for other components, a focused search for the latter is justified. The same applies to cases in which a gene is found in one of two closely related genomes, but not the other.

Use of Phylogenetic Patterns for Differential Genome Display

The phylogenetic pattern approach and, specifically, the pattern search tool associated with the COGs can be used to perform systematic logical operations (AND, OR, NOT) on gene sets — an approach called ‘‘differential genome display’’. This type of genome comparison permits to delineate subsets of gene products that are likely to contribute to the specific characteristics of the studied organisms, for example, thermophily. The use of this approach is of particular interest when identifying candidate drug targets in pathogenic bacteria. It seems logical to look for such targets among those genes that are shared by several pathogenic organisms, but are missing in eukaryotes. On the other hand, it is appealing to suggest that the best targets for new broad-spectrum antimicrobial agents would be genes that are shared by all pathogenic microbes, but not by any other organisms. However, such genes do not seem to exist. In this respect, it seems that the best solution when searching for such potentially universal antimicrobial agents is to isolate the genes that are present in most of the pathogens, but not in eukaryotes.

Study of Gene (Domain) Fusions

Another recently developed comparative genomic approach involves systematic analysis of protein and domain fusion (and fission). The basic hypothesis is that fusion would be maintained by selection only when it facilitates functional interaction between proteins, for example, kinetic coupling of consecutive enzymes in a pathway. Thus, proteins that are fused in some species can be expected to interact, perhaps physically or at least functionally, in other organisms. A straightforward example of functional inferences that can be drawn from domain fusion is seen in the histidine biosynthesis pathway, which in E. coli and H. influenzae includes two two-domain proteins, HisI and HisB. The two domains of HisI catalyze two consecutive steps of histidine biosynthesis and thus represent subunits that are likely to physically interact even when produced as separate proteins. In contrast, the two domains of HisB catalyze the seventh and ninth steps of the pathway and hence are not likely to physically interact. The COG database includes about 700 distinct multidomain architectures. Thus, using domain fusion for functional prediction has considerable empirical potential although this approach will not work for ‘‘promiscuous’’ domains such as, for example, the DNA-binding helix-turn-helix domain, which can be found in combination with a wide variety of other domains.

In addition, several databases have recently been developed for detecting domains and exploring architectures of multidomain proteins: Pfam, ProDom, and SMART.

From all of them, SMART seems to be the most advanced, combining high sensitivity of domain detection with accuracy, high speed, and extremely informative presentation of domain architectures. Rapid searches for protein domains, based on a modification of the PSI-BLAST program is now also available through the Conserved Domains Database (CDD) at NCBI.

Analysis of Operons

An approach that is conceptually similar to the analysis of gene fusions, but is more general, involves systematic analysis of gene ‘‘neighborhoods’’ in genomes. Because functionally linked genes frequently form operons in bacteria and archaea, gene adjacency may provide important functional suggestions. However, many functionally related genes never form operons, and, in many instances, adjacent genes are not connected in any way. Due to the lack of overall conservation of gene order in prokaryotes, the presence of a pair of adjacent orthologous genes in three or more genomes or the presence of three orthologs in a row in two genomes can be considered a statistically meaningful event and can be used to infer potential functional interaction for the products of these genes. The simplest current tool for identification of conserved gene strings in any two genomes is available as part of KEGG. It allows the user to select any two complete genomes (e.g., B. burgdorferi and R. prowazekii) and look for all genes whose products are similar to each other and are located within a certain distance from each other (for example, separated by 0–5 genes). The results are displayed in a graphical format illustrating the gene order and the presumed functions of gene products. The conservation of gene position in phylogenetically distant bacteria suggests a functional connection.

APPLICATION OF COMPARATIVE GENOMICS—RECONSTRUCTION OF METABOLIC PATHWAYS

To illustrate the genome analysis tools discussed above, a reconstruction of the glycolytic pathway in the archaeon Methanococcus jannaschii is presented. Metabolic reconstruction is one of the crucial final steps of all genome analyses and a convergence point for the data produced by different methods. Glycolysis is one of the central pathways of cellular biochemistry as it becomes obvious from a cursory exploration of the general scheme of biochemical pathways, available in the interactive form on the KEGG Web site (Fig. 8).

Fig. 8. Glycolysis in KEGG

Fig. 8. Glycolysis in KEGG

The names of all the enzymes and metabolites on this map are hyperlinked and searchable. The enzyme names are hyperlinked to the enzyme information. It lists the names and catalyzed reactions, the official Enzyme Commission (EC) numbers, whether or not their protein sequences are known. Thus, clicking on the name ‘‘hexokinase” will bring up the corresponding page (Fig. 9).

Fig. 9. Hexokinase information

Fig. 9. Hexokinase information

Error Propagation and Incomplete Information in Databases

Sequence databases are predisposed to error propagation, whereby wrong annotation of one protein causes multiple errors as it is used for annotation of new genomes. Furthermore, database searches have the potential for noise amplification, so that the original annotation could have involved a minor inaccuracy or incompleteness, but its transfer on the basis of sequence similarity worsens the problem and eventually results in outright false functional assignments. These aspects of sequence databases make the common practice of assigning gene function on the basis of the annotation of the best database hit (or even a group of hits with compatible annotations) highly error-prone. Although time- and labor-consuming, the adequate genome annotation requires that each gene be considered in the context of both its phylogenetic relationships and the biology of the respective organism, hence the rather disappointing performance of automated systems for genome annotation. There are numerous reasons why functional annotation may be wrong in the first place, but two main groups of problems are due to the database search methods and to the complexity and diversity of the genomes themselves.

False Positives and False Negatives in Database Searches

It is usual in genome annotation to use a cutoff for ‘‘statistically significant’’ database hits. It can be expressed in terms of the false-positive expectation (E) value for the BLAST searches and is set routinely at values such as E = 0.001 or E = 10-5. The problem with this approach is that the distribution of similarity scores for evolutionarily and functionally relevant sequence alignments is very broad and that a considerable fraction of them fail the E-value cutoff, resulting in undetected relationships and missed opportunities for functional prediction (false negatives). On the contrary, spurious hits may have E-values lower than the cutoff, resulting in false positives. The latter is most frequently caused by compositional bias (low-complexity regions) in the query sequence and in the database sequences. Clearly, there is a trade-off between sensitivity (false-negative rate) and selectivity (false-positive rate) in all database searches, and it is particularly difficult to optimize the process in genome-wide analyses. There is no simple decision to circumvent these problems. To minimize the false-positive rate, appropriate procedures for filtering low-complexity sequences are critical. Filtering using the SEG program is the default for Web-based BLAST searches, but additional filtering is justified for certain types of proteins. For example, filtering of predicted nonglobular domains using SEG with specifically adjusted parameters and filtering for coiled-coil domains using the COILS2 program is one way to minimize the false positive rate. Minimizing the false-negative rate (that is, maximizing sensitivity) is an open-ended problem. It should be kept in mind that a standard database search (e.g., using BLAST) with the protein sequences encoded in the given genome as queries is insufficient for an adequate annotation. To increase the sensitivity of genome analysis, it should be supplemented by other, more powerful methods such as screening the set of protein sequences from the given genome with preformed profile libraries.

Genome, Protein, and Organismal Context as a Source of Errors

As discussed above, protein domain architecture, genomic context and an organism’s biology may serve as sources of important, even if indirect, functional information. However, those same context features, if misinterpreted, may become one of the major sources of error and confusion in genome annotation. Standard database search programs are not equipped with the means to clearly address the implications of the multidomain organization of proteins. Therefore, unless specialized tools such as SMART or COGs are employed and/or the search output is carefully examined, assignment of the function of a single-domain protein to a multidomain homolog and vice versa becomes frequent in genome annotation. For example, mobile domains could cause chaos in the annotation process, as demonstrated, for example, by the proliferation of ‘‘IMP-dehydrogenase-related proteins’’ in several genomes. In reality, most or all of these proteins (depending on the genome) share with IMP dehydrogenase the mobile CBS domain but not the enzymatic part.

As discussed above, it is also critical for reliable genome annotation that the biological context of the given organism is taken into account. For example, it is undesirable to annotate archaeal gene products as nucleolar proteins, even if their eukaryotic homologs are correctly described as such. As a general guide to functional annotation, it should be kept in mind that current methods for genome analysis, even the most powerful and sophisticated of them, facilitate, but do not replace the work of a human expert.

FINAL REMARKS

With an increasing number of complete genome sequences becoming available and specialized tools for genome comparison being developed, the comparative approach is becoming the most powerful strategy for genome analysis. It seems that the future should belong to databases and tools that consistently organize the genomic data according to phylogenetic, functional, or structural principles and explicitly take advantage of the diversity of genomes to increase the resolution power and robustness of the analysis. Many tasks in genome analysis can be automated, and, given the rapidly growing amount of data, automation is critical for the progress of genomics. This being said, the ultimate success of comparative genome analysis and annotation critically depends on complex decisions based on a variety of inputs, including the unique biology of each organism. Therefore, the process of genome analysis and annotation taken as a whole is, at least at this time, not automatable, and human expertise is necessary for avoiding errors and extracting the maximum possible information from the genome sequences.

Funding

Disclaimer

The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsi-ble for any use which may be made of the information contained therein.