LO3: Omics and system biology
- Tools for Genomics and Proteomics
- Sequencing Genes and Genomes
- Analysis of Raw Sequence Data: Basecalling
- Sequencing an Entire Genome
- LIMS: Tracking mini sequences
- Sequence Assembly
- Accessing Genome Information on the Web
- NCBI Genome Resources
- Genome Annotation
- Genome Comparison
- Functional Genomics
- Sequence-Based Approaches for Analyzing Gene Expression
- DNA Microarrays
- Bioinformatics Challenges in Microarray Design and Analysis
- Planning array experiments
- Analyzing scanned microarray images
- Clustering expression profiles
- Experimental Approaches in Proteomics
- Informatics Challenges in 2D-PAGE Analysis
- Tools for Proteomics Analysis
- Biochemical Pathway Databases
Tools for Genomics and Proteomics
The sequence alignment methods can be used to analyze a single sequence or structure and compare multiple sequences of single-gene length. These methods can help in understanding the function of a particular gene or the mechanism of a particular protein. However, it is also interesting to understand how gene functions manifest in the observable characteristics of an organism: its phenotype. In this respect, some datatypes and tools are available that allow studying the integrated function of all the genes in a genome.
Experimental strategies for analysing one gene or one protein are progressively replaced by parallel approaches in which many genes are examined simultaneously. Using bioinformatics algorithms information from multiple sources can be integrated to form a complete picture of genomic function and its expression, as well as to allow comparison between the genomes of different organisms. Figure 1 shows how genome information is transformed in phenotypic expression.
Figure 1. Transferring genome information to phenotype
For decades biologists have been collecting information from the molecular to the cellular level and beyond to see the functions of the genome as a whole. The process of automating and scaling up biochemical experimentation, and treating biochemical data as a public resource, is significantly facilitated by the use of bioinformatics.
The Human Genome Project has not only made gigabytes of biological sequence information available but it has begun to change the entire landscape of biological research by its example. Protein structure determination has not yet been automated at the same level as sequence determination, but several projects in structural genomics are launched, with the main goal to create a high-speed structure determination approaches. The concept behind the DNA microarray experiment allows performance of comprehensive biochemical and molecular biology experiments.
One of the major tasks of bioinformatics is creating software systems for information management that can effectively annotate each part of a genome sequence with information about everything from its function, to the structure of its protein product (if it has one), to the rate at which the gene is expressed at different life stages of an organism. Another task of genome information management systems is to allow users to make intuitive, visual comparisons between large data sets. Many new data integration projects, from visual comparison of multiple genomes to visual integration of expression data with genome map data, are developed.
Sequencing Genes and Genomes
One of the first computational challenges in the process of sequencing a gene (or a genome) is the interpretation of the pattern of fragments on a sequencing gel.
Analysis of Raw Sequence Data: Basecalling
The process of assigning a sequence to raw data from DNA sequencing is called basecalling. If this step doesn't produce a correct DNA sequence, any subsequent analysis of the sequence is affected. All sequences deposited in public databases are affected by basecalling errors due to uncertainties in sequencer output or to equipment malfunctions. EST and genome survey sequences have the highest error rates (1/10 -1/100 errors per base), followed by finished sequences from small laboratories (1/100 - 1/1,000 per base) and finished sequences from large genome sequencing centers (1/10,000 -1/100,000 per base). Any sequence in GenBank is likely to have at least one error. Improving sequencing technology, and especially the signal detection and processing involved in DNA sequencing, is still the subject of active research.
There are two popular high-throughput methods for DNA sequencing. DNA sequencing relies on the ability to create a ladder of fragments of DNA at single base resolution and separate the DNA fragments by gel electrophoresis. Generally, the fragmented DNA is labeled with four different fluorescent labels, one for each base-specific fragmentation, and run a mixture of the four samples in one gel lane. Another commonly used sequencing method runs each sample in a separate, closely spaced lane. In both cases, the gel is scanned with a laser, which excites each fluorescent band on the gel in sequence. Each of these protocols has its advantages in different types of experiments, so both are in common use.
There are a variety of commercial and noncommercial tools for automated basecalling. Some of them are fully integrated with particular sequencing hardware and input datatypes. Most of them allow, and in fact require, curation by an expert user as sequence is determined.
The raw result of sequencing is a record of fluorescence intensities at each position in a sequencing gel. Figure 2 shows detector output from a modern sequencing experiment. The challenge for automated basecalling software is to translate the fluorescence peaks into four-letter DNA sequence code. As the separation of bands on a sequencing gel isn't perfect, the quality of the separation and the shape of the bands worsens over the length of the gel. Peaks broaden and intermix, and at some point (usually 400 -500 bases) the peaks become impossible to resolve. It is well-understood that systematic errors occurred, so computer algorithms are developed in a way to compensate them. The main goal of the basecalling software is to improve the accuracy of each sequence read, as well as to extend the range of sequencing runs, by providing means to deconvolute the more unclear fluorescence peaks at the end of the run.
Figure 2. Detector output from a sequencing experiment
Modern sequencing technologies replace gels with microscopic capillary systems, but the core concepts of the process are the same as in gel-based sequencing: fragmentation of the DNA and separation of individual fragments by electrophoresis.
Sequencing an Entire Genome
Genome sequencing isn't simply a scaled -up version of a gene-sequencing run. The sequence length limit of something like 500 base pairs. And the length of a genome can range from tens of thousands to billions of base pairs. So, in order to sequence an entire genome, the genome has to be cleaved into fragments, and then the sequenced fragments need to be reassembled into a continuous sequence.
There are two popular strategies for sequencing genomes: the shotgun approach and the clone contig approach. Combinations of these strategies are often used to sequence larger genomes.
The shotgun approach
Shotgun DNA sequencing is an automated approach for DNA sequencing. Here, DNA is broken into random fragments of manageable length (around 2,000 KB). They are cloned into plasmids (called a clone library). If a sufficiently large amount of genomic DNA is fragmented, the set of clones spans every base pair of the genome many times. The end of each cloned DNA fragment is then sequenced, or in some cases, both ends are sequenced. Although only 400 -500 bases at the end(s) of the fragment are sequenced, if enough clones are randomly selected from the library, the amount of sequenced DNA still encompass every base pair of the genome several times. The final step in shotgun sequencing is sequence assembly. Usually, assembly of sequences results in multiple contigs—clearly assembled lengths of sequence that don't overlap each other. The final steps in sequencing a complete genome by shotgun sequencing are either to find clones that can fill in the missing regions, or to use PCR or other techniques to amplify DNA sequence from the gaps.
The clone contig approach
The clone contig approach relies on shotgun sequencing as well, but on a smaller scale. Instead of starting by breaking down the entire genome into random fragments, the clone contig approach starts by breaking it down into restriction fragments, which can then be cloned into artificial chromosome vectors and amplified. Each of the cloned restriction fragments can be sequenced and assembled by a standard shotgun approach. When the genome is cleaved into restriction fragments, it is only partially degraded. The amount of restriction enzyme applied to the DNA sample is sufficient to cut at only approximately 50% of the available restriction sites in the sample. This means that some fragments will span a particular restriction site, while other fragments will be cut at that particular site and will span other restriction sites. So, the clone library that is made up of these restriction fragments will contain overlapping fragments. The process of assembly starts with so called chromosome walking. Finding a specific clone, then finding the next clone that overlaps it, and then the next, etc. Usually, a probe hybridization technique or PCR are used to help identify the restriction fragment that has been inserted into each clone.
Genomes can be mapped at various levels of detail. Genetic linkage maps could be created which assign the genes that give rise to particular traits to specific loci on the chromosome. Thus, they provide a set of ordered markers, sometimes very detailed depending on the organism, which can help researchers understand genome function (and provide a framework for assembling a full genome map). Also, physical maps can be built in several ways: by digesting the DNA with restriction enzymes that cut at particular sites, by developing ordered clone libraries, and by fluorescence microscopy of single, restriction enzyme-cleaved DNA molecules fixed to a glass substrate. The key to each method is that, using a combination of labeled probes and known genetic markers (in restriction mapping) or by identifying overlapping regions (in library creation), the fragments of a genome can be ordered correctly into a highly specific map.
LIMS: Tracking mini sequences
Tracking the millions of unique DNA samples that may be isolated from the genome is one of the biggest information technology challenges. The systems that manage output from high-throughput sequencing are called Laboratory Information Management Systems (LIMS), and its development and maintenance make up the biggest share of bioinformatics work in industrial settings. Other high throughput technologies, such as microarrays and cheminformatics, also require complicated LIMS support.
Basecalling is only the first step in putting together a complete genome sequence (Fig. 3). Once the short fragments of sequence are obtained, they must be assembled into a complete sequence that may be many thousands of times their length. The next step is sequence assembly.
DNA sequencing using a shotgun approach provides thousands or millions of mini sequences, each 400-500 fragments in length. The fragments are random and can partially or completely overlap each other. Because of these overlaps, every fragment in the set can be identified by sequence identity as adjacent to some number of other fragments. Each of those fragments overlaps yet another set of fragments, and so on. Finally, all the fragments need to be optimally join together into one continuous sequence. However, the repetitive sequences can complicate the assembly process. Some fragments will be uncloneable, and the sequencing process will fail, leaving gaps in the DNA sequence that complicate automated assembly. If there isn't sufficient information at some point in the sequence for assembly to continue, the sequence contig that is being created comes to an end, and a new contig starts.
Figure 3. The shotgun DNA sequencing approach
Accessing Genome Information on the Web
Partial or complete DNA sequences from hundreds of genomes are available in GenBank. Putting those sequence records together into an intelligible representation of genome structure isn't so easy. There are several efforts underway to integrate DNA sequence with higher-level maps of genomes in a user-friendly format. So far, these efforts are focused on the human genome and genomes of important plant and animal model systems.
NCBI Genome Resources
NCBI offers access to a wide selection of web-based genome analysis tools from the Genomic Biology section of its main web site. Their interfaces are user-friendly, and NCBI supplies plenty of documentation explaining how to use the provided tools and databases.
Some of the available genomic tools are:
Genome project information is available from the Entrez Genomes page at NCBI. Database listings are available for the full database or for related groups of organisms such as microorganisms, archaea, bacteria, eukaryotes, and viruses. Each entry in the database is linked to a taxonomy browser entry or a home page with further links to available information about the organism. If a genome map of the organism is available, a "See the Genome" link shows up on the organism's home page. From the home page, you can also download genome sequences and references.
Depending on the genome, you can access links to overview maps showing known protein-coding regions, listings of coding regions for protein and RNA, and other information. Map Viewer distinguishes between four levels of information: the organism's home page, the graphical view of the genome, the detailed map for each chromosome (aligned to a master map from which the user can select where to zoom in), and the sequence view, which graphically displays annotations for regions of the genome sequence.
The Open Reading Frame (ORF) Finder is a tool for locating open reading frames in a DNA sequence. ORF finders translate the sequence using standard or user-specified genetic code. In noncoding DNA, stop codons are frequently found. Information from the ORF finder can provide hints about the precise reading frame for a DNA sequence and about where coding regions start and stop. For many genomes found in the Entrez Genomes database, ORF Finder is available as an integrated tool from the map view of the genome.
HomoloGene is an automated system for constructing putative homology groups from the complete gene sets of a wide range of eukaryotic species. The ortholog pairs are identified either by curation of literature reports or calculation of similarity. The HomoloGene database can be searched using gene symbols, gene names, GenBank accession numbers, and other features.
Clusters of Orthologous Groups (COG)
COG is a database of orthologous protein groups. The database was developed by comparing protein sequences across 97 genomes. The entries in COG represent genome functions that are conserved throughout much of evolutionary history. The COG database can be searched by functional category, phylogenetic pattern, and a number of other properties.
Genome annotation in practice is hyperlinking of content between multiple databases—sequence, structure, and functional genomics fully linked together in a queryable system. It is a difficult process because there are a huge number of different pieces of information attached to every gene in a genome and it generally relies on relational databases to integrate genome sequence information with other data.
Pairwise or multiple comparison of genomes is the tool that can be used in many different studies, such as answering of basic questions of evolutionary biology (genetic polymorphisms) or very specific clinical questions (variations in phenotype).
Comparing of whole genomes, rather than just comparing genes one at a time, can help in defining the regions of similarity within uncharacterized or even supposedly redundant DNA. Genome comparison will also aid in genomic annotation. Prototype genome comparisons allows justifying the sequencing of additional genomes and it is useful both at the map level and directly at the sequence level.
PipMaker is a tool that computes alignments of similar regions in two DNA sequences. This is useful in identifying large-scale patterns of similarity in longer sequences. The process of using PipMaker is relatively simple. Starting with two FASTA-format sequence files, you first generate a set of instructions for masking sequence repeats (using the RepeatMasker server). This reduces the number of uninformative hits in the sequence comparison. The resulting information, plus a simple file containing a numerical list of known gene positions, is submitted to the PipMaker web server at Penn State University and the results are emailed to you.
Another program for ultra-fast alignment of large-scale DNA and protein sequences is MUMmer. Its first application was a detailed comparison of genomes of two strains of M. tuberculosis. MUMmer can compare sequences millions of base pairs in length and produce colorful visualizations of regions of similarity. MUMmer is based on a computer algorithm called a suffix tree, which essentially makes it easy for the system to rapidly handle a large number of pairwise comparisons. MUMmer can also align incomplete genomes; it can easily handle the 100s or 1000s of contigs from a shotgun sequencing project and will align them to another set of contigs or a genome using the NUCmer program included with the system. If the species are too divergent for a DNA sequence alignment to detect similarity, then the PROmer program can generate alignments based upon the six-frame translations of both input sequences.
Launching of high-speed sequencing methods has changed the way we study the DNA sequences that code for proteins. It is now becoming possible to view the whole DNA sequence of a chromosome as a single entity and to examine how the parts of it work together to produce the complexity of the organism as a whole.
The functions of the genome break down loosely into a few obvious categories: metabolism, regulation, signaling, and construction. Metabolic pathways convert chemical energy derived from environmental sources into useful work in the cell. Regulatory pathways are biochemical mechanisms that control what genomic DNA does: when it is expressed or not. Genomic regulation involves not only expressed genes but structural and sequence signals in the DNA where regulatory proteins may bind. Signaling pathways control the fluxes of chemicals from one compartment in a cell to another. Many regulatory systems for the control of DNA transcription have been studied. Mapping these metabolic, regulatory, and signaling systems to the genome sequence is the goal of the field of functional genomics.
Sequence-Based Approaches for Analyzing Gene Expression
In addition to genome sequence, GenBank contains many other kinds of DNA sequence. Expressed sequence tag (EST) data for an organism can be an extremely useful starting point for analysis of gene expression. ESTs are partial sequences of cDNA clones of cellular mRNA. mRNA levels respond to changes in the cell or its environment; mRNA levels are tissue dependent, and they change during the life cycle of the organism as well. Quantitation of mRNA or cDNA provides a good measure of what a genome is doing under particular conditions.
NCBI offers a database called dbEST that provides access to several thousand libraries of ESTs. Quite a large number of these are human EST libraries, but there are libraries from dozens of other organisms as well.
Recently, new technology has made it possible for researchers to rapidly explore expression patterns of entire genomes. A microarray (or gene chip) is a small glass which surface is covered with 20,000 or more precisely placed spots each containing a different DNA oligomer. cDNA can also be affixed to the slide to function as probes. Other media, such as thin membranes, can be used in place of slides. The key to the experiment is that each piece of DNA is immobilized and any reaction that results in a change in microarray signal can be precisely assigned to a specific DNA sequence.
Microarrays are conceptually no different from traditional hybridization experiments such as Southern Blots or Northern Blots. In traditional blotting, the protein sample is immobilized; in microarray experiments, the probe is immobilized, and the amount of information that can be collected in one experiment is vastly larger. Figure 4 shows just a portion of a microarray scan.
Figure 4. A microarray scan
Microarray technology is now routinely used for DNA sequencing experiments; for instance, in testing for the presence of polymorphisms. Another development is the use of microarrays for gene expression analysis. When a gene is expressed, an mRNA transcript is produced. If DNA oligomers complementary to the genes of interest are placed on the microarray, mRNA or cDNA can be hybridized to the chip, providing a rapid assay as to whether or not those genes are being expressed. Experiments like these for example have been performed in yeast to test differences in whole-genome expression patterns in response to changes in ambient sugar concentration. Microarray experiments can provide information about the behavior of every one of an organism's genes in response to environmental changes.
Bioinformatics Challenges in Microarray Design and Analysis
Bioinformatics plays multiple roles in microarray experiments. In fact, it is difficult to consider of microarrays as useful without the involvement of computers and databases. From the design of chips for specific purposes, to the quantitation of signals, to the extraction of groups of genes with linked expression profiles, microarray analysis is a process that is difficult, if not impossible, to do without the use of specific bioinformatics software.
In the public domain, several projects for linking expression data with associated sequences and annotations are ongoing. The biggest microarray database is the EMBL-EBI's ArrayExpress. The National Human Genome Research Institute (NHGRI) is currently offering a demonstration version of an array data management system that includes both analysis tools and relational database support.
Planning array experiments
A key element in microarray experiments is chip design. Chip design is a process that can take months. In order for microarray results to be clear and unambiguous, each DNA probe in the array must be sufficiently unique that only one specific target gene can hybridize with it. Otherwise, the amount of signal detected at each spot will be quantitatively incorrect.
Analyzing scanned microarray images
Once the array experiment is complete, you'll find yourself in possession of a lot of very large TIFF files containing scanned images of your arrays. The standard for public-domain microarray analysis tools are the packages developed at Stanford. One package, ScanAlyze, is the image analysis tool, well regarded and widely used. It supports TIFF files as well as the Stanford SCN format.
Numerous others softwares exist for microarray data analysis, such as:
GenomeStudio Software enables you to visualize and analyze microarray data generated on Illumina platforms. The software package is composed of discrete application modules that enable you to obtain a comprehensive view of the genome, gene expression, and gene regulation.
TM4 Microarray Software Suite is an open‐source tools for microarray data management and reporting, image analysis, normalization and pipeline control, and data mining and visualization.
MAIA is a software package for automatic processing of the one- and two-color images produced in cDNA, CGH or protein microarray technologies.
AIM (Automatic Image Processing system for Microarray) provides a method for uncalibrated microarray gridding and quantitative image analysis. AIM is a fast suffix array construction algorithm that performs very well even for worst-case strings. This system operates independently as well as command-line tools.
Koadarray, a fully automatic array image analysis software which can process single or multiple array images entirely unattended.
Clustering expression profiles
The most popular strategy for analysis of microarray data is the clustering of expression profiles. An expression profile can be visualized as a plot that describes the change in expression at one spot on a microarray grid over the course of the experiment. The course of the experiment changes with the context, anything from changes in the concentration of nutrients in the medium in which cells are being grown prior to having their DNA hybridized to the array, to cell cycle stages.
Different clustering methods, such as hierarchical clustering or SOMs (self-organizing maps) may work better in different situations, but the general aim of each of these methods is the same. If two genes change expression levels in the same way in response to a change in environment, it can be assumed that those genes are related. They may share something as simple as a promoter, or more likely, they are controlled by the same complex regulatory pathway. Automated clustering of expression profiles looks for similar features but doesn't necessarily point to causes for those changes.
Proteomics refers to techniques that simultaneously study the entire protein complement of a cell. While protein purification and separation methods are constantly improving, and the time-to completion of protein structures determined by NMR and x-ray crystallography is decreasing, there is as yet no single way to rapidly crystallize the entire protein complement of an organism and determine every structure. The technological advance in biochemistry that most requires informatics support is the immobilized-gradient 2D-PAGE process and the subsequent characterization of separated protein products by mass spectrometry.
Experimental Approaches in Proteomics
Knowing when and at what levels genes are being expressed is only the first step in understanding how the genome determines phenotype. While mRNA levels are correlated with protein concentration in the cell, proteins are subject to post-translational modifications that can't be detected with a hybridization experiment. Experimental tools for determining protein concentration and activity in the cell are the crucial next step in the process.
Another high-throughput technology that is emerging as a tool in functional genomics is 2D gel electrophoresis. Two-dimensional gel electrophoresis can be used to separate protein mixtures containing thousands of components. The first dimension of the experiment is separation of the components of a solution along a pH gradient (isoelectric focusing). The second dimension is separation of the components orthogonally by molecular weight. Separation in these two dimensions can resolve even a complicated mixture of components. Figure 5 shows an example of 2D-PAGE map from E. coli. The 2D-PAGE experiment separates proteins from a mixed sample so that individual proteins can be identified. Each spot on the map represents a different protein.
Figure 5. A 2D-PAGE map from E. coli
Using 2D gel electrophoresis allows very precise protein separations, resulting in standardized high-density data arrays. They can therefore be subjected to automated image analysis and quantitation and used for accurate comparative studies. The other advance that has put 2D gel technology at the forefront of modern molecular biology methods is the capacity to chemically analyze each spot on the gel using mass spectrometry. This allows the measurable biochemical phenomenon—the amount of protein found in a particular spot on the gel—to be directly connected to the sequence of the protein found at that spot.
Informatics Challenges in 2D-PAGE Analysis
The analysis pathway for 2D-PAGE gel images is essentially quite similar to that for microarrays. The first step is an image analysis, in which the positions of spots on the gel are identified and the boundaries between different spots are resolved. Molecular weight and isoelectric point (PI) for each protein in the gel can be estimated according to position.
Next, the spots are identified, and sequence information is used to make the connection between a particular spot and its gene sequence. In proteome analysis, the immobilized proteins can either be sequenced in situ or spots of protein can be physically removed from the gel, eluted, and analyzed using mass spectrometry methods such as electrospray ionization mass spectrometry (ESI-MS) or matrix-assisted laser desorption ionization mass spectrometry (MALDI).
Tools for Proteomics Analysis
Several public-domain programs for proteomics analysis are available on the Web. Most of these can be accessed through the excellent proteomics resource at Expert Protein Analysis System (ExPASy). ExPASy is the Swiss Institute of Bioinformatics Resource Portal which provides access to scientific databases and software tools (i.e., resources) in different areas of life sciences including proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc.
Biochemical Pathway Databases
Gene and protein expression are only two steps in the translation of genetic code to phenotype. Once genes are expressed and translated into proteins, their products participate in complicated biochemical interactions called pathways, as shown in Figure 6. Each pathway may supply chemical precursors to many other pathways, meaning that each protein has relationships not only to the preceding and following biochemical steps in a single pathway, but possibly to steps in several pathways. The complicated branching of metabolic pathways are far more difficult to represent and search than the linear sequences of genes and genomes.
Figure 6. A complex metabolic pathway
Several web-based services offer access to metabolic pathway information.
The best known metabolic pathway resources on the Web is the Kyoto Encyclopedia of Genes and Genomes (KEGG). KEGG provides its metabolic overviews as map illustrations, rather than text-only, and can be easier to use for the visually-oriented user. KEGG also provides listings of EC numbers and their corresponding enzymes broken down by level, and many helpful links to sites describing enzyme and ligand nomenclature in detail. The LIGAND database, associated with KEGG, is a useful resource for identifying small molecules involved in biochemical pathways. KEGG is searchable by sequence homology, keyword, and chemical entity; you can also input the LIGAND ID codes of two small molecules and find all of the possible metabolic pathways connecting them.
PathDB is another type of metabolic pathway database. While it contains roughly the same information as KEGG—identities of compounds and metabolic proteins, and information about the steps that connect these entities—it handles information in a far more flexible way than the other metabolic databases. Instead of limiting searches to arbitrary metabolic pathways and describing pathways with preconceived images, PathDB allows you to find any set of connected reactions that link point A to point B, or compound A to compound B. PathDB contains, in addition to the usual search tools, a pathway visualization interface that allows you to review any selected pathway and display different representations of the pathway.