LO2: Alignments and phylogenetic trees

Bioinformatics tools

There are several tools that study protein and DNA sequences, the most abundant type of biological data available electronically. The importance of sequence databases is from crucial importance to biological investigations and the pairwise sequence comparison is the most essential technique in bioinformatics. It allows you to search sequence-based datasets, to build evolutionary trees, to recognize specific features of protein families, to create homology models. But it's also the key for the development of larger projects, such as analyzing whole genomes, exploring the sequence determinants of protein structure, connecting expression data to genomic information, etc.

The following types of analysis can be performed by using sequence data:

  • Single sequence analysis and sequence characterization
  • Pairwise alignment and DNA / protein sequence searching
  • Multiple sequence alignment
  • Sequence motif discovery in multiple alignments
  • Phylogenetic analysis

Pairwise sequence comparison is the main tool of connecting biological function with genome and of transferring known information from one genome to another. The techniques for analysis of biological sequences is the most significant approaches for sequence data assessment. There are numerous freely accessible software tools for performing pairwise sequence comparison. Some of them are summarized in Table 1.

Table 1. Sequence Analysis Tools and Techniques

What you do Why you do it What you use to do it
Gene finding Identify possible coding regions in genomic DNA sequences GENSCAN, GeneWise, PROCRUSTES, GRAIL
DNA feature detection Locate splice sites, promoters, and sequences involved in regulation of gene expression CBS Prediction Server
DNA translation and reverse translation Convert a DNA sequence into protein sequence or vice versa "Protein machine" server at EBI
Pairwise sequence alignment (local) Locate short regions of homology in a pair of longer sequences BLAST, FASTA
Pairwise sequence alignment (global) Find the best full-length alignment between two sequences ALIGN
Sequence database search by pairwise comparison Find sequence matches that aren't recognized by a keyword search; find only matches that actually have some sequence homology BLAST, FASTA, SSEARCH

Mechanisms of Molecular Evolution

The discovery of DNA as the molecular basis of heredity and evolution made it possible to understand the process of evolution in a whole new way. It is known that often mutations occur in different parts of an organism's DNA: in the middle of genes that code for proteins or functional RNA molecules, in the middle of regulatory sequences that govern whether a gene to be expressed or not, or in the "middle of nowhere", in the regions between gene sequences. Mutations can have important effects on the organism's phenotype or they can have no apparent consequence. Over time mutations that are beneficial or at least not harmful to a species can become fixed in the population.

By comparative study of DNA sequences or of whole genomes, it's possible to develop quantitative methods for understanding when and how mutational events occurred, as well as how and why they were preserved to survive in existing species and populations. Genomics and bioinformatics have made it possible to study the evolutionary record and make statements about the phylogenetic relationship of one species to another. Changes in the identity of the residue (nucleotide or amino acid) at a given position in the sequence are scored using standard substitution scores (for example, a positive score for a match and a negative score for a mismatch) or substitution matrices. Insertions and deletions are scored with penalties for gap opening and gap extension.

Genefinders and DNA Features Detection

Once a large piece of DNA has been mapped and sequenced, the next important task is to understand its function. Analysis of single DNA for sequence features is a rapidly growing research area in bioinformatics. There are two reasons that genefinding and feature detection represent difficult problems. First, there are a huge number of protein-DNA interactions, many of which have not yet been experimentally characterized, and some of which differ from organism to organism. Current promoter detection algorithms yield about 20-40 false positives for each real promoter identified. Some proteins bind to specific sequences; others are more flexible and recognize different attachment sites. To complicate matters further, a protein can bind in one part of a chromosome but affect completely different region hundreds or thousands of base pairs away.

Genefinders are programs that try to identify all the open reading frames in unannotated DNA. They use a variety of approaches to locate genes, but the most successful combine content-based and pattern-recognition approaches. Content-based tools for gene prediction take advantage of the fact that the distribution of nucleotides in genes is different than in non-genes. Pattern-recognition methods look for characteristic sequences associated with genes (start and stop codons, promoters, splice sites) to deduce the presence and structure of a gene. In fact, the current generation of genefinders combine both methods with additional knowledge, such as gene structure or sequences of other, known genes.

Some genefinders are accessible only though web interfaces: the sequence that needs to be examined for genes is submitted to the program, it is processed, and the corresponding result is returned. On one hand, this eliminates the need for installation and maintenance of the specific software on your system, and it provides a relatively uniform interface for the different programs. On the other, if you plan to rely on the results of a genefinder, you should take the time to understand underlying algorithm, find out if the model is specific for a given species or family, and, in the case of content-based models, know which sequences they are.

Some frequently used programs in gene finding include Oak Ridge National Labs' GRAIL, GENSCAN, PROCRUSTES, and GeneWise. GRAIL combines evidence from a variety of signal and content information using a neural network. GENSCAN combines information about content statistics with a probabilistic model of gene structure. PROCRUSTES and GeneWise find open reading frames by translating the DNA sequence and comparing the resulting protein sequence with known protein sequences. PROCRUSTES compares potential ORFs with close homologs, while GeneWise compares the gene against a single sequence or a model of an entire protein family.

Feature Detection

In addition to their role in genefinder systems, feature-detection algorithms can be used on their own to find patterns in DNA sequences. Frequently, these tools help interpret newly sequenced DNA or choose targets for designing PCR primers or microarray oligomers. Some starting places for tools like these include the Center for Biological Sequence Analysis at the Technical University of Denmark, the CodeHop server at the Fred Hutchinson Cancer Research Center, and the Tools collection at the European Bioinformatics Institute. In addition to these special-purpose tools, another popular approach is to use motif discovery programs that automatically find common patterns in sequences.

DNA Translation

Before a protein can be synthesized, its sequence must be translated from the DNA into protein sequence. However, any DNA sequence can be translated in six possible ways. The sequence can be translated backward and forward. Because each amino acid in a protein is specified by three bases in the DNA sequence, there are three possible translations of any DNA sequence in each direction: one beginning with the very first character in the sequence, one beginning with the second character, and one beginning with the third character.

Figure 1 shows "back-translation" of a protein sequence (shown on the top line) into DNA, using the bacterial and plant plastid genetic code. However, note that nature has grouped the codons "sensibly": alanine (A) is always specified by a "G-C-X" codon, arginine (R) is specified either by a "C-G-X" codon or an "A-G-pyrimidine" codon, etc. This reduces the number of potential sequences that have to be checked if you (for example) try to write a program to compare a protein sequence to a DNA sequence database.

The more computationally efficient solution to this problem is simply to translate the DNA sequence database in all six reading frames.

Figure 1. Back-translation from a protein sequence

There are no markers in the DNA sequence to indicate where one codon ends and the next one begins. Consequently, unless the location of the start codon is known ahead of time, a double-stranded DNA sequence can be interpreted in any of six ways: an open reading frame can start at nucleotide i, at i+1, or at i+2 on either of both DNA strand. To interpret this uncertainty, when a protein is compared with a set of DNA sequences, the DNA sequences are translated into all six possible amino acid sequences, and the protein query sequence is compared with these resulting conceptual translations. This exhaustive translation is called a "six-frame translation" and is illustrated in Figure 2.

Figure 2. A DNA sequence and its translation in three of six possible reading frames

Because of the large number of codon possibilities for some amino acids, back-translation of a protein into DNA sequence can result in an extremely large number of possible sequences. However, codon usage statistics for different species are available and can be used to suggest the most likely backtranslation out of the range of possibilities. However, if you need to produce a six-frame translation of a single DNA sequence or translate a protein back into a set of possible DNA sequences, and you don't want to script it yourself, the Protein Machine server at the European Bioinformatics Institute (EBI) will do it for you.

Pairwise Sequence Comparison

Comparison of protein and DNA sequences is one of the fundamentals of bioinformatics. The ability to perform rapid automated comparisons of sequences facilitates assignment of function to a new sequence, prediction and construction of model protein structures, design and analysis of gene expression experiments. As biological sequence data has accumulated, it has become apparent that nature is conservative. A new biochemistry isn't created for each new species, and new functionality isn't created by the sudden appearance of whole new genes. Instead, incremental modifications give rise to genetic diversity and novel function. Thus, detection of similarity between sequences allows transferring of information about one sequence to other similar sequences with reasonable, though not always total, confidence.

Before making a comparative conclusion about one nucleic acid or protein sequence, a sequence alignment is required. The basic concept of selecting an optimal sequence alignment is simple. The two sequences are matched up in an arbitrary way. The quality of the match is scored. Then one sequence is moved with respect to the other and the match is scored again, until the best-scoring alignment is found.

What sounds simple in principle isn't at all simple in practice. So, using an automated method for finding the optimal alignment is the most suitable approach. Next question is how should alignments be scored? A scoring scheme can be as simple as +1 for a match and -1 for a mismatch. But, should gaps be allowed to open in the sequences to facilitate better matches elsewhere? If gaps are allowed, how should they be scored? What is the best algorithm for finding the optimal alignment of two sequences? And when an alignment is produced, is it necessarily significant? Can an alignment of similar quality be produced for two random sequences?

Figure 3 shows examples of three kinds of alignment. In each alignment, the sequences being compared are displayed, one above the other, such that matching residues are aligned. Similarities are indicated with plus (+). Information about the alignment is presented at the top, including percent identity (the number of identical matches divided by the length of the alignment) and score. Finally, gaps in one sequence relative to another are represented by dashes (-) for each position in that sequence occupied by a gap.

Figure 3. Three alignments: random, high scoring, and low scoring but meaningful

The first alignment is a random alignment, a comparison between two unrelated sequences. Notice that, in addition to the few identities and conservative mutations between the two, large gaps have been opened in both sequences to achieve this alignment. Second alignment is a high-scoring one: it shows a comparison of two closely related proteins. Compare that alignment with the third, a comparison of two distantly related proteins. It shows that fewer identical residues are shared by the sequences in the low-scoring alignment than in the high-scoring one. Still, there are several similarities or conservative changes.

In describing sequence comparisons, several different terms are frequently used. Sequence identity, sequence similarity, and sequence homology are the most important. Sequence similarity is meaningful only when possible substitutions are scored according to the probability with which they occur. In protein sequences, amino acids of similar chemical properties are found to substitute for each other much more readily than dissimilar amino acids. Sequence homology is a more general term that indicates evolutionary relatedness among sequences. It is common to speak of a percentage of sequence homology when comparing two sequences, although that percentage may include a mixture of identical and similar sites. Finally, sequence homology refers to the evolutionary relatedness between sequences. Two sequences are said to be homologous if they are both derived from a common ancestral sequence. The terms similarity and homology are often used interchangeably to describe sequences, but, however, they mean different things. Similarity refers to the presence of identical and similar sites in the two sequences, while homology reflects a sharing of a common ancestor.

Scoring Matrices

The most important information when evaluating a sequence alignment is whether it is random, or meaningful. If the alignment is meaningful, the question is how meaningful it is. This is assessed by constructing a scoring matrix. A scoring matrix is a table of values that describe the probability of a residue (amino acid or base) pair occurring in an alignment. The values in a scoring matrix are logarithms of ratios of two probabilities. One is the probability of random occurrence of an amino acid in a sequence alignment. This value is simply the product of the independent frequencies of occurrence of each of the amino acids. The other is the probability of meaningful occurrence of a pair of residues in a sequence alignment. These probabilities are derived from samples of actual sequence alignments that are known to be valid.

Figure 4 shows an example of a BLOSUM62 substitution matrix for amino acids.

See the source image

Figure 4. The BLOSUM62 substitution matrix for amino acids

Substitution matrices for amino acids are complicated because they reflect the chemical nature and frequency of occurrence of the amino acids. For example, in the BLOSUM matrix, glutamic acid (E) has a positive score for substitution with aspartic acid (D) and also with glutamine (Q). Both these substitutions are chemically conservative. Aspartic acid has a sidechain that is chemically similar to glutamic acid, though one methyl group shorter. On the other hand, glutamine is similar in size and chemistry to glutamic acid, but it is neutral while glutamic acid is negatively charged. Substitution scores for glutamic acid with residues such as isoleucine (I) and leucine (L) are negative

Substitution matrices for bases in DNA or RNA sequence are very simple. In most cases, it is reasonable to assume that A:T and G:C occur in roughly equal proportions. Commonly used substitution matrices include the BLOSUM and PAM matrices. When using BLAST, you need to select a scoring matrix. Most automated servers select a default matrix for you, and if you're just doing a quick sequence search, it's fine to accept the default.

BLOSUM matrices are derived from the Blocks database. The numerical value (e.g., 62) associated with a BLOSUM matrix represents the cutoff value for the clustering step. A value of 62 indicates that sequences were put into the same cluster if they were more than 62% identical. By allowing more diverse sequences to be included in each cluster, lower cutoff values represent longer evolutionary time scales, so matrices with low cutoff values are appropriate for seeking more distant relationships. BLOSUM62 is the standard matrix for ungapped alignments, while BLOSUM50 is more commonly used when generating alignments with gaps.

Point accepted mutation (PAM) matrices are scaled according to a model of evolutionary distance from alignments of closely related sequences. The most commonly used PAM matrix is PAM250. However, comparison of results using PAM and BLOSUM matrices suggest that BLOSUM matrices are better at detecting biologically significant similarities.

Gap Penalties

DNA sequences change not only by point mutation, but by insertion and deletion of residues as well. Consequently, it is often necessary to introduce gaps into one or both of the sequences being aligned to produce a meaningful alignment between them. Most algorithms use a gap penalty for the introduction of a gap in the alignment. Most sequence alignment models use affine gap penalties, in which the rate of opening a gap in a sequence is different from the rate of extending a gap that has already been started. Of these two penalties—-the gap opening penalty and the gap extension penalty—-the gap opening penalties tend to be much higher than the associated extension penalty. Scores of -11 for gap opening and -1 for gap extension are commonly used in conjunction with the BLOSUM 62 matrix.

Global Alignment

One possibility is to align two sequences along their whole length. This algorithm is called the Needleman-Wunsch algorithm. In this case, an optimal alignment is built up from high-scoring alignments of subsequences, stepping through the matrix from top left to bottom right. Only the best-scoring path can be traced through the matrix, resulting in an optimal alignment.

Local Alignment

The most commonly used sequence alignment tools rely on a strategy called local alignment. The global alignment strategy assumes that the two sequences to be aligned are known and are to be aligned over their full length. However, often a sequence is searched against a sequence database with unknown sequences, or a short query sequence is used to match with a very long DNA sequence. For example, in protein or gene sequences that do have some evolutionary relatedness, but which have diverged significantly from each other, short homologous segments may be all the evidence of sequence homology that remains. The algorithm that performs local alignment of two sequences is known as the Smith-Waterman algorithm. A local alignment isn't required to extend from beginning to end of the two sequences being aligned. If the cumulative score up to some point in the sequence is negative, the alignment can be abandoned and a new alignment started. The alignment can also end anywhere in the matrix.

Tools for local alignment

One of the most frequently reported implementations of the Smith-Waterman algorithm for database searching is the program SSEARCH, which is part of the FASTA distribution. LALIGN, also part of the FASTA package, is an implementation of the Smith-Waterman algorithm for aligning two sequences.

Sequence Queries Against Biological Databases

A common application of sequence alignment is searching a database for sequences that are similar to a query sequence. In these searches, an alignment of a sequence hundreds or thousands of residues long is matched against a database of at least tens of thousands of comparably sized sequences.

Local Alignment-Based Searching Using BLAST

By far, the most popular tool for searching sequence databases is a program called BLAST (Basic Local Alignment Search Tool). It performs pairwise comparisons of sequences, seeking regions of local similarity, rather than optimal global alignments between whole sequences. BLAST can perform hundreds or even thousands of sequence comparisons in a matter of minutes. And in less than a few hours, a query sequence can be compared to an entire database to find all similar sequences.

The BLAST algorithm

Local sequence alignment searching using a standard Smith-Waterman algorithm is a fairly slow process. The BLAST algorithm, which speeds up local sequence alignment, has three basic steps. First, it creates a list of all short sequences (called WORDS) that score above a threshold value when aligned with the query sequence. Next, the sequence database is searched for occurrences of these words. Because the word length is so short (3 residues for proteins, 11 residues for nucleic acids), it's possible to search a precomputed table of all words and their positions in the sequences for improved speed. These matching words are then extended into ungapped local alignments between the query sequence and the sequence from the database. Extensions are continued until the score of the alignment drops below a threshold. The top-scoring alignments in a sequence, or maximal-scoring segment pairs (MSPs), are combined where possible into local alignments. The new additions to the BLAST software package also search for gapped alignments.

NCBI BLAST and WU-BLAST

There are two implementations of the BLAST algorithm: NCBI BLAST and WU-BLAST. Both can be used as web services and as downloadable software packages. NCBI BLAST is available from the National Center for Biotechnology Information (NCBI), while WU-BLAST is developed and maintained at Washington University. NCBI BLAST is the more commonly used of the two. The most recent versions of this program have focused on the development of methods for comparing multiple-sequence profiles. WU-BLAST, on the other hand, has developed a different system for handling gaps as well as a number of features that are useful for searching genome sequences.

Different BLAST programs

The four main executable programs in the BLAST distribution are:

[blastall]

Performs BLAST searches using one of five BLAST programs: blastp, blastn, blastx, tblastn, or tblastx

[blastpgp]

Performs searches in PSI-BLAST or PHI-BLAST mode

[bl2seq]

Performs a local alignment of two sequences

[formatdb]

Converts a FASTA-format flat file sequence database into a BLAST database

blastall encompasses all the major options for ungapped and gapped BLAST searches. A full list of its command-line arguments can be displayed with the command blastall - :

[-p]

Program name. Its options include:

blastp

Protein sequence (PS) query versus PS database

blastn

Nucleic acid sequence (NS) query versus NS database

blastx

NS query translated in all six reading frames versus PS database

tblastn

PS query versus NS database dynamically translated in all six reading frames

tblastx

Translated NS query versus translated NS database—computationally intensive

blastpgp allows you to use two new BLAST modes: PHI-BLAST (Pattern Hit Initiated BLAST) and PSI-BLAST (Position Specific Iterative BLAST). PHI-BLAST uses protein motifs, such as those found in PROSITE and other motif databases, to increase the likelihood of finding biologically significant matches. PSI-BLAST uses an iterative alignment procedure to develop position-specific scoring matrices, which increases its capability to detect weak pattern matches.

bl2seq allows the comparison of two known sequences using the blastp or blastn programs. Most of the command-line options for bl2seq are similar to those for blastall.

Evaluating BLAST results

A BLAST search provides three related pieces of information that allow you to interpret its results: raw scores, bit scores, and E-values.

The raw score for a local sequence alignment is the sum of the scores of the maximal-scoring segment pairs (MSPs) that make up the alignment. Bit scores are raw scores that have been converted from the log base of the scoring matrix that creates the alignment to log base 2. E-values provide information about the likelihood that a given sequence alignment is significant. An alignment's E-value indicates the number of alignments one expects to find with a score greater than or equal to the observed alignment's score in a search against a random database. Thus, a large E-value (5 or 10) indicates that the alignment probably has occurred by chance, and that the target sequence has been aligned to an unrelated sequence in the database. E-values of 0.1 or 0.05 are typically used as cutoffs in sequence database searches. Using a larger E-value cutoff in a database search allows more distant matches to be found, but it also results in a higher rate of spurious alignments. Of the three, E values are the values most often reported in the literature.

There is a limit beyond which sequence similarity becomes uninformative about the relatedness of the sequences being compared. This limit is encountered below approximately 25% sequence similarity for protein sequences. In the case of protein sequences with low sequence similarity that are still believed to be related, structural analysis techniques may provide evidence for such a relationship. Where structure is unknown, sequences with low similarity are categorized as unrelated, but that may mean only that the evolutionary distance between sequences is so great that a relationship can't be detected.

Local Alignment Using FASTA

Another method for local sequence alignment is the FASTA algorithm. FASTA precedes BLAST and like BLAST, it is available both as a service over the Web and as a downloadable set of programs.

The FASTA algorithm

FASTA first searches for short sequences (called ktups) that occur in both the query sequence and the sequence database. Then, using the BLOSUM50 matrix, the algorithm scores the 10 ungapped alignments that contain the most identical ktups. These ungapped alignments are tested for their ability to be merged into a gapped alignment without reducing the score below a threshold. For those merged alignments that score over the threshold, an optimal local alignment of that region is then computed, and the score for that alignment (called the optimized score) is reported.

FASTA ktups are shorter than BLAST words, typically 1 or 2 for proteins, and 4 or 6 for nucleic acids. Lower ktup values result in slower but more sensitive searches, while higher ktup values yield faster searches with fewer false positives.

The FASTA programs

The FASTA distribution contains search programs that are analogous to the main BLAST modes, with the exception of PHI-BLAST and PSI-BLAST.

[fasta]

Compares a protein sequence against a protein database (or a DNA sequence against a DNA database) using the FASTA algorithm

[ssearch]

Compares a protein sequence against a protein database (or DNA sequence against a DNA database) using the Smith-Waterman algorithm

[fastx /fasty]

Compares a DNA sequence against a protein database, performing translations on the DNA sequence

[tfastx /tfasty]

Compares a protein sequence against a DNA database, performing translations on the DNA sequence database

[align]

Computes the global alignment between two DNA or protein sequences

[lalign]

Computes the local alignment between two DNA or protein sequences

Multifunctional Tools for Sequence Analysis

Several research groups and companies have assembled web-based interfaces to collections of sequence tools. The best of these have fully integrated tools, public databases, and the ability to save a record of user data and activities from one use to another. If you're searching for matches to just one or a few sequences and you want to search the standard public databases, these portals can save you a lot of time while providing most of the functionality and ease of use of a commercial sequence analysis package.

The Biology Workbench

The Biology Workbench resource is freely available to academic users and offers keyword and sequence-based searching of nearly 40 major sequence databases and over 25 whole genomes. Both BLAST and FASTA are implemented as search and alignment tools in the Workbench, along with several local and global alignment tools, tools for DNA sequence translation, protein sequence feature analysis, multiple sequence alignment, and phylogenetic tree drawing. Although its interface can be somewhat complicated, involving a lot of window scrolling and button clicking, the Biology Workbench is comprehensive, convenient, and accessible web-based toolkit. One of its main benefits is that many sequence file formats are accepted and can move easily from keyword-based database search, to sequence-based search, to multiple alignment, to phylogenetic analysis.

EMBOSS

EMBOSS is "The European Molecular Biology Open Software Suite". EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology user community. The software automatically copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web. Within EMBOSS you will find numerous applications covering areas such as:

  • Sequence alignment,
  • Rapid database searching with sequence patterns,
  • Protein motif identification, including domain analysis,
  • Nucleotide sequence pattern analysis---for example to identify CpG islands or repeats,
  • Codon usage analysis for small genomes,
  • Rapid identification of sequence patterns in large scale sequence sets,
  • Presentation tools for publication, and much more.

Funding

Disclaimer

The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsi-ble for any use which may be made of the information contained therein.