LO5: Alignments and phylogenetic trees

Introduction

The idea of using sequence alignment is to find and compare pairs of related sequences. Biologically interesting problems, however, often involve comparing more than two sequences at once. BLAST or FASTA search can yield a large number of sequences that match the query. One approach to compare all these resulting sequences with each other is to perform pairwise alignments of all pairs of sequences, then study these pairwise alignments individually. It's more efficient (and easier to comprehend), however, if you compare all the sequences at once, then examine the resulting ensemble alignment. This process is known as multiple sequence alignment. Multiple sequence alignments can be used to study groups of related genes or proteins, to infer evolutionary relationships between genes, and to discover patterns that are shared among groups of functionally or structurally related sequences.

Multiple Sequence Alignment

Multiple sequence alignment techniques are generally applied to protein sequences. They are used for both evolutionary and structural similarity search among the proteins encoded by each sequence in the alignment. The proteins with closely related functions are similar in both sequence and structure from organism to organism. However, that sequence tends to change more rapidly than structure in the course of evolution. In multiple alignments generated from sequence data alone, regions that are similar in sequence are usually found to be superimposable in structure as well.

Progressive Strategies for Multiple Alignment

A common approach to multiple sequence alignment is to progressively align pairs of sequences. This strategy can be described as follows: a starting pair of sequences is selected and aligned, then each subsequent sequence is aligned to the previous alignment. Like the Needleman - Wunsch and Smith-Waterman algorithms for sequence alignment, progressive alignment is an instance of a heuristic algorithm. It decomposes a problem into pieces, then choose the best solution to each piece without paying attention to the problem as a whole. In the case of progressive alignment, the overall problem (alignment of many sequences) is decomposed into a series of pairwise alignment steps.

Because it is a heuristic algorithm, progressive alignment isn't guaranteed to find the best possible alignment. However, it is efficient and produces biologically meaningful results. The methods used differ in several respects: how they choose the initial pair of sequences to align, whether they align every subsequent sequence to a single cumulative alignment or create subfamilies, and how they score individual alignments and alignments of individual sequences to previous alignments.

Multiple Alignment with Clustal Omega

One commonly used program for progressive multiple sequence alignment is Clustal Omega. The heuristic used in Clustal Omega is based on phylogenetic analysis. First, a pairwise distance matrix for all the sequences to be aligned is generated, and a guide tree is created using the neighbor-joining algorithm. Then, each of the most closely related pairs of sequences are aligned to each other. Next, each new alignment is analyzed to build a sequence profile. Finally, alignment profiles are aligned to each other or to other sequences until a full alignment is built.

This strategy produces reasonable alignments under a range of conditions. For example, it's not guaranteed for distantly related sequences. Pairwise sequence alignment by dynamic programming is very accurate for closely related sequences regardless of which scoring matrix or penalty values are used. Using multiple sequences to create profiles increases the accuracy of pairwise alignment for more distantly related sequences.

There are several parameters involved in multiple sequence alignment - scoring matrices and gap penalties associated with the pairwise alignment steps, weighting parameters that alter the scoring matrix used in sequence-profile and profile-profile alignments. In Clustal Omega, these are set from the Set your parameters menu (Fig. 1).

The pairwise alignment parameters are familiar and have the same meaning in multiple alignment as they do in pairwise alignment. The multiple alignment parameters include gap opening and gap extension penalties for the multiple alignment process—to be used when fine-tuning alignments—and a maximum allowable delay, in terms of sequence length, for the start of divergent sequences at the beginning of the alignment.

One of Clustal Omega's heuristics is that, in protein sequence alignment, different scoring matrices are used for each alignment based on expected evolutionary distance. If two sequences are close neighbors in the tree, a scoring matrix optimized for close relationships aligns them. Distant neighbors are aligned using matrices optimized for distant relationships. Thus, when prompted to choose a series of matrices in the Multiple Alignment Parameters menu, it means just that: use BLOSUM62 for close relationships and BLOSUM45 for more distant relationships, rather than the same scoring matrix for all pairwise alignments.

Fig. 1. Clustal Omega multiple sequence alignment program

Fig. 1. Clustal Omega multiple sequence alignment program

Sequence Logos

A way to view sequence alignments, and one which has become quite popular recently, is the sequence logo format. This format is especially good for shorter sequence regions, such as protein motifs. Consensus sequences represent each position in an alignment with the residue that is most commonly found in that position. Sequence logos, as illustrated in Figure 2, are a graphical way to represent relative frequencies, information content, order of substitution preference, and other characteristics of each site in an alignment.

Figure 2. A sequence logo

Figure 2. A sequence logo

The software for creating sequence logos is part of a larger group of programs called the DELILA package. You actually need only two of the many DELILA programs (alpro and makelogo) to create logos from aligned sequences. An easier approach for the novice is to use the Sequence logo web server. Aligned sequences can be submitted to this server in FASTA alignment format.

Phylogenetic Analysis

One of the applications of the multiple sequence alignment is the phylogenetic inference. Phylogenetic inference is the process of developing hypotheses about the evolutionary relatedness of organisms based on their observable characteristics.

While hand-drawn trees of life may branch according to what is essentially an artist's conception of evolutionary relationships, modern phylogenetic trees are strictly binary. Accordingly, at any branch point, a parent branch splits into only two daughter branches. Binary trees can approximate any other branching pattern, and the assumption that trees are binary greatly simplifies the tree-building algorithms.

The length of branches in a quantitative phylogenetic tree can be determined in more than one way. For example, the evolutionary distance between pairs of sequences is one way to assign branch length.

While a phylogeny of species generally has a root, assuming that all species have a specific common ancestor, a phylogenetic tree derived from sequence data may be rooted or unrooted. It isn't too difficult to calculate the similarity between any two sequences in a group and to determine where branching points belong. It is much harder to pinpoint which sequence in such a tree is the common ancestor, or which pair of sequences can be selected as the first daughters of a common ancestor. While some phylogenetic inference programs do offer a hypothesis about the root of a tree, most simply produce unrooted trees. Figure 3 and Figure 4 illustrate rooted and unrooted phylogenetic trees.

Figure 3. A rooted phylogenetic tree

Figure 3. A rooted phylogenetic tree

Figure 4. An unrooted phylogenetic tree

Figure 4. An unrooted phylogenetic tree

A phylogeny based on sequence alignment may be a tree, and it may describe a biological entity, but it takes far more than a single evolutionary analysis to draw conclusions about whole-organism phylogeny. Sequence-based phylogenies are quantitative. When they are built based on sufficient amounts of data, they can provide valuable, scientifically valid evidence to support theories of evolutionary history. However, a single sequence based phylogenetic analysis can only quantitatively describe the input data set. It isn't valid as a quantitative tool beyond the bounds of that data set.

It has been shown, by comparative analysis of phylogenies generated for different protein and gene families, that one protein may evolve more quickly than another, and that a single protein may evolve more quickly in some organisms than in others. Thus, the phylogenetic analysis of a sequence family is most informative about the evolution of that particular gene. Only by analysis of much larger sets of data can theories of whole-organism phylogeny be suggested.

Phylogenetic Trees Based on Pairwise Distances

One of the easiest algorithms for tree drawing is the pairwise distance method. This method produces a rooted tree. The algorithm is initialized by defining a matrix of distances between each pair of sequences in the input set. Sequences are then clustered according to distance, in effect building the tree from the branches down to the root.

Distances can be defined by more than one measure, but one of the more common and simple measures of dissimilarity between DNA sequences is the Jukes-Cantor distance, which is logarithmically related to the fraction of sites at which two sequences in an alignment differ. The fraction of matching positions in an ungapped alignment between two unrelated DNA sequences approaches 25%. Consequently, the Jukes-Cantor distance is scaled such that it approaches infinity as the fraction of unmatched residue pairs approaches 75%.

The pairwise clustering procedure used for tree drawing (UPGMA, unweighted pair group method using arithmetic averages) is intuitive. Each sequence is assigned to its own cluster, and a branch of the tree is started for that sequence at height zero in the tree. Then, the two clusters that are closest together in terms of whatever distance measure has been chosen are merged into a single cluster. A branch point (or node) is defined that connects the two branches. The node is placed at a height in the tree that reflects the distance between the two branches that have been joined. This process is repeated iteratively, until there are only two clusters left. When they are joined, the root of the tree is defined. The branch lengths in a tree constructed using this process theoretically reflect evolutionary time.

Phylogenetic Trees Based on Neighbor Joining

Neighbor joining is another distance matrix method. It eliminates a possible error that can occur when the UPGMA method is used. UPGMA produces trees in which the branches that are closest together by absolute distance are placed as neighbors in the tree. This assumption places a restriction on the topology of the tree that can lead to incorrect tree construction under some conditions.

In order to get around this problem, the neighbor-joining algorithm searches not just for minimum pairwise distances according to the distance metric, but for sets of neighbors that minimize the total length of the tree. Neighbor joining is the most widely used of the distance-based methods and can produce reasonable trees, especially when evolutionary distances are short.

Phylogenetic Trees Based on Maximum Parsimony

A more widely used algorithm for tree drawing is called parsimony. Parsimony is related to a principle that states the simplest explanation is probably the correct one. Parsimony searches among the set of possible trees to find the one requiring the least number of nucleic acid or amino acid substitutions to explain the observed differences between sequences.

The only sites considered in a parsimony analysis of aligned sequences are those that provide evolutionary information — that is, those sites that favor the choice of one tree topology over another. A site is considered to be informative if there is more than one kind of residue at the site, and if each type of residue is represented in more than one sequence in the alignment. Then, for each possible tree topology, the number of inferred evolutionary changes at each site is calculated. The topology that is maximally parsimonious is that for which the total number of inferred changes at all the informative sites is minimized. In some cases there may be multiple tree topologies that are equally parsimonious.

As the number of sequences increases, so does the number of possible tree topologies. After a certain point, it is impossible to exhaustively enumerate the scores of each topology. A shortcut algorithm that finds the maximally parsimonious tree in such cases is the branch-and-bound algorithm. This algorithm establishes an upper bound for the number of allowed evolutionary changes by computing a tree using a fast or arbitrary method. As it evaluates other trees, it throws out any exceeding this upper bound before the calculation is completed.

Phylogenetic Trees Based on Maximum Likelihood Estimation

Maximum likelihood methods also evaluate every possible tree topology given a starting set of sequences. Maximum likelihood methods are probabilistic. They search for the optimal choice by assigning probabilities to every possible evolutionary change at informative sites, and by maximizing the total probability of the tree. Maximum likelihood methods use information about amino acid or nucleotide substitution rates, analogous to the substitution matrices that are used in multiple sequence alignment.

Software for Phylogenetic Analysis

There is a variety of phylogenetic analysis software available for many operating systems. One of the most extensively is the PHYLIP package.

PHYLIP

The phylogenetic analysis package PHYLIP contains 30 programs that implement different phylogenetic analysis algorithms. Each of the programs runs separately, from the command line. By default, most of the programs look for an input file called infile and write an output file called outfile. Rather than entering parameters via command-line flags, as with BLAST, the programs have an interactive text interface that prompts you for information.

The following are frequently used the PHYLIP programs:

PROTPARS

Infers phylogenies from protein sequence input using the parsimony method

PROTDIST

Computes an evolutionary distance matrix from protein sequence input, using maximum likelihood estimation

DNAPARS

Infers phylogenies from DNA sequence input using parsimony

DNAPENNY

Finds all maximally parsimonious phylogenies for a set of sequences using a branch-and-bound search

DNAML

Infers phylogenies from DNA sequence input using maximum likelihood estimation

DNADIST

Computes a distance matrix from DNA sequence input using the Jukes-Cantor distance or one of three other distance criteria

NEIGHBOR

Infers phylogenies from distance matrix data using either the pairwise clustering or the neighbor joining algorithm

DRAWGRAM

Draws a rooted tree based on output from one of the phylogeny inference programs

DRAWTREE

Draws an unrooted tree based on output from one of the phylogeny inference programs

CONSENSE

Computes a consensus tree from a group of phylogenies

RETREE

Allows interactive manipulation of a tree by the user—not based on data

PHYLIP is a flexible package, and the programs can be used together in many ways. To analyze a set of protein sequences with PHYLIP, you can:

  1. Read a multiple protein sequence alignment using PROTDIST and create a distance matrix.

  2. Input the distance matrix to NEIGHBOR and generate a phylogeny based on neighbor joining.

  3. Read the phylogeny into DRAWTREE and produce an unrooted phylogenetic tree.

Or, you can:

  1. Read a multiple sequence alignment using PROTPARS and produce a phylogeny based on parsimony.

  2. Read the phylogeny using DRAWGRAM and produce a rooted tree.

Each of the PHYLIP programs is thoroughly documented in the \.doc* files available with the PHYLIP distribution. This documentation has been converted into HTML by several groups.

Generating input for PHYLIP with Clustal Omega

The multiple sequence alignment program Clustal Omega draws phylogenetic trees with the neighbor joining method. Perhaps more importantly, it can read sequences in various input formats and then write PHYLIP - format files from multiple sequence alignments.

Profiles and Motifs

In addition to studying relationships between sequences, one of the most successful applications of multiple sequence alignments is in discovering novel, related sequences. This profile- or motif-based analysis uses data derived from multiple alignments to construct and search for sequence patterns.

Multiple sequence alignments can span the full sequence of the proteins involved or a single region of similarity, depending on their purpose. Multiple sequence alignments, such as the one shown in Figure 5, are generally built up by iterative pairwise comparison of sequences and sequence groups, rather than by explicit multiple alignment.

Figure 5. A multiple sequence alignment, shown using Clustal Omega

Figure 5. A multiple sequence alignment, shown using Clustal Omega

A sequence motif is a locally conserved region of a sequence, or a short sequence pattern shared by a set of sequences. The term "motif" most often refers to any sequence pattern that is predictive of a molecule's function, a structural feature, or family membership. Motifs can be detected in protein, DNA, and RNA sequences, but the most common use of motif-based analyses is the detection of sequence motifs that correspond to structural or functional features in proteins. Motifs are generated from multiple sequence alignments and can be displayed as patterns of amino acids (such as those in the Prosite database) or as sequence logos.

Motifs can be created for protein families, or sets of proteins whose members are evolutionarily related. Protein families can consist of many proteins that range from very similar to quite diverse. A sequence profile is a quantitative or qualitative method of describing a motif. A profile can be expressed in its most basic form as a list of the amino acids occurring at each position in the motif. Position-specific scoring matrix (PSSM) is used when detecting a motif. Unlike a standard scoring matrix, the first dimension of the matrix is the length of the motif; the second dimension consists of the 20 amino acid possibilities. For each position in the matrix, there is a probability score for the occurrence of each amino acid. Most methods for developing position-specific scoring matrices normalize the raw probabilities with respect to a standard scoring matrix such as BLOSUM62.

Motif Databases

As profiles and other consensus representations of sequence families can be used to search sequence databases, it is no surprisingly that there are motif databases that can be searched using individual sequences. Motif databases contain representations of conserved sequences shared by a sequence family and their main use is in annotation of unknown sequences.

Motifs are generated by a variety of methods and with different aims. Some rely on automated analysis, but there is often a large amount of hands-on labor invested in the database by an expert curator. Because they store only those motifs that are present in reasonably large families, motif databases are small relative to GenBank, and they don't reflect the extent of the protein structure or sequence databases. An unsuccessful search against a motif database doesn't mean your sequence contains no detectable pattern. It could be part of a family that has not yet been curated or that doesn't meet the criteria of the particular pattern database you've searched. For proteins that do match defined families, a search against the pattern databases can yield a lot of homology information very quickly.

Blocks

Blocks, a service of the Fred Hutchinson Cancer Research Center, is an automatically generated database of ungapped multiple sequence alignments that correspond to the most conserved regions of proteins. Blocks is created using a combination of motif-detection methods, beginning with a step that exhaustively searches all spaced amino acid triplets in the sequence to discover a seed alignment, followed by a step that extends the alignment to find an aligned region of maximum length. The Blocks database provides several useful search services, including IMPALA (which uses the BLAST statistical model to compare a sequence against a library of profiles) and LAMA (Local Alignment of Multiple Alignments a program for comparing an alignment of your own sequences against a database of Blocks).

PROSITE

PROSITE is an expert-curated database of patterns hosted by the Swiss Institute of Bioinformatics. PROSITE uses a single consensus pattern to characterize each family of sequences. Patterns in PROSITE are carefully selected based on data published in the primary literature or on reviews describing the functionality of specific groups of proteins. PROSITE contains pattern information as well as position-specific scoring matrices that can detect new instances of the pattern.

Pfam

Pfam is a database of alignments of protein domain families. Pfam a curated database of over 2,700 gapped profiles, most of which cover whole protein domains. Its entries are generated automatically by applying a clustering method. Pfam entries begin with a seed alignment, a multiple sequence alignment that the curators are confident is biologically meaningful and that may involve some manual editing. From each seed alignment, a profile hidden Markov model is constructed and used to search a nonredundant database of available protein sequences. A full alignment of the family is produced from the seed alignments and any new matches. This process can be repeated to produce more extensive families and detect remote matches. Pfam entries are annotated with information extracted from the scientific literature, and incorporate structural data when available (Fig. 6).

Fig. 6. Pfam entries representation

Fig. 6. Pfam entries representation

PRINTS-S

PRINTS-S is a database of protein motifs similar to PROSITE, except that it uses "fingerprints" composed of more than one pattern to characterize an entire protein sequence. Motifs are often short relative to an entire protein sequence. In PRINTS, groups of motifs found in a sequence family can define a signature for that family.

COG

NCBI's Clusters of Orthologous Groups (COG) database is a different type of pattern database. COG is constructed by comparing all the protein sequences encoded in the complete sequenced genomes. Each cluster must consist of protein sequences from at least three separate genomes. The principle of COG is that proteins that are conserved across these genomes from many diverse organisms represent ancient functions that have been conserved throughout evolution. COG entries can be accessed by organism or by functional category from the NCBI web site.

Accessing multiple databases

When analyzing a new sequence it is recommendable to use as many as possible motif databases. Blocks uses InterPro as one of the sources for its own patterns and contains only ungapped patterns, at the same time profiles contained in Pfam and PROSITE are gapped. Thus keeping track of the best matches from each database, their scores, and (if available) the significance of the hit, will provide more profound information on the performed analysis.

One service that allows integrated searching of many motif databases is the European Bioinformatics Institute's Integrated Resource of Protein Domains and Functional Sites (InterPro). InterPro allows you to compare a sequence against all the motifs from Pfam, PRINTS, ProDom, and PROSITE. InterPro motifs are annotated with the name of the source protein, examples of proteins in which the motif occurs, references to the literature, and related motifs (Fig. 7).

Fig. 7. Structure of InerPro database

Fig. 7. Structure of InerPro database

Constructing and Using Your Own Profiles

Motif databases are useful when looking for protein families that are already well documented. However, if a new motif is found and it is intended to be used in GenBank search, or to look for patterns, it’s necessary to build an own profiles. Several software packages and servers are available for motif discovery - a process of finding and constructing your own motifs from a set of sequences. The simplest way to construct a motif is to find a well-conserved section out of a multiple sequence alignment. A number of programs are commonly used to search for and discover motifs, like Block Maker, MEME and HMMer.

Incorporating Motif Information into Pairwise Alignment

Multiple sequence information can optimize pairwise alignments. The BLAST package contains two new modes that use multiple alignment information to improve the specificity of database searches. These modes are accessed through the blastpgp – a program used to run PSI-BLAST and PHI-BLAST. The last are specialized protein BLAST comparisons that are more sensitive than the standard BLASTP search.

Position Specific Iterative BLAST (PSI-BLAST) is an enhancement of the original BLAST program that implements profiles to increase the specificity of database searches. Starting with a single sequence, PSI-BLAST searches a database for local alignments using gapped BLAST and builds a multiple alignment and a profile the length of the original query sequence. The profile is then used to search the protein database again, seeking local alignments. This procedure can be restated any number of times. One caution of using PSI-BLAST is that you need to know where to stop. Errors in alignment can be magnified by iteration, giving rise to false positives in the ultimate sequence search. The NCBI PSI-BLAST server is probably the optimal way to run a PSI-BLAST search.

Pattern Hit Initiated BLAST (PHI-BLAST) takes a sequence and a preselected pattern found in that sequence as input to query a protein sequence database. The pattern must be expressed in PROSITE syntax, which is described in detail on the PHI-BLAST server site. PHI-BLAST can also initiate a series of PSI-BLAST iterations, and can be a standalone program or a (vastly more user-friendly) web server.

Funding

Disclaimer

The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsi-ble for any use which may be made of the information contained therein.