LO1: Biology, biological databases, and high-throughput data sources

Biology in the Computer Age

Bioinformatics is the science combining utilization of computer and biological data. It's the instrument we can use to understand biological processes and to answer of numerous others questions. Entirely, bioinformatics is a subset of the bigger field of computational science, the use of quantitative scientific strategies in modelling biological systems. The field of bioinformatics depends vigorously on work by specialists with statistical methods and pattern recognition. Scientists come to bioinformatics from many fields, including arithmetic, software engineering, and semantics. Unfortunately, biology is a study of the particular and in addition the general. Bioinformatics is full of pitfalls for the individuals who search for examples and make expectations without an entire comprehension of where biological data originates from and what it implies. By giving calculations, databases, UIs, and measurable devices, bioinformatics makes it conceivable to do things like compare DNA sequences and generate results that are potentially significant. Possibly critical" is maybe the most essential expression. “These new approaches additionally give the chance to overinterpret information and assign meaning where none truly exists”. We can't exaggerate the significance of understanding the restrictions of these tools. In any case, once you gain that understanding and turn into smart user of bioinformatics strategies, the speed at which your research advances can be genuinely astonishing.

Bioinformatics deals with any type of data that is of interest to biologists

  • DNA and protein sequences
  • Gene expression (microarray)
  • Articles from the literature and databases of citations
  • Images
  • Raw data collected from any type of field or laboratory experiment
  • Software

How Informatics Change Biology?

Biological genetic and functional data are stored as DNA, RNA, and proteins, which are all linear chains composed of smaller molecules. These macromolecules are composed from a defined alphabet of well-studied chemicals: DNA is comprised of four deoxyribonucleotides (adenine, thymine, cytosine, and guanine), RNA is made up from the four ribonucleotides (adenine, uracil, cytosine, and guanine), and proteins are built using the 20 amino acids. Since these macromolecules are straight chains of characterized parts, they can be represented as sequences of symbols. These sequences can then be compared to find similarities that suggest the molecules are related by form or function. Sequences examination is conceivably the most valuable computational tool to emerge for molecular biologists. The World Wide Web has made it possible for a single public database of genome sequence data to give benefits through a uniform interface to an overall group of users. With an ordinarily utilized PC program called fsBLAST, molecular biologists can compare an uncharacterized DNA with the all openly available DNA sequence collections.

Bioinformatics and Databases Building

A lot of what we currently consider as a major aspect of bioinformatics— sequence comparison, sequence database searching, sequence analysis —is more complicated than simply outlining and setting public databases. Bioinformaticians (or computational scientists) go beyond simply downloading, managing, and introducing information, drawing motivation from a wide variety of quantitative fields, including statistics, physics, material science, software engineering. Figure 1 indicates how quantitative science intersects with biology at each level, from investigation of sequence information and macromolecules structure, to metabolic modelling, to quantitative study of populations and ecology.

Figure 1. How technology intersects with biology

Figure 1. How technology intersects with biology

Bioinformatics is above all else a part of the biological sciences. The principle objective of bioinformatics isn't building up the most sophisticated algorithms or the most hidden analysis; the objective is discovering how living organism function. Like the molecular biology science strategies that extraordinarily extended what researcher were fit for examining, bioinformatics is an approach and not an end in itself. Bioinformaticians are the tool- developers, and it's important that they comprehend natural issues and computational arrangements so as to create valuable instruments. Research in bioinformatics and computational science can incorporate abstraction of the properties of a biological system into a mathematical or physical model, to execution of new calculations for information investigation, to the improvement of databases and web tools to assess them.

Informatics and Biologists

The science of informatics is focused on the representation, organization, manipulation, distribution, maintenance, and use of data, especially in computerized frame. The functional part of bioinformatics is the representation, storage, and distribution of data. Smart outline of information configurations and databases, formation of instruments to search in those databases, and advancement of UIs that unite diverse apparatuses to enable the user to make complex inquiries about the information are generally parts of the improvement of bioinformatics foundation.

Creating analytical tools to find information in information is the second, and more logical, part of bioinformatics. There are many levels at which we utilize biological data, regardless of whether we are comparing sequences to build up a theory about the function of a newfound gene, examining known 3D protein structures to discover patterns that can help foresee how the protein folds, or displaying how proteins and metabolites in a cell cooperate to make the cell function. A ultimate objective of analytical bioinformaticians is to create prescient techniques that enable researchers to display the function and phenotype of a living organisms based only on its genome sequence.

Bioinformatician Skills?

There's an extensive variety of points that are helpful in case you're interested in bioinformatics, and it's not possible to learn them all. However, the following "core requirements" for bioinformaticians could be underlined:

  • Have a genuinely profound background in some part of molecular biology, like: biochemistry, molecular biology, molecular biophysics, or even molecular modelling.
  • Completely comprehend the “central dogma” of molecular biology. Understanding how and why DNA sequence is transcribed into RNA and then translated into protein.
  • Have significant experience with at least one or two major molecular biology software packages, either for sequence analysis or molecular modelling. The experience of learning one of these softwares makes it substantially much esier to figure out how to utilize other available programmes.
  • Be open to work in command-line computing environment.
  • Have experience with programming in a computer language, for example, C/C++, as well as in a scripting language, for example, Perl or Python.

Biologists and Computers

Computers are powerful devices for study any system that can be described in a mathematical way. As our comprehension of biological processes has developed and extended, it isn't amazing, at that point, that computational biology and bioinformatics, have advanced from the intersection of traditional biology, mathematics, and computer science.

The expanding automation of experimental molecular biology and the use of increasing data in the biological sciences have prompt a major change in the way biological research is performed. Notwithstanding narrative research — finding and studying in detail a single gene at a time — we are presently classifying all the information that is accessible, making complete maps to which we to can later return and mark the points of interest. This is occurring in the domains of sequence and structure, and has started to be the way to deal with different sorts of information also. The trend is toward storage of row biological information in numerous public databases with open access. Rather than doing preparatory research in the lab, investigators are going to the databases initially to save time and assets.

Web Information Use

While you can rapidly locate a single protein structure file or DNA sequence file by filling in a web form and looking through a public database, it's reasonable that in the end you will want to work with more than one bit of information. You may gathering and archiving your own particular information; as well as you might need to make newly discovered information accessible to a broader research community. To do these things effectively, you have to store information on your own PC. In the event that you need to process your data utilizing a computer program, you have to structure your information. Understanding the contrast amongst organized and unstructured information and outlining an information arrange that suits your data storage and access needs is the way to making your information valuable and accessible.

There are numerous approaches to sort out information. While most biological data is stored in flat file databases, this sort of database becomes inefficient when the quantity of data being stored becomes extremely large. More information regarding differences between flat file and relational databases, introduce the best public -domain tools for managing databases, and show you how to use them to store and access your data you could find in GM2 (Advance level).

Understanding Sequence Alignment Data

It's difficult to comprehend your data, or make a point, without visualization tools. The extraction of cross sections or subsets of complex multivariate data is regularly required to understand biological information. Once you've stored data in an open, flexible format, the next stage is to extract what is essential to you and visualize it. You have to make a histogram of your information or show a molecular structure in three dimensions and watch it move in real time using a specific visualization instruments.

Predicting Protein Structure from Sequence

There are a few questions that Bioinformatics can't answer, and this is one of them. Indeed, it's one of the greatest open research inquiries in computational science. What is conceivable is to give the instruments to discover data about such issues and different authors who are working on them. Bioinformatics, similar to some other science, doesn't generally give fast and simple responses to all issues.

Questions That Bioinformatics Can Answer

The questions that drive bioinformatics development are similar that people have at in applied biology for the last couple of hundred years. How might we cure disease? How might we prevent infection? How might we produce enough food to sustain all of mankind? Organizations working in the field of drugs development, agricultural chemicals, hybrid plants, plastics and other petroleum derivatives, and biological approaches to environmental remediation, among others, are creating bioinformatics divisions and looking to bioinformatics to give new targets and to help replace scarce natural resources.

The presence of genome projects infers our goal to utilize the information they create. The important objectives of modern molecular biology are to read the entire genomes of living organisms, to identify each gene, to match every gene with the protein it encodes, and to determine the structure and function of each protein. Detailed knowledge of gene sequence, protein structure and function, and gene expression patterns is expected to enable us to see how life functions at the most noteworthy conceivable resolution. In this way the ability to manipulate living organisms will be performed with exactness and precision.

Computational Approaches to Biological Questions

There is a standard range of approaches that are applied in bioinformatics. Currently, the greater part of the important methods depends on one key principle: that sequence and structural homology (or similarity) between molecules can be utilized to define basic and functional similarity. Here, an outline for the standard computer tools accessible to researcher is given; in GM2 how specific software packages implement these strategies is examined and how a researcher should utilize them.

Molecular Biology's Central Dogma

The central dogma of molecular biology states that:

  • DNA is a template to replicate itself,
  • DNA is transcribed into RNA, and
  • RNA is translated into protein.

In brief, genomic DNA contains all the necessary information about functioning of a define living organism. Without DNA, organisms wouldn't be able to replicate themselves. The raw "one-dimensional" sequence of DNA, however, doesn't actually do anything biochemically; it's only store information, a blueprint that is read by the cell's protein synthesizing machinery. DNA sequences are the punch cards; cells are the computers.

Replication of DNA

The specific structure of DNA molecules assures its special properties. These properties allow the information stored in DNA to be preserved and transfered from one cell to another, and thus from parents to their offspring.

Figure 2. Schematic replication of DNA helix

Genomes and Genes

The genome comprises individual genes. There are three classes of genes: protein-coding genes, RNA-specifying genes are untranscribed genes.

Transcription of DNA

DNA act as a blueprint for a synthesis of ribonucleic acid (RNA).

Figure 3. Schematic transcription of DNA into RNA

Translation of mRNA

Translation of mRNA into protein is the final key step in putting the information in the genome to work in the cell.

Figure 4. The genetic code

Molecular Evolution

Errors in replication and transcription of DNA are relatively common. If these errors occur in in dividing cells, they can be passed to its offspring. Modifications in the DNA sequence can have harmful effect, they can also have beneficial, or they can be neutral. If a mutation doesn't kill the organism before it reproduces, the mutation can become fixed in the population over many generations. The slow accumulation of such mutations is the background of the evolution. Thus, knowing the DNA sequences provide us with more precise understanding of evolution. Knowing the molecular mechanism of evolution as a gradual process of accumulating DNA sequence mutations is the reason for creating theories based on DNA and protein sequence comparison.

Biological Models

One of the most important exercises in biology and bioinformatics is modeling. A model is an abstract way of describing a complicated system. Turning something as complex (and confusing) as a chromosome, or the cycle of cell division, into a simplified representation that captures all the features you are trying to study can be extremely difficult. A model helps us see the larger picture. One feature of a good model is that it makes systems that are otherwise difficult to study easier to analyze using quantitative approaches. Bioinformatics tools rely on our ability to extract relevant parameters from a biological system (be it a single molecule or something as complicated as a cell), describe them quantitatively, and then develop computational methods that use those parameters to compute the properties of a system or predict its behavior.

Accessing 3D Molecules Through a 1D Representation

In reality, DNA and proteins are complicated 3D molecules, composed of thousands or even millions of atoms bonded together. However, DNA and proteins are both polymers, chains of repeating monomers. Not too long after the chemical natures of DNA and proteins were understood, researchers recognized that it was convenient to represent them by strings of single letters. Instead of representing each nucleic acid in a DNA sequence as a detailed chemical entity, they could be represented simply as A, T, C, and G. Thus, a short piece of DNA that contains thousands of individual atoms can be represented by a sequence of few hundred letters.

Not only does this abstraction save storage space and provide a convenient form for sharing sequence information, it represents the nature of a molecule uniquely and correctly and ignores levels of detail (such as atomic structure of DNA and many proteins) that are experimentally inaccessible. Many computational biology methods exploit this 1D abstraction of 3D biological macromolecules.

The abstraction of nucleic acid and protein sequences into 1D strings has been one of the most fruitful modeling strategies in computational molecular biology, and analysis of character strings is a longstanding area of research in computer science. One of the elementary questions you can ask about strings is, "Do they match?" There are well-established algorithms in computer science for finding exact and inexact matches in pairs of strings. These algorithms are applied to find pairwise matches between biological sequences and to search sequence databases using a sequence query.

In addition to matching individual sequences, string-based methods from computer science have been successfully applied to a number of other problems in molecular biology. For example, algorithms for reconstructing a string from a set of shorter substrings can assemble DNA sequences from overlapping sequence fragments. Techniques for recognizing repeated patterns in single sequences or conserved patterns across multiple sequences allow researchers to identify signatures associated with biological structures or functions. Finally, multiple sequence-alignment techniques allow the simultaneous comparison of several molecules that can infer evolutionary relationships between sequences.

This simplifying abstraction of DNA and protein sequence seems to ignore a lot of biology. The cellular context in which biomolecules exist is completely ignored, as are their interactions with other molecules and their molecular structure. And yet it has been shown over and over that matches between biological sequences can be biologically meaningful.

Abstractions for Modeling Protein Structure

There is more to biology than sequences. Proteins and nucleic acids also have complex 3D structures that provide clues to their functions in the living organism. Structure analysis can be performed on static structures, or movements and interactions in the molecules can be studied with molecular simulation methods.

Standard molecular simulation approaches model proteins as a collection of point masses (atoms) connected by bonds. The bond between two atoms has a standard length, derived from experimental chemistry, and an associated applied force that constrains the bond at that length. The angle between three adjacent atoms has a standard value and an applied force that constrains the bond angle around that value. The same is true of the dihedral angle described by four adjacent atoms. In a molecular dynamics simulation, energy is added to the molecular system by simulated "heating." Following standard Newtonian laws, the atoms in the molecule move. The energy added to the system provides an opposing force that moves atoms in the molecule out of their standard conformations. The actions and reactions of hundreds of atoms in a molecular system can be simulated using this abstraction.

In any case, the computational requests for molecular simulations are huge, and there is some weakness both in the force field - the accumulation of standard forces that model the molecule — and in the displaying of nonbonded interactions

  • interactions between nonadjacent atoms. In this way, it has not demonstrated conceivable to anticipate protein structure utilizing the all-atom modeling approach.

A few researchers have recently moderate success in predicting protein topology for small proteins utilizing a moderate level of abstraction — more than linear sequence, but less than an all atom model. For this situation, the protein is dealt with as a progression of globules (speaking to the individual amino acids) on a string (speaking to the backbone). Globules may have distinctive characters to represent the distinctions in the amino acids sidechains. They might be positively or negatively charged, polar or nonpolar, small or large. There are rules overseeing which globules will attract each other. Polar groups cluster with other polar groups, and nonpolar with nonpolar. There are also rules concerning the the string; essentially that it can't go through itself throughout the course of simulation. Modeling the protein folding itself is directed through sequential or simultaneous perturbations of the position of each globule.

Mathematical Modeling of Biochemical Systems

Using theoretical models in biology goes far beyond the single molecule level. For years, ecologists have been using mathematical models to help them understand the dynamics of changes in interdependent populations. What effect does a decrease in the population of a predator species have on the population of its prey? What effect do changes in the environment have on population? The answers to those questions are theoretically predictable, given an appropriate mathematical model and a knowledge of the sizes of populations and their standard rates of change due to various factors.

In molecular biology, a similar approach, called metabolic control analysis, is applied to biochemical reactions that involve many molecules and chemical species. While cells contain hundreds or thousands of interacting proteins, small molecules, and ions, it's possible to create a model that describes and predicts a small corner of that complicated metabolism. For instance, if you are interested in the biological processes that maintain different concentrations of hydrogen ions on either side of the mitochondrial inner membrane in eukaryotic cells, it's probably not necessary for your model to include the distant group of metabolic pathways that are closely involved in biosynthesis of the heme structure.

Metabolic models depict a biochemical process in respect to the concentrations of chemical substances engaged with a pathway, and the reactions and fluxes that influence those concentrations. Reactions and fluxes can be identified by differential equations; they are basically rates of change in concentration.

What makes metabolic modeling intriguing is the possibility of displaying many reactions at the same time to perceive what impact they have on the concentration of specific chemical compound. Utilizing a properly built metabolic model, you can test diverse presumptions about cell conditions and fine-tune the model to simulate experimental trials. That, in turn, can propose testable speculations to drive further research.

Bioinformatics Approaches

Molecular biology research is a fast-growing area. The amount and type of data that can be gathered is exploding, and the trend of storing this data in public databases is spilling over from genome sequence to all sorts of other biological datatypes. The information landscape for biologists is changing so rapidly that often more of the provided information is somewhat behind the times.

Yet, since the inception of the Human Genome Project, a core set of computational approaches has emerged for dealing with the types of data that are currently shared in public databases—DNA, protein sequence, and protein structure. Although databases containing results from new high-throughput molecular biology methods have not yet grown to the extent the sequence databases have, standard methods for analyzing these data have begun to emerge.

The following list gives an overview of the key computational methods:

Using public databases and data formats

The first key skill for biologists is to learn to use online search tools to find information. Literature searching is no longer a matter of looking up references in a printed index. You can find links to most of the scientific publications you need online. There are central databases that collect reference information, so you can search dozens of journals at once. You can even set up "agents" that notify you when new articles are published in an area of interest. Searching the public molecular-biology databases requires the same skills as searching for literature references: you need to know how to construct a query statement that will pluck the particular needle you're looking for out of the database haystack.

Sequence alignment and sequence searching

Having the capacity to analyze pairs of DNA or protein sequences and extract partial matches has made it conceivable to utilize a biological sequence as a database query. Sequence-based searching is another key expertise for biologists; a little investigation of the biological databases toward the start of a scientific project often saves a lot of valuable time in the lab. Recognizing homologous sequences gives a basis to phylogenetic examination and sequence pattern recognition. Sequence-based searching should be possible online through web platforms, so it requires no extraordinary computer skills, yet to judge the quality of your search results or you have to understand how the sequence-alignment method functions and how to go beyond different kinds of further investigations.

Gene prediction

Gene prediction is just one of a bunch of techniques for recognition of meaningful signals in uncharacterized DNA sequences. Up to this point, most sequences deposit in GenBank were already characterized at the time of deposition. That is, somebody had officially gone in and, utilizing molecular biology, genetic, or biochemical approaches, made sense of what the gene did. Nonetheless, now that the genome projects are going all out, a lot of DNA sequence out there that isn't characterized.

Programming for forecast of open reading frames, genes, exon splice sites, promoter binding sites, repeat sequences, and tRNA genes enables researchers to make sense out of this unmapped DNA.

Multiple sequence alignment

Multiple sequence-alignment techniques assemble pairwise sequence alignment for some related sequences into a image of sequence homology among all individuals from a gene family. Multiple sequence alignments help in visual distinguishing of sites in a DNA or protein sequence that might be functionally important. Such sites are normally conserved; the same amino acid is present at that site in each one of a group of related sequences. Multiple sequence alignments can also be quantitatively examined to obtain data about certain gene family. This technique is a basic advance in phylogenetic investigation of a group of related sequences, and they additionally provide the basis for identifying sequence patterns that describe specific protein families.

Phylogenetic analysis

Phylogenetic analysis endeavors to depict the evolutionary relatedness of a group of sequences. A traditional phylogenetic tree or cladogram groups species into a diagram presenting their relative evolutionary similarity / divergence. Branching of the tree that occur uttermost from the root isolate individual species; branching that that occur close to the root assembly species into kingdoms, phyla, classes, families, genera, et cetera.

The information in a molecular sequence alignment can be used to compute a phylogenetic tree for a particular family of gene sequences. The branching in phylogenetic trees represent evolutionary distance based on sequence similarity scores or on information-theoretic modeling of the number of mutational steps required to change one sequence into the other. Phylogenetic analyses of protein sequence families talk not about the evolution of the entire organism but about evolutionary change in specific coding regions, although our ability to create broader evolutionary models based on molecular information will expand as the genome projects provide more data to work with.

Extraction of patterns and profiles from sequence data

A motif is a sequence of amino acids that defines a substructure in a protein that can be connected to function or to structural stability. In a group of evolutionarily related gene sequences, motifs appear as conserved sites. Sites in a gene sequence tend to be conserved—to remain the same in all or most representatives of a sequence family—when there is selection pressure against copies of the gene that have mutations at that site. Nonessential parts of the gene sequence will diverge from each other in the course of evolution, so the conserved motif regions show up as a signal in a sea of mutational noise. Sequence profiles are statistical descriptions of these motif signals; profiles can help identify distantly related proteins by picking out a motif signal even in a sequence that has diverged radically from other members of the same family.

Protein sequence analysis

The amino-acid content of a protein sequence can be used as the basis for many analyses, from computing the isoelectric point and molecular weight of the protein and the characteristic peptide mass fingerprints that will form when it's digested with a particular protease, to predicting secondary structure features and post-translational modification sites.

Protein structure prediction

It's a lot harder to determine the structure of a protein experimentally than it is to obtain DNA sequence data. One very active area of bioinformatics and computational biology research is the development of methods for predicting protein structure from protein sequence. Methods such as secondary structure prediction and threading can help determine how a protein might fold, classifying it with other proteins that have similar topology, but they don't provide a detailed structural model. The most effective and practical method for protein structure prediction is homology modeling—using a known structure as a template to model a structure with a similar sequence. In the absence of homology, there is no way to predict a complete 3D structure for a protein.

Protein structure property analysis

Protein structures have numerous quantifiable properties that are important to crystallographers and structural biologists. Protein structure validation devices are utilized by crystallographers to measure how well a structure model fits in with auxiliary standards extricated from existing structures or chemical model compounds. These instruments may also examine the "fitness" of each amino acid in a structure model for its environment, hailing such peculiarities as hidden charges with no countercharge or large patches of hydrophobic amino acids found on a protein surface. These tools are valuable for assessing both experimental and hypothetical structure models.

Another class of methods can figure inner geometry and physicochemical properties of proteins. These instruments generally are used to create models of the protein's catalytic mechanism or other chemical features. Probably the most fascinating properties of protein structures are the locations of deeply concave surface clefts and internal cavities, both of which may point to the area of a cofactor binding site or active site. Different tools register hydrogen-bonding patterns or investigate intramolecular interactions. An especially intriguing properties are the electrostatic potential field encompassing the protein and other electrostatically controlled parameters, for example, individual amino acid pKa, protein solvation energies, and binding constants.

Protein structure alignment and comparison

Notwithstanding when two gene sequences aren't obviously homologous, the structures of the proteins they encode can be similar. New instruments for computing structural similarity are making it conceivable to recognize distant homologies by comparing structures, even without much sequence similarity. These tools also are helpful for comparing developed homology models with the known protein structures they are based on.

Biochemical simulation

Biochemical simulation utilizes the instruments of dynamical systems modeling to mimic the chemical reactions involved in metabolism. Simulations can reach out from individual metabolic pathways to transmembrane transport process and even properties of entire cells or tissues. Biochemical and cell simulations generally depended on the capacity of the researcher to describe a system mathematically, building up an arrangement of differential conditions that represent the different reactions and fluxes occurring in the system. In any case, new software tools can develop the mathematical framework of a simulation automatically from a description given interactively by the user. This make mathematical modeling accessible to any biologist who knows enough about a system to describe it according to the conventions of dynamical systems modeling.

Whole genome analysis

As more and more genomes are sequenced completely, the analysis of raw genome data has become a more important task. There are a number of perspectives from which one can look at genome data: for example, it can be treated as a long linear sequence, but it's often more useful to integrate DNA sequence information with existing genetic and physical map data. This allows you to navigate a very large genome and find what you want. The National Center for Biotechnology Information (NCBI) and other organizations are making a concerted effort to provide useful web interfaces to genome data, so that users can start from a high-level map and navigate to the location of a specific gene sequence.

Genome navigation is far from the only issue in genomic sequence analysis, however. Annotation frameworks, which integrate genome sequence with results of gene finding analysis and sequence homology information, are becoming more common, and the challenge of making and analyzing complete pairwise comparisons between genomes is beginning to be addressed.

Primer design

Many molecular biology protocols require the design of oligonucleotide primers. Proper primer design is critical for the success of polymerase chain reaction (PCR), oligo hybridization, DNA sequencing, and microarray experiments. Primers must hybridize with the target DNA to provide a clear answer to the question being asked, but, they must also have appropriate physicochemical properties; they must not self-hybridize or dimerize; and they should not have multiple targets within the sequence under investigation. There are several web-based services that allow users to submit a DNA sequence and automatically detect appropriate primers, or to compute the properties of a desired primer DNA sequence.

DNA microarray analysis

DNA microarray analysis is a relatively new molecular biology method that expands on classic probe hybridization methods to provide access to thousands of genes at once. Microarray experiments are amenable to computational analysis because of the uniform, standardized nature of their results—a grid of equally sized spots, each identifiable with a particular DNA sequence. Computational tools are required to analyze larger microarrays because the resulting images are so visually complex that comparison by hand is no longer feasible.

The main tasks in microarray analysis as it's currently done are an image analysis step, in which individual spots on the array image are identified and signal intensity is quantitated, and a clustering step, in which spots with similar signal intensities are identified. Computational support is also required for the chip -design phase of a microarray experiment to identify appropriate oligonucleotide probe sequences for a particular set of genes and to maintain a record of the identity of each spot in a grid that may contain thousands of individual experiments.

Proteomics analysis

Before they're at any point crystallized and biochemically characterized, proteins are frequently analysid utilizing a combination of gel electrophoresis, partial sequencing, and mass spectroscopy. 2D gel electrophoresis can separate a mixture of thousands of proteins into particular segments; the individual spots of material can be blotted or even cut from the gel and examined. Simple computational instruments can give some data to help in the process of analyzing the protein mixtures. It's easier to calculate the molecular weight and pI from a protein sequence; by utilizing these values, sets of putative candidate identities can be identified for each spot on a gel. It's also conceivable to compute, from a protein sequence, the peptide fingerprint that is made when that protein is broken down into fragments by enzymes with specific protein cleavage sites. Mass spectrometry investigations of protein fragments can be compared with processed peptide fingerprints to further limit the search.

The Public Biological Databases

The nomenclature problem in biology at the molecular level is immense. Genes are commonly known by unsystematic names. These may come from developmental biology studies in model systems, so that some genes have names like flightless, shaker, and antennapedia due to the developmental effects they cause in a particular animal. Other names are chosen by cellular biologists and represent the function of genes at a cellular level, like homeobox. Still other names are chosen by biochemists and structural biologists and refer to a protein that was probably isolated and studied before the gene was ever found.

Though proteins are direct products of genes, they are not always referred to by the same names or codes as the genes that encode them. This kind of confusing nomenclature generally means that only a scientist who works with a particular gene, gene product, or the biochemical process that it's a part of can immediately recognize what the common name of the gene refers to. The biochemistry of a single organism is a more complex set of information than the taxonomy of living species was at the time of Linnaeus, so it isn't to be expected that a clear and comprehensive system of nomenclature will be arrived at easily. There are many things to be known about a given gene: its source organism, its chromosomal location, and the location of the activator sequences and identities of the regulatory proteins that turn it on and off. Genes also can be categorized by when during the organism's development they are turned on, and in which tissues expression occurs. They can be categorized by the function of their product, whether it's a structural protein, an enzyme, or a functional RNA. They can be categorized by the identity of the metabolic pathway that their product is part of, and by the substrate it modifies or the product it produces. They can be categorized by the structural architecture of their protein products. Clearly this is a wealth of information to be condensed into a reasonable nomenclature. Figure 5 shows a portion of the information that may be associated with a single gene.

Figure 5. Information associated with a single gene

Figure 5. Information associated with a single gene

The issue for maintainers of biological databases turns out to be mostly one of annotation; that is, putting adequate data into the database that there is no doubt of what the gene is, regardless of whether it has a cryptic common name, and making the best possible links between that data and the gene sequence and serial number. Correct annotation of genomic data is a dynamic research area itself, as scientists attempt to discover approaches to exchange data crosswise over genomes without spreading error. Storage of macromolecular information in electronic databases has offered ascend to a method for working around the issue of classification. The solution has been to give each new entry into the database a serial number and afterward to store it in a relational database that knows the correct linkages between that serial number, any number of names for the gene or gene product it encodes, and all manner of other information about the gene. This technique is the the one currently in use in the major biological databases.

The questions databases resolve are essentially the same questions that arise in developing a nomenclature. However, by using relational databases and complex querying strategies, they (perhaps somewhat unfortunately) avoid the issue of finding a concise way for scientists to communicate the identities of genes on a nondigital level.

Data Annotation and Data Formats

The representation and distribution of biological data is still an open problem in bioinformatics. The nucleotide sequences of DNA and RNA and the amino acid sequences of proteins reduce neatly to character strings in which a single letter represents a single nucleotide or amino acid. The remaining challenges in representing sequence data are verification of the correctness of the data, thorough annotation of data, and handling of data that comes in ever-larger chunks, such as the sequences of chromosomes and whole genomes.

The standard reduced representation of the 3D structure of biomolecule consists of the Cartesian coordinates of the atoms in the molecule. This aspect of representing the molecule is straightforward. On the other hand, there are a host of complex issues for structure databases that are not completely resolved. Annotation is still an issue for structural data, although the biology community has attempted to form a consensus as to what annotation of a structure is currently required. In the last 15 years, different researchers have developed their own styles and formats for reporting biological data. Biological sequence and structure databases have developed in parallel in the United States and in Europe. The use of proprietary software for data analysis has contributed a number of proprietary data formats to the mix. While there are many specialized databases, we focus here on the fields in which an effort is being made to maintain a comprehensive database of an entire class of data.

3D Molecular Structure Data

Though DNA sequence, protein sequence, and protein structure are in some sense just different ways of representing the same gene product, these datatypes currently are maintained as separate database projects and in unconnected data formats. This is mainly because sequence and structure determination methods have separate histories of development.

The first public molecular biology database, set up about 10 years before the public DNA sequence databases, was the Protein Data Bank (PDB). It represents the central repository for x-ray crystal structures of protein molecules. While the first finish protein structure was presented in the 1950s, there were not a noteworthy number of protein structures accessible until the late 1970s. Computers had not created to the point where graphical representation of protein coordinate structure information was possible, at least at useful speeds. However, in 1971, the PDB was set up at the Brookhaven National Laboratory, to store protein structure information in a computer-based archive. A data format created, which owed a lot of its style to the prerequisites of early computer technology. All through the 1980s, the PDB grew. From 15 sets of entries in 1973, it augments to 69 entries in 1976. The number of coordinate sets deposited each year remained under 100 until 1988, at which time there were still fewer than 400 PDB entries.

In the vicinity of 1988 and 1992, the PDB hit the the turning point in its exponential growth curve. By January 1994, there were 2,143 entries in the PDB; and at the moment the PDB has more than 14,000 entries. Administration of the PDB has been exchanged to a consortium of entry mark, called the Research Collaboratory for Structural Bioinformatics, and and a new format for recording of crystallographic data, the Macromolecular Crystallographic Information File (mmCIF), is being introduced in to replace the antiquated PDB format. Journals that publish crystallographic results require submission to the PDB as a condition of publication, which means that nearly all protein structure data obtained by academic researchers becomes available in the PDB.

A typical issue for information driven investigations of protein structure is the excess and absence of thoroughness of the PDB. There are numerous proteins for which various crystal structures have been submitted to the database. Choosing subsets of the PDB information with which to work is in this manner a critical step in any statistical investigation of protein structure. Numerous statistical studies of protein structure depend on sets of protein chains that have close to 25% of their sequence in common; if this paradigm is utilized, there are still just around 1,000 unique protein folds represented in the PDB. As the amount of biological sequence data available has grown, the PDB now falls a long ways behind the gene-sequence databases.

DNA, RNA, and Protein Sequence Data

Sequence databases generally specialize in one type of sequence data: DNA, RNA, or protein. There are major sequence data collections and deposition sites in Europe, Japan, and the United States, and there are independent groups that mirror all the data collected in the major public databases, often offering some software that adds value to the data.

In 1970, Ray Wu sequenced the first segment of DNA; twelve bases that occurred as a single strand at the end of a circular DNA that was opened utilizing a cleaving enzyme. In any case, DNA sequencing demonstrated considerably more troublesome than protein sequencing, on the grounds that there is no chemical process that selectively cleaves the first nucleotide from a nucleic acid chain. At the point when Robert Holley announced the sequencing of a 76-nucleotide RNA molecule from yeas, it was following seven years of work. After Holley's sequence was published, different groups refined the protocols for sequencing, even succeeding in sequence effectively a 3,200-base bacteriophage genome. Genuine advance with DNA sequencing came after 1975, with the chemical cleavage method created by Allan Maxam and Walter Gilbert, and with Frederick Sanger's chain terminator procedure.

The first DNA sequence database, established in 1979, was the Gene Sequence Database (GSDB) at Los Alamos National Lab. While GSDB has since been supplanted by the worldwide collaboration that is the modern GenBank, up-to-date gene sequence information is still available from GSDB through the National Center for Genome Resources.

The European Molecular Biology Laboratory, the DNA Database of Japan, and the National Institutes of Health cooperate to make all freely accessible sequence data through GenBank. NCBI has built up a standard relational database format for sequence information presentation and storage, known as the ASN.1 format. While this format guarantees to locate the right sequences of the right kind in GenBank simpler, there are also various services tions giving access to nonredundant versions of the database. The DNA sequence database developed gradually through its first decade. In 1992, GenBank contained just 78,000 DNA sequences — a little more than 100 million pairs of DNA. In 1995, the Human Genome Project, and advances in sequencing innovation, kicked GenBank's growth into high gear. GenBank currently doubles in size every 6 to 8 months, and its rate of increase is constantly growing.

Genomic Data

In addition to the Human Genome Project, there are now separate genome project databases for a large number of model organisms. The sequence content of the genome project databases is represented in GenBank, but the genome project sites also provide everything from genome maps to supplementary resources for researchers working on that organism. As of October 2000, NCBI's Entrez Genome database contained the partial or complete genomes of over 900 species. Many of these are viruses. The remainder include bacteria; archaea; yeast; commonly studied plant model systems such as A. thaliana, rice, and maize; animal model systems such as C. elegans, fruit flies, mice, rats, and puffer fish; as well as organelle genomes. NCBI's web-based software tools for accessing these databases are constantly evolving and becoming more sophisticated.

Biochemical Pathway Data

The most vital biological activities don't occur by the action of single molecule, however as the orchestrated activities of multiple molecules. Since the mid twentieth century, biochemists have analyzed these functional ensembles of enzymes and their substrates. A couple of research groups have started work at intelligently arranging and storing these pathways in databases. Key example of pathway database is KEGG. The Kyoto Encyclopedia of Genes and Genomes (KEGG) stores comparative information about sequence, structure, and genetic linkage databases. This database is queryable through web interfaces and are curated by a combination of automation and human expertise. In addition to these whole genome "parts catalogs," other, more specialized databases that focus on specific pathways (such as intercellular signaling or degradation of chemical compounds by microbes) have been developed.

Gene Expression Data

DNA microarrays (or gene chips) are miniaturized laboratories for the study of gene expression. Each chip contains a deliberately designed array of probe molecules that can bind specific pieces of DNA or mRNA. Labeling the DNA or RNA with fluorescent molecules allows the level of expression of any gene in a cellular preparation to be measured quantitatively. Microarrays also have other applications in molecular biology, but their use in studying gene expression has opened up a new way of measuring genome functions.

Since the advancement of DNA microarray technology in the late 1990s, it has turned out that the increase in available gene expression data will eventually parallel the growth of the sequence and structure databases. Raw microarray information has been started to be made accessible to the general audience in particular databases, and the building up of a central data repository for such data is done (Gene Expression Omnibus).

Since a significant number of the early microarray experiments were performed at Stanford, their genome resources site has connections to raw information and databases that can be queried utilizing gene names or functional descriptions. Furthermore, the European Bioinformatics Institute has been instrumental in setting up of standards for deposition of microarray data in databases. Several databases additionally exist for the deposition of 2D gel electrophoresis results, including SWISS-2DPAGE and HSC-2DPAGE. 2D-PAGE is an innovation that permits quantitative investigation of protein concentrations in the cell, for many proteins at the same time. The combination of these two systems is an intense tool for understanding how genomes function.

Table 1 summarizes sources on the Web for some of the most important databases we've discussed in this section.

Table 1. Major Biological Data and Information Sources

Subject Source Link
Biomedical literature PubMed http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
Nucleic acid sequence GenBank http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide
SRS at EMBL/EBI http://srs.ebi.ac.uk
Genome sequence Entrez Genome http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Genome
TIGR databases http://www.tigr.org/tdb/
Protein sequence GenBank http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein
SWISS-PROT at ExPASy http://www.expasy.ch/spro/
PIR http://www-nbrf.georgetown.edu
Protein structure Protein Data Bank http://www.rcsb.org/pdb/
Entrez Structure DB Protein and peptide mass spectroscopy PROWL http://prowl.rockefeller.edu
Post-translational modifications RESID http://www-nbrf.georgetown.edu/pirwww/search/textresid.html
Biochemical and biophysical information ENZYME http://www.expasy.ch/enzyme/
BIND http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Structure
Biochemical pathways PathDB http://www.ncgr.org/software/pathdb/
KEGG http://www.genome.ad.jp/kegg/
WIT http://wit.mcs.anl.gov/WIT2/
Microarray Gene Expression Links http://industry.ebi.ac.uk/\~alan/MicroArray/
2D-PAGE SWISS-2DPAGE http://www.expasy.ch/ch2d/ch2d-top.html
Web resources The EBI Biocatalog http://www.ebi.ac.uk/biocat/
IUBio Archive http://iubio.bio.indiana.edu



The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsi-ble for any use which may be made of the information contained therein.