LO4: Biology, biological databases, and high-throughput data sources

The Internet has completely changed the way scientists search for and exchange information. Data that once had to be communicated on paper is now digitized and distributed from centralized databases. Articles in journals are available online. And nearly every research group has a web page offering everything from reprints to software downloads to data to automated data-processing services.

Search Engines and Boolean Searching

AltaVista, Mozilla, Google, Internet explorer, Safari, and dozens of other search engines exist to help you find the billion or more pages that respond to your search. However, often scientists are looking for perhaps a couple of needles in a large haystack. Knowing how to structure a query to limit the majority of the junk that will come up in a search is very useful, both in web searching and in keyword-based database searching. Understanding how to formulate boolean queries that limit your search space is a critical research skill.

Most web surfers approach searching randomly at best. But each search engine makes different default assumptions, so if you enter protein structure into Excite's query field, you are asking for an entirely different search than if you enter protein structure into Google's query field. In order to search effectively, you need to use boolean logic, which is an extremely simple way of stating how a group of things should be divided or combined into sets.

Search engines and public biological databases use some form of boolean logic. Boolean queries restrict the results that are returned from a database by joining a series of search terms with the operators AND, OR, and NOT. For example: joining two key terms with AND finds documents that contain only key term1 and key term2 ; using OR returns documents that contain either key term1 or key term2 (or both); and using NOT discovers documents that contain key term1 but not key term2.

However, search engines differ in how they interpret a space. Some of them consider a space as OR, so when protein structure is typed, the search engine looks for protein or structure. As a result, a lot of advertisements for fad diets and protein supplements come up before to get to the scientific sites of interest. On the other hand, in Google space refers to AND, so the only references to be found are those that contain protein and structure.

Boolean queries are read from left to right, just like text. Parentheses can structure more complex boolean queries. For instance, if you look for documents that contain key term1 and one of either key term2 or key term3, but not key term4, your query would look like this: (key term1 AND (key term2 OR key term3)) NOT key term4.

Many search engines allow to use quotation marks to specify a phrase. In order to find only documents in which the key term enzyme activity appear together in sequence, searching for "enzyme activity" is one way to narrow the results.

There are many excellent web tutorials available on boolean searching. Try a search with the phrase boolean searching in Google, and see what comes up.

Finding Scientific Articles

An excellent resource for searching the scientific literature in the biological sciences is the free server sponsored by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine. This server makes it possible for anyone with a web browser to search the Medline database. There are other literature databases of comparable quality available, but most of these are not free. Outside of refereed resources, however, anyone can publish information on the Web. Often research groups make papers available as technical reports on their web sites. These technical reports may never be peer reviewed or published outside the research group's home organization, and your only evidence to their quality is the reputation and expertise of the authors. This isn't to say that you shouldn't trust or seek out these sources. Many government organizations and academic research groups have reference material of near-textbook quality on their web sites. For example, the University of Washington Genome Center has an excellent tutorial on genome sequencing, and NCBI has a good practical tutorial on use of the BLAST sequence alignment program and its variants.

Using PubMed Effectively

PubMed is one of the most valuable web resources available to biologists. Over 4,000 journals are indexed in PubMed, including most of the well-regarded journals in cell and molecular biology, biochemistry, genetics, and related fields, as well as many clinical publications of interest to medical professionals. PubMed uses a keyword-based search strategy and allows the boolean operators AND, OR, and NOT in query statements. Users can specify which database fields to check for each search term by following the search term with a field name enclosed in square brackets. Additionally, users can search PubMed using Medical Subject Heading (MeSH) terms. MeSH is a library of standardized terms that may help locate manuscripts that use alternate terms to refer to the same concept. The MeSH browser allows users to enter a word or word fragment and find related keywords in the MeSH library. PubMed automatically finds MeSH terms related to query terms and uses them to enhance queries.

For example, we searched for "protein structure" in PubMed. The terms protein and structure are automatically joined with an AND unless otherwise specified. The resulting boolean query statement submitted to PubMed is actually:

("proteins"[MeSH Terms] OR "proteins"[All Fields] OR "protein"[All Fields]) AND ("Structure"[Journal] OR "structure"[All Fields])

The results of the search are shown in Figure 1.

Figure 1. Results from a PubMed search

Figure 1. Results from a PubMed search

As you can see in Figure 2, PubMed also allows you to use a web interface to narrow your search.

The Advanced link immediately below the query box on the main PubMed page takes you to this web form.

Figure 2. Narrowing a search strategy using the Advanced menu in PubMed

Figure 2. Narrowing a search strategy using the Advanced menu in PubMed

The Advanced form allows you to add specificity to your query. You can limit your search to particular fields in the PubMed database record, such as the Author Name or MeSH Major Topic. Searches can also be limited by language, content (e.g., searching for review articles or clinical trials only), and date.

The Public Biological Databases

The nomenclature problem in biology at the molecular level is immense. Genes are commonly known by unsystematic names. These may come from developmental biology studies in model systems, so that some genes have names like flightless, shaker, and antennapedia due to the developmental effects they cause in a particular animal. Other names are chosen by cellular biologists and represent the function of genes at a cellular level, like homeobox. Still other names are chosen by biochemists and structural biologists and refer to a protein that was probably isolated and studied before the gene was ever found.

Though proteins are direct products of genes, they are not always referred to by the same names or codes as the genes that encode them. This kind of confusing nomenclature generally means that only a scientist who works with a particular gene, gene product, or the biochemical process can immediately recognize what the common name of the gene refers to. The biochemistry of a single organism is a more complex set of information than the taxonomy of living species was at the time of Linnaeus, so it isn't to be expected that a clear and comprehensive system of nomenclature will be arrived at easily. There are many things to be known about a given gene: its source organism, its chromosomal location, and the location of the activator sequences and identities of the proteins that down and up regulated it. Genes also can be categorized by when during the organism's development they are expressed, and in which tissues the expression occurs. They can be characterized by the function of their product, whether it's a structural protein, an enzyme, or a functional RNA. They can be determined by the metabolic pathway that their product is part of, by the substrate they modify or by the product they produce Moreover, they can be categorized by the structural characteristics of their protein products. Figure 3 shows some of the information that could be related with a single gene.

Figure 3. Part of the information associated with a single gene

Figure 3. Part of the information associated with a single gene

The problem for maintainers of biological databases becomes mainly one of annotation. Correct annotation of genomic data may be achieved through putting the sufficient information into the database that there is no question of what the gene is, even if it does have a cryptic common name, and creating the proper links between that information and the gene sequence and serial number. Storage of macromolecular data in electronic databases has given rise to a way of working around the problem of nomenclature. The solution has been to give each new entry into the database a serial number and then to store it in a relational database that knows the proper linkages between that serial number, any number of names for the gene or gene product it represents, and all manner of other information about the gene. This strategy is the one currently in use in the major biological databases.

Data Annotation and Data Formats

The representation and distribution of biological data is still an open problem in bioinformatics. The nucleotide sequences of DNA and RNA and the amino acid sequences of proteins reduce neatly to character strings in which a single letter represents a single nucleotide or amino acid. The remaining challenges in representing sequence data are verification of the correctness of the data, thorough annotation of data, and handling of data that comes in ever-larger chunks, such as the sequences of chromosomes and whole genomes.

The standard reduced representation of the 3D structure of biomolecule consists of the Cartesian coordinates of the atoms in the molecule. This aspect of representing the molecule is straightforward. On the other hand, there are a host of complex issues for structure databases that are not completely resolved. Annotation is still an issue for structural data, although the biology community has attempted to form a consensus as to what annotation of a structure is currently required. In the last 15 years, different researchers have developed their own styles and formats for reporting biological data. Biological sequence and structure databases have developed in parallel in the United States and in Europe. The use of proprietary software for data analysis has contributed a number of proprietary data formats to the mix. While there are many specialized databases, we focus here on the fields in which an effort is being made to maintain a comprehensive database of an entire class of data.

3D Molecular Structure Data

Though DNA sequence, protein sequence, and protein structure are in some sense just different ways of representing the same gene product, these datatypes currently are maintained as separate database projects and in unconnected data formats. This is mainly because sequence and structure determination methods have separate histories of development.

The first public molecular biology database, set up about 10 years before the public DNA sequence databases, was the Protein Data Bank (PDB). It represents the central repository for x-ray crystal structures of protein molecules. While the first finish protein structure was presented in the 1950s, there were not a noteworthy number of protein structures accessible until the late 1970s. Computers had not created to the point where graphical representation of protein coordinate structure information was possible, at least at useful speeds. However, in 1971, the PDB was set up at the Brookhaven National Laboratory, to store protein structure information in a computer-based archive. A data format created, which owed a lot of its style to the prerequisites of early computer technology. All through the 1980s, the PDB grew. From 15 sets of entries in 1973, it augments to 69 entries in 1976. The number of coordinate sets deposited each year remained under 100 until 1988, at which time there were still fewer than 400 PDB entries.

In the vicinity of 1988 and 1992, the PDB hit the the turning point in its exponential growth curve. By January 1994, there were 2,143 entries in the PDB; and at the moment the PDB has more than 14,000 entries. Administration of the PDB has been exchanged to a consortium of entry mark, called the Research Collaboratory for Structural Bioinformatics, and and a new format for recording of crystallographic data, the Macromolecular Crystallographic Information File (mmCIF), is being introduced in to replace the antiquated PDB format. Journals that publish crystallographic results require submission to the PDB as a condition of publication, which means that nearly all protein structure data obtained by academic researchers becomes available in the PDB.

A typical issue for information driven investigations of protein structure is the excess and absence of thoroughness of the PDB. There are numerous proteins for which various crystal structures have been submitted to the database. Choosing subsets of the PDB information with which to work is in this manner a critical step in any statistical investigation of protein structure. Numerous statistical studies of protein structure depend on sets of protein chains that have close to 25% of their sequence in common; if this paradigm is utilized, there are still just around 1,000 unique protein folds represented in the PDB. As the amount of biological sequence data available has grown, the PDB now falls a long ways behind the gene-sequence databases.

DNA, RNA, and Protein Sequence Data

Sequence databases generally specialize in one type of sequence data: DNA, RNA, or protein. There are major sequence data collections and deposition sites in Europe, Japan, and the United States, and there are independent groups that mirror all the data collected in the major public databases, often offering some software that adds value to the data.

In 1970, Ray Wu sequenced the first segment of DNA; twelve bases that occurred as a single strand at the end of a circular DNA that was opened utilizing a cleaving enzyme. In any case, DNA sequencing demonstrated considerably more troublesome than protein sequencing, on the grounds that there is no chemical process that selectively cleaves the first nucleotide from a nucleic acid chain. At the point when Robert Holley announced the sequencing of a 76-nucleotide RNA molecule from yeas, it was following seven years of work. After Holley's sequence was published, different groups refined the protocols for sequencing, even succeeding in sequence effectively a 3,200-base bacteriophage genome. Genuine advance with DNA sequencing came after 1975, with the chemical cleavage method created by Allan Maxam and Walter Gilbert, and with Frederick Sanger's chain terminator procedure.

The first DNA sequence database, established in 1979, was the Gene Sequence Database (GSDB) at Los Alamos National Lab. While GSDB has since been supplanted by the worldwide collaboration that is the modern GenBank, up-to-date gene sequence information is still available from GSDB through the National Center for Genome Resources.

The European Molecular Biology Laboratory, the DNA Database of Japan, and the National Institutes of Health cooperate to make all freely accessible sequence data through GenBank. NCBI has built up a standard relational database format for sequence information presentation and storage, known as the ASN.1 format. While this format guarantees to locate the right sequences of the right kind in GenBank simpler, there are also various services tions giving access to nonredundant versions of the database. The DNA sequence database developed gradually through its first decade. In 1992, GenBank contained just 78,000 DNA sequences — a little more than 100 million pairs of DNA. In 1995, the Human Genome Project, and advances in sequencing innovation, kicked GenBank's growth into high gear. GenBank currently doubles in size every 6 to 8 months, and its rate of increase is constantly growing.

Genomic Data

In addition to the Human Genome Project, there are now separate genome project databases for a large number of model organisms. The sequence content of the genome project databases is represented in GenBank, but the genome project sites also provide everything from genome maps to supplementary resources for researchers working on that organism. As of October 2000, NCBI's Entrez Genome database contained the partial or complete genomes of over 900 species. Many of these are viruses. The remainder include bacteria; archaea; yeast; commonly studied plant model systems such as A. thaliana, rice, and maize; animal model systems such as C. elegans, fruit flies, mice, rats, and puffer fish; as well as organelle genomes. NCBI's web-based software tools for accessing these databases are constantly evolving and becoming more sophisticated.

Biochemical Pathway Data

The most vital biological activities don't occur by the action of single molecule, however as the orchestrated activities of multiple molecules. Since the mid twentieth century, biochemists have analyzed these functional ensembles of enzymes and their substrates. A couple of research groups have started work at intelligently arranging and storing these pathways in databases. Key example of pathway database is KEGG. The Kyoto Encyclopedia of Genes and Genomes (KEGG) stores comparative information about sequence, structure, and genetic linkage databases. This database is queryable through web interfaces and are curated by a combination of automation and human expertise. In addition to these whole genome "parts catalogs," other, more specialized databases that focus on specific pathways (such as intercellular signaling or degradation of chemical compounds by microbes) have been developed.

Gene Expression Data

DNA microarrays (or gene chips) are miniaturized laboratories for the study of gene expression. Each chip contains a deliberately designed array of probe molecules that can bind specific pieces of DNA or mRNA. Labeling the DNA or RNA with fluorescent molecules allows the level of expression of any gene in a cellular preparation to be measured quantitatively. Microarrays also have other applications in molecular biology, but their use in studying gene expression has opened up a new way of measuring genome functions.

Since the advancement of DNA microarray technology in the late 1990s, it has turned out that the increase in available gene expression data will eventually parallel the growth of the sequence and structure databases. Raw microarray information has been started to be made accessible to the general audience in particular databases, and the building up of a central data repository for such data is done (Gene Expression Omnibus).

Since a significant number of the early microarray experiments were performed at Stanford, their genome resources site has connections to raw information and databases that can be queried utilizing gene names or functional descriptions. Furthermore, the European Bioinformatics Institute has been instrumental in setting up of standards for deposition of microarray data in databases. Several databases additionally exist for the deposition of 2D gel electrophoresis results, including SWISS-2DPAGE and HSC-2DPAGE. 2D-PAGE is an innovation that permits quantitative investigation of protein concentrations in the cell, for many proteins at the same time. The combination of these two systems is an intense tool for understanding how genomes function.

Table 1 summarizes sources on the Web for some of the most important databases we've discussed in this section.

Table 1. Major Biological Data and Information Sources

Subject Source Link
Biomedical literature PubMed http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
Nucleic acid sequence GenBank http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide
SRS at EMBL/EBI http://srs.ebi.ac.uk
Genome sequence Entrez Genome http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Genome
TIGR databases http://www.tigr.org/tdb/
Protein sequence GenBank http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein
SWISS-PROT at ExPASy http://www.expasy.ch/spro/
PIR http://www-nbrf.georgetown.edu
Protein structure Protein Data Bank http://www.rcsb.org/pdb/
Entrez Structure DB Protein and peptide mass spectroscopy PROWL http://prowl.rockefeller.edu
Post-translational modifications RESID http://www-nbrf.georgetown.edu/pirwww/search/textresid.html
Biochemical and biophysical information ENZYME http://www.expasy.ch/enzyme/
BIND http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Structure
Biochemical pathways PathDB http://www.ncgr.org/software/pathdb/
KEGG http://www.genome.ad.jp/kegg/
WIT http://wit.mcs.anl.gov/WIT2/
Microarray Gene Expression Links http://industry.ebi.ac.uk/\~alan/MicroArray/
2D-PAGE SWISS-2DPAGE http://www.expasy.ch/ch2d/ch2d-top.html
Web resources The EBI Biocatalog http://www.ebi.ac.uk/biocat/
IUBio Archive http://iubio.bio.indiana.edu

Searching Biological Databases

There are numerous biological databases, and many alternative web interfaces that provide access to the same sets of data. Which one to use depends on personal needs, but it's necessary to be aware of what kind of data the central data repositories are, and how often the peripheral databases are synchronized with the central data sources.

The two most established databases are NCBI's GenBank, for DNA sequences; and the Protein Data Bank (PDB), for molecular structure data. Each database has its own deposition procedures. However, both NCBI and PDB have well developed, automated, web-based deposition systems that do not change often over time.

GenBank

NCBI, in cooperation with EMBL and other international organizations, provides the most complete collection of DNA sequence data in the world - the database, known as GenBank.

NCBI maintains sequence data from every organism, every source, every type of DNA—from mRNA to cDNA clones to expressed sequence tags (ESTs) to high-throughput genome sequencing data and information about sequence polymorphisms. Users of the NCBI database need to be aware of the differences between these datatypes so that they can search the data set that's most appropriate for the work they're doing. The main sequence types that you'll encounter in a full GenBank search include:

mRNA

Messenger RNA, the product of transcription of genomic DNA. mRNA may be edited by the cell to remove introns (in eukaryotes) or in other ways that result in differences from the transcribed genomic DNA. May be "partial" or "complete"; an mRNA may not cover the complete coding sequence of a gene.

cDNA

A DNA sequence artificially generated by reverse transcription of mRNA. cDNA represents the coding components of the genomic DNA region that produced the mRNA. May be "partial" or "complete."

Genomic DNA

A DNA sequence from genome sequencing that contains both coding and noncoding DNA sequences. May contain introns, repeat regions, and others. Genomic DNA is generally "complete"; it's a result of multiple sequencing experiments over a single stretch of a genome, and can generally be relied upon as a fairly good representation of the real DNA sequence of that region.

EST

Short cDNA sequences prepared from mRNA extracted from a cell under particular conditions or in specific developmental phases. ESTs are used for quick identification of genes and don't cover the entire coding sequence of a gene.

GSS

Genome survey sequence. Single-time sequenced part of DNA direct from the genome projects. Covers each region of sequence only once and may contain a relatively large percentage of sequencing errors. Genome survey sequence is included in a search only when search a very new hypothetical gene annotations in a genome project that is still in progress.

There are two ways to search GenBank. The first is to use a text-based query to search the annotations associated with each DNA sequence entry in the database. The second is to use a method called BLAST to compare a query DNA (or protein) sequence to a sequence database. Here's a sample GenBank record. Each GenBank entry contains annotation—information about the gene's identity, the conditions under which it was characterized, etc.—in addition to sequence (Fig. 4).

Fig. 4. GeneBank record of Listeria monocytogenes superoxide dismutase gene

Fig. 4. GeneBank record of Listeria monocytogenes superoxide dismutase gene

This sample GenBank record shows the types of fields that can be found in a record from the GenBank Nucleotide database. In the record could be found the relevant information for the identity of the protein product, the sequence of the protein product, and its starting and ending point within the gene, to the authors who submitted the record and the journal references in which the experiment was described. The GenBank search interface is nearly identical to the PubMed search interface. The Advanced features for searching work the same way in the Protein, Nucleic Acid, and Genome databases as they do for PubMed, although the specific fields that can be searched and limits that can be set are more or less different.

Saving search results

Sequences can be downloaded from NCBI in several file formats: the simple FASTA format, which is readable by many sequence analysis programs but contains little information other than sequence; the GenBank flat file format, which is a legacy flat file format that was used at GenBank earlier in its history; and the modern ASN.1 (Abstract Syntax Notation One) format. ASN.1 is a generic data specification, designed to promote database interoperability, that is now used for storage and retrieval of all datatypes—sequences, genomes, structure, and literature—at NCBI. The NCBI Toolkit, a code library for developing molecular biology software, relies on the ASN.1 specification. NCBI, and increasingly, other organizations, rely on the NCBI Toolkit for software development.

The casual database user or depositor doesn't have to think too much about file formats, except if database files are to be exported and read by another piece of software. NCBI's forms-based interfaces convert user-entered data into the appropriate format for deposition, and the availability of GenBank files in FASTA format means that most sequence analysis software can handle sequence files you download from NCBI without complicated conversions.

When saving results of a GenBank search, the format in which to save them can be easily chosen. A particularly foolproof format in which to save your sequence files if you're going to process them with other software is the FASTA format. FASTA files have a simple format, a single comment line that begins with a > character, followed by single-character DNA sequence on as many lines as needed to hold the sequence, with no breaks. Of course, some information associated with the gene is lost when you save the data in FASTA format, but if the program can't read that extra data, it won't be useful to have it anyway.

Here's a sample of data in FASTA format:

> gene identifier and comments here
MATVQEIRNAQRADGPATVLAIGTATPAHSVNQADYPDYY
FRITKSEHMTELKEKFKRMCDKSMIKKRYMYLTEEILKEN
PNMCAYMAPSLDARQDIVVVEVPKLGKEAATKAIKEWGQP
KSKITHLIFCTTSGVDMPGADYQLTKLIGLRPSVKRFMMY
QQGCFAGGTVLRLAKDLAENNKGARVLVVCSEITAVTFRG
PADTHLDSLVGQALFGDGAAAVIVGADPDTSVERPLYQLV
STSQTILPDSDGAIDGHLREVGLTFHLLKDVPGLISKNIE
KSLSEAFAPLGISDWNSIFWIAHPGGPAILDQVESKLGLK
GEKLKATRQVLSEYGNMSSACVLFILDEMRKKSVEEAKAT
TGEGLDWGVLFGFGPGLTVETVVLHSVPIKA

To save your files in FASTA format, simply use the pulldown menu at the top of the results page. When you first see it, it will say "Summary," but you can change it to FASTA, ASN.1, and other formats. Once you've chosen your format, you can click the Save button to save all your sequences into one big FASTA-format file. Figure 5 shows you how to change the file formats when doing a GenBank search.

Figure 5. Selecting the file format to write out a GenBank search result

Figure 5. Selecting the file format to write out a GenBank search result

Saving large result sets

Modern bioinformatics studies increasingly deal with large amounts of sequence data. For example, gene finding programs are verified on hundreds or thousands of DNA sequences; comprehensive studies of protein families can involve analysis of up to thousands of protein sequences as well. In such cases it would be better to use an automated tool that can return a large number of sequences based on criteria you specify.

NCBI provides just such a tool in the form of Batch Entrez. Batch Entrez is one of the tools that allows the user to select sequences by source organism, by an Entrez query (using the query structure described in the section on PubMed), or by a list of accession numbers (provided by the user in the form of a text file). The results of a Batch Entrez search are then packaged in a file that is downloaded to the user's computer, where the complete result set can be edited manually or using a script.

At this time, all the public databases have at least FTP sites that allows to download the entire database on the computer. That can take up a lot of space on the hard disk, but is more easier to handle a large set of results in comparison to the interactive web site. When having a local copy of the big databases of interest, a script can be written that can processes the database, looking for particular keyword of choice, and writing out the desired information from a file.

PDB

Unlike NCBI, the Protein Data Bank (PDB) contains only one type of molecular data: molecular structures of molecules and, to a growing extent, the underlying raw data sets from which the molecular structures were modeled. It offers a number of services for submitting and retrieving three-dimensional structure data. The home page of the RCSB site provides links to services for depositing three-dimensional structures, information on how to obtain the status of structures undergoing processing for submission, ways to download the PDB database, and links to other relevant sites and software.

Figure 6. PDB features

The main information stored in the PDB consists of coordinate files for biological molecules. These files list the atoms in each protein, and their 3D location in space. They are available in several formats (PDB, mmCIF, XML). A typical PDB file contains a text that describes the protein, citation information, and the details of the structure solution, followed by the sequence and a list of the atoms and their coordinates. The PDB files can be viewed directly using a text editor. Online tools, such as the ones on the RCSB PDB website, allow to search and explore the information under the PDB header, including information on experimental methods and the chemistry and biology of the protein (Fig. 7).

Figure 7. Query results at the PDB

The structure files may be viewed using one of several free and open source computer programs, including Jmol, Pymol, VMD, and Rasmol. Other non-free, shareware programs include ICM-Browser, MDL Chime, UCSF Chimera, Swiss-PDB Viewer, StarBiochem (a Java-based interactive molecular viewer with integrated search of protein databank), Sirius, and VisProt3DS (a tool for Protein Visualization in 3D stereoscopic view in anaglyph and other modes), and Discovery Studio. The RCSB PDB website contains an extensive list of both free and commercial molecule visualization programs and web browser plugins, as shown in Figure 8.

Figure 8. Viewing a PDB file using a browser plug-in

Depositing Data into the Public Databases

In addition to downloading information from the public databases, you may also submit your own results.

GenBank Deposition

Deposition of sequences to GenBank has been made extremely simple by NCBI. Users depositing only a few sequences can use the web-based BankIt tool, which is a self-explanatory form-based interface accessible from the GenBank main page at NCBI. NCBI has recently established two special submission paths: EST sequences should be submitted through dbEST, rather than to GenBank, and genome survey sequences through dbGSS.

PDB Deposition

Deposition of structures to the PDB are done using the wwPDB OneDep System that integrates data validation software with the deposition process so that the user can receive feedback on data quality during the deposition process. wwPDB OneDep System is tied in with the curation tools the PDB uses to prepare structure data for inclusion in the data bank.

Finding Software

Bioinformatics is a broad field, attracting researchers from many disciplines, and articles about new research developments in bioinformatics are widely distributed in the literature. If you're looking for cutting-edge developments, journals such as Bioinformatics, Nucleic Acids Research, Journal of Molecular Biology, and Protein Science often publish papers describing innovations in computational biology methods.

If you're looking for proven software for a particular application, there are a number of reliable web resource lists that link to computational biology software sites. Most of the major biological databases have software resource listings and the necessary motivation to keep their listings up-to-date. The PDB links to the best free software packages for macromolecular structure refinement, visualization, and dynamics. ExPASy and NCBI portals provide links to many tools for protein and DNA sequence analysis.

Judging the Quality of Information

The ability to judge the quality of information and software will improve as you continue to learn the field. One of the first things to consider when evaluating software, data, or information found on the Internet is the source. If you don't know the authors presenting the information by reputation, search for information about their affiliation and credentials available on the web site. Their expertise related to the topic or purpose of the web site is also important. An individual academic researcher's site doesn't always have the same need to be all-inclusive as a publicly funded database does. There is nothing inherently wrong with these offerings, but you should be aware of whether or not they are comprehensive, whether all their features are available to the casual user, and why.

Even data and software from national or international public sites are not necessarily entirely correct. It has been estimated that any given sequence in GenBank is likely to contain at least one error. While these errors generally don't render the data meaningless, it's always best to be aware of such issues even when using top-of-the-line public resources. Like any other software you find on the Web, software offered by public agencies such as NCBI and the PDB may still be under development. You can use this software, and much of it is of good quality. If you're basing your research on a beta version (a version still under development) of a software package, just read the documentation carefully so that you know what problems still remain to be worked out.

Funding

Disclaimer

The European Commission support for the production of this publication does not constitute endorsement of the contents which reflects the views only of the authors, and the Commission cannot be held responsi-ble for any use which may be made of the information contained therein.