SureshKumar's Bioinformatics Blog

I am Suresh Kumar Sampathrajan. I have completed my PhD degree in bioinformatics from the University of Vienna, Austria in the year 2010. If you want to know more about me and my research,please click the menus at the top.

I have started this bioinformatics blog mainly for undegraduate and postgraduate students of bioinformatics. This blog will serve as an open resource material for the students and for those who wish to know about bionformatics. This blog contains video tutorials, tips, bioinformatics software downloads, articles on bioinformatics and career opportunities.

Ace of (data)base

A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name; the input sequence with a description of the type of molecule; the scientific name of the source organism from which it was isolated; and, often, literature citations associated with the sequence.

For researchers to benefit from the data stored in a database, two additional requirements must be met:
1.Easy access to the information; and
2.A method for extracting only that information needed to answer a specific biological question.

The principal requirements on the public data services are:

* Data quality - data quality has to be of the highest priority. However, because the data services in most cases lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter.
* Supporting data - database users will need to examine the primary experimental data, either in the database itself, or by following cross-references back to network-accessible laboratory databases.
* Deep annotation - deep, consistent annotation comprising supporting and ancillary information should be attached to each basic datat object in the database.
* Timeliness - the basic data should be available on an Internet-accessible server within days (or hours) of publication or submission.
* Integration - each data object in the database should be cross-referenced to representation of the same or related biological entities in other databases. Data services should provide capabilities for following these links from one database or data service to another.
Primary databases(consisting of data derived experimentally)
a.) Sequence databases
DNA / nucleotide databases

GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences. It has a flat file structure that is an ASCII text file, readable by both humans and computers. part of the International Nucleotide Sequence Database Collaboration.It consists of the DNA Data Bank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank (NCBI).In addition to sequence data, GenBank files contain information like accession numbers and gene names, phylogenetic classification and references to published literature.There are approximately contains publicly available DNA sequences for more than 170,000 different organisms, obtained primarily through the submission of sequence data from individual laboratories and batch submissions from large-scale sequencing projects as of 2006.It exchanges data on daily basis.


The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups. Data collection is done in collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). The database currently doubles in size every 18 months and currently (June 1994) contains nearly 2 million bases from 182,615 sequence entries.

DDBJ (DNA Data Bank of Japan)

DDBJ was established in 1986 at the National Institute of Genetics (NIG).It reorganized as the Center for Information Biology and DNA Data Bank of Japan (CIB/DDBJ) in 2001

Protein databases

SwissProt was established in 1986.It is maintained collaboratively by the EMBL Outstation (EBI) and the Swiss Institute of Bioinformatics (SIB). This is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy (means less identical sequences are present in the database).

TrEMBL (Translation of EMBL Nucleotide Sequence Databases)

It was created in 1996 as supplement to Swiss-Prot.It make new sequences available as quickly as possible
through computer-annotated entries derived from the translation of all coding sequences (CDS) in EMBL.

PIR (Protein Information Resource)
PIR was established in 1984 by the National Biomedical Research Foundation (NBRF), since 1988 maintained by PIR-International.It is partitioned into four sections by differences in classification, annotation and redundancy and cross-referencing to other biological databases.
b.) Structure databases

PDB (Protein Data Bank)
Single worldwide repository for processing and distribution of 3-D biological macromolecular structure data.

NDB (Nucleic Acid Database)
The Nucleic Acid Database Project (NDB) assembles and distributes structural information about nucleic acids. The data available consist of coordinates, experimental details used to determine the structures, and derived information about the geometry of the structures.

CCDB / CSD (Cambridge Crystallographic Data Centre / Cambridge Structural Database)
compilation of a computerised database containing comprehensive data for organic and metal-organic compounds studied by X-ray and neutron diffraction

Secondary databases(derived information)

It contains derived information from a primary database, like information about conserved sequence, signature sequence and active site residues of the protein families arrived by multiple sequence alignment of a set of related proteins. secondary structure database contains entries of the PDB in an organized way (for instance, by classification of all PDB entries according to structures like alpha-helix or ß-sheets) and also information on conserved secondary structure motifs of a particular protein

ProSite (Database of Protein Families and Domains)
It contains patterns and profiles specific for more than a thousand protein families or domains and also background information on the structure and function of these proteins.

Pfam (Protein Families Database of Alignment and HMMs)
Large collection of multiple sequence alignments and hidden Markov models covering many protein domains and families.Pfam currently contains over 6,000 protein families and domains as of 2006.

Enzyme (Enzyme Nomenclature Database)
Primarily based on the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB).It is a repository of information relative to the nomenclature of enzymes.

REBase (Restriction Enzyme Database)
It is collection of information about restriction enzymes and related proteins.Currently, there are over 4000 enzymes and over 7000 references stored in REBASE as of 2006.

Genome-related Information
OMIM (Online Mendelian Inheritance in Man)
It is incorporated into NCBI's Entrez system and can be queried using the same approach as the other Entrez databases such as PubMed and GenBank. It has catalog of human genes and genetic disorders includes information on genetic variation in humans and also contains textual information, pictures, and reference information.

TransFac (Transcription Factor Database)
Database on eukaryotic cis-acting regulatory DNA elements and trans-acting factors.It covers the whole range from yeast to human.

Structure-related Information
HSSP (Homology-derived Secondary Structure of Proteins)
A database of homology-derived secondary structure of proteins (HSSP) by aligning to each protein of known structure all sequences deemed homologous on the basis of the threshold curve. For each known protein structure, the derived database contains the aligned sequences, secondary structure, sequence variability and sequence profile. Tertiary structures of the aligned sequences are implied, but not modelled explicitly.

FSSP (Fold classification based on Structure-Structure alignment of Proteins)
Based on exhaustive all-against-all 3D structure comparison of protein structures currently in the Protein Data Bank (PDB)

Pathway Information

KEGG (Kyoto Encyclopedia of Genes and Genomes)
It is a suite of databases and associated software, integrating knowledge on molecular interaction networks in biological processes, the information about the universe of genes and proteins, and the information about the universe of chemical compounds and reactions.It serves as bioinformatics resource for understanding higher order functional meanings and utilities of the cell or the organism from its genome information.

Composite databases

composite databases joins a variety of different primary database sources, which obviates the need to search multiple resources

For more database listing see:


  • The Molecular Biology Database Collection: 2006 update -Nucleic Acids Research, 2006, Vol. 34, Database issue D3-D5

Impact of human genome project

The Human Genome Project (HGP) is a project to map and sequence the 3 billion nucleotides contained in the human genome and to identify all the genes present in it. There are two draft sequences of the human genome were generated by the Human Genome Project (HGP)1 and Celera Genomics. The HGP used a hierarchical mapping and sequencing approach, involving generation of a series of overlapping clones that cover the entire genome and shotgun sequencing of each clone. The genome sequence was reconstructed by assembling the fragments on the basis of sequence overlap and mapping and chromosomal position information on the clones. Celera Genomics used a whole-genome shotgun sequencing approach, without generating a series of overlapping clones, but also incorporated HGP information where available.

Project goals
# Identify all the approximately 20,000-25,000 genes in human DNA,
# Determine the sequences of the 3 billion chemical base pairs that make up human DNA,
# Store this information in databases,
# Improve tools for data analysis,
# Transfer related technologies to the private sector, and
# Address the ethical, legal, and social issues (ELSI) that may arise from the project.

Facts after sequencing human genome

The present (assembly number 35, May 2004) human DNA sequence contains ~3,100,000,000 bp (depending on the actual source of the assembled DNA sequence) that covers most of the nonheterochromatic portions of the genome and contains some 250 gaps

we have ~20,000-25,000 genes (International Human Genome Sequencing Consortium 2004Go), somewhat fewer than estimates based on the preliminary reports of the human sequence (International Human Genome Sequencing Consortium 2001; Venter et al. 2001).

The sequence revealed the full extent to which human DNA is comprised of abundant interspersed repeats, extending and completing what was already known; fully 45% of our DNA consists of repetitive elements interspersed within nonrepetitive sequences. Interestingly, the extent and diversity of gene repetitions contained in low copy number repeats were greater than expected; very extensive duplications of regions of DNA both within and between chromosomes were identified by the International Human Genome Sequencing Consortium (2001) and Venter et al. (2001).

Challenges to bioinformatics research

The first challenge to bioinformatics research relates to the analysis of data posted on the Web in advance of publication without violating ethical standards

The second challenge to bioinformatics research derives not from restrictions on data access but from restrictions on downstream use, such as incorporation into new or existing databases.

Download free Pdf booklet - Bioinformatics and the Human genome Project

The Human genome : Future research

Genomics to biology

#Comprehensively identify the structural and functional components encoded in the human genome
#Elucidate the organization of genetic networks and protein pathways and establish how they contribute to cellular and organismal phenotypes
#Develop a detailed understanding of the heritable variation in the human genome
#Understand evolutionary variation across species and the mechanisms underlying it
#Develop policy options that facilitate the widespread use of genome information in both research and clinical settings

Genomics to health

#Translating genome-based knowledge into health benefits
#Develop robust strategies for identifying the genetic contributions to disease and drug response
#Develop strategies to identify gene variants that contribute to good health and resistance to disease
#Develop genome-based approaches to prediction of disease susceptibility and drug response, early detection of illness, and molecular taxonomy of disease states
#Use new understanding of genes and pathways to develop powerful new therapeutic approaches to disease
#Investigate how genetic risk information is conveyed in clinical settings, how that information influences health strategies and behaviours, and how these affect health outcomes and costs
#Develop genome-based tools that improve the health of all

Genomics to society

#Promoting the use of genomics to maximize benefits and minimize harms
#Develop policy options for the uses of genomics in medical and non-medical settings
#Understand the relationships between genomics, race and ethnicity, and the consequences of uncovering these relationships
#Understand the consequences of uncovering the genomic contributions to human traits and behaviours
Assess how to define the ethical boundaries for uses of genomics

What next?

HapMap Project

The International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings. Using the information in the HapMap, researchers will be able to find genes that affect health, disease, and individual responses to medications and environmental factors.

compare the genetic sequences of different individuals to identify chromosomal regions where genetic variants are shared.
By making this information freely available, the Project will help biomedical researchers find genes involved in disease and responses to therapeutic drugs.

ENCODE project

ENCODE, the Encyclopedia Of DNA Elements, in September 2003, to carry out a project to identify all functional elements in the human genome sequence. The project is being conducted in three phases: a pilot project phase, a technology development phase and a planned production phase.

Archon X PRIZE for Genomics - Create technology that can successfully map 100 human genomes in 10 days and win $10 million.

On October 4, 2006, the X PRIZE Foundation announced the launch of its second prize — the Archon X PRIZE for Genomics. The $10 million cash prize has been created to revolutionize the medical world.The Archon X PRIZE for Genomics challenges scientists and engineers to create better, cheaper and faster ways to sequence genomes. The knowledge gained by compiling and comparing a library of human genomes will create a new era of preventive and personalized medicine — and transform medical care from reactive to proactive.

The Competition Guidelines

The purpose of this X PRIZE competition is to develop radically new technology that will dramatically reduce the time and cost of sequencing genomes, and accelerate a new era of predictive and personalized medicine. The X PRIZE Foundation aims to enable the development of low-cost diagnostic sequencing of human genomes.

The preliminary guidelines for the competition have been written with this intent and will be further developed and interpreted by the X PRIZE Foundation towards this end.

The $10 million X PRIZE for Genomics prize purse will be awarded to the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 10,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 per genome.

If more than one Team attempts the competition at the same time, and more than one Team fulfills all the criteria, then Teams will be ranked according to the time of completion. No more than three teams will be ranked and will share the purse in the following manner: $7.5 million to the winner and $2.5 million to the second place team if two teams are successful, or $7 million, $2 million and $1 million if three teams are successful.

Actual competition events will take place twice a year with all eligible teams given the opportunity to make an attempt, starting at precisely the same time as the other teams.

For more information, please see:
  • A Vision for the Future of Genomics Research Francis S. Collins, Eric D. Green, Alan E. Guttmacher, Mark S. Guyer A blueprint for the genomic era. Nature Apr 24 2003: 835
  • Bioinformatics--Trying to swim in a sea of data David S. Roos
  • Computational comparison of two draft sequences of the human genome John Aach, et al.
  • The Human Genome Project: Lessons from Large-Scale Biology Francis S. Collins, Michael Morgan, Aristides Patrinos Science Apr 11 2003: 286

Twitter Delicious Facebook Digg Stumbleupon Favorites More