SureshKumar's Bioinformatics Blog

I am Suresh Kumar Sampathrajan. I have completed my PhD degree in bioinformatics from the University of Vienna, Austria in the year 2010. If you want to know more about me and my research,please click the menus at the top.

I have started this bioinformatics blog mainly for undegraduate and postgraduate students of bioinformatics. This blog will serve as an open resource material for the students and for those who wish to know about bionformatics. This blog contains video tutorials, tips, bioinformatics software downloads, articles on bioinformatics and career opportunities.

Downloading human chromosome sequences

Sequencing of a genome often starts with a random shotgun sequencing strategy or with direct sequencing on genomic DNA . The DNA sequences of the clones or sequenced genome fragments often overlap, yielding enlarged DNA sequences (contigs).

Genome Assembly

The genomic sequences are assembled into a series of genomic sequence contigs. These are then ordered, oriented with respect to each other, and placed along each chromosome with appropriately sized gaps inserted between adjacent contigs. The resulting genome assembly thus consists of a set of genomic sequence contigs and a specification for how to arrange the sequence contigs along each chromosome.

Finished Chromosomes

A chromosome sequence is considered finished when any gaps that remain cannot be closed using current cloning and sequencing technology. In practice, therefore, the sequence for a finished chromosome usually consists of a small number of genomic sequence contigs.

Unfinished Chromosomes

Genomic sequence contigs for unfinished chromosomes are assembled and laid out based largely on the clone tiling path. However, the tiling paths do not specify the orientation of the clone sequences or how they should be joined; therefore, data on the alignment of the input genomic sequences to each other and to other sequences are also used to guide the assembly. Genomic sequences that augment the initial set of genomic contigs based on the tiling path clones are also incorporated.

To download complete human chromosome sequences:

It is possible to download in fasta format of each chromosome as whole sequences, through NCBI ftp site.NCBI ftp site maintains section called assembled chromosomes. We can download each chromosome sequences by clicking file which starts with hs_ref.

Manual Annotation:
Vega site maintianed by Sanger Institute presents data from the manual annotation of the human genome.

High-quality annotated human chromosome sequences
To download all human annotated contigs in one fasta sequnence

Identification of genes

Genes are found using three complementary approaches: (a) known genes are placed primarily by aligning mRNAs to the assembled genomic contigs; (b) additional genes are located based on alignment of ESTs to the assembled genomic contigs; and (c) previously unknown genes are predicted using hints provided by protein homologies.

Identifying Paralogs and Orthologs via COGs and KOGs databases

Orthologs and Paralogs defined as

Orthologs: similar sequences or genes in different species that arose through speciation and mutation and not from gene duplication.

Paralogs: Related genes(or proteins) in the same genome. The related genes have arisen by gene duplication.

COG and KOG databases:

The COG(Clusters of Orthologous Groups) and KOG (euKaryotic Orthologous Groups) databases have been constructed using a careful analysis of BLAST hits.

First, low-complexity sequence regions and commonly occuring domains are masked to prevent spurious hits and also to improve the the statistical score analysis (E-values).

All gene sequences from one genome are then scanned against all from another genome, noting the best-scoring BLAST hits for each gene, and this is repeated for all possible pairs.

Paralogous genes within a genome that result from gene duplication since divergence of two species are identified as those that give a better-scoring BLAST hit with each other than their BLAST hits with the other genome.

Orthologus genes are found as groups of genes from different genomes that are reciprocal BLAST hits of each other.

All sequences in a COG or a KOG are assumed to have a related function, and thus the method can be used to predict gene and protein function.

Antibody modeling

Antibodies are the soluble proteins of the immune system that bind to pathogens and their toxic products, preventing their harmful action.

Antibody, or immunoglobulin, molecule are roughly Y-shaped and consist of four protein chains linked by disulfide bonds.

Each antibody molecule has two identical antigen-binding sites, which are the parts that vary from antibody to antibody. The rest of the molecule is almost identical, in both sequence and structure, in all antibodies of a given class.

Each antibody molecule consists of four chains of two different types: two identical heavy chains, and two identical, smaller, light chains. Each chains is composed of a series of discrete domains of around 110 amino acids with a characteristic structure known as the immunoglobulin fold.

The structures of only a few hundred antibodies have been solved by X-ray crystallography and there are even fewer structures of antibody-antigen complexes. There are however, many more antibody sequences in the databases, and homology modeling can be a useful tool in extending the number of structures of antigen-binding sites. Being able to model the structures of anitgen-binding sites can help in guiding the synthesis of novel antibody variable regions for potential therapeutics and laboratory regents.

Current methods of antibody modelling is generally through the homology approach.

WAM - Web Antibody Modelling uses the algorithm improvement on that used by Oxford Molecular's AbM. We can submit the sequence through two methods

*Manual method: This requires that you manually line up your sequence with the known antibody structures by inserting deletions in the CDRs.

*Autoalign method: attempts to automatically add deletions to the CDRs based on the positions of certain key residues; it will work provided certain sequence conditions are met.

Sequence Manipulation Suite for sequence analysis

The Sequence Manipulation Suite (SMS) is a collection of JavaScript programs for generating, formatting, and analyzing short DNA and protein sequences.
The SMS is divided into different sections like
1.Format conversion - For combining and coverting from one format to another format
2.sequence analysis - It has range of programmes for analysing the sequences by pairwise alignment, translating, finding patterns etc
3.Sequence Figures - This section has programmes like grouping the protein, colouring the sequences etc
4.Random sequences - It contains programmes like mutating the sequences, shfulling etc
5.Miscallaneous - It contains references of IUPAC codes, genetic codes.

It can be mirrored on your website or it can be run locally on your computer.

  • To mirror the Sequence Manipulation Suite: Extract or sms2.tar.gz and place the resulting sms2 directory into a directory from which your server will serve HTML files.
  • To run the Sequence Manipulation Suite locally: Extract or sms2.tar.gz and load index.html from the resulting sms2 directory into your web browser.

Modelling a protein through automatic comparative modelling servers

It is possible to a model a protein if we have more than 50% identical residues in an alignment between the target and a template protein. The modelling process has following steps

>> Choosing a suitable template sequences
>> Algining template and target sequences
>> Building backbone
>> constructing loop and generating side chains

Finding suitable template is usually done by performing a BLAST search against the sequences in the PDB, the respoistory for protein sturcture obtained by X-ray crystallography and NMR. All sequences in the PDb with an E-value of Blast below a certain treshold are considered as candidates for the template. The alignment between template and target sequences should atleast contain 30% identities and most important it should have existence of conserved regions. A mulitple sequence alignment with same family members of the template or several template sequences may also be constructed. This step requires some manual correction of the alignment in order to obtain a reliable model.Several methods are used for loop building and side chain reconstruction.

When no suitable template is available for comparative modeling, de novo modeling methods (also called ab initio modeling) may be used. The success rate with such modeling is considerably lower than that with comparative modeling. The accuracy of de novo models is too low for problems requiring high-resolution structure information."

Automatic Comparative modelling servers

Ready made Perl scripts to manipulate biological data

The Scriptome is a minimal-learning toolbox for manipulating biological data especially for biologists.
There are currently six tool categories: Calculate, Change, Choose, Fetch, Merge, and Sort. Popular tools include:
- Choosing lines where the value in a given column exceeds a certain threshold. This can be especially useful with files larger than Excel's limit of 65535 lines.
- Merging files together based on shared values in certain columns. This tool essentially performs a SQL join.
- Changing FASTA files to a tabular format. The output can be viewed in Excel, or filtered with other Scriptome tools.

Each tool is a short Perl script embedded in a one-line shell command. These tools can be cut and pasted from the website onto the command line. The tools' simplicity makes it possible to develop tools rapidly, to keep up with biologists' changing needs.

The Scriptome requires no installation at all on UNIX and Macintosh, since Perl is standard on those platforms. Windows users need only a one-click installation of Perl from ActiveState. A few tools also require Bioperl.

Virtual Screening

Virtual screening is a technology that is gaining increasing use in drug discovery. It is seen as a complementary approach to experimental screening. There are many tools available for performing these computational analyses and broadly speaking they can be categorized as being either ligand-based or receptor based.

For ligand-based methods, the strategy is to use information provided by a compound or set of compounds that are known to bind to desired target and to use this to identify other compounds in the corporate database or external databases with similar properties. This can be done by a variety of methods, including similarity and substructure searching, pharmacophore matching or 3D shape matching.

For receptor-based methods involves explicit molecular docking of each ligand into the binding site of the target, producing a predicted binding mode for each database compound, together with a measure of the quality of the fit of the compound in the target binding site. This information is the used to rank the compounds with a view to selecting and experimentally testing a small subset for biological activity.

Computational tools for Glycomics Studies

Sugars are involved in almost every aspect of biology, from recognising pathogens and to blood clotting.The glycome's basic building blocks are far more numerous and varied than the four letters of the DNA alphabet or the score of amino acids that make proteins.In the late 1980s, when researchers isolated the first gene for a glycosyl transferase, an enzyme that adds sugars to fats and proteins. The discovery gave scientists the first opportunity to study this process, which is usually called glycoslyation, by manipulating the activity of such enzymes.

Fig: Carbohydrate only (no protein) - PDB id:2HYA

Glycomics, or glycobiology is a discipline of biology that deals with the structure and function of oligosaccharides (chains of sugars). The identity of the entirety of carbohydrates in an organism is thus collectively referred to as the glycome.The progressing glycomics projects will dramatically accelerate the understanding of the roles of carbohydrates in cell communication and hopefully lead to novel therapeutic approaches for treatment of human disease

The Functional Glycomics Gateway is a comprehensive and free online resource that is the result of a collaboration between the Consortium for Functional Glycomics (CFG) and Nature Publishing Group. It is aimed at keeping you abreast of developments in the emerging field of functional glycomics.

For annotation and/or cross-reference carbohydrate-related data collections which will allow us to find important data for compounds of interest in a compact and well-structured representation

Many pdb-files contain carbohydrate structures. Since there is not such a standard nomenclature like it exists for amino acids, it is difficult to find the carbohydrate information. Sometimes entire oligosaccharides are encoded in one single residue. Information about carbohydrate linkages is often missing, and if it is present, it is not in a unique format and therefore also difficult to find.pdb2linucs automatically extracts carbohydrate information from pdb-files .

GlycoSuite comprises GlycoSuiteDB, the leading curated and annotated glycan database, and new bioinformatic tools which interface mass spectrometric data with the database.

A Complex Carbohydrate Structure Database, also known as CarbBank is available . But, due to lack of funding it is no longer updated.

Protein druggability prediction

The availability of structural data, especially of proteins complexed to small molecule ligands, has enabled numerous analyses that attempt to understand and predict the forces that govern molecular recognition and ligand binding. Successful drug development requires a disease target that plays a vital role in the causation and progression of the disease phenotype and that can be modulated with a drug molecule.

One approach for evaluating protein druggability is to analyze the genome on the basis of sequence homology to known therapeutic targets.

Another approach which rely on on the 3D structure of the protein target. Systematic analyses of protein srufaces in the search for binding pockets that have high potential to bind small drug like compuunds with high affinity.

To identify all possible binding sites on a protien surface based on algorithms broadly classified into

(i) Geometry-based

Tools based on geometry based alogirthm

Binding site prediction for malate dehydrogenase (PDB: 2cmd).

(ii)Energy based algorithms

Tools based on Energy based algorithms

  • GRID
  • vdW-FT
  • Drugsite

Useful online resources for cancer target finding and analysis

Collection of cancer genes based on mutation data

Repository of microarray data from cancer genomics publication

Respository of cytogenetic abnormalities in human cancer

Respository of cytogenetic abnormalities in human cancer

Validated SBNPs in cancer genes

BioMed -Bioinformatics search engine

PubMed is the free public interface to MEDLINE. It provides access to bibliographic information in MEDLINE as well as additional life science journals.
Searching articles on PubMed requires some skill and more over it does not support some of search strings like popular search engines (Google and Yahoo).

Eventhough we can search through Google Scholar it gives many false positive results. But, Google provides us to customise our own search engine through google co-op. Through this we can create our own search engine on our interested topic.

I have created a search engine for bioinformatics through Google Co-op, this search engine searches only bioinformatics related journals and it has less false positive results. But this BioMed search engine is still in beta version and it requires further improvement.

You can access the BioMed - Search engine for bioinformatics by clicking here.

Guide to use of computational tools for finding transcriptional factor binding sites

Transcriptional factors are proteins that binds to DNA, typically upstream from and close to the transciption start site of gene, and regulate the expression of gene by activating or inhibiting the transcription machinery.

Transcription factors contain several functional regions:

* Activation domain: region that interacts wtih other parts of the transcription machinery (RNA polymerase or other transcription factors).

* DNA binding domain: amino acids in the protein that recognize specific bases near the start of transcription.

* Nuclear localization domain: region that serves as a signal for the protein to go to the nucleus after being synthesized in the cytoplasm.

* Dimerization domain: Many transcription factors work as dimers (two subunits). For these proteins, a region of the protein facilitates interaction with another subunit.

The figure shows several transcription factors (JUN, FOS, Sp1, and basal factors) that are necessary for transcription of some genes

Computational approaches to this problem have come in two flavors. One class of methods looks for overrepresented motifs in sequences that are believed to contain several binding sites for the same factor (such as promoters of co-regulated genes) . The second class of methods identifies motifs that are significantly conserved in orthologous sequences, e.g., promoters of the same gene in different species. Yet,the prediction of such regulatory elements computationally challenging task.

Eventhough numerous tools available for this task it should be used with cautious.Based on the assessment each tools performs well depends on the dataset.

Transcriptional factor databases

Transcription factors database

Eurkaryotic transcriptional factors databse

TRANSFAC -contains data on transcription factors, their experimentelly-proven binding sites, and regulated genes. Its broad compilation of binding sites allows the derivation of positional weight matrices.

Plant transcription factor database

Database of motifs found in plant cis-acting regulatory DNA elements, all from previously published reports. It covers vascular plants only.

PlantProm DB
Database with annotated, non-redundant collection of proximal promoter sequences for RNA polymerase II with experimentally determined transcription start site(s), TSS, from various plant species.

Database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences.

DoOP: Databases of Orthologous Promoters
A database containing orthologous clusters of promoters from Homo sapiens, Arabidopsis thaliana and other organisms.

DATF: Database of Arabidopsis Transcription Factors

The Database of Arabidopsis Transcription Factors (DATF) contains known and predicted Arabidopsis transcription factors with sequences and many other features including 3D structure templates, EST expression information, transcription factor binding sites and Nuclear Location Signals.

The Arabidopsis thaliana promoter binding element database, an aid to find binding elements and check data against the primary literature.

A genome-wide map of putative transcription factor binding sites in Arabidopsis thaliana.

contains two databases, AtcisDB (Arabidopsis thaliana cis-regulatory database) and AtTFDB (Arabidopsis thaliana transcription factor database).

Prediction tools

Weeder - For all eukaryotic datasets

oligo/dyad analysis & ANN-Spec - for human dataset

SesiMCMC performs better for flydataset

MEME3 & YMF - Performs better for mouse data set

Motif sampler performs better for real experimental dataset

PhyME - Good for comparative sequence analysis (also known as phylogenetic footprinting)

It is advised to use a few complementary tools in combination rather than relying on a single one.

other tools:






Assessing computational tools for the discovery of transcription factor binding sites.Nat Biotechnol. 2005 Jan;23(1):137-44.

Database for prediction of entire proteomes

Large-scale genome sequencing has provided us with the building blocks of living organisms. However, to obtain new insights into physiological and biochemical processes, it is essential to analyse and catalogue the structural and functional features of each individual protein in the genome. Such predictions for entire proteomes suggest conclusions in context of comparative genomics and provide crucial information in the context of structural genomics.

PEP is a database of Predictions for Entire Proteomes. The database contains summaries of analyses of protein sequences from a range of organisms representing all three major kingdoms of life: eukaryotes, prokaryotes and archaea.The database contains structural and functional features analysis including:

• coiled-coil regions predicted by COILS
• 3-state secondary structure predicted by PROFsec
• percentage relative solvent accessibility predicted by PROFacc
• transmembrane helices assigned by PHDhtm
• low sequence complexity regions according to SEG
• long stretches of non-regular secondary structure (NORS)
• presence and location of signal peptide cleavage sites identified by SignalP
• PROSITE motifs
• nuclear localization signals
• cellular functional classes assigned by EUCLID

PEP database can be accessed by SRS, PSI-BLAST and BlastP interface.It can also downloaded as flat files.

Computational protein kinase substrate identification

Post-translational modification by phosphorylation is the most abundant type of cellular regulation, affecting essentially every cellular process including metabolism, growth, differentiation, motility, membrane transport, learning and memory. Defects in protein kinase function result in a variety of diseases and kinases are major targets for drug design.

The identification of protein kinase substrates requires understanding the peptide specificity of protein kinases. Understanding phosphorylation specificity will therefore contribute to understanding the roles of protein kinases in health and disease, and help identifying new therapeutic targets and strategies of protein kinase inhibition and anti-kinase drug development.

In eukaryotes, protein kinases phosphorylate mainly Ser or Thr residues (protein Ser/Thr kinases) or Tyr residues (protein Tyr kinases). Although phosphorylation of His residues, as well as other amino acids, occurs also.

The three-dimensional structures are known for a number of protein kinases, some with bound substrates and nucleotides.The characteristic fold consists of a smaller N-terminal “lobe”, comprising a five-stranded β-sheet and one or two α-helices, and a larger C-terminal lobe that usually contains six major α-helices and two small β-sheets (As shown in Fig below).

The peptide substrate is held in the groove between the two lobes. The phosphate group is extracted from an ATP molecule located close to the substrate towards the small lobe. A conserved Asp residue is essential for catalysis.


1.The Phospho.ELM database contains a collection of experimentally verified Serine, Threonine and Tyrosine sites in eukaryotic proteins. The entries, manually annotated and based on scientific literature, provide information about the phosphorylated proteins and the exact position of known phosphorylated instances.

2.General databases on post-translational modifications

3.The RESID Database of Protein Modifications is a comprehensive collection of annotations and structures for protein modifications including amino-terminal, carboxyl-terminal and peptide chain cross-link post-translational modifications.

Prediction tools:

1.ELM is a resource for predicting functional sites in eukaryotic proteins.

2.Identification of phosphorylation sites
The NetPhos 2.0 server produces neural network predictions for serine, threonine and tyrosine phosphorylation sites in eukaryotic proteins.

3.Predict PKA phosphorylation sites
NetPhosK is neural network predictions of kinase specific eukaryotic protein phosphoylation sites.It covers the following kinases: PKA, PKC, PKG, CKII, Cdc2, CaM-II, ATM, DNA PK, Cdk5, p38 MAPK, GSK3, CKI, PKB, RSK, INSR, EGFR and Src.

Scansite searches for motifs within proteins that are likely to be phosphorylated by specific protein kinases or bind to domains such as SH2 domains, 14-3-3 domains or PDZ domains.

PredPhospho predictsphosphorylation sites of protein sequences.

The AMS tool allows for identification of PTM (post-translational modification) sites in proteins.

7.GPS -group-based phosphorylation predicting and scoring method
It covers a larger number of protein kinase families and have greater sensitivity and specificity than Scansite and PredPhospho

A computer program that can be used to predict substrates for serine/threonine protein kinases.

Predikin can predict peptide specificities directly from the amino acid sequences and can therefore be used for most kinases, including hypothetical and uncharacterized ones.

Google desktop tweak: Searching description names in muliple genome files

Descriptions in Fasta files is valuable source to know about details of particular sequences. For eg

>gi|30677876|ref|NP_849568.1| LHY (LATE ELONGATED HYPOCOTYL); DNA binding /transcription factor [Arabidopsis thaliana]

This header file describes that protein has DNA binding and transcription factor.
Suppose if we have more than 100 files and if we want to search particular description, we can use use google desktop search by using simple tweak. Ofcourse this can be easily done by using simple programming script. This is just to show the power of google desktop search.

1.I have downloaded genome sequences of Arabdidopsis thalian from NCBI ftp ( and rename the file extension as .txt

2.I have placed the downloaded sequences in a particular folder; named as "arbi" and placed in mydocuments.

3.Now download google desktop search from

4.Now go to advance search options in google desktop search. Choose as following
(i) In show results: choose files
(ii) In the file type: select text
(iii) In the loction: My Documents\arbi
(iv) Has the words: phosphotidylinositol (or)

Type in search bar as: phosphatidylinositol filetype:txt under:"C:\Documents and Settings\..\My Documents\arbi" search desktop

Now you can see the results that find the word " phosphatidylinositol" from two files

Protein subcellular location prediction

One of the fundamental goals in cell biology and proteomics is to identify the functions of proteins in the context of compartments that organize them in the cellular environment. Knowledge of subcellular locations of proteins can provide key hints for revealing their functions and understanding how they interact with each other in cellular networking. Unfortunately, it is both time-consuming and expensive to determine the localization of an uncharacterized protein in a living cell purely based on experiments.

Location classification
According to their subcellular locations, proteins are classified
into the following 12 discriminative groups: (1) chloroplast,(2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracell, (6) Golgi apparatus, (7) lysosome, (8) mitochondria,(9) nucleus, (10) peroxisome, (11) plasma membrane and (12) vacuole

Such a classification covers almost all the organelles in an animal or plant cell . With the rapid increase in new protein sequences entering into data banks, we are confronted with a challenge: is it possible to utilize a bioinformatic approach to help expedite the determination of protein subcellular locations?The enormous complexity of the protein sorting process, alternative means of transportation pathways, and lack of complete data for every organelle, present great challenges to the eager prediction method developers.

Categories of computational predictors
Computational methods for predicting protein sub-cellular localization can generally be divided into four categories: prediction methods based on
(i) The over all protein amino acid composition,
(ii) Known targeting sequences
(iii) Sequence homology and/or motifs,and
(iv) A combination of several sources of information from the first three categories
(hybrid methods).

Database of Protein subcellular localization

Online Prediction tools






Proteome Analyst








Dönnes, P, and Höglund, A (2004). Predicting Protein Subcellular Localization: Past, Present, and Future Genomics Proteomics Bioinformatics 2(4):209--215.

Kuo-Chen Chou1 and David W. Elrod (1999).Protein subcellular location prediction Protein Engineering 12(2): 107-118

Bioinformatics for analysing metagenomes

Metagenomics is a new field of research in which scientists analyze the genomes of organisms recovered directly from the environment. Most naturally occuring bacteria cannot be cultured and therefore cannot be analyzed by traditional means. Metagenomic studies provide us with a mechanism for analyzing previously unknown organisms. At the same time we can examine the diversity of organisms present in specific environments as well as analyze the complex interactions between members of a specific environment. Scientists can study the smallest component of an environmental system by extracting DNA from organisms in the system and inserting it into a model organism.

The isolation, archiving and analysis of environmental DNA (or so-called 'metagenomes') has enabled us to mine microbial diversity, allowing us to access their genomes, identify protein coding sequences and even to reconstruct biochemical pathways, providing insights into the properties and functions of these organisms. The generation and analysis of (meta)genomic libraries is thus a powerful approach to harvest and archive environmental genetic resources. It will enable us to identify which organisms are present, what they do, and how their genetic information can be beneficial to mankind.

The mining of genomes and metagenomic libraries will not only provide new enzymes for biotechnological processes and a basis to study new protein structures and catalytic mechanisms, but will also enable the functional assignment of many proteins found in abundance in databases and currently designated as ‘hypothetical’ or ‘conserved hypothetical’ proteins. The identification of novel catalysts will both improve existing processes and will lead to the design of novel processes for making innovative products or high-value intermediates.

One of the main focus in analysing metagenomes using genomic analysis tools to find novel genes, discovering novel pathways, functional groups and evolutionary related studies.

Gene finding
Gene finding is a fundamental goal in virtually all metagenomics projects, regardless of whether complete genome sequences can be assembled or not.
>>Gene prediction can be done using GLIMMER which is trained on long open reading frames.


Discovering novel pathways & functionalgroups
>>Predicted genes blasted against COGs or KEGG database
>>To perform single-linkage hierarchical clustering (eg.Cluster & Treeview)


Dealing with partial sequences
Many metagenomes contain partial sequences. The partial sequences create obstacle in phylogenetic studies. However the problem can be solved by aligning the partial sequences against the complete ones and the phylogenetic assignment performed by finding the closest sequences in the database.
>>Performing semi-global multiple alignment (i.e., terminal gaps are not penalized). The most widely used alignment tools are based on global or local alignments and do not correctly handle partial sequences.
>>Muliple alignment using MUSCLE tool although not optimized for partial sequences, MUSCLE do a reasonable job, as ascertained by several criteria: the number of internal gaps was small, sequences shorter than the read length had either no beginning gaps or no ending gaps (since the gene length is greater than the read length), and the total length was comparable to related proteins.


Drug-Target Database for Drug-Discovery

The BindingDB is a public, web-accessible database of measured binding affinities of small, drug-like molecules for proteins known to be drug-targets. BindingDB supports the discovery of new medications by enhancing the availability and utility of these critical data.

BindingDB database currently containing 20 000 experimentally determined binding affinities of protein–ligand complexes, for 110 protein targets including isoforms and mutational variants, and 11 000 small molecule ligands.

The BindingDB website provides an increasingly rich set of tools for query, analysis and download of binding data. Search capabilities include queries by target name; ligand name; affinity range; chemical structure, substructure and similarity; and target sequence, via BLAST

The website also provides web-accessible tools for virtual screening of candidate ligands;The user provides a training set of ligands active against a given target or class of targets, either by using queries to form a BindingDB data set,or user can also upload files of molecules not in the database to compare them to inhibitors of a particular enzyme.

Go to

A tool for alignment of protein interaction networks

PathBLAST is a network alignment and search tool for comparing protein interaction networks across species to identify protein pathways and complexes that have been conserved by evolution.Target protein–protein interaction networks are currently available for Helicobacter pylori, Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster.Partial protein–protein interaction networks are also available for Homo sapiens and Mus musculus.

PathBLAST searches for high-scoring pathway alignments between two paths, one from each network, in which proteins of the first path are paired with putative orthologs occurring in the same order in the second path. Pathway alignments are scored by the degree of protein sequence similarity at each pathway position and by the quality of the protein interactions they contain.

The PathBLAST front page prompts users to specify both the pathway query and the target network.The score of each pathway alignment is also reported with each textual and graphical alignment result.

Go to

Subscribe to bioinformatics Blog: Get Free Updates via Email or RSS

You can subscribe to bioinformatics blog via

1.Email subscribtion:
Subscribe to bioinformatics blog email and get updates delivered to your inbox. Your email address is 100% safe.

2. Subscription via RSS reader:
Subscribing to an RSS feed is easy.
* Click the below orange button and choose your favourite RSS reader (or) choose any of the readers from the drop-down list.
* Another way to subscribe is to copy the link by right clicking the organe button and paste the address in your favourtie RSS reader. click here to see the list of free readers

Designing primer through computational approach

A primer is a short synthetic oligonucleotide which is used in many molecular techniques from PCR to DNA sequencing. These primers are designed to have a sequence which is the reverse complement of a region of template or target DNA to which we wish the primer to anneal.

A good primer should usually meet the following criteria
1. Primers should be 18-38 bases in length;
2. Base composition should be 50-60% (G+C);
3. Primers should end (3') in a G or C, or CG or GC: this prevents "breathing" of ends and increases efficiency of priming;
4. Melting temperature(Tms) between 55-80degrees Celsius are preferred;
5. 3'-ends of primers should not be complementary (ie. base pair), as otherwise primer dimers will be synthesised preferentially to any other product;
6. Primer self-complementarity should be avoided;
7. Runs of three or more Cs or Gs at the 3'-ends of primers may promote mispriming at G or C-rich sequences (because of stability of annealing), and should be avoided.
(adapted from Innis and Gelfand,1991)

Steps to follow to while designing a Primer through insilico approach:
1.Calculating Properties of an Oligonucleotide
2.Finding All Potential Forward Primers
3.Finding All Potential Reverse Primers
4.Filtering Primers Based on GC Content
5.Filtering Primers Based on Their Melting Temperature
6.Filtering out the Primers With Self-Dimerization and Hairpin Formation
Self-dimerization and hairpin formation can prevent the primer from binding to the target sequence.
7.Filtering out Primers Without a GC Clamp
A strong base pairing at the 3' end of the primer helps in PCR. Find all the primers that do not end in a G or C.
8.Filtering out Primers With Nucleotide Repeats
Primers that have stretches of repeated nucleotides can give poor PCR results.
9.Find the Primers That Satisfy All the Criteria
10.Checking For Cross Dimerization
Cross dimerization can occur between the forward and reverse primer if they have a significant amount of complementarity. The primers will not function properly if they dimerize with each other. To check for dimerization, align every forward primer against every reverse dimer.
11.Visualizing Potential Pairs of Primers in the Sequence Domain
An alternative way to present this information is to look at all potential combinations of primers in the sequence domain. Each dot in the plot represents a possible combination between the forward and reverse primers after filtering out all those cases with potential cross dimerization.
11.Find Restriction Enzymes That Cut Inside the Primer
Finding the restriction enzymes from the REBASE database that will cut a primer. These restriction enzymes can be used in the design of cloning experiments. For example, you can use this on the first pair of primers from the list of possible primers that you just calculated.

PCR primers based upon protein sequence:
If you have the protein sequence and want the DNA sequence, we can reverse translate a protein using reverse translate tools
(i) Reverse translate a protein:

(ii) Back translation tool: We can enter sequence motifs which will be included in or excluded from resulting optimal sequence

For changing a specific amino acid
(i) Reverse translator:

useful online tools for PCR design:
1.Manipulating the sequence

2.Analysing oligonucleotide properties (calculating GC content, molecular weight, melting temperature)

3.For dimer check, hairpin check, 5'GC content check, 3'GC content check
Primer design assistant:

4.Visualizing Potential Pairs of Primers

5. For restriction analysis

Online PCR design suite:
2.Interactive primer design
3.PCR Now

Free software tools:
1. Download Fast PCR (for windows)

Protein Quaternay structure file Server

The concept of quaternary structure was first put forward by Bernal in 1958. Quaternay structure is defined as the level of form in which units of tertiary structure aggregate to form homo or hetero-multimers. Quaternary structure is the interaction of non-covalently bound monomeric protein subunits to form oligomers. Such complexes are involved in various biological processes, including metabolism, signal transduction and chromosome replicating etc.

The protein quaternay structure file Server (PQS) is a internet resource that makes available coordinates for probable quaternay states for structures contained in the Protein Data Bank (PDb) that were determined by X-ray crystallography.

Uses of the PQS server
1.Gives complete description of domain structures that involve more than one chain.
2.Residues that are buried at protein protein interface can be conserved and this information can be used in homology model building to improve alignment.


Life inside a cell animation

Molecular & Cellular Biology program that "transports Biology students into a three-dimensional journey through the microscopic world of a cell". the animation illustrates the mechanisms that allow a white blood cell to sense its surroundings & respond to an external stimulus.

click here to see the movie

Specialized Immunology databases

Immunology databases provide more detailed information on immunologically relevant molecules, systems and processes. They are typically annotated by experts and contain immunology-specific annotations.

Kabat database ( contains entries of proteins of immunological interest: Ig, T cell receptors (TCR), major histocompatibility complex (MHC) molecules and other immunological proteins.
Now not supported through website freely. It is available for purchase for $2250 US

The IMGT databases ( contain highquality annotations of DNA and protein sequences of Ig, TCR and MHC. They also contain IMGT-related genomic and structural data.

The FIMM database ( focuses on protein antigens, MHC molecules and structures, MHC-associated peptides and relevant disease associations.

The SYFPEITHI database ( contains entries of MHC ligands and peptide motifs.

The HIV molecular immunology database ( is an annotated searchable repository of HIV1 T cell and B cell epitopes.

Get a free copy of the Human Genome Landmarks poster!

The Human Genome Landmarks poster is a 24" by 36" wall poster that lists selected genes, traits, and disorders associated with each of the 24 different human chromosomes.

Request a free print copy of this poster online.

Click here to order online

Motifs discovery in groups of related DNA/protein sequences

1.Go to the MEME Website( and click on Discover Motifs.
2. Fill in the following fields in the MEME input form
a. E-mail address: Enter the E-mail address where results are to be
b. Description (optional): Enter information describing the sequences and/or parameters of the MEME run. This information will be included in the subject of the E-mail message received from MEME and can be very useful if submitting many MEME runs.
c. Name of a file: Use the Browse button to enter the path to the
training set file.
d. Number of motifs: Enter 2.
3. Click on the Start Search button. This will submit the search to the MEME Web-server at the SDSC. Within a few seconds, the browser should display a verification message.
4. Use an E-mail reader to receive the confirmation message MEME will send. If this message does not arrive, it is possible that the Email address was mistyped. In that case, resubmit the MEME run.
5. Save MEME results to a text file.

Toolbar for browsing biological data and databases

The biobar project is a bioinformatics power-browsing toolbar for Mozilla-based browsers including Firefox/Flock/Mozilla/Netscape and Seamonkey.

The primary advantage of this tool is that it allows a biologist to browse and retrieve data from Genomic, Proteomic, Functional, Literature, Taxonomic, Structural, Plant and Animal-specific databases. In addition to the browsing features, biobar also provides links to important bioinformatics sites and services including services at the European Bioinformatics Institute (EBI), National Center for Biotechnology Information (NCBI) and DNA Data Bank of Japan (DDBJ). The tool also provides links to major data deposition sites for nucleotide, protein and 3D-structure data. Finally, the menu also contains links to many Sequence, Structure alignment and analysis tools.Biobar provides browsing access to over 46 different databases (including Google Scholar, HubMed etc)

Install biobar toolbar

Don't have Firefox browser. Get it now.

Keeping up with the Human genome - Tim Hubbard

Abstract from Tim Hubbard talk:
Thirty times bigger than the worm genome that we were only just getting to grips with and with far greater numbers of interested users. The Ensembl project was started from scratch to handle this data: a system to store the data in an RDBMS; a pipeline to generate a pre-computed set of analysis; an API to provide both web and programmatic access. Ensembl evolves continuously: a new release is made every 2 months and in nearly every release the schema is updated to handle new data types. It now integrates more than thirty large genomes and provides researchers with a resource of >300Gb of data, all of which is free to download. The website alone generates >1million page impressions per week. However, with genome sequencing output per machine recently jumped 300 fold and costs having dropped 10 fold, with more drops promised, what Ensembl deals with now is tiny compared to what is to come.

Despite all this data, we are far from understanding our genome. Given the complexity of the system it is probably only feasible to tackle it as a huge global collaborative project, making data integration and exchange critical. One of the most significance features of the genome sequence is that it provides a framework to organize other biological information. However, there's a limit to how much can be usefully imported into a single database, especially as new resources spring up continuously and frequently are of unknown scientific value. The web has been constructed on links, however its hard to compare data unless it is easily aggregated. The Distributed Annotation System (DAS) is essentially a system of standardized web services: each provider runs a DAS server; DAS clients can aggregate data from as many servers as they wish around a single coordinate system, i.e. a genome sequence. Ensembl is both a DAS server and DAS client. There are analogies with layering data on and google earth, except that here the servers of different layers are distributed. However visual integration is only a first step: the genome is too big for researchers to explore manually. We are going need to computational guide researchers to the most interesting areas of the genome.

Computational docking

Computational docking is a technique with which one predicts the 3D structure of the complex between two or more molecules. Typically, its applications are confined to protein-protein complexes and to associations between proteins and small molecules. The 3D structure of the individual partners must be known and it is possible to consider computational docking as an extension of the modelling techniques, used to predict 3D structure of proteins. Computationally it is possible to determine the 3D structure of complexes that cannot be obtained through experimental techniques like crystallography or NMR spectroscopy.

A number of inter-molecular interactions are in fact transient, from a kinetics point of view, or weak, from a thermodynamics point of view. Consequently they cannot be studied experimentally, since the average concentration of the complex is too low. Computational docking is therefore the only possibility to determine the 3D features of these types of inter-molecular associations.

Membrane Proteins Structure database

A database specialized in membrane proteins structures determined by x-ray and electron diffraction with links to the Protein Data Bank and other useful sites. It is a typical example in which the sub-cellular location and, to some extent, the physiological function is the criterion of inclusion of the data. It provides serveral links to other source of information.

Go to Membrane Proteins Known 3D Structure

Proteins that lack definite 3D structure

Not all the proteins have definite 3D structure.There are partially and wholly unstructured proteins have been identified in all kingdoms of life, more commoly in eukaryotic organisms. These proteins are called as protein disorder or intrinsically disordered proteins. These unstructured regions in the proteins are gaining importance since they take part in functional important pathways (eg.Signal transduction pathways) and associated with various disease related Proteins. They are functions classified into four categories: molecular recognition, molecular assembly/disassembly, protein modification and entropic chains.

Protein disorder can be directly studied by NMR or circular dichromism, or indirectly detected by a variety of experimental methods including stretches of missing electron density in X-ray crystallography maps, Raman spectra, hydrodynamic measurements or even limited, time resolved proteolysis. Each one of these methods detects different aspects of disorder resulting in different operational definitions defintions of protein disorder.

Visualize DNA structure through Music

DNA can be represented in a variety of ways, which can provide different visual perspectives of molecular structure.This Musical Atlas presents an aural representation of the B-DNA molecules without mismatches, drugs, or modifiers.For each structure, there is a "Plain Melody," which follows a simple algorithm to highlight the structure's sequence, and a "Composition," which follows a more complicated algorithm that features the base pairing of the structure.


In each melody, each base in the sequence is played for one beat.


For each composition, there are four measures in which every quarter note gets one beat.

The number of beats per measure is based upon the length of the nucleotide; the number of beats per measure is half the number of bases per strand.

The sequences used here all had an even number of base pairs. However, if a sequence contained an odd number of bases, the number of beats per measure will be half that amount minus the remainder.

Each base in the asymmetrical strand is an eighth note (as opposed to the quarter note used in the Plain Melody; an eighth note is half the length of a quarter note).

The compositions consist of two lines:

Melodic Line

The melodic line is the melody derived directly from the sequence of the molecules. If the asymmetric strand is self-complementary, the DNA molecule will have only one melody. If the strand(s) in the asymmetrical unit is(are) not self-complementary, both the asymmetrical strand and its symmetry related strand each have a separate melody.

In this algorithm, there are four measures to each melody. The melodic line consists the sequence being repeated of the asymmetrical sequence being repeated four times.

Bass Line

The first measure is a full measure rest for the bass line while the full sequence is played on the melodic line.

The second measure begins with the complimentary strand. This strand is read 3' to 5' (essentially, it base pairs with the melody).

The third measure slightly expands upon the base pairing concept of the second measure. Using notes from the a minor scale, the base pairing note in the bass line is followed by specifically assigned notes to create counterpoint while the melody is being

Go to Musical Altas

Database for RNA structural classification

The Structural Classification of RNA (SCOR) is a database designed to provide a comprehensive perspective and understanding of RNA motif structure, function, tertiary interactions and their relationships.

The structural elements are organized in a directed acyclic graph (DAG) architecture, allowing multiple parent classes for a motif. Users can browse the database or search by PDB or NDB identifier, keyword or sequence. Descriptions and cartoon representations of each of the classes are available.

The SCOR database can be used for RNA functional prediction, in searching for functional RNAs in genomes and further it can be used for RNA design and disovery of RNA protein.

Go to SCOR database

Human Protein Reference Database

Human Protein Reference Database (HPRD) that integrates information relevant to the function of human proteins in health and disease.

Data pertaining to thousands of protein-protein interactions, posttranslational modifications, enzyme/substrate relationships, disease associations, tissue expression, and subcellular localization for each protein in the human proteome.

Go to Human Protein Reference Database

Few suggestions for good biological database design

Few suggestion for good biological database design (from the NAR database issue 2007-Editorial). Here is the summary

1.The quality, quantity and originality of data as well as the quality of the web interface are the most important.

2.Web database should be comprehensive (database should not be overspecialized), attribution of original data sources .

3. For bulk data, it should be available as flat files

4. The database web address should have unique domain name and easy to remember. Providing easy web interface, easy searching.

5. Providing help and examples where every necessary.

6. Server should not be slow

The 2007 database issue update includes 968 databases, 110 more than the previous one.

It can be viewed online

Nucleic Acids Research, 2007, Vol. 35, Database issue D1-D2

A database of incorrect Protein conformations

Decoys ‘R’Us database contains a wide variety of decoys generated by different methods with the aim of fooling scoring functions. Decoys are computergenerated conformations of protein sequences that possess some characteristics of native proteins, but are not biologically real.

Decoys have been based on discrete-state models, molecular dynamics trajectories, crystal structures of different resolutions ,conformations with different loops, and amino acid sequences mounted on radically different folds.

In other words, this database provide incorrect conformations data in order to improve the protein structure prediciton.

Organisation of decoy sets

1.The multiple decoy sets
2.The single decoy sets
3.The loop decoy sets

The current version of the entire decoy set is only available as a single tar and gzipped file to download.

Go to Decoys 'R' Us database

Tips to use EBI new search interface-"EB-eye"

EMBL-EBI launched its new website interface with powerful search engine called he "EB-eye", a powerful search engine allowing instant searches of all the EBI's databases from a single query.

EB-eye Search is developed on top of the Apache Lucene project framework, which is an Open-source, high-performance, full-featured text search engine library written entirely in Java. It uses this technology to index EBI databases in various formats (e.g. flatfiles, XML dumps, OBO format, etc.) and provides very fast access to the EBI's data resources. The system allows the user to search globally across all EBI databases or individually in selected resources by using an Advance search.

1. Simple search

(i)boolean operators
* AND - (default) meaning that term1 AND term2 must exist in the searched documents. eg.cytochrome AND c
* OR - meaning that either term1 OR term2 must OR c
* NOT - meaning that term1 must not be present in any of the displayed documents (e.g. excludes documents containing the term1). eg.glutathione NOT transferase

* + '+term1' - The document must contain the term1.
* - '-term1' - Prohibit operator: The document must not contain term1.

At the bottom of any results page there is a 'Refine your search box'. This one will allow the user to add terms to the query and automatically appends additional AND operators to the search.

(ii)Term Modifiers:

* '*' - as in 'gluta*' (glutacin, glutamate, glutamic, etc.)
* '?' - as in 'b?ind' (bind, bond, band, etc.)

(iii) Gouping terms together using parenthesis e.(reductase OR transferase) AND glutathione

2. Advanced Search

(i) Searches with all the words in a string
(ii) Searches of the exact phrase - quoted string
(iii) Searches with at least one of the words in the string
(iv) Searches that display results where none of the words in the input string are present

3. Domain Specific Search

Allows the user to narrow searches to specific databases


Build Protein Model from your sequence in a easy way

If your protein sequence shows significant homology to another protein of known three-dimensional structure, then a fairly accurate model of your protein 3D structure can be obtained via homology modelling.

The easiest way to homology modelling automatically through Swiss Model Server-First approach method.SWISS-MODEL is a fully automated protein structure homology-modeling server.

1.Fill your details with your email address (the homology model of PDB file will be email to you).Your name, and title to identify your model.

2.Paste you protein sequence in space provided.Sequences can be provided in either RAW, SWISS-PROT, FASTA or GCG format.

3.Click Send request

1.It is possible to send in a protein sequence only.
2.Recommended- Only to use if the degree of sequence homology is high (50% or greater) between your query sequence and target sequences to get good model. This can be identified by similarity searching between your query sequence against PDB database using BLAST tool.
3.Carefully read the header section of the files to know what templates and alignments were used during the model building process.

Go to Swiss-Model Server

Ligand searching in Protein Data Bank (PDB)

Partial string search

Ligand name searching supports partial string matches. For example, searching for 'benz' will return all structures that contain benzene as well as those containing benzamidine.

Exact match search

For an exact match, the complete name of the ligand must be entered. Ligand searches can also be performed using the three-character ligand ID in the PDB file (the "HET" record). For example, searching for 'HEM' returns all structures that have a heme ligand.

MarvinSketch search

The PDB can be searched for structures containing the same ligand by drawing a ligand in MarvinSketch (provided by ChemAxon)

SMILES string search

SMILES (Simplified Molecular Input Line Entry Specification) is a comprehensive nomenclature system for chemicals.

eg.SMILES string for benzene:C1=CC=CC=C1 or c1ccccc1

SMILES search feature is the ability to query for ligands using a SMILES string representation

Go to PDB Ligand search

Aligning two sequences

To compare only two sequences that are already known to be homologous, coming from related species ‘BLAST 2 Sequence tool can be used.

‘BLAST 2 Sequence' utilizes the BLAST algorithm for aligning two protein or nucleotide sequences(i.e DNA-DNA or protein-protein) sequence comparison.

The resulting alignments are presented in both graphical and text form.A World Wide Web version of the program can be used interactively at the NCBI WWW site.

>>strand option: Forward strand, reverse strand or both strand
>>Parameters: Reward for a match and penalty for a mismatch
>>view options: Strandard, mismatch highlight
>>Masking colour option: Black, grey and red

Go to BLAST2 Sequences

PubMed Search Tips

PubMed was developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) as part of the Entrez retrieval system.
It provides free access to MEDLINE, the NLM database of indexed citations and abstracts to medical, nursing, dental, veterinary, health care, and preclinical sciences journal articles
It includes additional selected life sciences journals not in MEDLINE.It adds new citations Tuesday through Saturday.

Basic Search Techniques

1.Type any key word or phrase into the search box as shown in the image. Use an asterisk (*) to retrieve variations on a word, e.g., bacter* retrieves bacteria, bacterium, bacteriophage, etc.

For a Subject Search: Enter one or more words (e.g., asthma drug therapy) in the query box and click on Go. PubMed automatically "ANDs" (combines) terms together so that all terms or concepts are present, and it translates your words into MeSH terms.

For an Author Search: Enter the author's name in the format of last name first followed by initials (e.g., byrnes ca).

Use Boolean operators (AND, OR, and NOT) to combine topics in the search box if desired. button GO to Run Your Search

3.Setting Limits

Click on 'Limits' on the Feature tabs as shown in the image. Choose the restrictions for your search, e.g. a specific language, article type, date, or subset of PubMed, e.g. nursing journals, cancer or bioethics.

Note: Limits remain in place until you change or remove them. Limits other than language or date will exclude NEW records that are "in process" or "supplied by Publisher."

4.Anatomy of a PubMed Search
PubMed employs a process called Automatic Term Mapping. This means that your search term is matched against (in the following order):

1. MeSH (Medical Subject Headings) Translation Table
2. Journals Translation Table
3. Phrase List
4. Author Index

For example:
Enter mad cow disease and Pub Med will search for the mapped MeSH heading, Encephalopathy, Bovine Spongiform OR the text words mad cow disease.

Go to PubMed

Twitter Delicious Facebook Digg Stumbleupon Favorites More