SureshKumar's Bioinformatics Blog

I am Suresh Kumar Sampathrajan. I have completed my PhD degree in bioinformatics from the University of Vienna, Austria in the year 2010. If you want to know more about me and my research,please click the menus at the top.

I have started this bioinformatics blog mainly for undegraduate and postgraduate students of bioinformatics. This blog will serve as an open resource material for the students and for those who wish to know about bionformatics. This blog contains video tutorials, tips, bioinformatics software downloads, articles on bioinformatics and career opportunities.

Google desktop tweak: Searching description names in muliple genome files

Descriptions in Fasta files is valuable source to know about details of particular sequences. For eg

>gi|30677876|ref|NP_849568.1| LHY (LATE ELONGATED HYPOCOTYL); DNA binding /transcription factor [Arabidopsis thaliana]

This header file describes that protein has DNA binding and transcription factor.
Suppose if we have more than 100 files and if we want to search particular description, we can use use google desktop search by using simple tweak. Ofcourse this can be easily done by using simple programming script. This is just to show the power of google desktop search.

1.I have downloaded genome sequences of Arabdidopsis thalian from NCBI ftp ( and rename the file extension as .txt

2.I have placed the downloaded sequences in a particular folder; named as "arbi" and placed in mydocuments.

3.Now download google desktop search from

4.Now go to advance search options in google desktop search. Choose as following
(i) In show results: choose files
(ii) In the file type: select text
(iii) In the loction: My Documents\arbi
(iv) Has the words: phosphotidylinositol (or)

Type in search bar as: phosphatidylinositol filetype:txt under:"C:\Documents and Settings\..\My Documents\arbi" search desktop

Now you can see the results that find the word " phosphatidylinositol" from two files

Protein subcellular location prediction

One of the fundamental goals in cell biology and proteomics is to identify the functions of proteins in the context of compartments that organize them in the cellular environment. Knowledge of subcellular locations of proteins can provide key hints for revealing their functions and understanding how they interact with each other in cellular networking. Unfortunately, it is both time-consuming and expensive to determine the localization of an uncharacterized protein in a living cell purely based on experiments.

Location classification
According to their subcellular locations, proteins are classified
into the following 12 discriminative groups: (1) chloroplast,(2) cytoplasm, (3) cytoskeleton, (4) endoplasmic reticulum, (5) extracell, (6) Golgi apparatus, (7) lysosome, (8) mitochondria,(9) nucleus, (10) peroxisome, (11) plasma membrane and (12) vacuole

Such a classification covers almost all the organelles in an animal or plant cell . With the rapid increase in new protein sequences entering into data banks, we are confronted with a challenge: is it possible to utilize a bioinformatic approach to help expedite the determination of protein subcellular locations?The enormous complexity of the protein sorting process, alternative means of transportation pathways, and lack of complete data for every organelle, present great challenges to the eager prediction method developers.

Categories of computational predictors
Computational methods for predicting protein sub-cellular localization can generally be divided into four categories: prediction methods based on
(i) The over all protein amino acid composition,
(ii) Known targeting sequences
(iii) Sequence homology and/or motifs,and
(iv) A combination of several sources of information from the first three categories
(hybrid methods).

Database of Protein subcellular localization

Online Prediction tools






Proteome Analyst








Dönnes, P, and Höglund, A (2004). Predicting Protein Subcellular Localization: Past, Present, and Future Genomics Proteomics Bioinformatics 2(4):209--215.

Kuo-Chen Chou1 and David W. Elrod (1999).Protein subcellular location prediction Protein Engineering 12(2): 107-118

Bioinformatics for analysing metagenomes

Metagenomics is a new field of research in which scientists analyze the genomes of organisms recovered directly from the environment. Most naturally occuring bacteria cannot be cultured and therefore cannot be analyzed by traditional means. Metagenomic studies provide us with a mechanism for analyzing previously unknown organisms. At the same time we can examine the diversity of organisms present in specific environments as well as analyze the complex interactions between members of a specific environment. Scientists can study the smallest component of an environmental system by extracting DNA from organisms in the system and inserting it into a model organism.

The isolation, archiving and analysis of environmental DNA (or so-called 'metagenomes') has enabled us to mine microbial diversity, allowing us to access their genomes, identify protein coding sequences and even to reconstruct biochemical pathways, providing insights into the properties and functions of these organisms. The generation and analysis of (meta)genomic libraries is thus a powerful approach to harvest and archive environmental genetic resources. It will enable us to identify which organisms are present, what they do, and how their genetic information can be beneficial to mankind.

The mining of genomes and metagenomic libraries will not only provide new enzymes for biotechnological processes and a basis to study new protein structures and catalytic mechanisms, but will also enable the functional assignment of many proteins found in abundance in databases and currently designated as ‘hypothetical’ or ‘conserved hypothetical’ proteins. The identification of novel catalysts will both improve existing processes and will lead to the design of novel processes for making innovative products or high-value intermediates.

One of the main focus in analysing metagenomes using genomic analysis tools to find novel genes, discovering novel pathways, functional groups and evolutionary related studies.

Gene finding
Gene finding is a fundamental goal in virtually all metagenomics projects, regardless of whether complete genome sequences can be assembled or not.
>>Gene prediction can be done using GLIMMER which is trained on long open reading frames.


Discovering novel pathways & functionalgroups
>>Predicted genes blasted against COGs or KEGG database
>>To perform single-linkage hierarchical clustering (eg.Cluster & Treeview)


Dealing with partial sequences
Many metagenomes contain partial sequences. The partial sequences create obstacle in phylogenetic studies. However the problem can be solved by aligning the partial sequences against the complete ones and the phylogenetic assignment performed by finding the closest sequences in the database.
>>Performing semi-global multiple alignment (i.e., terminal gaps are not penalized). The most widely used alignment tools are based on global or local alignments and do not correctly handle partial sequences.
>>Muliple alignment using MUSCLE tool although not optimized for partial sequences, MUSCLE do a reasonable job, as ascertained by several criteria: the number of internal gaps was small, sequences shorter than the read length had either no beginning gaps or no ending gaps (since the gene length is greater than the read length), and the total length was comparable to related proteins.


Drug-Target Database for Drug-Discovery

The BindingDB is a public, web-accessible database of measured binding affinities of small, drug-like molecules for proteins known to be drug-targets. BindingDB supports the discovery of new medications by enhancing the availability and utility of these critical data.

BindingDB database currently containing 20 000 experimentally determined binding affinities of protein–ligand complexes, for 110 protein targets including isoforms and mutational variants, and 11 000 small molecule ligands.

The BindingDB website provides an increasingly rich set of tools for query, analysis and download of binding data. Search capabilities include queries by target name; ligand name; affinity range; chemical structure, substructure and similarity; and target sequence, via BLAST

The website also provides web-accessible tools for virtual screening of candidate ligands;The user provides a training set of ligands active against a given target or class of targets, either by using queries to form a BindingDB data set,or user can also upload files of molecules not in the database to compare them to inhibitors of a particular enzyme.

Go to

A tool for alignment of protein interaction networks

PathBLAST is a network alignment and search tool for comparing protein interaction networks across species to identify protein pathways and complexes that have been conserved by evolution.Target protein–protein interaction networks are currently available for Helicobacter pylori, Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster.Partial protein–protein interaction networks are also available for Homo sapiens and Mus musculus.

PathBLAST searches for high-scoring pathway alignments between two paths, one from each network, in which proteins of the first path are paired with putative orthologs occurring in the same order in the second path. Pathway alignments are scored by the degree of protein sequence similarity at each pathway position and by the quality of the protein interactions they contain.

The PathBLAST front page prompts users to specify both the pathway query and the target network.The score of each pathway alignment is also reported with each textual and graphical alignment result.

Go to

Twitter Delicious Facebook Digg Stumbleupon Favorites More