SureshKumar's Bioinformatics Blog

I am Suresh Kumar Sampathrajan. I have completed my PhD degree in bioinformatics from the University of Vienna, Austria in the year 2010. If you want to know more about me and my research,please click the menus at the top.

I have started this bioinformatics blog mainly for undegraduate and postgraduate students of bioinformatics. This blog will serve as an open resource material for the students and for those who wish to know about bionformatics. This blog contains video tutorials, tips, bioinformatics software downloads, articles on bioinformatics and career opportunities.

Guide to use of computational tools for finding transcriptional factor binding sites

Transcriptional factors are proteins that binds to DNA, typically upstream from and close to the transciption start site of gene, and regulate the expression of gene by activating or inhibiting the transcription machinery.

Transcription factors contain several functional regions:

* Activation domain: region that interacts wtih other parts of the transcription machinery (RNA polymerase or other transcription factors).

* DNA binding domain: amino acids in the protein that recognize specific bases near the start of transcription.

* Nuclear localization domain: region that serves as a signal for the protein to go to the nucleus after being synthesized in the cytoplasm.

* Dimerization domain: Many transcription factors work as dimers (two subunits). For these proteins, a region of the protein facilitates interaction with another subunit.

The figure shows several transcription factors (JUN, FOS, Sp1, and basal factors) that are necessary for transcription of some genes

Computational approaches to this problem have come in two flavors. One class of methods looks for overrepresented motifs in sequences that are believed to contain several binding sites for the same factor (such as promoters of co-regulated genes) . The second class of methods identifies motifs that are significantly conserved in orthologous sequences, e.g., promoters of the same gene in different species. Yet,the prediction of such regulatory elements computationally challenging task.

Eventhough numerous tools available for this task it should be used with cautious.Based on the assessment each tools performs well depends on the dataset.

Transcriptional factor databases

Transcription factors database

Eurkaryotic transcriptional factors databse

TRANSFAC -contains data on transcription factors, their experimentelly-proven binding sites, and regulated genes. Its broad compilation of binding sites allows the derivation of positional weight matrices.

Plant transcription factor database

Database of motifs found in plant cis-acting regulatory DNA elements, all from previously published reports. It covers vascular plants only.

PlantProm DB
Database with annotated, non-redundant collection of proximal promoter sequences for RNA polymerase II with experimentally determined transcription start site(s), TSS, from various plant species.

Database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences.

DoOP: Databases of Orthologous Promoters
A database containing orthologous clusters of promoters from Homo sapiens, Arabidopsis thaliana and other organisms.

DATF: Database of Arabidopsis Transcription Factors

The Database of Arabidopsis Transcription Factors (DATF) contains known and predicted Arabidopsis transcription factors with sequences and many other features including 3D structure templates, EST expression information, transcription factor binding sites and Nuclear Location Signals.

The Arabidopsis thaliana promoter binding element database, an aid to find binding elements and check data against the primary literature.

A genome-wide map of putative transcription factor binding sites in Arabidopsis thaliana.

contains two databases, AtcisDB (Arabidopsis thaliana cis-regulatory database) and AtTFDB (Arabidopsis thaliana transcription factor database).

Prediction tools

Weeder - For all eukaryotic datasets

oligo/dyad analysis & ANN-Spec - for human dataset

SesiMCMC performs better for flydataset

MEME3 & YMF - Performs better for mouse data set

Motif sampler performs better for real experimental dataset

PhyME - Good for comparative sequence analysis (also known as phylogenetic footprinting)

It is advised to use a few complementary tools in combination rather than relying on a single one.

other tools:






Assessing computational tools for the discovery of transcription factor binding sites.Nat Biotechnol. 2005 Jan;23(1):137-44.

Database for prediction of entire proteomes

Large-scale genome sequencing has provided us with the building blocks of living organisms. However, to obtain new insights into physiological and biochemical processes, it is essential to analyse and catalogue the structural and functional features of each individual protein in the genome. Such predictions for entire proteomes suggest conclusions in context of comparative genomics and provide crucial information in the context of structural genomics.

PEP is a database of Predictions for Entire Proteomes. The database contains summaries of analyses of protein sequences from a range of organisms representing all three major kingdoms of life: eukaryotes, prokaryotes and archaea.The database contains structural and functional features analysis including:

• coiled-coil regions predicted by COILS
• 3-state secondary structure predicted by PROFsec
• percentage relative solvent accessibility predicted by PROFacc
• transmembrane helices assigned by PHDhtm
• low sequence complexity regions according to SEG
• long stretches of non-regular secondary structure (NORS)
• presence and location of signal peptide cleavage sites identified by SignalP
• PROSITE motifs
• nuclear localization signals
• cellular functional classes assigned by EUCLID

PEP database can be accessed by SRS, PSI-BLAST and BlastP interface.It can also downloaded as flat files.

Computational protein kinase substrate identification

Post-translational modification by phosphorylation is the most abundant type of cellular regulation, affecting essentially every cellular process including metabolism, growth, differentiation, motility, membrane transport, learning and memory. Defects in protein kinase function result in a variety of diseases and kinases are major targets for drug design.

The identification of protein kinase substrates requires understanding the peptide specificity of protein kinases. Understanding phosphorylation specificity will therefore contribute to understanding the roles of protein kinases in health and disease, and help identifying new therapeutic targets and strategies of protein kinase inhibition and anti-kinase drug development.

In eukaryotes, protein kinases phosphorylate mainly Ser or Thr residues (protein Ser/Thr kinases) or Tyr residues (protein Tyr kinases). Although phosphorylation of His residues, as well as other amino acids, occurs also.

The three-dimensional structures are known for a number of protein kinases, some with bound substrates and nucleotides.The characteristic fold consists of a smaller N-terminal “lobe”, comprising a five-stranded β-sheet and one or two α-helices, and a larger C-terminal lobe that usually contains six major α-helices and two small β-sheets (As shown in Fig below).

The peptide substrate is held in the groove between the two lobes. The phosphate group is extracted from an ATP molecule located close to the substrate towards the small lobe. A conserved Asp residue is essential for catalysis.


1.The Phospho.ELM database contains a collection of experimentally verified Serine, Threonine and Tyrosine sites in eukaryotic proteins. The entries, manually annotated and based on scientific literature, provide information about the phosphorylated proteins and the exact position of known phosphorylated instances.

2.General databases on post-translational modifications

3.The RESID Database of Protein Modifications is a comprehensive collection of annotations and structures for protein modifications including amino-terminal, carboxyl-terminal and peptide chain cross-link post-translational modifications.

Prediction tools:

1.ELM is a resource for predicting functional sites in eukaryotic proteins.

2.Identification of phosphorylation sites
The NetPhos 2.0 server produces neural network predictions for serine, threonine and tyrosine phosphorylation sites in eukaryotic proteins.

3.Predict PKA phosphorylation sites
NetPhosK is neural network predictions of kinase specific eukaryotic protein phosphoylation sites.It covers the following kinases: PKA, PKC, PKG, CKII, Cdc2, CaM-II, ATM, DNA PK, Cdk5, p38 MAPK, GSK3, CKI, PKB, RSK, INSR, EGFR and Src.

Scansite searches for motifs within proteins that are likely to be phosphorylated by specific protein kinases or bind to domains such as SH2 domains, 14-3-3 domains or PDZ domains.

PredPhospho predictsphosphorylation sites of protein sequences.

The AMS tool allows for identification of PTM (post-translational modification) sites in proteins.

7.GPS -group-based phosphorylation predicting and scoring method
It covers a larger number of protein kinase families and have greater sensitivity and specificity than Scansite and PredPhospho

A computer program that can be used to predict substrates for serine/threonine protein kinases.

Predikin can predict peptide specificities directly from the amino acid sequences and can therefore be used for most kinases, including hypothetical and uncharacterized ones.

Twitter Delicious Facebook Digg Stumbleupon Favorites More