Bimatics: Guide to use of computational tools for finding transcriptional factor binding sites

Transcriptional factors are proteins that binds to DNA, typically upstream from and close to the transciption start site of gene, and regulate the expression of gene by activating or inhibiting the transcription machinery.

Transcription factors contain several functional regions:

* Activation domain: region that interacts wtih other parts of the transcription machinery (RNA polymerase or other transcription factors).

* DNA binding domain: amino acids in the protein that recognize specific bases near the start of transcription.

* Nuclear localization domain: region that serves as a signal for the protein to go to the nucleus after being synthesized in the cytoplasm.

* Dimerization domain: Many transcription factors work as dimers (two subunits). For these proteins, a region of the protein facilitates interaction with another subunit.

The figure shows several transcription factors (JUN, FOS, Sp1, and basal factors) that are necessary for transcription of some genes

Computational approaches to this problem have come in two flavors. One class of methods looks for overrepresented motifs in sequences that are believed to contain several binding sites for the same factor (such as promoters of co-regulated genes) . The second class of methods identifies motifs that are significantly conserved in orthologous sequences, e.g., promoters of the same gene in different species. Yet,the prediction of such regulatory elements computationally challenging task.

Eventhough numerous tools available for this task it should be used with cautious.Based on the assessment each tools performs well depends on the dataset.

Transcriptional factor databases

Transcription factors database

ftp://ftp.ncbi.nih.gov/repository/TFD/

Eurkaryotic transcriptional factors databse

TRANSFAC -contains data on transcription factors, their experimentelly-proven binding sites, and regulated genes. Its broad compilation of binding sites allows the derivation of positional weight matrices.

http://www.gene-regulation.com/pub/databases.html#transfac

Plant transcription factor database

http://plntfdb.bio.uni-potsdam.de/v1.0/

PLACE
Database of motifs found in plant cis-acting regulatory DNA elements, all from previously published reports. It covers vascular plants only.

http://www.dna.affrc.go.jp/PLACE/

PlantProm DB
Database with annotated, non-redundant collection of proximal promoter sequences for RNA polymerase II with experimentally determined transcription start site(s), TSS, from various plant species.

http://www.softberry.com

PlantCare
Database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences.

http://bioinformatics.psb.ugent.be/webtools/plantcare/html/

DoOP: Databases of Orthologous Promoters
A database containing orthologous clusters of promoters from Homo sapiens, Arabidopsis thaliana and other organisms.

http://doop.abc.hu/

DATF: Database of Arabidopsis Transcription Factors

The Database of Arabidopsis Transcription Factors (DATF) contains known and predicted Arabidopsis transcription factors with sequences and many other features including 3D structure templates, EST expression information, transcription factor binding sites and Nuclear Location Signals.

http://datf.cbi.pku.edu.cn/

AtProbe
The Arabidopsis thaliana promoter binding element database, an aid to find binding elements and check data against the primary literature.

http://exon.cshl.org/cgi-bin/atprobe/atprobe.pl

AthaMap
A genome-wide map of putative transcription factor binding sites in Arabidopsis thaliana.

http://www.athamap.de/

AGRIS
contains two databases, AtcisDB (Arabidopsis thaliana cis-regulatory database) and AtTFDB (Arabidopsis thaliana transcription factor database).

http://arabidopsis.med.ohio-state.edu/

Prediction tools

Weeder - For all eukaryotic datasets

http://159.149.109.16:8080/weederWeb/index2.html

oligo/dyad analysis & ANN-Spec - for human dataset

http://rsat.scmbb.ulb.ac.be/rsat/

SesiMCMC performs better for flydataset

http://favorov.imb.ac.ru/cgi-bin/gibbslfm/gibbslfm.pl?action=form

MEME3 & YMF - Performs better for mouse data set

http://meme.sdsc.edu/meme/intro.html

http://wingless.cs.washington.edu/YMF/YMFWeb/YMFInput.pl

Motif sampler performs better for real experimental dataset

http://homes.esat.kuleuven.be/~thijs/Work/MotifSampler.html

PhyME - Good for comparative sequence analysis (also known as phylogenetic footprinting)

http://edsc.rockefeller.edu/cgi-bin/phyme/download.pl

It is advised to use a few complementary tools in combination rather than relying on a single one.

other tools:

AlignACE: http://atlas.med.harvard.edu/

Consensus: http://bifrost.wustl.edu/consensus

GLAM: http://zlab.bu.edu/glam

MITRA: http://www.calit2.net.combio/mitral

quickscore: http://aglo.inria.fr/dolley/quickscore

Reference:
Assessing computational tools for the discovery of transcription factor binding sites.Nat Biotechnol. 2005 Jan;23(1):137-44.