Bimatics: Genome projects and bioinformatics

Genome

A genome is all of the DNA in an organism, including its genes and a lot of DNA that does not contribute to genes. Each animal or plant has its own unique genome. Genetic DNA is the molecular code that carries information for making all the proteins required by a living organism. These proteins determine, among other things, how the organism looks, how well it adapts to its environment, and sometimes even how it behaves.

Genome sequencing

There are essentially two ways to sequence a genome. The BAC-to-BAC method, the first to be employed in human genome studies, is slow but sure. The BAC-to-BAC approach, also referred to as the map-based method, evolved from procedures developed by a number of researchers during the late 1980s and 90s and that continues to develop and change.*

The other technique, known as whole genome shotgun sequencing, brings speed into the picture, enabling researchers to do the job in months to a year. The shotgun method was developed by J. Craig Venter in 1996.

BAC to BAC Sequencing

The BAC to BAC approach first creates a crude physical map of the whole genome before sequencing the DNA. Constructing a map requires cutting the chromosomes into large pieces and figuring out the order of these big chunks of DNA before taking a closer look and sequencing all the fragments.

1.Several copies of the genome are randomly cut into pieces base pairs (bp) long.

2.Each of these fragments is inserted into a BAC-a bacterial artificial chromosome. A BAC is a man made piece of DNA that can replicate inside a bacterial cell. The whole collection of BACs containing the entire human genome is called a BAC library, because each BAC is like a book in a library that can be accessed and copied.

3.These pieces are fingerprinted to give each piece a unique identification tag that determines the order of the fragments. Fingerprinting involves cutting each BAC fragment with a single enzyme and finding common sequence landmarks in overlapping fragments that determine the location of each BAC along the chromosome. Then overlapping BACs with markers every 100,000 bp form a map of each chromosome.

Each BAC is then broken randomly into 1,500 bp pieces and placed in another artificial piece of DNA called M13. This collection is known as an M13 library.

All the M13 libraries are sequenced. 500 bp from one end of the fragment are sequenced generating millions of sequences.These sequences are fed into a computer program called PHRAP that looks for common sequences that join two fragments together.

Whole Genome Shotgun Sequencing

The shotgun sequencing method goes straight to the job of decoding, bypassing the need for a physical map. Therefore, it is much faster.

1.Multiple copies of the genome are randomly shredded into pieces that are 2,000 base pairs (bp) long by squeezing the DNA through a pressurized syringe. This is done a second time to generate pieces that are 10,000 bp long.

2.Each 2,000 and 10,000 bp fragment is inserted into a plasmid, which is a piece of DNA that can replicate in bacteria. The two collections of plasmids containing 2,000 and 10,000 bp chunks of human DNA are known as plasmid libraries.

3.Both the 2,000 and the 10,000 bp plasmid libraries are sequenced. 500 bp from each end of each fragment are decoded generating millions of sequences. Sequencing both ends of each insert is critical for the assembling the entire chromosome.

Computer algorithms assemble the millions of sequenced fragments into a continuous stretch resembling each chromosome.

Genomic Projects and their importance

Genome projects are scientific endeavours that aim to map the genome of a living being or of a species (be it an animal, a plant, a fungus, a bacterium, an archaean, a protist or a virus), that is, the complete set of genes caried by this living being or virus. The Human Genome Project was such a project.

In the mid-1980s, the United States Department of Energy (DoE) initiated a number of projects to construct detailed genetic and physical maps of the human genome, to determine its complete nucleotide sequence, and to localise its estimated 100000 genes. Work on this scale required the develop- development of new computational methods for analysing genetic map and DNA
sequence data, and demanded the design of new techniques and instrumenta- instrumentation for detecting and analysing DNA. To benefit the public most effectively, the projects also necessitated the use of advanced means of information dis- dissemination in order to make the results available as rapidly as possible to scientists and physicians. The international effort arising from this vast initia- initiative became known as the Human Genome Project. Similar research efforts were also launched to map and sequence the genomes of a variety of organisms used extensively in research laboratories as model systems: these included the bacterium Escherichia coli, the yeast Saccharomyces cerevisiae, the nematode worm Caenorhabditis elegans, the fruit fly Drosophila melanogaster, the common weed Arabidopsis thalania,
and the domestic dog Canis familiaris and mouse Mus musculus. In April
1998, although the sequencing projects of only a small number of relatively small genomes had been completed, and the human genome is not expected to be complete until after the year 2000, the results of such projects were already beginning to pour into the public sequence databases in overwhelming numbers. we are now witnessing a dramatic change of focus towards sequence analysis, spurred on by the advent of the genome projects and the resultant
sequence/structure deficit.

GOLD: Genomes Online Database, is a World Wide Web resource for comprehensive access to information regarding complete and ongoing genome projects around the world.

Published complete genomes: 431
Metagenomes: 63
ongoing genomes projects
1.Archaeal genomes: 57
2.Bacterial genomes: 994
3.Eukaryotic genomes: 634
------------------------------------
Total genome projects: 2179

Bioinformatics challege:

The central challenge of bioinformatics is the rationalisation of the mass of sequence information, with a view not only to deriving more efficient means of data storage, but also to designing more incisive analysis tools. The imperative that drives this analytical process is
the need to convert sequence information into biochemical and biophysical knowledge; to decipher the structural, functional and evolutionary clues encoded in the language of biological sequences.

References:

Atwood, T.K., and Parry-Smith, D.J. 1999. Introduction to Bioinformatics. Prentice Hall, London.
Burke, D.T. et al . Cloning of large segments of exogenous DNA into yeast by means of artificial chromosomal vectors. Science 236 , 806-812 (1987).
Bernal, A., Ear, U., Kyrpides, N. (2001) Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. NAR 29, 126-127
Kyrpides, N. (1999) Genomes OnLine Database (GOLD): a monitor of complete and ongoing genome projects world wide. Bioinformatics 15,773-774
Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides, NC. The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide NAR 34, D332-334
Smith, L.M. et al . Fluorescence detection in automated DNA sequencing analysis. Nature 321 , 674-679 (1986).
Shizuya, H. et al . Cloning and stable integration of 300-kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc Natl Acad Sci USA 89 , 8794-8797 (September 1992).
Venter, J.C. et al. A new strategy for genome sequencing. Nature 381, 364-366 (May 30, 1996).
Venter, J.C. et al. Shotgun sequencing of the human genome. Science 280, 1540-1542 (June 5, 1998).