Bioinformatics for analysing metagenomes


Metagenomics is a new field of research in which scientists analyze the genomes of organisms recovered directly from the environment. Most naturally occuring bacteria cannot be cultured and therefore cannot be analyzed by traditional means. Metagenomic studies provide us with a mechanism for analyzing previously unknown organisms. At the same time we can examine the diversity of organisms present in specific environments as well as analyze the complex interactions between members of a specific environment. Scientists can study the smallest component of an environmental system by extracting DNA from organisms in the system and inserting it into a model organism.

The isolation, archiving and analysis of environmental DNA (or so-called 'metagenomes') has enabled us to mine microbial diversity, allowing us to access their genomes, identify protein coding sequences and even to reconstruct biochemical pathways, providing insights into the properties and functions of these organisms. The generation and analysis of (meta)genomic libraries is thus a powerful approach to harvest and archive environmental genetic resources. It will enable us to identify which organisms are present, what they do, and how their genetic information can be beneficial to mankind.

The mining of genomes and metagenomic libraries will not only provide new enzymes for biotechnological processes and a basis to study new protein structures and catalytic mechanisms, but will also enable the functional assignment of many proteins found in abundance in databases and currently designated as ‘hypothetical’ or ‘conserved hypothetical’ proteins. The identification of novel catalysts will both improve existing processes and will lead to the design of novel processes for making innovative products or high-value intermediates.

One of the main focus in analysing metagenomes using genomic analysis tools to find novel genes, discovering novel pathways, functional groups and evolutionary related studies.

Gene finding
Gene finding is a fundamental goal in virtually all metagenomics projects, regardless of whether complete genome sequences can be assembled or not.
>>Gene prediction can be done using GLIMMER which is trained on long open reading frames.

GLIMMER

Discovering novel pathways & functionalgroups
>>Predicted genes blasted against COGs or KEGG database
>>To perform single-linkage hierarchical clustering (eg.Cluster & Treeview)

Cluster
Treeview

Dealing with partial sequences
Many metagenomes contain partial sequences. The partial sequences create obstacle in phylogenetic studies. However the problem can be solved by aligning the partial sequences against the complete ones and the phylogenetic assignment performed by finding the closest sequences in the database.
>>Performing semi-global multiple alignment (i.e., terminal gaps are not penalized). The most widely used alignment tools are based on global or local alignments and do not correctly handle partial sequences.
>>Muliple alignment using MUSCLE tool although not optimized for partial sequences, MUSCLE do a reasonable job, as ascertained by several criteria: the number of internal gaps was small, sequences shorter than the read length had either no beginning gaps or no ending gaps (since the gene length is greater than the read length), and the total length was comparable to related proteins.

Muscle