Genomics and Bioinformatics-

Beginning in 1910, geneticists have worked to identify and map genes in organisms of interest. The efforts were gradually centered on a few organisms, such as Drosophila, maize, mice, bacteria, and yeast. Mapping of genes involved the following two steps. The first step consisted of identification of spontaneous mutations or mutations induced by chemical or physical agents.

Once a set of mutations was available, they were used for linkage analysis and preparation of linkage maps. In some organisms, such as Drosophila, physical maps of gene's locations on chromosomes were also created. This approach was efficient and widely used in genetics. The drawback of this approach is that at least one mutation for each gene in the genome is required.But obtaining mutations is a difficult, labour intensive endeavour. In addition, the mutation must produce a phenotypic effect. But mutations often have lethal effect making it difficult or impossible to map the mutated gene.

Beginning in the mid-1980s, geneticists began to use recombinant DNA technology for genetic analysis. In this approach, a collection of clones, called a genomic library, is established. These clones are pieced together into overlapping sets and assembled into genetic and physical maps for the entire genome.In the final step, the clones are sequenced, and all the genes in the genome are identified from this sequence. Collectively, these methods are called genomics.

Definition of Genomics and Bioinformatics-

The term genome was introduced by H. Winkler in 1920 to denote the complete set of chromosomal and extra chromosomal genes present in an organism, including a virus. This term is used in the same sense even today. The term genomics was coined by T.H. Roderick sometime in 1987 mean mapping and sequencing to analyze the structure and organization of genomes.

But today genomics includes sequencing of genomes, determination of the complete set of proteins encoded by an organism, and the functioning of genes and metabolic pathways in an organism. Thus genomics not only deals with the determination of the genetic information present in an organism, but also with the understanding the mechanism by which this information is used by the organism.

The information generated in genomics is enormous. Interpretation and management of this information requires the use of powerful computers and specific software. Bioinformatics is an emerging field concerned with the development and application of computer hardware and software to the acquisition, storage, analysis, and visualization of biological information.

Databases for the storage and analysis of genomic information are now essential tools for geneticists. Proteomics is the study of gene products encoded by a genome, including a list of which genes are expressed, the time of their expression, the type and extent of any post translational modification / of the gene product, the function of the encoded protein, and its location in various cellular compartments.The discipline of genomics is often divided into the following two domains:
(1) structural genomics and
(2) functional genomics. Structural genomics deals with the determination of the complete sequence of genomes or the complete set of proteins produced by an organism.

This has progressed in steps as follows:

(1) construction of high resolution genetic and physical maps,

(2) sequencing of the genome, and

(3) determination of the complete set of proteins in an organism.

Often it also includes determination of three dimensional structures of the concerned proteins. The functional genomics, on the other hand, studies the functioning of genes and metabolic pathways, i.e., the gene expression patterns in organisms.

Historical Genome Sequencing-

The idea of genome sequencing was discussed in the scientific community during 1984 onwards. In 1986, a proposal was prepared for sequencing of the human genome. The human genome project officially began on Oct. 1, 1990. The first genome to be sequenced was that of Haemophilus influenzae in 1995.The E. coli genome was soon to be completely sequenced in 1997.

Yeast (Saccharomyces cerevisiae) and worm (Caenorhabditis elegans) genomes were the first eukaryotic genomes to be sequenced in 1999. One year later, in 2000, genomes of Drosophila melanogaster and Arabidopsis thaliana were sequenced.On June 26, 2001, the rough draft of human genome was announced. This draft was prepared separately by the public-funded Human Genome Sequencing Consortium and the private company Celera Genomics established by Craig Venter. However, the draft sequence was announced jointly after intervention by the U.S. President.

Important Dates in Genome Sequencing-

Genome Sequence Compilation-

Genome sequencing projects necessitated the development of high throughput technologies that generate data at a very fast pace. This has brought about the recruitment of computers to manage this flood of information; this has given birth to a new discipline called bioinformatics. Bioinformatics deals with storage, analysis, interpretation and utilization of the information about biological systems. For example, it includes activities like compiling genome sequences, identification of genes, assigning functions to the identified genes, preparation of databases, etc.

In order to ensure that the nucleotide sequence of a genome is complete and error free, the genome is sequenced more than once. For example, the genome of the bacterium Pseudomonas aeruginosa was sequenced seven times using the shotgun method to make sure that the sequence was accurate and free from errors. But the assembler software recognized 1,604 regions that required further clarification. These regions were reanalyzed and resequenced to complete the genome sequence. The accuracy of the shotgun method was compared with the sequence derived by the clone by clone method of two widely separated genomic regions of P. aerugionsa. These two regions together were 81,843 nucleotides long.

The sequences obtained by the two methods were in perfect agreement. This test revealed the accuracy of the shotgun method of genome sequencing. This also exemplifies the precautions taken in genome sequencing projects. This level of care is not unusual and similar precautions are used in all genome projects.The Human Genome Project sequenced the 3.2 billion base pairs of the human genome a total of 12 times. The Celera Genomics, U.S.A. used a strategy of sequencing from both ends of human DNA fragments; it sequenced the human genome 35.6 times. Although a draft of the human genome sequence is finished, several other tasks are yet to be completed.

These include obtaining the remaining sequence and correcting errors (proofreading the genome), filling sequence gaps and then sequencing the 7-15 per cent of the genome that contains heterochromatin.Heterochromatic regions of the genome were not sequenced initially because they contain long stretches of repetitive DNA sequences.

Further, it was initially considered that heterochromatin does not contain genes. But the genome sequence of Drosophila revealed that heterochromatic regions do contain a small number of genes (about 50 in Drosophila).As a result of this discovery, heterochromatic regions of the human genome have to be sequenced to ensure that all the genes in the human genome are identified. Once the genome of an organism is sequenced, compiled, and proofread, the next stage of genomics, viz., annotation, begins.

Genome Sequencing Projects-

The organisms selected for genome projects were mostly used in genetic and other scientific investigations. Thus these organisms may be regarded as model organisms. A model organism is an organism about which a large amount of scientific knowledge is already available.These organisms include both prokaryotic and eukaryotic microorganisms as well as animals and plants, e.g., E. coli, Bacillus subtilis, Achacoglobus fulgidus, yeast (S. cerevisiae), A. thaliana, roundworm (c. elegans) and the fruitfly (D. melanogaster).

In addition, human genome project focussed on sequencing of the whole human genome.E. coli is, without any doubt, the best studied microorganism. Many of the technological tools available today were developed for the E. coli genome project. The E. coli genome sequencing was completed in 1997.

The genome size is just over 4.64 x 106 bp and contains 4,408 genes.B. subtilis is a Gram positive bacterium that colonizes leaf surfaces. It is much used in industrial processes for both enzyme production and food supply fermentation. It is generally regarded as Safe (GRAS). It has a genome of 4.21 x 106bp that contains some 4,212 genes. A. fulgidus is a strictly anaerobic archaebacterium. Its genome sequence was publislied in 1997.

The genome size is 2.2 x 1()6 bp; it contains 2,493 genes. S. cerevisiae is perhaps the most important fungal species used in biotechnological processes. Its name derives from the fact that it can ferment saccharose (sugar). S. cerevisiae genome sequencing project began in 1989, and the sequencing was completed in 1999. Yeast genome is 12.8 x 106 bp and is estimated to contain 6,548 genes.

Genome Size and Number of Genes of Some Selected Organisms-

Human Genome Project-

This is an undertaking by many countries (currently administered jointly by the National Institute of Health and Department of Energy, USA) to acquire "complete knowledge of the organization, structure, and function of the human genome". It is called International Human Genome Sequencing Consortium is regarded as the most ambitious project ever undertaken by humans. The Project officially began on October 1, 1990. A great benefit expected from the project is the ability to identify human genes.

The potential identification of genes that are mutated in the development of disease is a strong motivation for this project.Further, the complete genome sequence will enable the researchers gain an insight into the types of proteins encoded by these genes. The cloning and sequencing of the disease causing alleles is expected to greatly facilitate diagnosis and treatment of diseases. Several conclusions have been drawn from the human genome.

Advantages of Genome Sequencing Projects -

1. They enables the determination of the complete genetic information present in the genomes of various organisms.
2. The relationships between genes can be deduced with confidence.
3. They provide insights into genome organization and evolution, and the mechanisms involved there in.
4. They have opened up exciting areas for future research, e.g., functional genomics.
5. One of the most ambitious expectations is that genome sequences will allow biologists to work out the various molecular interactions that lead to the normal development of the organisms.
6. Information like SNPs (single nucleotide polymorph isms) have become available.
7. A variety of tools and techniques were developed for the genome sequencing projects.
8. A better understanding of human genetic diseases should facilitate their management and cure.
9. It may provide an understanding of why different individuals respond differently to the same drugs (Pharmacogenomics).10. The pathogenicity of microorganisms would be better understood. This should facilitate protection from such diseases.

Functional Genomics-

Functional genomics may be defined as determination of the function of, ultimately, all the gene products encoded by the genome of an organism. This includes answers to the questions, how is a gene expressed, how is its product related in sequence and structure to products of other genes of the same organism, and how does it interact with them? These questions can be answered by studying the following:

(i) when and where particular genes are expressed (expression profiling).
(ii) The functions of specific genes by selectively mutating the desired genes, and
(iii) the interactions that take place among proteins and between proteins and other molecules.

These are the questions that molecular geneticists had been investigating all along. But while they were looking at one gene at a time, functional genomics attempts to examine all the genes present in the genome in one go. Therefore, the techniques used in functional genomics enable high throughput analyses that enable a very rapid data accumulation.

Expression Profiling-

Determination of the cell types/tissues in which a gene is expressed as well as when (e.g., the developmental stage or the external stimulus) the gene is expressed is called expression profiling. In functional genomics, the aim is to study the expression pattern of (Ideally) all the genes present in the genome at the same time; this is called global expression profiling.

This can be done either at the RNA level or at the protein level. At the RNA level, one could use either direct sequence sampling or DNA arrays ; the latter are described in some detail here. At the protein level, one may use either two-dimensional electrophoresis, followed by mass spectrometry or protein arrays.

Global expression profiling provides insights into complex biological phenomena, including differentiation, response to stress, onset of a disease, etc. It also provides a new way to define cellular phenotypes; this, in turn, could reveal novel drug targets and help develop more effective drugs.

Transcriptome-

The complete set of RNA molecules produced by the genome is, usually, referred to as transcriptome. In case of eukaryotes, a single gene can produce more than one type of mature mRNA by a phenomenon called alternative splicing. Splicing describes the removal of introns from RNA transcripts, and the linking together all the exons in the correct order to yield a mature functional mRNA.

In alternative splicing, the splicing a single primary RNA transcript occurs in two or more different but well defined patterns. In each splicing pattern, a defined set of exons is joined together to yield a functional mRNA molecule. The net effect of alternative splicing is the generation of a large number of different proteins from a relatively smaller number of genes.If each human gene was alternatively processed to yield an average of 3 different proteins, the estimated 35,000 human genes would produce 105,000 different proteins. An extreme case of alternative processing is provided by the Drosophila gene Dscam.

This gene can generate nearly 40,000 different mRNAs, each of which could be translated into a distinct receptor protein. Thus the transcriptome is bound to be much more complex (i.e., variable) than the transcribed portion of the genome.In addition, none of the tissues of a multicellular organism will express all the genes, and genes expressed in one tissue will differ from those in another tissue.

In other words, the transcriptome obtained from one tissue will differ in some respects from that obtained from another tissue. Therefore, it is customary to refer to the transcriptomes as 'human brain transcriptome', 'mouse liver transcriptome', etc.

Protein Interactions-

At the most fundamental level, gene function reflects the behaviour of proteins encoded by them. This behaviour may be seen as a series of interactions among various proteins, and between proteins and other molecules. For example, the drugs used to treat diseases act by modulating protein interactions in a beneficial manner.The logic of studying protein interactions is as follows. Suppose, the function of protein A is unknown. Protein A is discovered to interact with proteins B and C, which participate in RNA splicing.

This interaction would indicate the protein A to be involved in RNA splicing. Protein interactions are studied using high throughput techniques.A number of library based protein interaction mapping methods allow hundreds or thousands of proteins to be screened at a time. These interactions may be assayed in vitro or in vivo. Protein interaction data from various sources are assimilated in databases.

Several bioinformatics tools have been developed to extract information from such databases. A major challenge is to find a simple way to present protein interaction data in readily accessible and understable format.

GeneticsStudy