What does the majority of our DNA do? Hundreds of scientists have spent years examining these ‘junk’ sequences, which may hold the key to serious diseases – and much more.
The Human Genome Project – the sequencing of the human genome – was a major achievement of the past decade: it laid bare the human genetic blueprint, all three billion bases, but the story doesn’t stop there. Deciphering how this sequence is interpreted by our cells is essential to understanding how the genome works. Then, perhaps, we can apply this knowledge to biomedical research and healthcare.
automated DNA sequencing
machine used by the Human
Genome Project to determine
the complete human DNA
sequence. Each peak shows
the presence of a particular
base. The Human Genome
Project identified the 3
billion letters making up our
genome. ENCODE now
provides details of how the
genome works. Click on
image to enlarge.
Image courtesy of Genome
One of the big surprises of the human genome was that only 2% of the genome contains genes, the instructions to make proteins. After accounting for additional bits of the genome such as non-coding RNAs, parts involved in controlling the activity of genes and introns (the sections of a gene’s sequence that are removed before the messenger RNA molecule is translated), a common view was that the rest of the genome had no biological function. As a result, it was often referred to as ‘junk’ DNA.
Once the human genome was sequenced, it was time to find out whether these sequences really were junk. In 2003, the ENCODE consortium was formed to characterise the non-coding but functional elements of the human genome. The consortium was supported by the National Human Genome Research Institute in the USA and led by the European Bioinformatics Institute (EBI; see box) in the UK. The ENCODE pilot phase ran from 2003 to 2007 and allowed a global network of researchers to test, compare and optimise experimental and computational methods for identifying the active parts in a 1% portion of the genome – essentially sifting through some of the genomic ‘junk’.
Image courtesy of Ian Dunham
analysed 147 different cell
types to understand
differences in genome
regulation in different tissue
types. This diagram
pinpoints 47 of the 147
different cell types included
in the study. Multiple cell
types were used because
although cells share the
same genome, the way they
use this information differs
between cell types. Click on
image to enlarge.
Image courtesy of Darryl Leja
Their initial results, published in June 2007 (The ENCODE Project Consortium, 2007), gave a tantalising insight into what the genome is doing. For example, the combined data from microarray (see Koutsos et al., 2009) and sequencing experiments showed that the majority of the genome is transcribed, including regions that had been thought to be transcriptionally silent (figure 2). Although the biological roles of most of the transcripts were still unknown, some were shown to be important regulators of gene expression. Overall, this genome snapshot showed that the interplay between genes, regions involved in regulating the activity of genes, and other types of DNA sequences was much more complex than anyone had thought. The data had already started to indicate that the genome contained many forms of active elements and consequently less unused sequence than had been believed.
After successfully testing their approach, the ENCODE researchers then began to examine the entire human genome. This was made easier by advances in DNA sequencing technology and the availability of more precise biochemical assays.
Their analysis systematically mapped features of the genome, just as a map describes a physical landscape and geographical features such as forests, rivers and mountains. In the genome, the ENCODE researchers were looking for features such as regions of the genome flagged with ‘shhhh’ signs (specific types of methyl groups) indicating gene silencing, ‘bind here’ signposts for transcription factors, booster regions to enhance transcription, and DNA modifications that control how the DNA is packaged (figure 3).
In September 2012, after 5 years of experiments and analysis by 442 researchers from 32 research institutes in the UK, US, Spain, Singapore and Japan, the ENCODE project announced the results of the most detailed analysis of the whole genome to date. The study used about 300 years of computer time to analyse 15 terabytes of data (15 x 1012 bytes), all of which is publicly available. If the data were printed out at a density of 1000 base pairs per cm2, the tower of paper would be 16 m high and more than 30 m long: the equivalent of 12 double-decker buses in volume.
The ENCODE project is an example of what can be achieved by large-scale projects building on the individual contributions of hundreds of researchers, each adding a piece of the jigsaw to produce a complete picture of the genome that could not be achieved by any single organisation.
One of the most exciting things that the ENCODE experiments showed is that rather than being predominantly non-functional sequence, our genome is alive with activity: 80% of the genome is actively doing something. Exactly what it is doing remains to be discovered, but certainly 9% of it (and probably much more) is involved in regulating gene expression, controlling when and where proteins are made. The active 80% of the genome contains more than 70 000 promoter regions – the ‘bind here’ sites for transcription factors – and nearly 40 000 enhancer regions – the boosters that control the expression of distant genes.
Overall, ENCODE identified more than 4 million gene switches dispersed throughout the genome. You could picture the genome as a massive control panel, like a sound engineer’s mixing desk, with lots of switches that turn genes on and off. This information deepens our understanding of gene expression and opens up new opportunities for treating disease. For example, a small change in a gene switch called CARD9 is linked to a 20% increased risk of developing Crohn syndrome, an inflammatory bowel disease. What if you could reset gene switches back to normal, effectively turning off the causes of a disease?
The ENCODE results also shed light on how the genome is organised and the physical interactions occurring within it. The researchers found that these gene switches were in physical contact with the genes they controlled, even though they might be separated linearly by hundreds of kilobases. We tend to imagine the genome as a long, straight line of sequence but in reality it’s all tightly packed in the cell’s nucleus, bringing different parts of the genome in close contact with each other.
ENCODE provides a detailed map of the genome and opens up whole new areas of science to explore. As Ian Dunham from EBI and lead author on the ENCODE paper explains, “In many cases you may have a good idea of which genes are involved in a disease, but you might not know which switches are involved. Sometimes these switches are very surprising – their location might seem more logically connected to a completely different disease. ENCODE gives us a set of valuable leads to follow to discover key mechanisms at play in health and disease. Those can be exploited to create entirely new medicines, or to repurpose existing treatments.”
As well as knowing which genes are involved in a disease, researchers now know some of the switches involved in regulating how these genes are turned on and off. This will be especially valuable for interpreting the results of population-based studies that identify links between a gene and a disease. By combining ENCODE’s functional analysis of the genome with data from genome-wide association studies, researchers can map the genetic variations that have been linked to disease to the areas of regulatory function, including gene switches, identified by ENCODE. The ENCODE data will enable a better understanding of the genetic basis of disease and support the work of scientists for many years to come.
The European Molecular Biology Laboratory (EMBL)w1 is one of the world’s top research institutions, dedicated to basic research in the life sciences. EMBL is international, innovative and interdisciplinary. Its employees from 60 nations have backgrounds including biology, physics, chemistry and computer science, and collaborate on research that covers the full spectrum of molecular biology.
EBIw2, based near Cambridge, UK, is part of EMBL. It provides data from life science experiments free to the global scientific community, and performs basic research in computational biology. EBI is committed to training researchers in academia and industry to make the most of the incredible amount of data being produced every day in life science experiments.
EMBL is a member of EIROforumw3, the publisher of Science in School.
Kozlowski C (2010) Bioinformatics with pen and paper: building a phylogenetic tree. Science in School 17: 28-33.
Communication and Public Engagement team (2010) Can you spot a cancer mutation? Science in School 16: 39-44.