Atlas and biological significance of transcribed intergenic regions of the human genome

authors

De langen Pierre

keywords

RNA Polymerase II
Genomics
Bioinformatics
Genomic Regulation
Enhancer
ChIP-seq

document type

THESE

abstract

According to the central dogma of biology, DNA is transcribed into RNA and then translated into proteins. However, only one to two percent of the genome appears to code for proteins, while a majority of the genome seems capable of being transcribed into RNA. It has been shown that some of these non-coding RNAs, generally being less abundant and more unstable, are necessary for certain cellular identities, notably the pluripotent identity. In cancers, aberrant expression of these non-coding transcripts has been observed. A majority of the genetic variants associated with diseases or human traits are located in these non-coding regions. In my thesis work, we reanalyzed over 900 experiments targeting RNA Polymerase II, the enzyme responsible for generating these transcripts. I was able to identify more than 180,000 regions bound by RNA polymerase II in the intergenic genome, thus likely transcribed, and identify in which tissues they were active. We also analyzed the transcriptional signal at these regions in nearly 29,000 RNA-seq experiments from ENCODE, GTEx, and TCGA. In cancer data from TCGA, this allowed for the identifcation of new genomic regions that could serve as markers whose expression is associated with the tumor state of the tissue or the patient’s survival. This work also enabled the development of methods for analyzing genomic data that work with a low signal and non-coding regions, which have been implemented in a Python package, Muffn.

more information