SOUTENANCE DE THÈSE - PHD DEFENSE – PIERRE DE LANGEN | TAGC - Theories and Approaches of Genomic Complexity

Date de l'évènement:

Monday, 11 March, 2024 to Monday, 18 March, 2024

CARTOGRAPHIE ET SIGNIFICATION BIOLOGIQUE DES REGIONS INTERGENIQUES TRANSCRITES DU GENOME HUMAIN

" Mapping and Biological Significance of the Transcribed Intergenic Regions of the Human Genome"

Monday, March 18th at 9:00 am
Auditorium - Hexagone 172 Av. de Luminy, 13009 Marseille

Jury :

Sarah DJEBALI - Rapportrice
Charles LECELLIER - Rapporteur
Andrée DELAHAYE-DURIEZ - Examinatrice
Salvatore SPICUGLIA - Président du jury
Benoît BALLESTER - Directeur de thèse
Lionel SPINELLI - Membre invité

Abstract:
According to the central dogma of biology, DNA is transcribed into RNA and then
translated into proteins. However, only one to two percent of the genome appears to
code for proteins, while a majority of the genome seems capable of being transcribed
into RNA. It has been shown that some of these non-coding RNAs, generally being less
abundant and more unstable, are necessary for certain cellular identities, notably the
pluripotent identity. In cancers, aberrant expression of these non-coding transcripts
has been observed. A majority of the genetic variants associated with diseases or
human traits are located in these non-coding regions.
In my thesis work, we reanalyzed over 900 experiments targeting RNA Polymerase II, the enzyme responsible for generating these transcripts. I was able to
identify more than 180,000 regions bound by RNA polymerase II in the intergenic
genome, thus likely transcribed, and identify in which tissues they were active. We
also analyzed the transcriptional signal at these regions in nearly 29,000 RNA-seq
experiments from ENCODE, GTEx, and TCGA. In cancer data from TCGA, this allowed
for the identifcation of new genomic regions that could serve as markers whose
expression is associated with the tumor state of the tissue or the patient’s survival.
This work also enabled the development of methods for analyzing genomic data that
work with a low signal and non-coding regions, which have been implemented in a
Python package, Muffn.
Keywords: RNA Polymerase II, Genomics, Bioinformatics, Genomic Regulation,
Enhancer, ChIP-seq

Résumé :

Selon le dogme central de la biologie, l’ADN est transcrit en ARN puis traduit en
protéines. Cependant, seulement un à deux pourcents du génome semblent coder
pour des protéines, alors qu’une majorité du génome semble pouvoir être transcrite
en ARN. Il a été montré que certains de ces ARNs non-codants, étant en général moins
abondants et plus instables, étaient nécessaires pour certaines identités cellulaires,
notamment l’identité pluripotente. Dans les cancers, une expression aberrante de ces
transcrits non-codants a été observée. Une majorité des variants génétiques associés
à des maladies ou traits humains sont situés dans ces régions non-codantes.
Dans mon travail de thèse, nous avons réanalysé plus de 900 expériences ChIP-seq
ciblant l’ARN Polymérase II, l’enzyme responsable de la génération de ces transcrits.
J’ai pu identifer, dans le génome intergénique, plus de 180 000 régions fxées par
l’ARN polymérase II, donc probablement transcrites, et identifer dans quels tissus
celles ci étaient actives. Nous avons également analysé le signal transcriptionnel au
niveau de ces régions dans près de 29 000 expériences RNA-seq provenant d’ENCODE,
GTEx et TCGA. Dans les données de cancers issues de TCGA, cela a permis de mettre
en évidence de nouvelles régions génomiques pouvant servir de marqueur dont
l’expression est associée avec l’état tumoral du tissu ou à la survie du patient. Ce
travail a également permis le développement de méthodes d’analyses de données
génomique fonctionnant avec un signal bas et des régions non-codantes, qui ont été
implémentées dans un package python, Muffn.
Mots clés : ARN Polymérase II, génomique, Bio-informatique, Régulation génomique, enhancer, ChIP-seq