SOUTENANCE DE THÈSE - PHD DEFENSE – PIERRE DE LANGEN

CARTOGRAPHIE ET SIGNIFICATION BIOLOGIQUE DES REGIONS INTERGENIQUES TRANSCRITES DU GENOME HUMAIN

" Mapping and Biological Significance of the Transcribed Intergenic Regions of the Human Genome"

Monday, March 18th at 9:00 am
Auditorium - Hexagone 172 Av. de Luminy, 13009 Marseille

Jury :

Sarah DJEBALI - Rapportrice
Charles LECELLIER - Rapporteur
Andrée DELAHAYE-DURIEZ - Examinatrice
Salvatore SPICUGLIA - Président du jury
Benoît BALLESTER - Directeur de thèse
Lionel SPINELLI - Membre invité

Abstract:
According to the central dogma of biology, DNA is transcribed into RNA and then translated into proteins. However, only one to two percent of the genome appears to code for proteins, while a majority of the genome seems capable of being transcribed into RNA. It has been shown that some of these non-coding RNAs, generally being less abundant and more unstable, are necessary for certain cellular identities, notably the pluripotent identity. In cancers, aberrant expression of these non-coding transcripts has been observed. A majority of the genetic variants associated with diseases or human traits are located in these non-coding regions.
In my thesis work, we reanalyzed over 900 experiments targeting RNA Polymerase II, the enzyme responsible for generating these transcripts. I was able to identify more than 180,000 regions bound by RNA polymerase II in the intergenic genome, thus likely transcribed, and identify in which tissues they were active. We also analyzed the transcriptional signal at these regions in nearly 29,000 RNA-seq experiments from ENCODE, GTEx, and TCGA. In cancer data from TCGA, this allowed for the identifcation of new genomic regions that could serve as markers whose expression is associated with the tumor state of the tissue or the patient’s survival.
This work also enabled the development of methods for analyzing genomic data that work with a low signal and non-coding regions, which have been implemented in a Python package, Muffn.
Keywords: RNA Polymerase II, Genomics, Bioinformatics, Genomic Regulation, Enhancer, ChIP-seq

Résumé :

Selon le dogme central de la biologie, l’ADN est transcrit en ARN puis traduit en protéines. Cependant, seulement un à deux pourcents du génome semblent coder pour des protéines, alors qu’une majorité du génome semble pouvoir être transcrite en ARN. Il a été montré que certains de ces ARNs non-codants, étant en général moins abondants et plus instables, étaient nécessaires pour certaines identités cellulaires, notamment l’identité pluripotente. Dans les cancers, une expression aberrante de ces transcrits non-codants a été observée. Une majorité des variants génétiques associés à des maladies ou traits humains sont situés dans ces régions non-codantes.
Dans mon travail de thèse, nous avons réanalysé plus de 900 expériences ChIP-seq ciblant l’ARN Polymérase II, l’enzyme responsable de la génération de ces transcrits.
J’ai pu identifer, dans le génome intergénique, plus de 180 000 régions fxées par l’ARN polymérase II, donc probablement transcrites, et identifer dans quels tissus celles ci étaient actives. Nous avons également analysé le signal transcriptionnel au niveau de ces régions dans près de 29 000 expériences RNA-seq provenant d’ENCODE, GTEx et TCGA. Dans les données de cancers issues de TCGA, cela a permis de mettre en évidence de nouvelles régions génomiques pouvant servir de marqueur dont l’expression est associée avec l’état tumoral du tissu ou à la survie du patient. Ce travail a également permis le développement de méthodes d’analyses de données génomique fonctionnant avec un signal bas et des régions non-codantes, qui ont été implémentées dans un package python, Muffn.
Mots clés : ARN Polymérase II, génomique, Bio-informatique, Régulation génomique, enhancer, ChIP-seq

Date de fin de publication

2024-03-18T12:00:00

TAGC/INSERM U1090

Parc Scientifique de Luminy case 928
163, avenue de Luminy
13288 MARSEILLE cedex 09 FRANCE