Decembre 20th, 2018 -
In recent years, the advent of next-generation sequencing (NGS) technology has been revolutionizing how genomic studies are processed. An important and widely used application of NGS technology is the study of transcriptome through sequencing of cDNA obtained from RNA (RNA-seq). Compared with previous technologies like microarrays, RNA-seq data have many advantages, such as dynamic and wider ranges of measurements, increased precision, higher throughput, discovery of novel RNA species and splice forms, etc. Thence, RNA-seq has been became suitable alternative for the microarray approach as the main platform to transcriptome studies. NGS technologies produce huge amounts of data, which urges the development of effective multivariate analysis methods adapted to the particular nature of the data (discrete counts, huge dynamic range, outliers, …). In this dissertation, we focus on the use of machine learning methods to perform supervised classification to assign samples to groups based on their RNA-seq gene expression profiles.
First, we briefly revise the state-of-art for the genomics and the statistical methods to treat NGS data, in order to draw lessons from the latest developments in analysis the NGS data and to evaluate what our research will provide to the latest scientific developments in the scope of multivariate analysis for the NGS data.
We perform a comparative assessment of supervised classification methods, based on published data downloaded from the recount2 warehouse, which contains around 2000 RNA-seq experiments. From this database, we selected seven study cases that are representative for typical of RNA-seq studies with different type of categories (classes): disease states (cancer types, leukemia, psoriasis), or cell types (nervous cells). We assessed the impact of pre-processing on classifiers: filtering procedures (discarding unsuited genes and/or samples), normalization, PCA transformation. We also studied the impact of the feature selection, to circumvent the problem of over-dimensionality of the feature space, and find out the subset of genes or components that optimizes the accuracy of classifiers. The feature selection relied on variable ordering based on either differential expression analysis, or on variable importance returned by a Random Forest classifier.
We pay a particular attention to the metadata and we explore the structure of the datasets, in order to interpret the behavior of each tested classifier (Support Vector Machines, Random Forest, and K Nearest Neighbouts), in light of the specificities of each study case (number of samples, number of classes, distribution of the count values, bulk or single-cell RNA-seq, …).