diff --git a/R/preprocess_coldata.R b/R/preprocess_coldata.R index 14a77d4..7a3fb25 100644 --- a/R/preprocess_coldata.R +++ b/R/preprocess_coldata.R @@ -124,11 +124,14 @@ filter_columns_with_na <- function(summarized_experiment, to_lower = T, unnest = columnData <- mmRmeta::drop_unused_levels(columnData) #drop unused factor levels if (!is.null(percentage)) { #drop columns with more than >percentage< NA obseervations - keepMatch <- vector(mode = "character") + keepMatch <- numeric() sumNA <- colSums(is.na(columnData))/nrow(columnData) %>% round(., digits = 2) isAbove <- sumNA >= percentage if (!is.null(keep)) { - keepMatch <- append(keepMatch, grep(pattern = keep[i], names(columnData))) #get index for colums to be saved, only if >keep< is not null + for (i in 1:length(keep)) { + if (!is.null(keep)) keepMatch <- append(keepMatch, grep(pattern = keep[i], names(columnData))) #get index for colums to be saved, only if >keep< is not null + keepMatch <- unique(keepMatch) + } keepMatch <- unique(keepMatch) isAbove <- isAbove %notin% keepMatch } diff --git a/R/use_glmnet.R b/R/use_glmnet.R index 3f783f4..7aa5c15 100644 --- a/R/use_glmnet.R +++ b/R/use_glmnet.R @@ -23,7 +23,7 @@ #' @import magrittr #' @export -use_glmnet <- function(data = NULL, assay_classification = NULL, classification = NULL, model_matrix = NULL, col_names = NULL, gene_names = NULL, use_glm = T, use_cv = T, fast_cv = T, stratify_class = F, type_measure = "class", type.multinomial = "grouped", +use_glmnet <- function(data = NULL, assay_classification = NULL, classification = NULL, model_matrix = NULL, col_names = NULL, gene_names = NULL, use_glm = T, use_cv = T, fast_cv = F, stratify_class = F, type_measure = "class", type.multinomial = "grouped", alpha = 0.5, verbose = T, parallel = T, no_cluster = NULL, seed = NULL, repeats = 1, k_outer_loop = 10, k_inner_loop = 5, standardize = F, return_both = T, output_genewise = T, ... ) { #test for mode: data = null ? --> then only classification and model_matrix oder ###################### 1. Mise en Place ########################################################### diff --git a/README.md b/README.md index 94a00da..be35ccb 100644 --- a/README.md +++ b/README.md @@ -8,4 +8,5 @@ The workflow of mmRmeta consists of three major steps (outlined in Figure 2). Fi This is followed by the identification of bimodally expressed genes. The package mmRmeta uses mclust for model-based clustering to obtain parameter estimates and the cluster assignment of the samples. The third step is the multi-omics analysis to find relations between bimodally expressed genes and the phenotypes of the patient population via modeling regularized logistic regressions and conducting survival analyses with Cox proportional hazard models. + ![Overview](/vignettes/overview.svg) diff --git a/vignettes/Using_mmRmeta.Rmd b/vignettes/Using_mmRmeta.Rmd index bc3ead6..132aa75 100644 --- a/vignettes/Using_mmRmeta.Rmd +++ b/vignettes/Using_mmRmeta.Rmd @@ -20,9 +20,17 @@ The developed R package mmRmeta is a framework to analyze multi-omics data such # Installation Before doing the data analysis you need to install and load the package. ``` {r eval = FALSE} - devtools::install.package("mmRmeta") # link to package + devtools::install.github("sebastianlieske/mmRmeta", host = "github.molgen.mpg.de/apu/v3") # link to package library(mmRmeta) ``` +There are 7 other packages that are needed that can be downloaded from Bioconductor. +``` {r eval = FALSE} +requiredBio <- c("Biobase", "biomaRt", "edgeR", "GenomicRanges", "SummarizedExperiment", "EnsDb.Hsapiens.v86", "ComplexHeatmap") +for (i in requiredBio) { + BiocManager::install(i) +} +``` + # Preprocessing Let us start with the preprocessing of the data. To proceed this vignette, make sure you have available data either in form of a SummarizedExperiment containing the gene expression data and clinical data or at least one gene expression count matrix and clinical data in form of a data frame in your R environment. In this example, the data was downloaded from NCI's [Genomic Data Commons Data Portal](https://portal.gdc.cancer.gov/) via [TCGAbiolinks](https://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html). To see an example of how to download the data, please go to the following section [Data Acquisition](#anchor1). To combine RNA expression count matrix and additional data into a summarized experiment head to [Make Summarized Experiment](#anchor2). @@ -91,6 +99,7 @@ if (!exists("rnaCountLGG")) { Due to the data design, the colData of the summarized experiment may contain nested elements; like a list as a cell entry. Especially one column is creating trouble. ```{r eval = FALSE} #delete column "treatments" because its leading to errors and it contains no information +library(mmRmeta) if ("treatments" %in% colnames(colData((rnaCountLGG)))) { rnaCountLGG$treatments <- NULL } @@ -98,13 +107,13 @@ Due to the data design, the colData of the summarized experiment may contain nes The preprocessing of the summarized experiment consists of two parts: the first one handles the gene expression matrix in the assay and the second one deals with the colData containing the clinical data. Both need to be altered for downstream functions to properly work. ### Processing colData -Starting with the clinical data in the colData there are several things to pay attention to. We want to filter out columns with only missing values, change all categorical entries to factors and filter out single level columns. Additionally, you only want to have samples from primary solid tumors and filter out metastatic samples. +Starting with the clinical data in the colData, there are several things to pay attention to. We want to filter out columns with only missing values, change all categorical entries to factors and filter out single level columns. Additionally, you only want to have samples from primary solid tumours and filter out metastatic samples. ```{r eval = FALSE} #new object for applied changes editedCountLGG <- filter_for_factor(rnaCountLGG, column_value = "Primary solid Tumor") #only keep primary solid tumours editedCountLGG <- preprocess_coldata(editedCountLGG) -editedCountLGG <- filter_columns_with_na(editedCountLGG, unnest = T, col_char = "_id", percentage = 0.8, keep = "days_to_death") #drop columns with 80% missing values and drop columns with _id as part of their name +editedCountLGG <- filter_columns_with_na(editedCountLGG, unnest = T, percentage = 0.8, keep = "days_to_death") #drop columns with 80% missing values and drop columns with _id as part of their name editedCountLGG <- factorize_columns(editedCountLGG) editedCountLGG <- drop_unused_levels(editedCountLGG) editedCountLGG <- filter_small_binary(editedCountLGG, min_counts = 10) #drop binary factor columns where one level has less than 10 counts @@ -112,7 +121,7 @@ editedCountLGG <- filter_single_level(editedCountLGG) #drop factor columns with colnames(editedCountLGG) <- substr(colnames(editedCountLGG), 1, 12) ``` -The function _preprocess_coldata_ incorporates 3 single functions that also can be applied individually. When using the standard parameters all not numerical columns were set to factors, columns with only NAs or "not reported" are deleted and single factor level entries are discarded as well. +The function _preprocess_coldata_ incorporates 3 single functions that also can be applied individually. When using the standard parameters, all non-numerical columns were set to factors, columns with only NAs or "not reported" are deleted and single factor level entries are discarded as well. #### Additional Functions for Coldata These functions are called by _preprocess_coldata_setw_ @@ -142,7 +151,7 @@ editedCountLGG <- preprocess_assay(editedCountLGG) In this case we want to filter out genes located on the Y and X chromosome because they would be flagged as bimodal genes due to the gender. It is necessary that the summarized experiment has also a GRanges object. By applying *preprocess_assay* duplicated genes and genes with less than 5% of detected samples are discarded with a overall read count lower than 5% of samples having at least 5. Also at least 10% of all samples that have reads must have more than 5 reads for that gene because a gene is considered as expressed when it has at least a count of 5-10 in a library. -Some patients have multiple existing samples in the expression matrix. We want to have only one enty per patient so we need to merge those additional samples by either calculating the mean or median of all samples belonging to one patient. +Some patients have multiple existing samples in the expression matrix. We want to have only one entry per patient so we need to merge those additional samples by either calculating the mean or median of all samples belonging to one patient. ```{r eval = FALSE} editedCountLGG <- merge_duplicated(editedCountLGG, assay_index = 1, method = "median", shorten_names = T, original_names = T, parallel = T) #combine duplicated genes @@ -183,7 +192,7 @@ colnames(methylLGG) <- substr(colnames(methylall), 1, 12) #shorten IDs # Finding Bimodal Genes ## Using SIBERG -After preprocessing we are set to analyse the data, starting with finding bimodal genes. Our matrix has several thousand genes and with the help of [SIBERG](https://cran.r-project.org/web/packages/SIBERG/index.html) we can prefilter possible bimodal candidates relatively fast. +After preprocessing, we are set to analyse the data. Starting with the identification bimodal genes. Our matrix has several thousand genes and with the help of [SIBERG](https://cran.r-project.org/web/packages/SIBERG/index.html) we can prefilter possible bimodal candidates relatively fast. ```{r eval = FALSE} siberLGG <- use_siber(editedCountLGG, 1) #use SIBER in parallel @@ -218,8 +227,10 @@ Those flaws have to be corrected. ```{r eval = FALSE} correctedClust <- verify_mclust(resultClust, simple = T) #look for misclassification correctedClust <- verify_mclust(correctedClust, min_proportion = 0.10) #look for imbalanced models +correctedClust <- verify_mclust(resultClust, simple = T) #look for misclassification# again + ``` -The function *correctedClust* iterates through the list and searches for flawed results. It has two modi. The first one without the parameter *min_proportion* looks for misclassfied samples. With that parameter, it searches for results with imbalanced proportion of the modes. In this example we want to have a model where at least 10% of the samples are assigned to one mode. If the function encounters inconsistencies or flaws, if will call *use_mclust* again with different parameters. +The function *correctedClust* iterates through the list and searches for flawed results. It has two modes. The first one without the parameter *min_proportion* looks for misclassified samples. With that parameter, it searches for results with imbalanced proportion of the modes. In this example we want to have a model where at least 10% of the samples are assigned to one mode. If the function encounters inconsistencies or flaws, if will call *use_mclust* again with different parameters. Next up we filter results that returned an unimodal model and models with a low fold change. ```{r eval = FALSE} @@ -359,8 +370,37 @@ resMatrixCombined <- do.call(rbind, args = resMatrixCombined) ``` ## Multinomial Regression +Besides doing a binomial regression it is also possible to allow for multiple outcome levels. The function *use_glmnet* and the relevant functions can handle do multinomial regression as well. For example we can test, which genes and variables are related the most to the subtype of the patient. +```{r eval = FALSE} +exprModel <- assay_to_model(editedCountLGG, 2, zscore = T) #first turn gene expression into model matrix with scaled Values +rownames(editedCountLGG) <- editedCountLGG@rowRanges$external_gene_name #change gene names +variables <- c("age_at_diagnosis", "race", "gender", "primary_diagnosis", "paper_Mutation.Count", + "paper_TERT.expression..log2.", "paper_Percent.aneuploidy", "paper_ESTIMATE.combined.score", + "paper_ATRX.status", "paper_MGMT.promoter.status") +multiMatrix <- make_model_matrix(editedCountLGG, predictors = variables) +megedMatrix <- merge_model_matrices(multiMatrix, exprModel) +``` +Then you have to make sure, that the Matrix and the outcome variable have the same length and names. +```{r eval = FALSE} +multiclass <- as.vector(editedCountLGG$paper_IDH.codel.subtype) +names(multiclass) <- colnames(editedCountLGG) #give it names +multiclass <- multiclass[!is.na(multiclass)] #omit NAs +intersection <- dplyr::intersect(rownames(megedMatrix), names(multiclass)) +multimatrix <- megedMatrix[intersection,] #subset +multiclass <- factor(multiclass[intersection]) #subset +``` +Then you can apply glmnet to the data. +```{r eval = FALSE} +multimodelResult <- use_glmnet(classification = multiclass, + model_matrix = megedMatrix, type.multinomial = "grouped", fast_cv = F, + family = "multinomial", standardize = F, repeats = 10, stratify_class = T) +``` +From the result one can obtain the active coefficients. +```{r eval = FALSE} +activeCoef <- get_active_coef(multimodelResult) +``` ## Survival Analysis This package is able to conduct survival analysis. Within this implementation, glmnet and the included regularized Cox regression (Coxnet) is used. Coxnet fits a Cox regression model that is regularized by the elastic net penalty. The regression model describes the relation between an event (e.g. death) and a set of covariates as expressed by the hazard function. For that were looking at the gene expression and with addition of clinical variables. A nested cross-validation is used. @@ -370,7 +410,7 @@ joinedModel <- merge_model_matrices(modelMatrix, exprModel) #apply cox regression resultSurvival <- use_cv_coxglmnet(editedCountLGG, model_matrix = joinedModel, stratify = T, k_inner_loop = 5, k_outer_loop = 8, verbose = T, add_survival = F, seed = 15) ``` -The inner cross validation will be used for finding the optimal value for the penality factor lambda, while the outer loop is used to estimate the prediction accuracy with the C-index. Also a final model is build to calculate the hazard ratios/risk values and build risk groups. +The inner cross validation will be used for finding the optimal value for the penalty factor lambda, while the outer loop is used to estimate the prediction accuracy with the C-index. Also a final model is build to calculate the hazard ratios/risk values and build risk groups. As a result one obtains a list with 4 elements. First one is a vector of C (auc) statistics. Second, the whole glmnet fit. Third, risk values of all patients according to beta values oft the model. And fourth, the final survival or censoring times. To see which variables are active in the final model, one can extract them out of the results. ```{r eval = FALSE} diff --git a/vignettes/Using_mmRmeta.md b/vignettes/Using_mmRmeta.md index 43e3a4c..9d7df4e 100644 --- a/vignettes/Using_mmRmeta.md +++ b/vignettes/Using_mmRmeta.md @@ -28,10 +28,20 @@ hazard models. Before doing the data analysis you need to install and load the package. ``` r - devtools::install.package("mmRmeta") # link to package + devtools::install.github("sebastianlieske/mmRmeta", host = "github.molgen.mpg.de/apu/v3") # link to package library(mmRmeta) ``` +There are 7 other packages that are needed that can be downloaded from +Bioconductor. + +``` r +requiredBio <- c("Biobase", "biomaRt", "edgeR", "GenomicRanges", "SummarizedExperiment", "EnsDb.Hsapiens.v86", "ComplexHeatmap") +for (i in requiredBio) { + BiocManager::install(i) +} +``` + # Preprocessing Let us start with the preprocessing of the data. To proceed this @@ -164,6 +174,7 @@ column is creating trouble. ``` r #delete column "treatments" because its leading to errors and it contains no information +library(mmRmeta) if ("treatments" %in% colnames(colData((rnaCountLGG)))) { rnaCountLGG$treatments <- NULL } @@ -187,7 +198,7 @@ solid tumors and filter out metastatic samples. editedCountLGG <- filter_for_factor(rnaCountLGG, column_value = "Primary solid Tumor") #only keep primary solid tumours editedCountLGG <- preprocess_coldata(editedCountLGG) -editedCountLGG <- filter_columns_with_na(editedCountLGG, unnest = T, col_char = "_id", percentage = 0.8, keep = "days_to_death") #drop columns with 80% missing values and drop columns with _id as part of their name +editedCountLGG <- filter_columns_with_na(editedCountLGG, unnest = T, percentage = 0.8, keep = "days_to_death") #drop columns with 80% missing values and drop columns with _id as part of their name editedCountLGG <- factorize_columns(editedCountLGG) editedCountLGG <- drop_unused_levels(editedCountLGG) editedCountLGG <- filter_small_binary(editedCountLGG, min_counts = 10) #drop binary factor columns where one level has less than 10 counts @@ -382,6 +393,7 @@ Those flaws have to be corrected. ``` r correctedClust <- verify_mclust(resultClust, simple = T) #look for misclassification correctedClust <- verify_mclust(correctedClust, min_proportion = 0.10) #look for imbalanced models +correctedClust <- verify_mclust(resultClust, simple = T) #look for misclassification# again ``` The function *correctedClust* iterates through the list and searches for @@ -626,6 +638,48 @@ resMatrixCombined <- do.call(rbind, args = resMatrixCombined) ## Multinomial Regression +Besides doing a binomial regression it is also possible to allow for +multiple outcome levels. The function *use\_glmnet* and the relevant +functions can handle do multinomial regression as well. For example we +can test, which genes and variables are related the most to the subtype +of the patient. + +``` r +exprModel <- assay_to_model(editedCountLGG, 2, zscore = T) #first turn gene expression into model matrix with scaled Values +rownames(editedCountLGG) <- editedCountLGG@rowRanges$external_gene_name #change gene names +variables <- c("age_at_diagnosis", "race", "gender", "primary_diagnosis", "paper_Mutation.Count", + "paper_TERT.expression..log2.", "paper_Percent.aneuploidy", "paper_ESTIMATE.combined.score", + "paper_ATRX.status", "paper_MGMT.promoter.status") +multiMatrix <- make_model_matrix(editedCountLGG, predictors = variables) +megedMatrix <- merge_model_matrices(multiMatrix, exprModel) +``` + +Then you have to make sure, that the Matrix and the outcome variable +have the same length and names. + +``` r +multiclass <- as.vector(editedCountLGG$paper_IDH.codel.subtype) +names(multiclass) <- colnames(editedCountLGG) #give it names +multiclass <- multiclass[!is.na(multiclass)] #omit NAs +intersection <- dplyr::intersect(rownames(megedMatrix), names(multiclass)) +multimatrix <- megedMatrix[intersection,] #subset +multiclass <- factor(multiclass[intersection]) #subset +``` + +Then you can apply glmnet to the data. + +``` r +multimodelResult <- use_glmnet(classification = multiclass, + model_matrix = megedMatrix, type.multinomial = "grouped", fast_cv = F, + family = "multinomial", standardize = F, repeats = 10, stratify_class = T) +``` + +From the result one can obtain the active coefficients. + +``` r +activeCoef <- get_active_coef(multimodelResult) +``` + ## Survival Analysis This package is able to conduct survival analysis. Within this diff --git a/vignettes/classifiedLGG.RDS b/vignettes/classifiedLGG.RDS new file mode 100644 index 0000000..62ffc08 Binary files /dev/null and b/vignettes/classifiedLGG.RDS differ