Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
mmRmeta/lung_example.Rmd
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
58 lines (48 sloc)
3.81 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Using mmRmeta" | |
author: "Sebastian Lieske" | |
date: "30 Januar 2019" | |
output: html_document | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE) | |
``` | |
The goal of this small R package is to analyse the meta data associated with gene expression studies. It is based on the output of [multimodalR](https://github.molgen.mpg.de/loosolab/multimodalR) that includes a detection of multimodal genes in gene expression data sets. | |
## 1 Installing Packages | |
## 2 Loading and Preprocessing Data | |
You need to load two files: | |
* meta data | |
* filtered expression data | |
The meta data, a JSON file, can be obtained using the TCGA database or by loading the existing data. The filtered expression data is generated by multimodalR and may be saved as a .RData or .RDS file. | |
```{r eval = FALSE} | |
metadata <- RJSONIO::fromJSON("clinical.cases_selection.2019-01-18.json", nullValue = NA, simplify = FALSE) | |
lung <- readRDS("lungFiltered.RDS") | |
``` | |
#### 2.1 Filter Expression Data | |
The output of filtering done by multimodalR is a large list for a cancer type consisting of 2 elements: An "Output" containing information about the genes and the modality groups and an "Expressionmatrix" with the gene expression values for every gene and patient. | |
If not already done, you need to process this data a little bit further by using functions of multimodalR. | |
```{r eval = FALSE} | |
lung <- multimodalR::updateGeneNames(filteredOutput = lung$Output, lung$Expressionmatrix) | |
lungY <- multimodalR::filterForYChromosomeGenes(output = lung$Output,expressionmatrix = lung$Expressionmatrix) | |
lungXY <- multimodalR::filterForXChromosomeGenes(output = lungY$Output,expressionmatrix = lungY$Expressionmatrix) | |
lungXY <- remove.x(lungXY) #remove the unnecessary "X" infront of case_id | |
``` | |
#### 2.2 Filter Meta Data | |
The metadata is a large list that needs to be flattened into a data table. Furthermore, we want to filter out any columns with NA values and select columns of interests. | |
```{r eval = FALSE} | |
metadata <- plyr::ldply(metadata, data.frame) #flatten the list into a data frame | |
metadata <- filter.columns.as.na(metadata, "not reported") #filter any column consisting of NA | |
metadata <- rename.columns(metadata) #shorten the column names | |
``` | |
By using filter.column.as.na and rename.columns any colums consisting of only NA values are dropped and the remaining column names are shortened. It may be possible that there are several duplicated names. | |
From here, you need to select the colums of interest you want to keep. Note that you may have duplicated column names. In this example eleven colums are selected. | |
```{r eval = FALSE} | |
metadataSelect <- subset(metadata, select = c(case_id, tumor_stage, primary_diagnosis, site_of_resection_or_biopsy, vital_status, days_to_death, age_at_diagnosis, gender, race, ethnicity)) | |
``` | |
Then we are going to match the meta data with the gene expression data by the key "case_id" which is a unique identifier for a patient to get meta data specific to the cancer type. After subsetting every factor from the original data frame is kept and they need to be removed. Additionally, the column "tumor_stage" contains values encoded as for example "iiia" representing roman numbers with a possible subtype. For better comparison a new column "stage" is created that contains roman numerals representing the tumor stage without the subtype. | |
```{r eval = FALSE} | |
metadataSelect <- subset(metadata, select = c(case_id, tumor_stage, primary_diagnosis, site_of_resection_or_biopsy, vital_status, days_to_death, age_at_diagnosis, gender, race, ethnicity)) | |
lungMeta <- subset.metadata(metadataSelect, lungXY, key = "case_id") #match both objects | |
lungMeta <- drop.unused.levels(lungMeta) #drop unused factor levels in whole data frame | |
lungMeta <- add.stage.simple(lungMeta, tumor_stage = "tumor_stage", new_name = "stage") #adds a new column | |
``` |