lung_example.Rmd

---
title: "Using mmRmeta"
author: "Sebastian Lieske"
date: "30 Januar 2019"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
The goal of this small R package is to analyse the meta data associated with gene expression studies. It is based on the output of [multimodalR](https://github.molgen.mpg.de/loosolab/multimodalR) that includes a detection of multimodal genes in gene expression data sets.

## 1 Installing Packages

## 2 Loading and Preprocessing Data
You need to load two files:
* meta data
* filtered expression data

The meta data, a JSON file, can be obtained using the TCGA database or by loading the existing data. The filtered expression data is generated by multimodalR and may be saved as a .RData or .RDS file.

```{r eval = FALSE}
metadata <- RJSONIO::fromJSON("clinical.cases_selection.2019-01-18.json", nullValue = NA, simplify = FALSE)
lung <- readRDS("lungFiltered.RDS")
```
#### 2.1 Filter Expression Data
The output of filtering done by multimodalR is a large list for a cancer type consisting of 2 elements: An "Output" containing information about the genes and the modality groups and an "Expressionmatrix" with the gene expression values for every gene and patient.
If not already done, you need to process this data a little bit further by using functions of multimodalR.

```{r eval = FALSE}
lung <- multimodalR::updateGeneNames(filteredOutput = lung$Output, lung$Expressionmatrix)
lungY <- multimodalR::filterForYChromosomeGenes(output = lung$Output,expressionmatrix = lung$Expressionmatrix)
lungXY <- multimodalR::filterForXChromosomeGenes(output = lungY$Output,expressionmatrix = lungY$Expressionmatrix)
lungXY <- remove.x(lungXY) #remove the unnecessary "X" infront of case_id
```

#### 2.2 Filter Meta Data
The metadata is a large list that needs to be flattened into a data table. Furthermore, we want to filter out any columns with NA values and select columns of interests.

```{r eval = FALSE}
metadata <- plyr::ldply(metadata, data.frame)                 #flatten the list into a data frame
metadata <- filter.columns.as.na(metadata, "not reported")    #filter any column consisting of NA
metadata <- rename.columns(metadata)                          #shorten the column names
```
By using filter.column.as.na and rename.columns any colums consisting of only NA values are dropped and the remaining column names are shortened. It may be possible that there are several duplicated names.
From here, you need to select the colums of interest you want to keep. Note that you may have duplicated column names. In this example eleven colums are selected.

```{r eval = FALSE}
metadataSelect <-  subset(metadata, select = c(case_id, tumor_stage, primary_diagnosis, site_of_resection_or_biopsy, vital_status, days_to_death, age_at_diagnosis, gender, race, ethnicity))
```
Then we are going to match the meta data with the gene expression data by the key "case_id" which is a unique identifier for a patient to get meta data specific to the cancer type. After subsetting every factor from the original data frame is kept and they need to be removed. Additionally, the column "tumor_stage" contains values encoded as for example "iiia" representing roman numbers with a possible subtype. For better comparison a new column "stage" is created that contains roman numerals representing the tumor stage without the subtype.

```{r eval = FALSE}
metadataSelect <-  subset(metadata, select = c(case_id, tumor_stage, primary_diagnosis, site_of_resection_or_biopsy, vital_status, days_to_death, age_at_diagnosis, gender, race, ethnicity))
lungMeta <- subset.metadata(metadataSelect, lungXY, key = "case_id") #match both objects
lungMeta <- drop.unused.levels(lungMeta) #drop unused factor levels in whole data frame
lungMeta <- add.stage.simple(lungMeta, tumor_stage = "tumor_stage", new_name = "stage") #adds a new column
```
	---
	title: "Using mmRmeta"
	author: "Sebastian Lieske"
	date: "30 Januar 2019"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	```
	The goal of this small R package is to analyse the meta data associated with gene expression studies. It is based on the output of [multimodalR](https://github.molgen.mpg.de/loosolab/multimodalR) that includes a detection of multimodal genes in gene expression data sets.

	## 1 Installing Packages

	## 2 Loading and Preprocessing Data
	You need to load two files:
	* meta data
	* filtered expression data

	The meta data, a JSON file, can be obtained using the TCGA database or by loading the existing data. The filtered expression data is generated by multimodalR and may be saved as a .RData or .RDS file.

	```{r eval = FALSE}
	metadata <- RJSONIO::fromJSON("clinical.cases_selection.2019-01-18.json", nullValue = NA, simplify = FALSE)
	lung <- readRDS("lungFiltered.RDS")
	```
	#### 2.1 Filter Expression Data
	The output of filtering done by multimodalR is a large list for a cancer type consisting of 2 elements: An "Output" containing information about the genes and the modality groups and an "Expressionmatrix" with the gene expression values for every gene and patient.
	If not already done, you need to process this data a little bit further by using functions of multimodalR.

	```{r eval = FALSE}
	lung <- multimodalR::updateGeneNames(filteredOutput = lung$Output, lung$Expressionmatrix)
	lungY <- multimodalR::filterForYChromosomeGenes(output = lung$Output,expressionmatrix = lung$Expressionmatrix)
	lungXY <- multimodalR::filterForXChromosomeGenes(output = lungY$Output,expressionmatrix = lungY$Expressionmatrix)
	lungXY <- remove.x(lungXY) #remove the unnecessary "X" infront of case_id
	```

	#### 2.2 Filter Meta Data
	The metadata is a large list that needs to be flattened into a data table. Furthermore, we want to filter out any columns with NA values and select columns of interests.

	```{r eval = FALSE}
	metadata <- plyr::ldply(metadata, data.frame) #flatten the list into a data frame
	metadata <- filter.columns.as.na(metadata, "not reported") #filter any column consisting of NA
	metadata <- rename.columns(metadata) #shorten the column names
	```
	By using filter.column.as.na and rename.columns any colums consisting of only NA values are dropped and the remaining column names are shortened. It may be possible that there are several duplicated names.
	From here, you need to select the colums of interest you want to keep. Note that you may have duplicated column names. In this example eleven colums are selected.

	```{r eval = FALSE}
	metadataSelect <- subset(metadata, select = c(case_id, tumor_stage, primary_diagnosis, site_of_resection_or_biopsy, vital_status, days_to_death, age_at_diagnosis, gender, race, ethnicity))
	```
	Then we are going to match the meta data with the gene expression data by the key "case_id" which is a unique identifier for a patient to get meta data specific to the cancer type. After subsetting every factor from the original data frame is kept and they need to be removed. Additionally, the column "tumor_stage" contains values encoded as for example "iiia" representing roman numbers with a possible subtype. For better comparison a new column "stage" is created that contains roman numerals representing the tumor stage without the subtype.

	```{r eval = FALSE}
	metadataSelect <- subset(metadata, select = c(case_id, tumor_stage, primary_diagnosis, site_of_resection_or_biopsy, vital_status, days_to_death, age_at_diagnosis, gender, race, ethnicity))
	lungMeta <- subset.metadata(metadataSelect, lungXY, key = "case_id") #match both objects
	lungMeta <- drop.unused.levels(lungMeta) #drop unused factor levels in whole data frame
	lungMeta <- add.stage.simple(lungMeta, tumor_stage = "tumor_stage", new_name = "stage") #adds a new column
	```