Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
multimodalR/vignettes/Using-multimodalR-on-real-data.Rmd
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
450 lines (339 sloc)
21.8 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Using multimodalR on TCGA data" | |
author: "Svenja K. Beck" | |
date: "`r Sys.Date()`" | |
#output: rmarkdown::word_document | |
output: | |
html_document: | |
toc: true | |
number_sections: true | |
theme: readable | |
highlight: haddock | |
#output: pdf_document | |
vignette: > | |
%\VignetteIndexEntry{Using multimodalR on TCGA data} | |
%\VignetteEngine{knitr::rmarkdown} | |
%\VignetteEncoding{UTF-8} | |
--- | |
# Introduction | |
Welcome to multimodalR! multimodalR is an R package to simulate | |
multimodal gene expression data, to evaluate different algorithms | |
detecting multimodality and to detect bimodality or multimodality | |
in gene expression data sets. | |
This vignette gives an overview to multimodalR's usage on real data with the example of TCGA data. For an overview and introduction to multimodalR's functionalities, please use the vignette 'multimodalR'. | |
It was written in R Markdown, using the knitr package for production. | |
See help(package="multimodalR") for further details (and references provided by citation("multimodalR").) | |
# Installation | |
To install the most recent development version from Github use: | |
```{r install-github, eval = FALSE} | |
# biocLite("loosolab/multimodalR", dependencies = TRUE,build_vignettes = TRUE) | |
``` | |
#About TCGA | |
The Cancer Genome Atlas (TCGA) is a collaboration between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) that has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, comprising more than two petabytes of genomic data, has been made publically available, and this genomic information helps the cancer research community to improve the prevention, diagnosis, and treatment of cancer. | |
(https://cancergenome.nih.gov/abouttcga) | |
##About the data | |
TCGA collected and characterized high-quality tumor and matched normal samples from over 11,000 patients. The TCGA dataset, available in the Genomic Data Commons (GDC), contains: | |
* Clinical information about participants | |
* Metadata about the samples (e.g. the weight of a sample portion, etc.) | |
* Histopathology slide images from sample portions | |
* Molecular information derived from the samples (e.g. mRNA/miRNA expression, | |
protein expression, copy number, etc.) | |
(https://cancergenome.nih.gov/abouttcga/aboutdata) | |
##Downloading TCGA data | |
TCGA data can be downloaded from https://portal.gdc.cancer.gov/repository . | |
Select the facet 'Cases' and check the box of 'TCGA' at the 'Program' section. | |
At the 'Primary Site' section, check the boxes of cancer types you want to process with multimodalR and download the files in the JSON notation. | |
If only one cancer type is selected, the steps before 'processing single cancer groups' may be skipped. | |
# Quickstart | |
The following steps are an overview of the steps to use multimodalR on TCGA data: | |
1. Import your data | |
1. Clean up your data | |
1. Log-transform your data | |
1. Exclude 0-sum rows | |
1. Sort into groups | |
**Processing single groups:** | |
1. Collect data as data.frame | |
1. Exclude 0-sum rows | |
1. Selecting potential multimodal genes (with Silverman Test's for unimodality) | |
1. Use useMclust() (or any other algorithm producing the unified data output format) on data to find potential multimodal genes | |
1. Filter output | |
1. Get gene names from Ensembl | |
1. Exclude genes on allosomes | |
1. Create a plot of the filtered gene expression matrix & create survival curves plots | |
1. Check found multimodals against known bimodal genes | |
 | |
#Use multimodalR on TCGA data with few functions | |
## Import TCGA data | |
The data exported from TCGA comes in two big files: one contains only metadate, the other contains the FPKM expression values. | |
### Metadata | |
The metadata shows the following structure, which is the same for each of the dicts that are stored as a 'list of dicts' in JSON notation: | |
```{r metadata head, eval=FALSE} | |
[ | |
{ | |
"caseID": "028e99e9-5b9a-4954-bb6e-6d4709a3cea8", # uniquely identifies the patient | |
"fileID": "a347bee8-651f-4c8a-9e57-f21fcf955a1d", # uniquely identifies the data file | |
"fileName": "5d259206-1431-48f8-abc3-3001bab5243d.FPKM.txt.gz", # the actual name of the data file | |
"sampleType": "Solid Tissue Normal", # the sample type | |
"diagnosis": "c34.3", # The ICD10 classification of diagnosis (http://www.icd10data.com/ICD10CM/Codes/C00-D49) | |
"morphology": "8252/3", # Some weird code for cancer morphology | |
"age": null, # The age in days at data generation | |
"gender": "female", | |
"status": "alive", # Vitality status at end of study | |
"deathIn": null, # Days until death (NA if alive) | |
"followUpIn": 3261 # Days until last followUp (NA if dead) | |
}, # .. next entries ... | |
] | |
``` | |
###FPKM expression values | |
The expression values are stored in a h5 data store: expression-fpkm.h5 | |
Usage of the h5ls tool allows a peek into the files structure: | |
```{r TCGA FPKM file structure,eval=FALSE} | |
additional.metastatic Group | |
additional.new.primary Group | |
blood.derived Group | |
metastatic Group | |
primary.tumor Group | |
recurrent.tumor Group | |
solid.tissue.normal Group | |
``` | |
Each group contains an expression matrix. Row names correspond to gene IDs, | |
column names correspond to the case ID from the metadata. | |
For patients were the same tumour was measured multiple times, | |
a running number is added to avoud duplicated column names. | |
###Reading the data in R | |
First, the packages RJSONIO, plyr and rhdf5 have to be installed to load the data into R. | |
To read the metadata into a data.frame use | |
```{r load TCGA metadata, eval=FALSE} | |
library(RJSONIO) | |
library(plyr) | |
x <- fromJSON("pathToData/expression-fpkm.json", nullValue = NA, simplify = FALSE) | |
metadata <- ldply(x, data.frame) | |
``` | |
The expression data can be load directly from the h5 data store: | |
```{r load TCGA expression data, eval=FALSE} | |
library(rhdf5) | |
primary.tumors <- h5read("pathToData/expression-fpkm.h5", "/primary.tumor") | |
``` | |
The resulting object is a list with three elements: | |
* rownames: A 1D-Array with the Ensembl Gene ID (can be coerced to character) | |
* colnames: A 1D-Array with the Case ID (can be coerced to character), | |
with running counts in case of duplication, resembling R’s make.unique function. | |
* values: A high-dimensional expression matrix (has to be transposed before use!) | |
**Beware of R column names** | |
When creating data frames or matrices from the data, keep in mind that some Case IDs start with a numeric character. Those IDs will be preceded by an X, when put as colnames of an R object. | |
## Clean up your data | |
###Log-transform your data | |
The first step of cleaning up data usually is the log-transformation of data that are skewed. | |
For this, use: | |
```{r logTransformingData,eval = FALSE} | |
cancerData <- log2(data.frame(t(primary.tumors$values+1))) | |
rownames(cancerData) <- primary.tumors$rownames | |
colnames(cancerData) <- primary.tumors$colnames | |
``` | |
###Exclude 0-sum rows & sort expression values into groups | |
For processing a huge amount of genes the usage of parallel processing is advisable. | |
**Please check the size of the cancer data object, how many cores and how much memory you have available before specifying the number of cores to be used for parallel processing.** | |
With the toCancerDataGroups function 0sum Rows are excluded and the data are sorted into groups. | |
```{r toCancerGroups, eval=FALSE} | |
groupsData <- toCancerDataGroups(cancerDataExpression = cancerData, | |
metadata = metadata,coresToUse = 18) | |
``` | |
## Processing single groups | |
After using the 'toCancerDataGroups' function, groups that are interesting should be selected and isolated. The following steps will be executed for breast cancer patients, who are in diagnosisGroup "c50". All diagnosis groups can be looked up at http://www.icd10data.com/ICD10CM/Codes/C00-D49 . | |
###Collect data as data.frame | |
```{r breastCancerData,eval=FALSE} | |
groupsDataSets <- groupsData$groupsExpressionDataSets | |
diagnosisGroups <- groupsData$diagnosisGroups | |
breastCancerData <- data.frame(groupsDataSets[[which(diagnosisGroups=="c50")]]) | |
``` | |
###Use multimodalR on breast cancer data | |
The next steps are: | |
Processes one cancer group by | |
* filtering 0-sum rows to clear out 0sum genes | |
* using Silverman's test for unimodality to find potential multimodal genes | |
* filtering genes with <2% of patients that have expression values greater 0 to clear those genes out | |
* useMclust() on data to find potential multimodal genes | |
* cleaning the output of useMclust | |
* filtering for multimodal genes | |
* filtering for existing minimum mean of any mean of the multimodal genes | |
* filtering for existing minimum fold change between any adjacent means of multimodal genes | |
* getting gene names from Ensembl | |
* returning a list of the filtered output and the filtered expression matrix | |
All those steps are part of the 'getFilteredData' function, which will now be used on the breast cancer data. | |
```{r process Breast Cancer Data, eval=FALSE} | |
#process breast cancer data | |
filteredBreastCancerData <- getFilteredData(oneCancerGroupExpressionmatrix = breastCancerData,coresToUse = 1,updateToEnsembleGeneNames = TRUE) | |
#get filtered output | |
filteredBreastCancerOutput <- filteredBreastCancerData$Output | |
#get filtered expression matrix | |
filteredBreastCancerEm <- filteredBreastCancerData$Expressionmatrix | |
``` | |
The steps integrated in the 'getFilteredData' function can be used as functions via the Double Colon Operator ("::"). | |
Here is a list of the steps and corresponding functions: | |
* filtering 0-sum rows to clear out 0sum genes: filter0sumGenes() | |
* using Silverman's test for unimodality to find potential multimodal genes: useSilvermantest() | |
* filtering genes with <2% of patients that have expression values greater 0 to clear those genes out: filterRarelyExpressedGenes() | |
* process data with an algorithm (default: useMclust()) to find potential multimodal genes: useMclust(), useHdbscan(), useFlexmix() | |
* cleaning the algorithm's output: cleanOutput() | |
* filtering for multimodal genes: filterMultimodalGenes() | |
* filtering for existing minimum mean of any mean of the multimodal genes: filterForMean() | |
* filtering for existing minimum fold change between any adjacent means of multimodal genes: filterForFoldChange() | |
* getting gene names from Ensembl: updateGeneNames() | |
How to use these functions to create another workflow is described in the section "Use multimodalR filter functions" in detail. | |
### Create a plot of the filtered gene expression matrix | |
The next step of this fast workflow is to plot the filtered gene expression matrix. | |
```{r plotting the filtered gene expression matrix, eval=FALSE} | |
plotExpression(expression = filteredBreastCancerEm, | |
plotsPerPage = 1, | |
GenesToPlot = nrow(filteredBreastCancerEm), | |
plotName = "filteredBreastCancerData") | |
``` | |
### Create plots of the survival curves of the genes | |
The plotting of the survival curves of the filtered gene expression matrix is the last step of this fast workflow | |
and can be achieved with the plotSurvivalCurves() function. The gene names should be updated before using this function because then the correct gene names will be displayed above the plots. | |
```{r plotting the survival curves, eval=FALSE} | |
plotSurvivalCurves(metadata = metadata, | |
output = filteredBreastCancerOutput) | |
``` | |
## Use multimodalR filter functions | |
The filter functions described in this section are used in the getFilteredData() function which is a fast and straightforward way to analyse TCGA data. The usage of the separate functions allows the user to integrate self-written filtering functions and gives the user more control of the filtering process. These functions are specifically designed to be used after a single cancer group was collected as a data frame. | |
### Filter 0 sum rows | |
Genes that are not expressed in any of the patient can't be classified as multimodal. Therefore, they are useless for the further analysis and excluded from the data set with the 'filter0sumgenes()' function: | |
```{r filter0sumGenes, eval=FALSE} | |
clearedData <- filter0sumGenes(cancerData = oneGroupCancerData,parallel = TRUE) | |
``` | |
The result is a cleared gene expression data frame which only holds those genes were the sum of the gene expression was greater than 0. | |
### Filter genes with different tests for unimodality | |
The package multimodalR includes two different tests for unimodality: | |
Silverman's Test for Unimodality and Hartigans' Dip Test for Unimodality. | |
Both tests can be used to filter the genes for possible multimodal genes. | |
Genes that are classified as unimodal with the tests are excluded. | |
Silverman's test was used in the analysis of TCGA data because it has better recognized | |
multimodal genes in a simulated gene expression data set. Hartigans' Test for Unimodality may also be used by the user. | |
#### Use Silverman's Test for Unimodality | |
Silverman's Test for Unimodality tests the null hypothesis that an underlying density has at most 1 mode. | |
By rejecting this hypothesis (returned p-value is smaller than given level of significance SilvermanP), | |
one can verify that the underlying density has more than k modes. | |
Silverman's Test is used in the 'useSilvermantest()' function: | |
```{r useSilvermantest, eval=FALSE} | |
possiblyMultimodal <- useSilvermantest(clearedData =clearedData,parallel = TRUE,SilvermanP = 0.05) | |
``` | |
The result is a gene expression data frame which only holds those genes for that the null hypothesis of the Silverman test was rejected at the given level of significance. | |
#### Use Hartigans' Dip Test for Unimodality | |
Hartigans' Dip Test for Unimodality tests the null hypothesis that a given distribution is unimodal. | |
By rejecting this hypothesis, one can verify that the given distribution is not unimodal. | |
The null hypothesis is rejected if the returned p-value is smaller than the given level of significance DipTestP. | |
Hartigans' Dip Test for Unimodality is used in the 'useDiptest()' function: | |
```{r useDiptest, eval=FALSE} | |
possiblyMultimodal <- useDiptest(clearedData =clearedData,parallel = TRUE,DipTestP = 0.05) | |
``` | |
The result is a gene expression data frame which only holds those genes | |
for that the null hypothesis of Hartigans' Dip Test was rejected at the given level of significance. | |
### Filter rarely expressed genes | |
Genes that are only expressed in a few patients can't be classified as multimodal. Therefore, they are useless for the further analysis and excluded from the data set with the 'filterRarelyExpressedGenes()' function: | |
```{r filterRarelyExpressedGenes, eval=FALSE} | |
toBeProcessed <- filterRarelyExpressedGenes(possibles = possiblyMultimodal,minExpressedPercentage = 2,parallel = TRUE) | |
``` | |
The result is a gene expression data frame which only holds those genes for that more than the given minPercentage of patients showed a gene expression value greater than 0. | |
### Process data with an algorithm | |
The filtered gene expression data frame can be processed with any of the implemented algorithms: mclust, HDBSCAN and FlexMix: | |
```{r useAlgorithm, eval=FALSE} | |
#useMclust | |
output <- useMclust(expression = toBeProcessed,clusterNumbers = c(1:3),verbose = T,parallel = T) | |
#useHdbscan | |
output <- useHdbscan(expression = toBeProcessed,minClusterSize = 20,minSamples = NULL,verbose = T) | |
#useFlexmix | |
output <- useFlexmix(expression = toBeProcessed,maxModality = 3,verbose = T,parallel = T) | |
``` | |
The output is in the unified output data format which is explained in the `multimodalR` vignette in detail. | |
### Clean the algorithm's output | |
The output of the algorithms may need cleaning. Sometimes groups are found that only consist of very few patients, those groups will be sorted into the other groups. Also mistakes that might have occured (e.g. empty groups that falsify the modus) are cleared. | |
```{r cleanOutput, eval=FALSE} | |
cleanedOutput <- cleanOutput(cancerGroupOutput = output,cancerGroupExpressionmatrix = toBeProcessed, | |
minPercentage = 10,coresToUse = 6,parallel = FALSE) | |
``` | |
The returned output is the cleaned output in the unified output data format. | |
### Filter for multimodal genes | |
The cleaned output can be filtered for multimodal genes with the `filterMultimodalGenes()` function. With this function all unimodal genes are excluded: | |
```{r filterMultimodalGenes, eval=FALSE} | |
multimodalData <- filterMultimodalGenes(cleanedOutput = cleanedOutput, | |
processedExpressionmatrix = toBeProcessed) | |
multimodalOutput <- multimodalData$Output | |
multimodalEm <- multimodalData$Expressionmatrix | |
``` | |
The result is a list of the filtered expression matrix and output. | |
The output is in the unified output data format. | |
The filtered expression matrix and output only hold those genes which are classified as multimodal. | |
### Filter for minimum mean value | |
The multimodal output can be filtered for a minimum mean value with the `filterForMean()` function. | |
If any of the means of a gene is greater than `minMean`, the gene is kept. | |
All other genes are excluded from the output. | |
```{r filterForMean, eval=FALSE} | |
filteredMeanData <- filterForMean(multimodalExpressionmatrix = multimodalEm, | |
multimodalOutput = multimodalOutput, | |
minMean = 2) | |
filteredMeanOutput <- filteredMeanData$Output | |
filteredMeanEm <- filteredMeanData$Expressionmatrix | |
``` | |
The result is a list of the filtered expression matrix and output. | |
The output is in the unified output data format. | |
The filtered expression matrix and output consist of only those genes which have a mean greater than `minMean`. | |
### Filter for minimum fold change | |
The multimodal output can be filtered for a minimum fold change between the means with the `filterForFoldChange()` function. | |
If any of the differences between two neighbouring means of a gene is greater than `minFoldChange`, the gene is kept. | |
All other genes are excluded from the output. | |
```{r filterForFoldChange, eval=FALSE} | |
filteredFCData <- filterForFoldChange(filteredMeanExpressionmatrix =filteredMeanEm, | |
filteredMeanOutput = filteredMeanOutput, | |
minFoldChange = 2) | |
filteredFCOutput <- filteredFCData$Output | |
filteredFCEm <- filteredFCData$Expressionmatrix | |
``` | |
The result is a list of the filtered expression matrix and output. | |
The output is in the unified output data format. | |
The filtered expression matrix and output consist of only those genes which have a fold change greater than `minFoldChange` between any neighbouring means. | |
### Update gene names | |
The gene names that are specified in the TCGA data are Ensembl gene ids which may be converted to the real gene names. This can be done with the `updateGeneName()` function. | |
```{r updateGeneNames, eval=FALSE} | |
updatedNamesData <- updateGeneNames(filteredExpressionmatrix = filteredFCEm,filteredOutput=filteredFCOutput) | |
filteredOutput <- updatedNamesData$Output | |
filteredEm <- updatedNamesData$Expressionmatrix | |
``` | |
The result is a list of the filtered expression matrix and output with the updated gene names. | |
The output is in the unified output data format. | |
The filtered expression matrix has the updated gene names as row names. | |
### Filter for genes not on the X chromosome | |
The filtered output and expression matrix with the updated gene names can be further filtered to find the genes that are located on the X chromosome. Those genes will be excluded by the usage of the 'filterForXChromosomeGenes()' function. | |
```{r filterForXChromosomeGenes, eval=FALSE} | |
notOnXChromosomeData <- filterForXChromosomeGenes(output = filteredOutput,expressionmatrix = filteredEm) | |
notOnXChromosomeOutput <- notOnXChromosomeData$Output | |
notOnXChromosomeEm <- notOnXChromosomeData$Expressionmatrix | |
``` | |
The result is a list of the expression matrix and output with all genes that were not located on the X chromosome. | |
The output is in the unified output data format. | |
The filtered expression matrix has the updated gene names as row names. | |
### Filter for genes not on the Y chromosome | |
The filtered output and expression matrix with the updated gene names can be further filtered to find the genes that are located on the Y chromosome. Those genes will be excluded by the usage of the 'filterForYChromosomeGenes()' function. | |
```{r filterForYChromosomeGenes, eval=FALSE} | |
notOnYChromosomeData <- filterForYChromosomeGenes(output = notOnXChromosomeOutput,expressionmatrix = notOnXChromosomeEm) | |
notOnYChromosomeOutput <- notOnYChromosomeData$Output | |
notOnYChromosomeEm <- notOnYChromosomeData$Expressionmatrix | |
``` | |
The result is a list of the expression matrix and output with all genes that were not located on the Y chromosome. | |
The output is in the unified output data format. | |
The filtered expression matrix has the updated gene names as row names. | |
The filtered expression matrix can be plotted as described above. Of course, self-written filtering functions can be integrated in the described workflow. | |
### Filter for genes known in cancer pathways | |
The filtered ouput and expression matrix with updated gene names can be further filtered to find the genes that are known in cancer pathways. The gene names of the filtered output and expression matrix are compared to the genes in KEGG's Cancer Pathway. | |
This can be accomplished by the usage of the 'filterForKnownInCancer()' function. | |
```{r filterForKnownInCancer, eval=FALSE} | |
knownInCancerData <- filterForKnownInCancer(output = filteredOutput,expressionmatrix = filteredEm) | |
KnownInCancerOutput <- KnownInCancerData$Output | |
KnownInCancerEm <- KnownInCancerData$Expressionmatrix | |
``` | |
The result is a list of the expression matrix and output with all genes that were found in KEGG's Cancer Pathway. | |
The output is in the unified output data format. | |
The filtered expression matrix has the updated gene names as row names. | |
### Notes on the filtering functions | |
The functions filter0sumGenes(), useSilvermantest(), filterRarelyExpressedGenes(),and useMclust(), useHdbscan(), useFlexmix() are used in the `processOneCancerGroup()` function which itself is used in the `getFilteredData()` function. Therefore, in a smaller workflow the `processOneCancerGroup()` function can be used. | |
The filtering of the algorithm's output can then be done manually. |