Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
---
title: "Using multimodalR on TCGA data"
author: "Svenja K. Beck"
date: "`r Sys.Date()`"
#output: rmarkdown::word_document
output:
html_document:
toc: true
number_sections: true
theme: readable
highlight: haddock
#output: pdf_document
vignette: >
%\VignetteIndexEntry{Using multimodalR on TCGA data}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Introduction
Welcome to multimodalR! multimodalR is an R package to simulate
multimodal gene expression data, to evaluate different algorithms
detecting multimodality and to detect bimodality or multimodality
in gene expression data sets.
This vignette gives an overview to multimodalR's usage on real data with the example of TCGA data. For an overview and introduction to multimodalR's functionalities, please use the vignette 'multimodalR'.
It was written in R Markdown, using the knitr package for production.
See help(package="multimodalR") for further details (and references provided by citation("multimodalR").)
# Installation
To install the most recent development version from Github use:
```{r install-github, eval = FALSE}
# biocLite("loosolab/multimodalR", dependencies = TRUE,build_vignettes = TRUE)
```
#About TCGA
The Cancer Genome Atlas (TCGA) is a collaboration between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) that has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, comprising more than two petabytes of genomic data, has been made publically available, and this genomic information helps the cancer research community to improve the prevention, diagnosis, and treatment of cancer.
(https://cancergenome.nih.gov/abouttcga)
##About the data
TCGA collected and characterized high-quality tumor and matched normal samples from over 11,000 patients. The TCGA dataset, available in the Genomic Data Commons (GDC), contains:
* Clinical information about participants
* Metadata about the samples (e.g. the weight of a sample portion, etc.)
* Histopathology slide images from sample portions
* Molecular information derived from the samples (e.g. mRNA/miRNA expression,
protein expression, copy number, etc.)
(https://cancergenome.nih.gov/abouttcga/aboutdata)
##Downloading TCGA data
TCGA data can be downloaded from https://portal.gdc.cancer.gov/repository .
Select the facet 'Cases' and check the box of 'TCGA' at the 'Program' section.
At the 'Primary Site' section, check the boxes of cancer types you want to process with multimodalR and download the files in the JSON notation.
If only one cancer type is selected, the steps before 'processing single cancer groups' may be skipped.
# Quickstart
The following steps are an overview of the steps to use multimodalR on TCGA data:
1. Import your data
1. Clean up your data
1. Log-transform your data
1. Exclude 0-sum rows
1. Sort into groups
**Processing single groups:**
1. Collect data as data.frame
1. Exclude 0-sum rows
1. Selecting potential multimodal genes (with Silverman Test's for unimodality)
1. Use useMclust() (or any other algorithm producing the unified data output format) on data to find potential multimodal genes
1. Filter output
1. Get gene names from Ensembl
1. Exclude genes on allosomes
1. Create a plot of the filtered gene expression matrix & create survival curves plots
1. Check found multimodals against known bimodal genes
![Workflow to process TCGA data](WorkflowTCGAData.png)
#Use multimodalR on TCGA data with few functions
## Import TCGA data
The data exported from TCGA comes in two big files: one contains only metadate, the other contains the FPKM expression values.
### Metadata
The metadata shows the following structure, which is the same for each of the dicts that are stored as a 'list of dicts' in JSON notation:
```{r metadata head, eval=FALSE}
[
{
"caseID": "028e99e9-5b9a-4954-bb6e-6d4709a3cea8", # uniquely identifies the patient
"fileID": "a347bee8-651f-4c8a-9e57-f21fcf955a1d", # uniquely identifies the data file
"fileName": "5d259206-1431-48f8-abc3-3001bab5243d.FPKM.txt.gz", # the actual name of the data file
"sampleType": "Solid Tissue Normal", # the sample type
"diagnosis": "c34.3", # The ICD10 classification of diagnosis (http://www.icd10data.com/ICD10CM/Codes/C00-D49)
"morphology": "8252/3", # Some weird code for cancer morphology
"age": null, # The age in days at data generation
"gender": "female",
"status": "alive", # Vitality status at end of study
"deathIn": null, # Days until death (NA if alive)
"followUpIn": 3261 # Days until last followUp (NA if dead)
}, # .. next entries ...
]
```
###FPKM expression values
The expression values are stored in a h5 data store: expression-fpkm.h5
Usage of the h5ls tool allows a peek into the files structure:
```{r TCGA FPKM file structure,eval=FALSE}
additional.metastatic Group
additional.new.primary Group
blood.derived Group
metastatic Group
primary.tumor Group
recurrent.tumor Group
solid.tissue.normal Group
```
Each group contains an expression matrix. Row names correspond to gene IDs,
column names correspond to the case ID from the metadata.
For patients were the same tumour was measured multiple times,
a running number is added to avoud duplicated column names.
###Reading the data in R
First, the packages RJSONIO, plyr and rhdf5 have to be installed to load the data into R.
To read the metadata into a data.frame use
```{r load TCGA metadata, eval=FALSE}
library(RJSONIO)
library(plyr)
x <- fromJSON("pathToData/expression-fpkm.json", nullValue = NA, simplify = FALSE)
metadata <- ldply(x, data.frame)
```
The expression data can be load directly from the h5 data store:
```{r load TCGA expression data, eval=FALSE}
library(rhdf5)
primary.tumors <- h5read("pathToData/expression-fpkm.h5", "/primary.tumor")
```
The resulting object is a list with three elements:
* rownames: A 1D-Array with the Ensembl Gene ID (can be coerced to character)
* colnames: A 1D-Array with the Case ID (can be coerced to character),
with running counts in case of duplication, resembling R’s make.unique function.
* values: A high-dimensional expression matrix (has to be transposed before use!)
**Beware of R column names**
When creating data frames or matrices from the data, keep in mind that some Case IDs start with a numeric character. Those IDs will be preceded by an X, when put as colnames of an R object.
## Clean up your data
###Log-transform your data
The first step of cleaning up data usually is the log-transformation of data that are skewed.
For this, use:
```{r logTransformingData,eval = FALSE}
cancerData <- log2(data.frame(t(primary.tumors$values+1)))
rownames(cancerData) <- primary.tumors$rownames
colnames(cancerData) <- primary.tumors$colnames
```
###Exclude 0-sum rows & sort expression values into groups
For processing a huge amount of genes the usage of parallel processing is advisable.
**Please check the size of the cancer data object, how many cores and how much memory you have available before specifying the number of cores to be used for parallel processing.**
With the toCancerDataGroups function 0sum Rows are excluded and the data are sorted into groups.
```{r toCancerGroups, eval=FALSE}
groupsData <- toCancerDataGroups(cancerDataExpression = cancerData,
metadata = metadata,coresToUse = 18)
```
## Processing single groups
After using the 'toCancerDataGroups' function, groups that are interesting should be selected and isolated. The following steps will be executed for breast cancer patients, who are in diagnosisGroup "c50". All diagnosis groups can be looked up at http://www.icd10data.com/ICD10CM/Codes/C00-D49 .
###Collect data as data.frame
```{r breastCancerData,eval=FALSE}
groupsDataSets <- groupsData$groupsExpressionDataSets
diagnosisGroups <- groupsData$diagnosisGroups
breastCancerData <- data.frame(groupsDataSets[[which(diagnosisGroups=="c50")]])
```
###Use multimodalR on breast cancer data
The next steps are:
Processes one cancer group by
* filtering 0-sum rows to clear out 0sum genes
* using Silverman's test for unimodality to find potential multimodal genes
* filtering genes with <2% of patients that have expression values greater 0 to clear those genes out
* useMclust() on data to find potential multimodal genes
* cleaning the output of useMclust
* filtering for multimodal genes
* filtering for existing minimum mean of any mean of the multimodal genes
* filtering for existing minimum fold change between any adjacent means of multimodal genes
* getting gene names from Ensembl
* returning a list of the filtered output and the filtered expression matrix
All those steps are part of the 'getFilteredData' function, which will now be used on the breast cancer data.
```{r process Breast Cancer Data, eval=FALSE}
#process breast cancer data
filteredBreastCancerData <- getFilteredData(oneCancerGroupExpressionmatrix = breastCancerData,coresToUse = 1,updateToEnsembleGeneNames = TRUE)
#get filtered output
filteredBreastCancerOutput <- filteredBreastCancerData$Output
#get filtered expression matrix
filteredBreastCancerEm <- filteredBreastCancerData$Expressionmatrix
```
The steps integrated in the 'getFilteredData' function can be used as functions via the Double Colon Operator ("::").
Here is a list of the steps and corresponding functions:
* filtering 0-sum rows to clear out 0sum genes: filter0sumGenes()
* using Silverman's test for unimodality to find potential multimodal genes: useSilvermantest()
* filtering genes with <2% of patients that have expression values greater 0 to clear those genes out: filterRarelyExpressedGenes()
* process data with an algorithm (default: useMclust()) to find potential multimodal genes: useMclust(), useHdbscan(), useFlexmix()
* cleaning the algorithm's output: cleanOutput()
* filtering for multimodal genes: filterMultimodalGenes()
* filtering for existing minimum mean of any mean of the multimodal genes: filterForMean()
* filtering for existing minimum fold change between any adjacent means of multimodal genes: filterForFoldChange()
* getting gene names from Ensembl: updateGeneNames()
How to use these functions to create another workflow is described in the section "Use multimodalR filter functions" in detail.
### Create a plot of the filtered gene expression matrix
The next step of this fast workflow is to plot the filtered gene expression matrix.
```{r plotting the filtered gene expression matrix, eval=FALSE}
plotExpression(expression = filteredBreastCancerEm,
plotsPerPage = 1,
GenesToPlot = nrow(filteredBreastCancerEm),
plotName = "filteredBreastCancerData")
```
### Create plots of the survival curves of the genes
The plotting of the survival curves of the filtered gene expression matrix is the last step of this fast workflow
and can be achieved with the plotSurvivalCurves() function. The gene names should be updated before using this function because then the correct gene names will be displayed above the plots.
```{r plotting the survival curves, eval=FALSE}
plotSurvivalCurves(metadata = metadata,
output = filteredBreastCancerOutput)
```
## Use multimodalR filter functions
The filter functions described in this section are used in the getFilteredData() function which is a fast and straightforward way to analyse TCGA data. The usage of the separate functions allows the user to integrate self-written filtering functions and gives the user more control of the filtering process. These functions are specifically designed to be used after a single cancer group was collected as a data frame.
### Filter 0 sum rows
Genes that are not expressed in any of the patient can't be classified as multimodal. Therefore, they are useless for the further analysis and excluded from the data set with the 'filter0sumgenes()' function:
```{r filter0sumGenes, eval=FALSE}
clearedData <- filter0sumGenes(cancerData = oneGroupCancerData,parallel = TRUE)
```
The result is a cleared gene expression data frame which only holds those genes were the sum of the gene expression was greater than 0.
### Filter genes with different tests for unimodality
The package multimodalR includes two different tests for unimodality:
Silverman's Test for Unimodality and Hartigans' Dip Test for Unimodality.
Both tests can be used to filter the genes for possible multimodal genes.
Genes that are classified as unimodal with the tests are excluded.
Silverman's test was used in the analysis of TCGA data because it has better recognized
multimodal genes in a simulated gene expression data set. Hartigans' Test for Unimodality may also be used by the user.
#### Use Silverman's Test for Unimodality
Silverman's Test for Unimodality tests the null hypothesis that an underlying density has at most 1 mode.
By rejecting this hypothesis (returned p-value is smaller than given level of significance SilvermanP),
one can verify that the underlying density has more than k modes.
Silverman's Test is used in the 'useSilvermantest()' function:
```{r useSilvermantest, eval=FALSE}
possiblyMultimodal <- useSilvermantest(clearedData =clearedData,parallel = TRUE,SilvermanP = 0.05)
```
The result is a gene expression data frame which only holds those genes for that the null hypothesis of the Silverman test was rejected at the given level of significance.
#### Use Hartigans' Dip Test for Unimodality
Hartigans' Dip Test for Unimodality tests the null hypothesis that a given distribution is unimodal.
By rejecting this hypothesis, one can verify that the given distribution is not unimodal.
The null hypothesis is rejected if the returned p-value is smaller than the given level of significance DipTestP.
Hartigans' Dip Test for Unimodality is used in the 'useDiptest()' function:
```{r useDiptest, eval=FALSE}
possiblyMultimodal <- useDiptest(clearedData =clearedData,parallel = TRUE,DipTestP = 0.05)
```
The result is a gene expression data frame which only holds those genes
for that the null hypothesis of Hartigans' Dip Test was rejected at the given level of significance.
### Filter rarely expressed genes
Genes that are only expressed in a few patients can't be classified as multimodal. Therefore, they are useless for the further analysis and excluded from the data set with the 'filterRarelyExpressedGenes()' function:
```{r filterRarelyExpressedGenes, eval=FALSE}
toBeProcessed <- filterRarelyExpressedGenes(possibles = possiblyMultimodal,minExpressedPercentage = 2,parallel = TRUE)
```
The result is a gene expression data frame which only holds those genes for that more than the given minPercentage of patients showed a gene expression value greater than 0.
### Process data with an algorithm
The filtered gene expression data frame can be processed with any of the implemented algorithms: mclust, HDBSCAN and FlexMix:
```{r useAlgorithm, eval=FALSE}
#useMclust
output <- useMclust(expression = toBeProcessed,clusterNumbers = c(1:3),verbose = T,parallel = T)
#useHdbscan
output <- useHdbscan(expression = toBeProcessed,minClusterSize = 20,minSamples = NULL,verbose = T)
#useFlexmix
output <- useFlexmix(expression = toBeProcessed,maxModality = 3,verbose = T,parallel = T)
```
The output is in the unified output data format which is explained in the `multimodalR` vignette in detail.
### Clean the algorithm's output
The output of the algorithms may need cleaning. Sometimes groups are found that only consist of very few patients, those groups will be sorted into the other groups. Also mistakes that might have occured (e.g. empty groups that falsify the modus) are cleared.
```{r cleanOutput, eval=FALSE}
cleanedOutput <- cleanOutput(cancerGroupOutput = output,cancerGroupExpressionmatrix = toBeProcessed,
minPercentage = 10,coresToUse = 6,parallel = FALSE)
```
The returned output is the cleaned output in the unified output data format.
### Filter for multimodal genes
The cleaned output can be filtered for multimodal genes with the `filterMultimodalGenes()` function. With this function all unimodal genes are excluded:
```{r filterMultimodalGenes, eval=FALSE}
multimodalData <- filterMultimodalGenes(cleanedOutput = cleanedOutput,
processedExpressionmatrix = toBeProcessed)
multimodalOutput <- multimodalData$Output
multimodalEm <- multimodalData$Expressionmatrix
```
The result is a list of the filtered expression matrix and output.
The output is in the unified output data format.
The filtered expression matrix and output only hold those genes which are classified as multimodal.
### Filter for minimum mean value
The multimodal output can be filtered for a minimum mean value with the `filterForMean()` function.
If any of the means of a gene is greater than `minMean`, the gene is kept.
All other genes are excluded from the output.
```{r filterForMean, eval=FALSE}
filteredMeanData <- filterForMean(multimodalExpressionmatrix = multimodalEm,
multimodalOutput = multimodalOutput,
minMean = 2)
filteredMeanOutput <- filteredMeanData$Output
filteredMeanEm <- filteredMeanData$Expressionmatrix
```
The result is a list of the filtered expression matrix and output.
The output is in the unified output data format.
The filtered expression matrix and output consist of only those genes which have a mean greater than `minMean`.
### Filter for minimum fold change
The multimodal output can be filtered for a minimum fold change between the means with the `filterForFoldChange()` function.
If any of the differences between two neighbouring means of a gene is greater than `minFoldChange`, the gene is kept.
All other genes are excluded from the output.
```{r filterForFoldChange, eval=FALSE}
filteredFCData <- filterForFoldChange(filteredMeanExpressionmatrix =filteredMeanEm,
filteredMeanOutput = filteredMeanOutput,
minFoldChange = 2)
filteredFCOutput <- filteredFCData$Output
filteredFCEm <- filteredFCData$Expressionmatrix
```
The result is a list of the filtered expression matrix and output.
The output is in the unified output data format.
The filtered expression matrix and output consist of only those genes which have a fold change greater than `minFoldChange` between any neighbouring means.
### Update gene names
The gene names that are specified in the TCGA data are Ensembl gene ids which may be converted to the real gene names. This can be done with the `updateGeneName()` function.
```{r updateGeneNames, eval=FALSE}
updatedNamesData <- updateGeneNames(filteredExpressionmatrix = filteredFCEm,filteredOutput=filteredFCOutput)
filteredOutput <- updatedNamesData$Output
filteredEm <- updatedNamesData$Expressionmatrix
```
The result is a list of the filtered expression matrix and output with the updated gene names.
The output is in the unified output data format.
The filtered expression matrix has the updated gene names as row names.
### Filter for genes not on the X chromosome
The filtered output and expression matrix with the updated gene names can be further filtered to find the genes that are located on the X chromosome. Those genes will be excluded by the usage of the 'filterForXChromosomeGenes()' function.
```{r filterForXChromosomeGenes, eval=FALSE}
notOnXChromosomeData <- filterForXChromosomeGenes(output = filteredOutput,expressionmatrix = filteredEm)
notOnXChromosomeOutput <- notOnXChromosomeData$Output
notOnXChromosomeEm <- notOnXChromosomeData$Expressionmatrix
```
The result is a list of the expression matrix and output with all genes that were not located on the X chromosome.
The output is in the unified output data format.
The filtered expression matrix has the updated gene names as row names.
### Filter for genes not on the Y chromosome
The filtered output and expression matrix with the updated gene names can be further filtered to find the genes that are located on the Y chromosome. Those genes will be excluded by the usage of the 'filterForYChromosomeGenes()' function.
```{r filterForYChromosomeGenes, eval=FALSE}
notOnYChromosomeData <- filterForYChromosomeGenes(output = notOnXChromosomeOutput,expressionmatrix = notOnXChromosomeEm)
notOnYChromosomeOutput <- notOnYChromosomeData$Output
notOnYChromosomeEm <- notOnYChromosomeData$Expressionmatrix
```
The result is a list of the expression matrix and output with all genes that were not located on the Y chromosome.
The output is in the unified output data format.
The filtered expression matrix has the updated gene names as row names.
The filtered expression matrix can be plotted as described above. Of course, self-written filtering functions can be integrated in the described workflow.
### Filter for genes known in cancer pathways
The filtered ouput and expression matrix with updated gene names can be further filtered to find the genes that are known in cancer pathways. The gene names of the filtered output and expression matrix are compared to the genes in KEGG's Cancer Pathway.
This can be accomplished by the usage of the 'filterForKnownInCancer()' function.
```{r filterForKnownInCancer, eval=FALSE}
knownInCancerData <- filterForKnownInCancer(output = filteredOutput,expressionmatrix = filteredEm)
KnownInCancerOutput <- KnownInCancerData$Output
KnownInCancerEm <- KnownInCancerData$Expressionmatrix
```
The result is a list of the expression matrix and output with all genes that were found in KEGG's Cancer Pathway.
The output is in the unified output data format.
The filtered expression matrix has the updated gene names as row names.
### Notes on the filtering functions
The functions filter0sumGenes(), useSilvermantest(), filterRarelyExpressedGenes(),and useMclust(), useHdbscan(), useFlexmix() are used in the `processOneCancerGroup()` function which itself is used in the `getFilteredData()` function. Therefore, in a smaller workflow the `processOneCancerGroup()` function can be used.
The filtering of the algorithm's output can then be done manually.