Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
# XGB Survival Network
This is the code repository corresponding to ["A gradient tree boosting and network propagation derived pan-cancer survival network of the tumor microenvironment"](https://doi.org/10.1016/j.isci.2021.103617).
To identify a pan-cancer survival gene network, a two-step approach is applied:
1. Survival prediction with XGBoost based on gene expression data
2. Network propagation on the feature importance weights derived in step 1 and inference of a pan-cancer survival gene sub-network
This repository contains the Python code for training XGBoost models on a single cancer cohort or on pan-cancer gene expression data as well as Python and R code for downloading the required TCGA data as well as creating the figures displayed in the paper. For the network propagation, the [NetCore](https://github.molgen.mpg.de/barel/NetCore) software was used.
## Dependencies
For the identification of a survival gene network by performing survival prediction and network propagation, the following software and packages are required. For installation instructions of [Python](https://www.python.org/) and [R](https://www.r-project.org/) dependencies, please see [Installation of Dependencies](#installation-of-dependencies) .
The following software and packages are required for downloading and preprocessing TCGA data:
- Linux/Unix
- [R (3.6)](https://www.r-project.org/)
- [Bioconductor (3.10)](https://www.bioconductor.org/)
- [TCGAbiolinks (2.12.6)](https://bioconductor.org/packages/release/bioc/html/TCGAbiolinks.html)
- [optparse (1.6.6)](https://github.com/trevorld/r-optparse)
- [dplyr (1.0.0)](https://dplyr.tidyverse.org/)
The following software and packages are required for running the XGBoost survival prediction:
- Linux/Unix
- [Python (3.7)](https://www.python.org/)
- [XGBoost (0.90)](https://github.com/dmlc/xgboost/tree/master/python-package)
- [NumPy (1.18.5)](https://numpy.org/)
- [Pandas (1.1.5)](https://pandas.pydata.org/)
- [scikit-learn (0.22.2.post1)](https://scikit-learn.org)
- [tqdm (4.38.0)](https://github.com/tqdm/tqdm)
The following software and packages are required for performing network propagation:
- Linux/Unix
- [Python (3.7)](https://www.python.org/)
- [NumPy (1.18.5)](https://numpy.org/)
- [SciPy (1.2.1)](https://scipy.org/)
- [Pandas (1.1.5)](https://pandas.pydata.org/)
- [Matplotlib (3.1.1)](https://matplotlib.org/)
- [Seaborn (0.9.0)](https://seaborn.pydata.org/)
- [NetworkX (2.3)](https://networkx.org/)
- [NetCore](https://github.molgen.mpg.de/barel/NetCore)
The following software and packages are required for re-creating the figures displayed in ["A gradient tree boosting and network propagation derived pan-cancer survival network of the tumor microenvironment"](https://doi.org/10.1016/j.isci.2021.103617):
- Linux/Unix
- [Python (3.7)](https://www.python.org/)
- [NumPy (1.18.5)](https://numpy.org/)
- [Pandas (1.1.5)](https://pandas.pydata.org/)
- [MyGene (3.1.0)](https://pypi.org/project/mygene/)
- [matplotlib (3.1.1)](https://matplotlib.org/)
- [Matplotlib-Venn (0.11.6)](https://github.com/konstantint/matplotlib-venn)
- [R (3.6)](https://www.r-project.org/)
- [reshape2 (1.4.4)](https://cran.r-project.org/web/packages/reshape2/)
- [rjson (0.2.20)](https://cran.r-project.org/web/packages/rjson/)
- [ggplot2 (3.3.1)](https://cran.r-project.org/web/packages/ggplot2/)
- [ggpubr (0.2.5)](https://cran.r-project.org/web/packages/ggpubr/)
- [corrplot (0.84)](https://cran.r-project.org/web/packages/corrplot/)
- [plyr (1.8.6)](https://cran.r-project.org/web/packages/plyr/)
### Installation of Dependencies
You can install all required [Python](https://www.python.org/) packages from Unix shell as follows:
```
pip install numpy==1.18.5
pip install pandas==1.1.5
pip install tqdm==4.38.0
pip install scipy=1.2.1
pip install matplotlib=3.1.1
pip install scikit-learn==0.22.2.post1
pip install seaborn==0.9.0
pip install networkx==2.3
pip install xgboost==0.90
pip install mygene==3.1.0
```
All [R](https://www.r-project.org/) dependencies can be installed by entering an [R](https://www.r-project.org/) session and typing:
```
>if (!require("BiocManager", quietly = TRUE))
> install.packages("BiocManager")
>BiocManager::install(version = "3.10")
>BiocManager::install("TCGAbiolinks")
>if (!require("optparse"))
> install.packages(“optparse”)
>if (!require("dplyr"))
> install.packages(“dplyr”)
>if (!require("reshape2"))
> install.packages(“reshape2”)
>if (!require("rjson"))
> install.packages(“rjson”)
>if (!require("ggplot2"))
> install.packages(“ggplot2”)
>if (!require("ggpubr"))
> install.packages(“ggpubr”)
>if (!require("corrplot"))
> install.packages(“corrplot”)
>if (!require("plyr"))
> install.packages(“plyr”)
```
## How to Run
### Download and Preprocessing of TCGA Data
To download and preprocess the TCGA data for survival prediction with XGBoost, you can execute the following R script:
```
Rscript downloadTCGAData.R
```
which will download gene expression and clinical data for 25 different TCGA cohorts. To download specific cohorts only, you can add `-c` followed by the desired cohort(s) (e.g., `'TCGA-BRCA'` or `'TCGA-BRCA', 'TCGA-COAD', 'TCGA-LUAD'`) to the programm call.
### Survival Prediction with XGBoost
To run model replications of pan-cancer XGBoost training, please run:
```
python run_xgb_survival_replications.py
-r <result_dir>
-f <feature_dir>
-s <first_replication>
-e <last_replication>
```
where\
`result_dir`: Survival prediction results will be written to this directory\
`feature_dir`: The selected features will be written to this directory\
`first_replication`: Number of first model replication to be performed (e.g. 1)\
`last_replication`: Number of last model replication to be performed (e.g. 1 to run only one replication or 100 to run 100 model replications with different train-test splits)
To run model replications of single-cohort XGBoost training for a selected cohort, please run instead:
```
python run_xgb_survival_replications.py
-r <result_dir>
-f <feature_dir>
-s <first_replication>
-e <last_replication>
-c <cohort>
```
where
`cohort`: Name of the selected TCGA cancer cohort (e.g. 'TCGA-COAD')
To train a model on all data from the 25 TCGA cohorts with more than 20 uncensored patients and test the model on the remaining 8 TCGA cohorts, run:
```
python run_xgb_survial_test_new_cohorts.py
-r <result_dir>
-f <feature_dir>
```
To train a pan-cancer XGBoost model that is trained on a random subset of the training data with specified size, run:
```
python run_xgb_survival_random_subsets.py
-r <result_dir>
-f <feature_dir>
-n <subsample_size>
-s <first_replication>
-e <last_replication>
```
where
`subsample_size`: The number of patients that are randomly subsampled from the training data for model training (e.g. 500)
### Preparation of Survival Prediction Outputs for Network Propagation
Run the following python script to prepare the outputs from the XGBoost survival prediction for network propagation with NetCore:
```
python prepare_XGBoost_results_for_NetCore.py
--result_path <result_dir>
--num_replications <num_reps>
--output_path <out_dir>
```
where\
`result_dir`: The path to the directory containing the XGBoost survival prediction results\
`num_reps`: The number of model replications that have been performed for XGBoost survival prediction\
`out_dir`: The output of this script, which can then be used as input to network propagation, are written to this directory
### Network Propagation with NetCore
For performing network propagation with netcore, run the following command:
```
python <path_to_netcore>/netcore/netcore.py
-e <network_file>
-w <weight_file>
-pd <permutation_dir>
-o <out_dir>
```
where\
`network_file`: File containing network in edge list format\
`weight_file`: Weight file containing the gene weights computed from survival prediction outputs\
`permutation_dir`: Path to directory containing permutation files of the network\
`out_dir`: Network propagation results will be written to this directory\
Note that before running NetCore on a network for the first time, permutations of the network need to be constructed. For more information, please visit [https://github.molgen.mpg.de/barel/NetCore](https://github.molgen.mpg.de/barel/NetCore).