diff --git a/README.md b/README.md index 259deea..dadeb92 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,16 @@ # LSTrAP -LSTrAP, shot for Large Scale Transcriptome Analysis Pipeline, greatly facilitates the construction of co-expression networks from -RNA Seq data. The various tools involved are seamlessly connected and CPU-intensive steps are submitted to a computer cluster +LSTrAP, short for Large Scale Transcriptome Analysis Pipeline, greatly facilitates the construction of co-expression networks from +RNA-Seq data. The various tools involved are seamlessly connected and CPU-intensive steps are submitted to a computer cluster automatically. +## Version 1.3 Changelog + + * Support for [PBS](https://en.wikipedia.org/wiki/Portable_Batch_System) / [Torque](http://www.adaptivecomputing.com/products/open-source/torque/) scheduler (note proper [configuration](./docs/configuration.md) is required) + * [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) can be used as an alternative to [BowTie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) and [TopHat 2](https://ccb.jhu.edu/software/tophat/index.shtml) + * Added [helper](./docs/helper.md) script to do PCA on samples + * **Parameter names in data.ini changed** + ## Workflow LSTrAP wraps multiple existing tools into a single workflow. To use LSTrAP the following tools need to be installed @@ -13,13 +20,10 @@ LSTrAP wraps multiple existing tools into a single workflow. To use LSTrAP the f Steps in bold are submitted to a cluster. Optional steps can be enabled by adding the flag *‑‑enable‑interpro* and/or *‑‑enable‑orthology*. -## Preparation - -LSTrAP is designed to run on an [Oracle Grid Engine](https://www.oracle.com/sun/index.html) computer cluster system and requires -all external tools to be installed on the compute nodes. The "module load" system is supported. A comprehensive list of all tools -necessary can be found [here](docs/preparation.md). Instructions to run LSTrAP on other systems are provided below. - ## Installation +Before installing make sure your system meets all requirements. A detailed list of supported systems and required +software can be found [here](docs/preparation.md). + Use git to obtain a copy of the LSTrAP code @@ -31,34 +35,35 @@ Next, move into the directory and copy **config.template.ini** and **data.templa cp config.template.ini config.ini cp data.template.ini data.ini -Configure config.ini and data.ini using the guidelines below - -## Configuration of LSTrAP +Configure config.ini and data.ini using these [guidelines](docs/configuration.md) -After copying the templates, **config.ini** needs to be set up to run on your system. It requires the path to Trimmomatic's jar and the -modules where Bowtie, Tophat ... are installed in (if the [modules](http://modules.sourceforge.net/) environment is used. -The location of the transcriptome data, the refrence genome and a few per-species options need to be defined in **data.ini**. +## Running LSTrAP -Detailed instruction how to set up both configuration files can be found [here](docs/configuration.md) +Once properly configured for your system and data, LSTrAP can be run using a single simple command (that should be +executed on the head node). -## Obtaining and preparing data + ./run.py config.ini data.ini -Scripts to download and prepare data from the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) are included in -LSTrAP in the folder **helper**. Furthermore, it is recommended to remove splice variants from the GFF3 files, a script -to do that is included there as well. Detailed instructions for each script provided to obtain and prepare data can be -found [here](docs/helper.md) +Run using [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) -## Running LSTrAP + ./run.py --use-hisat2 config.ini data.ini -Once properly configured for your system and data, LSTrAP can be run using a single simple command (that should be executed on the head node) +Run with InterProScan and/or OrthoFinder - ./run.py config.ini data.ini + ./run.py --enable-orthology --enable-interproscan config.ini data.ini -Options to enable InterProScan and/or OrthoFinder or to skip certain steps of the pipeline are included, use the command below for more info +Furthermore, steps can be skipped (to avoid re-running steps unnecessarily). Use the command below for more info. ./run.py -h +## Obtaining and preparing data + +Scripts to download and prepare data from the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) are included in +LSTrAP in the folder **helper**. Furthermore, it is recommended to remove splice variants from the GFF3 files, a script +to do that is included there as well. Detailed instructions for each script provided to obtain and prepare data can be +found [here](docs/helper.md) + ## Quality report After running LSTrAP a log file (*lstrap.log*) is written, in which samples which failed a quality measure @@ -92,11 +97,6 @@ for LSTrAP can be generated. python3 fasta_to_gff.py /path/to/transcript.cds.fasta > output.gff -## Adapting LSTrAP to other cluster managers - -LSTrAP is designed and tested on a cluster running the Oracle Grid Engine (default), PBS/Torque is also supported. - -Though due to differences in parameters ## Contact diff --git a/docs/helper.md b/docs/helper.md index 8db0ab7..3a01543 100644 --- a/docs/helper.md +++ b/docs/helper.md @@ -82,10 +82,17 @@ on a normalized expression matrix. ![matrix example](images/matrix.png "Sample distance heatmap (with hierarchical clustering)") - + +### pca_plot.py + +Script to perform a PCA analysis on any expression matrix. + + python3 pca_plot.py ./data/sbi.expression.matrix.tpm.txt ### pca_powerlaw.py +*This script and the required data are included to recreate results from the manuscript (Proost et al., under review)* + Script to perform a PCA analysis on the *Sorghum bicolor* data (case study) and draw the node degree distribution. The required data is included here as well. Note that this script requires sklearn and seaborn. diff --git a/docs/preparation.md b/docs/preparation.md index 90bffbf..6b86d93 100644 --- a/docs/preparation.md +++ b/docs/preparation.md @@ -1,24 +1,29 @@ # Preparing your system -LSTrAP is designed to run on the head node of a Oracle Grid Engine cluster. Apart from a running compute cluster, the essential -tools need to be installed. A full list is provided below, tools can be installed on the grid nodes directly or inside modules. -When opting for the latter, the configuration file needs to contain the exact names of the modules containing the tools. +LSTrAP is designed with High Performance Computing in mind and requires a computer cluster running +[Oracle Grid Engine]((https://www.oracle.com/sun/index.html)) or [PBS](https://en.wikipedia.org/wiki/Portable_Batch_System) +/ [Torque](http://www.adaptivecomputing.com/products/open-source/torque/). Furthermore, the essential +tools (see below) need to be installed prior to running LSTrAP. +Using the [Environment modules](http://modules.sourceforge.net/) are supported, in that case the configuration file +needs to contain the exact names of the modules containing the tools. +## Required Tools * [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) * [TopHat](https://ccb.jhu.edu/software/tophat/manual.shtml) - * [HISAT2] + * [HISAT2](http://ccb.jhu.edu/software/hisat2/index.shtml) * [Samtools](http://www.htslib.org/) * [SRAtools](http://ncbi.github.io/sra-tools/) - * [Python 2.7](https://www.python.org/download/releases/2.7/) + [HTSeq](http://www-huber.embl.de/users/anders/HTSeq/doc/index.html) + all dependencies (including [PySam](https://github.com/pysam-developers/pysam)) + * [HTSeq](http://www-huber.embl.de/users/anders/HTSeq/doc/index.html) + all dependencies (including [PySam](https://github.com/pysam-developers/pysam)) * [Python 3.5](https://www.python.org/download/releases/3.5.1/) + SciPy + [Numpy](http://www.numpy.org/) - * [InterProScan](https://www.ebi.ac.uk/interpro/) - * [OrthoFinder](https://github.com/davidemms/OrthoFinder) * [MCL](http://www.micans.org/mcl/index.html?sec_software) * [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) -Optional tools +## Optional tools + * [InterProScan](https://www.ebi.ac.uk/interpro/) + * [OrthoFinder](https://github.com/davidemms/OrthoFinder) + * [Python 2.7](https://www.python.org/download/releases/2.7/) (for OrthoFinder) * [scikit-learn](http://scikit-learn.org/) for Python 3, required for PCA analysis (helper script) * [seaborn](https://stanford.edu/~mwaskom/software/seaborn/) for Python 3, required for PCA analysis (helper script) * [Aspera connect client](http://downloads.asperasoft.com/en/downloads/2), required for the *get_sra_ip.py* (helper script) \ No newline at end of file