Skip to content

Commit

Permalink
Merge pull request #18 from proost/hisat2
Browse files Browse the repository at this point in the history
Hisat2 support --> 1.3rc1
  • Loading branch information
Sebastian Proost authored Jul 26, 2017
2 parents f25143a + 6c36e08 commit 375fc42
Show file tree
Hide file tree
Showing 15 changed files with 481 additions and 208 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -58,5 +58,6 @@ target/
.idea/
.data/

tmp/
config.ini
data.ini
95 changes: 27 additions & 68 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,16 @@
# LSTrAP

LSTrAP, shot for Large Scale Transcriptome Analysis Pipeline, greatly facilitates the construction of co-expression networks from
RNA Seq data. The various tools involved are seamlessly connected and CPU-intensive steps are submitted to a computer cluster
LSTrAP, short for Large Scale Transcriptome Analysis Pipeline, greatly facilitates the construction of co-expression networks from
RNA-Seq data. The various tools involved are seamlessly connected and CPU-intensive steps are submitted to a computer cluster
automatically.

## Version 1.3 Changelog

* Support for [PBS](https://en.wikipedia.org/wiki/Portable_Batch_System) / [Torque](http://www.adaptivecomputing.com/products/open-source/torque/) scheduler (note proper [configuration](./docs/configuration.md) is required)
* [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) can be used as an alternative to [BowTie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) and [TopHat 2](https://ccb.jhu.edu/software/tophat/index.shtml)
* Added [helper](./docs/helper.md) script to do PCA on samples
* **Parameter names in data.ini changed**

## Workflow

LSTrAP wraps multiple existing tools into a single workflow. To use LSTrAP the following tools need to be installed
Expand All @@ -13,13 +20,10 @@ LSTrAP wraps multiple existing tools into a single workflow. To use LSTrAP the f
Steps in bold are submitted to a cluster. Optional steps can be enabled by adding the flag *‑‑enable‑interpro* and/or
*‑‑enable‑orthology*.

## Preparation

LSTrAP is designed to run on an [Oracle Grid Engine](https://www.oracle.com/sun/index.html) computer cluster system and requires
all external tools to be installed on the compute nodes. The "module load" system is supported. A comprehensive list of all tools
necessary can be found [here](docs/preparation.md). Instructions to run LSTrAP on other systems are provided below.

## Installation
Before installing make sure your system meets all requirements. A detailed list of supported systems and required
software can be found [here](docs/preparation.md).


Use git to obtain a copy of the LSTrAP code

Expand All @@ -31,84 +35,39 @@ Next, move into the directory and copy **config.template.ini** and **data.templa
cp config.template.ini config.ini
cp data.template.ini data.ini

Configure config.ini and data.ini using the guidelines below

## Configuration of LSTrAP

After copying the templates, **config.ini** needs to be set up to run on your system. It requires the path to Trimmomatic's jar and the
modules where Bowtie, Tophat ... are installed in.

The location of the transcriptome data, the refrence genome and a few per-species options need to be defined in **data.ini**.

Detailed instruction how to set up both configuration files can be found [here](docs/configuration.md)

## Obtaining and preparing data
Configure config.ini and data.ini using these [guidelines](docs/configuration.md)

Scripts to download and prepare data from the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) are included in
LSTrAP in the folder **helper**. Furthermore, it is recommended to remove splice variants from the GFF3 files, a script
to do that is included there as well. Detailed instructions for each script provided to obtain and prepare data can be
found [here](docs/helper.md)

## Running LSTrAP

Once properly configured for your system and data, LSTrAP can be run using a single simple command (that should be executed on the head node)
Once properly configured for your system and data, LSTrAP can be run using a single simple command (that should be
executed on the head node).

./run.py config.ini data.ini

Options to enable InterProScan and/or OrthoFinder or to skip certain steps of the pipeline are included, use the command below for more info
Run using [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml)

./run.py -h

## Quality report

After running LSTrAP a log file (*lstrap.log*) is written, in which samples which failed a quality measure
are reported. Note that no samples are excluded from the final network. In case certain samples need to be excluded
from the final network remove the htseq file for the sample you which to exclude and re-run the pipeline skipping all
steps prior to building the network.

./run.py config.ini data.ini --skip-interpro --skip-orthology --skip-bowtie-build --skip-trim-fastq --skip-tophat --skip-htseq --skip-qc

More information on how the quality of samples is determined can be found [here](docs/quality.md).
./run.py --use-hisat2 config.ini data.ini

## Output
Run with InterProScan and/or OrthoFinder

Apart from the output all tools included generate, LSTrAP will generate raw and normalized expression matrices, a
co‑expression network and co‑expression clusters.
./run.py --enable-orthology --enable-interproscan config.ini data.ini

A detailed overview of files produces, including examples, can be found [here](docs/example_output.md).
Furthermore, steps can be skipped (to avoid re-running steps unnecessarily). Use the command below for more info.

## Helper Scripts

LSTrAP comes with a few additional scripts to assist users to download and process data from the [Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra),
repeat analyses and the case study reported in the manuscript (Proost et al., *under preparation*).

Details for each script can be found [here](docs/helper.md)
./run.py -h

## Running LSTrAP on transcriptome data
## Further reading

To use LSTrAP on a *de novo* assembled transcriptome a little pre-processing is required. Instead of the genome a fasta
file containing **coding** sequences can be used (remove UTRs). Using the helper script fasta_to_gff.py a gff file suited
for LSTrAP can be generated.
* [LSTrAP output](docs/example_output.md)
* [Quality statistics](docs/quality.md): How to check the quality of samples and remove problematic samples
* [Helper Scripts](docs/helper.md): To acquire data from the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra)
and process results.

python3 fasta_to_gff.py /path/to/transcript.cds.fasta > output.gff

## Adapting LSTrAP to other cluster managers

LSTrAP is designed and tested on a cluster running the Oracle Grid Engine, though with minimal effort it can be adopted to run on PBS and Torque
based systems (and likely others). First, in the configuration file, check the qsub parameters (e.g. jobs that require multiple
CPUs to run *-pe cores 4*), that differ between systems are set up properly (the nodes and cores on Torque and PBS need to be
set using *-l nodes=4:ppn=2* to request 4 nodes with 2 processes per node).

Furthermore the submission script might differ, these are located in **./cluster/templates.py** . For PBS based systems some
settings need to be included by adding *#PBS ...*.

We strive to get LSTrAP running on as many systems as possible. Do not hesitate to contact us in case you experience difficulties
running LSTrAP on your system.


## Contact

LSTrAP was developed by [Sebastian Proost](mailto:proost@mpimp-golm.mpg.de) and [Marek Mutwil](mailto:mutwil@mpimp-golm.mpg.de) at the [Max-Planck Institute for Molecular Plant Physiology](http://www.mpimp-golm.mpg.de/2168/en)
LSTrAP was developed by [Sebastian Proost](mailto:proost@mpimp-golm.mpg.de) and [Marek Mutwil](mailto:mutwil@gmail.com) at the [Max-Planck Institute for Molecular Plant Physiology](http://www.mpimp-golm.mpg.de/2168/en)

## Acknowledgements and Funding

Expand Down
1 change: 1 addition & 0 deletions cluster/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ def detect_cluster_system():
:return: string "SBE", "PBS" or "other"
"""

try:
which_output = check_output(["which", "sge_qmaster"], stderr=DEVNULL).decode("utf-8")

Expand Down
90 changes: 70 additions & 20 deletions config.template.ini
Original file line number Diff line number Diff line change
@@ -1,52 +1,102 @@
[TOOLS]
; In case there is no module load system on the system set the module name to None
; Tool Configuration
;
; Some tools require additional files or might require a hard coded path to the script.
; Please make sure these are set up correctly.


; Trimmomatic Path
; ADJUST THIS
trimmomatic_path=/home/sepro/tools/Trimmomatic-0.36/trimmomatic-0.36.jar

; Module names
bowtie_module=biotools/bowtie2-2.2.6
samtools_module=biotools/samtools-1.3
sratoolkit_module=biotools/sratoolkit-2.5.7
tophat_module=biotools/tophat-2.1.0

interproscan_module=biotools/interproscan-5.16-55.0

blast_module=biotools/ncbi-blast-2.3.0+
mcl_module=biotools/mcl-14.137
; COMMANDS to run tools
;
; Here the commands used to start different steps are defined, ${name} are variables that will be set by LSTrAP for
; each job.

python_module=devel/Python-2.7.10
python3_module=devel/Python-3.5.1

; commands to run tools
; Note that in some cases hard coded paths were required, adjust these to match the location of these files on
; your system
bowtie_cmd=bowtie2-build ${in} ${out}
hisat2_build_cmd=hisat2-build ${in} ${out}

; ADJUST PATHS TO ADAPTERS
trimmomatic_se_command=java -jar ${jar} SE -threads 1 ${in} ${out} ILLUMINACLIP:/home/sepro/tools/Trimmomatic-0.36/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
trimmomatic_pe_command=java -jar ${jar} PE -threads 1 ${ina} ${inb} ${outap} ${outau} ${outbp} ${outbu} ILLUMINACLIP:/home/sepro/tools/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

tophat_se_cmd=tophat -p 3 -o ${out} ${genome} ${fq}
tophat_pe_cmd=tophat -p 3 -o ${out} ${genome} ${forward},${reverse}

htseq_count_cmd=htseq-count -s no -f bam -t ${feature} -i ${field} ${bam} ${gff} > ${out}
hisat2_se_cmd=hisat2 -p 3 -x ${genome} -U ${fq} -S ${out} 2> ${stats}
hisat2_pe_cmd=hisat2 -p 3 -x ${genome} -1 ${forward} -2 ${reverse} -S ${out} 2> ${stats}

htseq_count_cmd=htseq-count -s no -f ${itype} -t ${feature} -i ${field} ${bam} ${gff} > ${out}

interproscan_cmd=interproscan.sh -i ${in_dir}/${in_prefix}${SGE_TASK_ID} -o ${out_dir}/${out_prefix}${SGE_TASK_ID} -f tsv -dp -iprlookup -goterms --tempdir /tmp

pcc_cmd=python3 ./scripts/pcc.py ${in} ${out} ${mcl_out}
mcl_cmd=mcl ${in} --abc -o ${out} -te 4

; ADJUST THIS
mcxdeblast_cmd=perl /apps/biotools/mcl-14.137/bin/mcxdeblast --m9 --line-mode=abc ${blast_in} > ${abc_out}

; ADJUST THIS
orthofinder_cmd=python /home/sepro/OrthoFinder-0.4/orthofinder.py -f ${fasta_dir} -t 8

; qsub parameters (OGE)

; qsub parameters

qsub_bowtie=''
qsub_indexing=''
qsub_trimmomatic=''
qsub_tophat='-pe cores 4'
qsub_htseq_count=''
qsub_interproscan='-pe cores 5'
qsub_pcc=''
qsub_mcl='-pe cores 4'
qsub_orthofinder='-pe cores 8'
qsub_mcxdeblast=''
qsub_mcxdeblast=''

; qsub parameters (PBS/Torque)

; qsub_indexing=''
; qsub_trimmomatic=''
; qsub_tophat='-l nodes=1,ppn=4'
; qsub_htseq_count=''
; qsub_interproscan='-l nodes=1,ppn=5'
; qsub_pcc=''
; qsub_mcl='-l nodes=1,ppn=4'
; qsub_orthofinder='-l nodes=1,ppn=8'
; qsub_mcxdeblast=''

; qsub parameters (PBS/Torque with walltimes)

; qsub_indexing='-l walltime=00:10:00'
; qsub_trimmomatic='-l walltime=00:10:00'
; qsub_tophat='-l nodes=1,ppn=4 -l walltime=00:10:00'
; qsub_htseq_count=' -l walltime=00:02:00'
; qsub_interproscan='-l nodes=1,ppn=5 -l walltime=00:10:00'
; qsub_pcc=' -l walltime=00:10:00'
; qsub_mcl='-l nodes=1,ppn=4 -l walltime=00:10:00'
; qsub_orthofinder='-l nodes=1,ppn=8 -l walltime=01:00:00'
; qsub_mcxdeblast='-l walltime=00:10:00'

; Module names
; These need to be configured if the required tools are installed in the environment modules.
; You can find the modules installed on your system using
;
; module avail
;
; In case there is no module load system on the system set the module name to None

bowtie_module=biotools/bowtie2-2.2.6
samtools_module=biotools/samtools-1.3
sratoolkit_module=biotools/sratoolkit-2.5.7
tophat_module=biotools/tophat-2.1.0

hisat2_module=

interproscan_module=biotools/interproscan-5.16-55.0

blast_module=biotools/ncbi-blast-2.3.0+
mcl_module=biotools/mcl-14.137

python_module=devel/Python-2.7.10
python3_module=devel/Python-3.5.1
5 changes: 2 additions & 3 deletions data.template.ini
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,9 @@ fastq_dir=./data/zma/fastq
tophat_cutoff=65
htseq_cutoff=40

bowtie_output=./output/bowtie-build/zma
indexing_output=./output/bowtie-build/zma
trimmomatic_output=./output/trimmed_fastq/zma
tophat_output=./output/tophat/zma
samtools_output=./output/samtools/zma
alignment_output=./tmp/tophat/zma
htseq_output=./output/htseq/zma

exp_matrix_output=./output/zma/exp_matrix.txt
Expand Down
Loading

0 comments on commit 375fc42

Please sign in to comment.