Merge pull request #18 from proost/hisat2

Hisat2 support --> 1.3rc1
proost · Jul 26, 2017 · 375fc42 · 375fc42
2 parents f25143a + 6c36e08
commit 375fc42
Show file tree

Hide file tree

Showing 15 changed files with 481 additions and 208 deletions.
diff --git a/.gitignore b/.gitignore
@@ -58,5 +58,6 @@ target/
 .idea/
 .data/
 
+tmp/
 config.ini
 data.ini
diff --git a/README.md b/README.md
@@ -1,9 +1,16 @@
 # LSTrAP
 
-LSTrAP, shot for Large Scale Transcriptome Analysis Pipeline, greatly facilitates the construction of co-expression networks from
-RNA Seq data. The various tools involved are seamlessly connected and  CPU-intensive steps are submitted to a computer cluster 
+LSTrAP, short for Large Scale Transcriptome Analysis Pipeline, greatly facilitates the construction of co-expression networks from
+RNA-Seq data. The various tools involved are seamlessly connected and  CPU-intensive steps are submitted to a computer cluster 
 automatically. 
 
+## Version 1.3 Changelog
+
+  * Support for [PBS](https://en.wikipedia.org/wiki/Portable_Batch_System) / [Torque](http://www.adaptivecomputing.com/products/open-source/torque/) scheduler (note proper [configuration](./docs/configuration.md) is required)
+  * [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) can be used as an alternative to [BowTie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) and [TopHat 2](https://ccb.jhu.edu/software/tophat/index.shtml)
+  * Added [helper](./docs/helper.md) script to do PCA on samples
+  * **Parameter names in data.ini changed**
+
 ## Workflow
 
 LSTrAP wraps multiple existing tools into a single workflow. To use LSTrAP the following tools need to be installed
@@ -13,13 +20,10 @@ LSTrAP wraps multiple existing tools into a single workflow. To use LSTrAP the f
 Steps in bold are submitted to a cluster. Optional steps can be enabled by adding the flag *&#8209;&#8209;enable&#8209;interpro* and/or 
 *&#8209;&#8209;enable&#8209;orthology*.
 
-## Preparation
-
-LSTrAP is designed to run on an [Oracle Grid Engine](https://www.oracle.com/sun/index.html) computer cluster system and requires 
-all external tools to be installed on the compute nodes. The "module load" system is supported. A comprehensive list of all tools 
-necessary can be found  [here](docs/preparation.md). Instructions to run LSTrAP on other systems are provided below.
-
 ## Installation
+Before installing make sure your system meets all requirements. A detailed list of supported systems and required 
+software can be found [here](docs/preparation.md).
+
 
 Use git to obtain a copy of the LSTrAP code
 
@@ -31,84 +35,39 @@ Next, move into the directory and copy **config.template.ini** and **data.templa
     cp config.template.ini config.ini
     cp data.template.ini data.ini
 
-Configure config.ini and data.ini using the guidelines below
-
-## Configuration of LSTrAP
-
-After copying the templates, **config.ini** needs to be set up to run on your system. It requires the path to Trimmomatic's jar and the
-modules where Bowtie, Tophat ... are installed in.
-
-The location of the transcriptome data, the refrence genome and a few per-species options need to be defined in **data.ini**. 
-
-Detailed instruction how to set up both configuration files can be found [here](docs/configuration.md)
-
-## Obtaining and preparing data
+Configure config.ini and data.ini using these [guidelines](docs/configuration.md)
 
-Scripts to download and prepare data from the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) are included in
-LSTrAP in the folder **helper**. Furthermore, it is recommended to remove splice variants from the GFF3 files, a script
-to do that is included there as well. Detailed instructions for each script provided to obtain and prepare data can be
-found [here](docs/helper.md)
 
 ## Running LSTrAP
 
-Once properly configured for your system and data, LSTrAP can be run using a single simple command (that should be executed on the head node)
+Once properly configured for your system and data, LSTrAP can be run using a single simple command (that should be 
+executed on the head node).
 
     ./run.py config.ini data.ini
 
-Options to enable InterProScan and/or OrthoFinder or to skip certain steps of the pipeline are included, use the command below for more info
+Run using [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml)
 
-    ./run.py -h
-
-## Quality report
-
-After running LSTrAP a log file (*lstrap.log*) is written, in which samples which failed a quality measure
-are reported. Note that no samples are excluded from the final network. In case certain samples need to be excluded
-from the final network remove the htseq file for the sample you which to exclude and re-run the pipeline skipping all
-steps prior to building the network.
-
-    ./run.py config.ini data.ini --skip-interpro --skip-orthology --skip-bowtie-build --skip-trim-fastq --skip-tophat --skip-htseq --skip-qc
-
-More information on how the quality of samples is determined can be found [here](docs/quality.md).
+    ./run.py --use-hisat2 config.ini data.ini
 
-## Output
+Run with InterProScan and/or OrthoFinder 
 
-Apart from the output all tools included generate, LSTrAP will generate raw and normalized expression matrices, a 
-co&#8209;expression network and co&#8209;expression clusters.
+    ./run.py --enable-orthology --enable-interproscan config.ini data.ini
 
-A detailed overview of files produces, including examples, can be found [here](docs/example_output.md).
+Furthermore, steps can be skipped (to avoid re-running steps unnecessarily). Use the command below for more info.
 
-## Helper Scripts
-
-LSTrAP comes with a few additional scripts to assist users to download and process data from the [Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra),
-repeat analyses and the case study reported in the manuscript (Proost et al., *under preparation*).
-
-Details for each script can be found [here](docs/helper.md)
+    ./run.py -h
 
-## Running LSTrAP on transcriptome data
+## Further reading
 
-To use LSTrAP on a *de novo* assembled transcriptome a little pre-processing is required. Instead of the genome a fasta 
-file containing **coding** sequences can be used (remove UTRs). Using the helper script fasta_to_gff.py a gff file suited
-for LSTrAP can be generated.
+  * [LSTrAP output](docs/example_output.md)
+  * [Quality statistics](docs/quality.md): How to check the quality of samples and remove problematic samples
+  * [Helper Scripts](docs/helper.md): To acquire data from the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra)
+  and process results.
 
-    python3 fasta_to_gff.py /path/to/transcript.cds.fasta > output.gff
-
-## Adapting LSTrAP to other cluster managers
-
-LSTrAP is designed and tested on a cluster running the Oracle Grid Engine, though with minimal effort it can be adopted to run on PBS and Torque
-based systems (and likely others). First, in the configuration file, check the qsub parameters (e.g. jobs that require multiple
-CPUs to run *-pe cores 4*), that differ between systems are set up properly (the nodes and cores on Torque and PBS need to be 
-set using *-l nodes=4:ppn=2* to request 4 nodes with 2 processes per node). 
-
-Furthermore the submission script might differ, these are located in **./cluster/templates.py** . For PBS based systems some
-settings need to be included by adding *#PBS ...*. 
-
-We strive to get LSTrAP running on as many systems as possible. Do not hesitate to contact us in case you experience difficulties 
-running LSTrAP on your system.
-
 
 ## Contact
 
-LSTrAP was developed by [Sebastian Proost](mailto:proost@mpimp-golm.mpg.de) and [Marek Mutwil](mailto:mutwil@mpimp-golm.mpg.de) at the [Max-Planck Institute for Molecular Plant Physiology](http://www.mpimp-golm.mpg.de/2168/en)
+LSTrAP was developed by [Sebastian Proost](mailto:proost@mpimp-golm.mpg.de) and [Marek Mutwil](mailto:mutwil@gmail.com) at the [Max-Planck Institute for Molecular Plant Physiology](http://www.mpimp-golm.mpg.de/2168/en)
 
 ## Acknowledgements and Funding
 

diff --git a/cluster/__init__.py b/cluster/__init__.py
@@ -12,6 +12,7 @@ def detect_cluster_system():
 
     :return: string "SBE", "PBS" or "other"
     """
+
     try:
         which_output = check_output(["which", "sge_qmaster"], stderr=DEVNULL).decode("utf-8")
 

diff --git a/config.template.ini b/config.template.ini
@@ -1,52 +1,102 @@
 [TOOLS]
-; In case there is no module load system on the system set the module name to None
+; Tool Configuration
+;
+; Some tools require additional files or might require a hard coded path to the script.
+; Please make sure these are set up correctly.
+
 
 ; Trimmomatic Path
+; ADJUST THIS
 trimmomatic_path=/home/sepro/tools/Trimmomatic-0.36/trimmomatic-0.36.jar
 
-; Module names
-bowtie_module=biotools/bowtie2-2.2.6
-samtools_module=biotools/samtools-1.3
-sratoolkit_module=biotools/sratoolkit-2.5.7
-tophat_module=biotools/tophat-2.1.0
-
-interproscan_module=biotools/interproscan-5.16-55.0
-
-blast_module=biotools/ncbi-blast-2.3.0+
-mcl_module=biotools/mcl-14.137
+; COMMANDS to run tools
+;
+; Here the commands used to start different steps are defined, ${name} are variables that will be set by LSTrAP for
+; each job.
 
-python_module=devel/Python-2.7.10
-python3_module=devel/Python-3.5.1
-
-; commands to run tools
+; Note that in some cases hard coded paths were required, adjust these to match the location of these files on
+; your system
 bowtie_cmd=bowtie2-build ${in} ${out}
+hisat2_build_cmd=hisat2-build ${in} ${out}
 
+; ADJUST PATHS TO ADAPTERS
 trimmomatic_se_command=java -jar ${jar} SE -threads 1  ${in} ${out}  ILLUMINACLIP:/home/sepro/tools/Trimmomatic-0.36/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
 trimmomatic_pe_command=java -jar ${jar} PE -threads 1  ${ina} ${inb} ${outap} ${outau} ${outbp} ${outbu} ILLUMINACLIP:/home/sepro/tools/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
 
 tophat_se_cmd=tophat -p 3 -o ${out} ${genome} ${fq}
 tophat_pe_cmd=tophat -p 3 -o ${out} ${genome} ${forward},${reverse}
 
-htseq_count_cmd=htseq-count -s no -f bam -t ${feature} -i ${field} ${bam} ${gff} > ${out}
+hisat2_se_cmd=hisat2 -p 3 -x ${genome} -U ${fq} -S ${out} 2> ${stats}
+hisat2_pe_cmd=hisat2 -p 3 -x ${genome} -1 ${forward} -2 ${reverse} -S ${out} 2> ${stats}
+
+htseq_count_cmd=htseq-count -s no -f ${itype} -t ${feature} -i ${field} ${bam} ${gff} > ${out}
 
 interproscan_cmd=interproscan.sh -i ${in_dir}/${in_prefix}${SGE_TASK_ID} -o ${out_dir}/${out_prefix}${SGE_TASK_ID} -f tsv -dp -iprlookup -goterms --tempdir /tmp
 
 pcc_cmd=python3 ./scripts/pcc.py ${in} ${out} ${mcl_out}
 mcl_cmd=mcl ${in} --abc -o ${out} -te 4
 
+; ADJUST THIS
 mcxdeblast_cmd=perl /apps/biotools/mcl-14.137/bin/mcxdeblast --m9 --line-mode=abc ${blast_in} > ${abc_out}
 
+; ADJUST THIS
 orthofinder_cmd=python /home/sepro/OrthoFinder-0.4/orthofinder.py -f ${fasta_dir} -t 8
 
+; qsub parameters (OGE)
 
-; qsub parameters
-
-qsub_bowtie=''
+qsub_indexing=''
 qsub_trimmomatic=''
 qsub_tophat='-pe cores 4'
 qsub_htseq_count=''
 qsub_interproscan='-pe cores 5'
 qsub_pcc=''
 qsub_mcl='-pe cores 4'
 qsub_orthofinder='-pe cores 8'
-qsub_mcxdeblast=''
+qsub_mcxdeblast=''
+
+; qsub parameters (PBS/Torque)
+
+; qsub_indexing=''
+; qsub_trimmomatic=''
+; qsub_tophat='-l nodes=1,ppn=4'
+; qsub_htseq_count=''
+; qsub_interproscan='-l nodes=1,ppn=5'
+; qsub_pcc=''
+; qsub_mcl='-l nodes=1,ppn=4'
+; qsub_orthofinder='-l nodes=1,ppn=8'
+; qsub_mcxdeblast=''
+
+; qsub parameters (PBS/Torque with walltimes)
+
+; qsub_indexing='-l walltime=00:10:00'
+; qsub_trimmomatic='-l walltime=00:10:00'
+; qsub_tophat='-l nodes=1,ppn=4  -l walltime=00:10:00'
+; qsub_htseq_count=' -l walltime=00:02:00'
+; qsub_interproscan='-l nodes=1,ppn=5  -l walltime=00:10:00'
+; qsub_pcc=' -l walltime=00:10:00'
+; qsub_mcl='-l nodes=1,ppn=4  -l walltime=00:10:00'
+; qsub_orthofinder='-l nodes=1,ppn=8  -l walltime=01:00:00'
+; qsub_mcxdeblast='-l walltime=00:10:00'
+
+; Module names
+; These need to be configured if the required tools are installed in the environment modules.
+; You can find the modules installed on your system using
+;
+;       module avail
+;
+; In case there is no module load system on the system set the module name to None
+
+bowtie_module=biotools/bowtie2-2.2.6
+samtools_module=biotools/samtools-1.3
+sratoolkit_module=biotools/sratoolkit-2.5.7
+tophat_module=biotools/tophat-2.1.0
+
+hisat2_module=
+
+interproscan_module=biotools/interproscan-5.16-55.0
+
+blast_module=biotools/ncbi-blast-2.3.0+
+mcl_module=biotools/mcl-14.137
+
+python_module=devel/Python-2.7.10
+python3_module=devel/Python-3.5.1
diff --git a/data.template.ini b/data.template.ini
@@ -23,10 +23,9 @@ fastq_dir=./data/zma/fastq
 tophat_cutoff=65
 htseq_cutoff=40
 
-bowtie_output=./output/bowtie-build/zma
+indexing_output=./output/bowtie-build/zma
 trimmomatic_output=./output/trimmed_fastq/zma
-tophat_output=./output/tophat/zma
-samtools_output=./output/samtools/zma
+alignment_output=./tmp/tophat/zma
 htseq_output=./output/htseq/zma
 
 exp_matrix_output=./output/zma/exp_matrix.txt