diff --git a/.gitignore b/.gitignore index 563f048..73a6536 100644 --- a/.gitignore +++ b/.gitignore @@ -58,5 +58,6 @@ target/ .idea/ .data/ +tmp/ config.ini data.ini \ No newline at end of file diff --git a/README.md b/README.md index 4135c52..cac9d3e 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,16 @@ # LSTrAP -LSTrAP, shot for Large Scale Transcriptome Analysis Pipeline, greatly facilitates the construction of co-expression networks from -RNA Seq data. The various tools involved are seamlessly connected and CPU-intensive steps are submitted to a computer cluster +LSTrAP, short for Large Scale Transcriptome Analysis Pipeline, greatly facilitates the construction of co-expression networks from +RNA-Seq data. The various tools involved are seamlessly connected and CPU-intensive steps are submitted to a computer cluster automatically. +## Version 1.3 Changelog + + * Support for [PBS](https://en.wikipedia.org/wiki/Portable_Batch_System) / [Torque](http://www.adaptivecomputing.com/products/open-source/torque/) scheduler (note proper [configuration](./docs/configuration.md) is required) + * [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) can be used as an alternative to [BowTie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) and [TopHat 2](https://ccb.jhu.edu/software/tophat/index.shtml) + * Added [helper](./docs/helper.md) script to do PCA on samples + * **Parameter names in data.ini changed** + ## Workflow LSTrAP wraps multiple existing tools into a single workflow. To use LSTrAP the following tools need to be installed @@ -13,13 +20,10 @@ LSTrAP wraps multiple existing tools into a single workflow. To use LSTrAP the f Steps in bold are submitted to a cluster. Optional steps can be enabled by adding the flag *‑‑enable‑interpro* and/or *‑‑enable‑orthology*. -## Preparation - -LSTrAP is designed to run on an [Oracle Grid Engine](https://www.oracle.com/sun/index.html) computer cluster system and requires -all external tools to be installed on the compute nodes. The "module load" system is supported. A comprehensive list of all tools -necessary can be found [here](docs/preparation.md). Instructions to run LSTrAP on other systems are provided below. - ## Installation +Before installing make sure your system meets all requirements. A detailed list of supported systems and required +software can be found [here](docs/preparation.md). + Use git to obtain a copy of the LSTrAP code @@ -31,84 +35,39 @@ Next, move into the directory and copy **config.template.ini** and **data.templa cp config.template.ini config.ini cp data.template.ini data.ini -Configure config.ini and data.ini using the guidelines below - -## Configuration of LSTrAP - -After copying the templates, **config.ini** needs to be set up to run on your system. It requires the path to Trimmomatic's jar and the -modules where Bowtie, Tophat ... are installed in. - -The location of the transcriptome data, the refrence genome and a few per-species options need to be defined in **data.ini**. - -Detailed instruction how to set up both configuration files can be found [here](docs/configuration.md) - -## Obtaining and preparing data +Configure config.ini and data.ini using these [guidelines](docs/configuration.md) -Scripts to download and prepare data from the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) are included in -LSTrAP in the folder **helper**. Furthermore, it is recommended to remove splice variants from the GFF3 files, a script -to do that is included there as well. Detailed instructions for each script provided to obtain and prepare data can be -found [here](docs/helper.md) ## Running LSTrAP -Once properly configured for your system and data, LSTrAP can be run using a single simple command (that should be executed on the head node) +Once properly configured for your system and data, LSTrAP can be run using a single simple command (that should be +executed on the head node). ./run.py config.ini data.ini -Options to enable InterProScan and/or OrthoFinder or to skip certain steps of the pipeline are included, use the command below for more info +Run using [HISAT2](https://ccb.jhu.edu/software/hisat2/index.shtml) - ./run.py -h - -## Quality report - -After running LSTrAP a log file (*lstrap.log*) is written, in which samples which failed a quality measure -are reported. Note that no samples are excluded from the final network. In case certain samples need to be excluded -from the final network remove the htseq file for the sample you which to exclude and re-run the pipeline skipping all -steps prior to building the network. - - ./run.py config.ini data.ini --skip-interpro --skip-orthology --skip-bowtie-build --skip-trim-fastq --skip-tophat --skip-htseq --skip-qc - -More information on how the quality of samples is determined can be found [here](docs/quality.md). + ./run.py --use-hisat2 config.ini data.ini -## Output +Run with InterProScan and/or OrthoFinder -Apart from the output all tools included generate, LSTrAP will generate raw and normalized expression matrices, a -co‑expression network and co‑expression clusters. + ./run.py --enable-orthology --enable-interproscan config.ini data.ini -A detailed overview of files produces, including examples, can be found [here](docs/example_output.md). +Furthermore, steps can be skipped (to avoid re-running steps unnecessarily). Use the command below for more info. -## Helper Scripts - -LSTrAP comes with a few additional scripts to assist users to download and process data from the [Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra), -repeat analyses and the case study reported in the manuscript (Proost et al., *under preparation*). - -Details for each script can be found [here](docs/helper.md) + ./run.py -h -## Running LSTrAP on transcriptome data +## Further reading -To use LSTrAP on a *de novo* assembled transcriptome a little pre-processing is required. Instead of the genome a fasta -file containing **coding** sequences can be used (remove UTRs). Using the helper script fasta_to_gff.py a gff file suited -for LSTrAP can be generated. + * [LSTrAP output](docs/example_output.md) + * [Quality statistics](docs/quality.md): How to check the quality of samples and remove problematic samples + * [Helper Scripts](docs/helper.md): To acquire data from the [Sequence Read Archive](https://www.ncbi.nlm.nih.gov/sra) + and process results. - python3 fasta_to_gff.py /path/to/transcript.cds.fasta > output.gff - -## Adapting LSTrAP to other cluster managers - -LSTrAP is designed and tested on a cluster running the Oracle Grid Engine, though with minimal effort it can be adopted to run on PBS and Torque -based systems (and likely others). First, in the configuration file, check the qsub parameters (e.g. jobs that require multiple -CPUs to run *-pe cores 4*), that differ between systems are set up properly (the nodes and cores on Torque and PBS need to be -set using *-l nodes=4:ppn=2* to request 4 nodes with 2 processes per node). - -Furthermore the submission script might differ, these are located in **./cluster/templates.py** . For PBS based systems some -settings need to be included by adding *#PBS ...*. - -We strive to get LSTrAP running on as many systems as possible. Do not hesitate to contact us in case you experience difficulties -running LSTrAP on your system. - ## Contact -LSTrAP was developed by [Sebastian Proost](mailto:proost@mpimp-golm.mpg.de) and [Marek Mutwil](mailto:mutwil@mpimp-golm.mpg.de) at the [Max-Planck Institute for Molecular Plant Physiology](http://www.mpimp-golm.mpg.de/2168/en) +LSTrAP was developed by [Sebastian Proost](mailto:proost@mpimp-golm.mpg.de) and [Marek Mutwil](mailto:mutwil@gmail.com) at the [Max-Planck Institute for Molecular Plant Physiology](http://www.mpimp-golm.mpg.de/2168/en) ## Acknowledgements and Funding diff --git a/cluster/__init__.py b/cluster/__init__.py index bb8bc36..302f089 100644 --- a/cluster/__init__.py +++ b/cluster/__init__.py @@ -12,6 +12,7 @@ def detect_cluster_system(): :return: string "SBE", "PBS" or "other" """ + try: which_output = check_output(["which", "sge_qmaster"], stderr=DEVNULL).decode("utf-8") diff --git a/config.template.ini b/config.template.ini index 302d7dc..955a2ab 100644 --- a/config.template.ini +++ b/config.template.ini @@ -1,47 +1,50 @@ [TOOLS] -; In case there is no module load system on the system set the module name to None +; Tool Configuration +; +; Some tools require additional files or might require a hard coded path to the script. +; Please make sure these are set up correctly. + ; Trimmomatic Path +; ADJUST THIS trimmomatic_path=/home/sepro/tools/Trimmomatic-0.36/trimmomatic-0.36.jar -; Module names -bowtie_module=biotools/bowtie2-2.2.6 -samtools_module=biotools/samtools-1.3 -sratoolkit_module=biotools/sratoolkit-2.5.7 -tophat_module=biotools/tophat-2.1.0 - -interproscan_module=biotools/interproscan-5.16-55.0 - -blast_module=biotools/ncbi-blast-2.3.0+ -mcl_module=biotools/mcl-14.137 +; COMMANDS to run tools +; +; Here the commands used to start different steps are defined, ${name} are variables that will be set by LSTrAP for +; each job. -python_module=devel/Python-2.7.10 -python3_module=devel/Python-3.5.1 - -; commands to run tools +; Note that in some cases hard coded paths were required, adjust these to match the location of these files on +; your system bowtie_cmd=bowtie2-build ${in} ${out} +hisat2_build_cmd=hisat2-build ${in} ${out} +; ADJUST PATHS TO ADAPTERS trimmomatic_se_command=java -jar ${jar} SE -threads 1 ${in} ${out} ILLUMINACLIP:/home/sepro/tools/Trimmomatic-0.36/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 trimmomatic_pe_command=java -jar ${jar} PE -threads 1 ${ina} ${inb} ${outap} ${outau} ${outbp} ${outbu} ILLUMINACLIP:/home/sepro/tools/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 tophat_se_cmd=tophat -p 3 -o ${out} ${genome} ${fq} tophat_pe_cmd=tophat -p 3 -o ${out} ${genome} ${forward},${reverse} -htseq_count_cmd=htseq-count -s no -f bam -t ${feature} -i ${field} ${bam} ${gff} > ${out} +hisat2_se_cmd=hisat2 -p 3 -x ${genome} -U ${fq} -S ${out} 2> ${stats} +hisat2_pe_cmd=hisat2 -p 3 -x ${genome} -1 ${forward} -2 ${reverse} -S ${out} 2> ${stats} + +htseq_count_cmd=htseq-count -s no -f ${itype} -t ${feature} -i ${field} ${bam} ${gff} > ${out} interproscan_cmd=interproscan.sh -i ${in_dir}/${in_prefix}${SGE_TASK_ID} -o ${out_dir}/${out_prefix}${SGE_TASK_ID} -f tsv -dp -iprlookup -goterms --tempdir /tmp pcc_cmd=python3 ./scripts/pcc.py ${in} ${out} ${mcl_out} mcl_cmd=mcl ${in} --abc -o ${out} -te 4 +; ADJUST THIS mcxdeblast_cmd=perl /apps/biotools/mcl-14.137/bin/mcxdeblast --m9 --line-mode=abc ${blast_in} > ${abc_out} +; ADJUST THIS orthofinder_cmd=python /home/sepro/OrthoFinder-0.4/orthofinder.py -f ${fasta_dir} -t 8 +; qsub parameters (OGE) -; qsub parameters - -qsub_bowtie='' +qsub_indexing='' qsub_trimmomatic='' qsub_tophat='-pe cores 4' qsub_htseq_count='' @@ -49,4 +52,51 @@ qsub_interproscan='-pe cores 5' qsub_pcc='' qsub_mcl='-pe cores 4' qsub_orthofinder='-pe cores 8' -qsub_mcxdeblast='' \ No newline at end of file +qsub_mcxdeblast='' + +; qsub parameters (PBS/Torque) + +; qsub_indexing='' +; qsub_trimmomatic='' +; qsub_tophat='-l nodes=1,ppn=4' +; qsub_htseq_count='' +; qsub_interproscan='-l nodes=1,ppn=5' +; qsub_pcc='' +; qsub_mcl='-l nodes=1,ppn=4' +; qsub_orthofinder='-l nodes=1,ppn=8' +; qsub_mcxdeblast='' + +; qsub parameters (PBS/Torque with walltimes) + +; qsub_indexing='-l walltime=00:10:00' +; qsub_trimmomatic='-l walltime=00:10:00' +; qsub_tophat='-l nodes=1,ppn=4 -l walltime=00:10:00' +; qsub_htseq_count=' -l walltime=00:02:00' +; qsub_interproscan='-l nodes=1,ppn=5 -l walltime=00:10:00' +; qsub_pcc=' -l walltime=00:10:00' +; qsub_mcl='-l nodes=1,ppn=4 -l walltime=00:10:00' +; qsub_orthofinder='-l nodes=1,ppn=8 -l walltime=01:00:00' +; qsub_mcxdeblast='-l walltime=00:10:00' + +; Module names +; These need to be configured if the required tools are installed in the environment modules. +; You can find the modules installed on your system using +; +; module avail +; +; In case there is no module load system on the system set the module name to None + +bowtie_module=biotools/bowtie2-2.2.6 +samtools_module=biotools/samtools-1.3 +sratoolkit_module=biotools/sratoolkit-2.5.7 +tophat_module=biotools/tophat-2.1.0 + +hisat2_module= + +interproscan_module=biotools/interproscan-5.16-55.0 + +blast_module=biotools/ncbi-blast-2.3.0+ +mcl_module=biotools/mcl-14.137 + +python_module=devel/Python-2.7.10 +python3_module=devel/Python-3.5.1 \ No newline at end of file diff --git a/data.template.ini b/data.template.ini index 0b27fc6..e7e7597 100644 --- a/data.template.ini +++ b/data.template.ini @@ -23,10 +23,9 @@ fastq_dir=./data/zma/fastq tophat_cutoff=65 htseq_cutoff=40 -bowtie_output=./output/bowtie-build/zma +indexing_output=./output/bowtie-build/zma trimmomatic_output=./output/trimmed_fastq/zma -tophat_output=./output/tophat/zma -samtools_output=./output/samtools/zma +alignment_output=./tmp/tophat/zma htseq_output=./output/htseq/zma exp_matrix_output=./output/zma/exp_matrix.txt diff --git a/docs/configuration.md b/docs/configuration.md index cb5efee..f8637c1 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -18,7 +18,9 @@ Additional parameters can be added to the qsub commands at the bottom, this allows users to submit jobs to specific queues, with specific options, ... Furthermore, while the template is designed for Oracle/Sun Grid Engine this can be set up to work with other job management systems -such as PBS and Torque. +such as PBS and Torque. At the bottom section there is an example on how to +set the number of nodes/cores on PBS/Torque and how to add a walltime (if +required). **Match the number of cores** to the number of cores the job needs. When starting TopHat with **-p 3** the job will require 4 cores (3 worker @@ -29,49 +31,52 @@ Example config.ini: ```ini [TOOLS] -; In case there is no module load system on the system set the module name to None - -trimmomatic_path=/home/sepro/tools/Trimmomatic-0.36/trimmomatic-0.36.jar - -; Module names -bowtie_module=biotools/bowtie2-2.2.6 -samtools_module=biotools/samtools-1.3 -sratoolkit_module=biotools/sratoolkit-2.5.7 -tophat_module=biotools/tophat-2.1.0 +; Tool Configuration +; +; Some tools require additional files or might require a hard coded path to the script. +; Please make sure these are set up correctly. -interproscan_module=biotools/interproscan-5.16-55.0 -blast_module=biotools/ncbi-blast-2.3.0+ -mcl_module=biotools/mcl-14.137 +; Trimmomatic Path +; ADJUST THIS +trimmomatic_path=/home/sepro/tools/Trimmomatic-0.36/trimmomatic-0.36.jar -python_module=devel/Python-2.7.10 -python3_module=devel/Python-3.5.1 +; COMMANDS to run tools +; +; Here the commands used to start different steps are defined, ${name} are variables that will be set by LSTrAP for +; each job. -; commands to run tools +; Note that in some cases hard coded paths were required, adjust these to match the location of these files on +; your system bowtie_cmd=bowtie2-build ${in} ${out} +hisat2_build_cmd=hisat2-build ${in} ${out} +; ADJUST PATHS TO ADAPTERS trimmomatic_se_command=java -jar ${jar} SE -threads 1 ${in} ${out} ILLUMINACLIP:/home/sepro/tools/Trimmomatic-0.36/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 trimmomatic_pe_command=java -jar ${jar} PE -threads 1 ${ina} ${inb} ${outap} ${outau} ${outbp} ${outbu} ILLUMINACLIP:/home/sepro/tools/Trimmomatic-0.36/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 tophat_se_cmd=tophat -p 3 -o ${out} ${genome} ${fq} tophat_pe_cmd=tophat -p 3 -o ${out} ${genome} ${forward},${reverse} -samtools_cmd=samtools view -h -o ${out} ${bam} -htseq_count_cmd=htseq-count -s no -t ${feature} -i ${field} ${sam} ${gff} > ${out} +hisat2_se_cmd=hisat2 -p 3 -x ${genome} -U ${fq} -S ${out} 2> ${stats} +hisat2_pe_cmd=hisat2 -p 3 -x ${genome} -1 ${forward} -2 ${reverse} -S ${out} 2> ${stats} + +htseq_count_cmd=htseq-count -s no -f ${itype} -t ${feature} -i ${field} ${bam} ${gff} > ${out} interproscan_cmd=interproscan.sh -i ${in_dir}/${in_prefix}${SGE_TASK_ID} -o ${out_dir}/${out_prefix}${SGE_TASK_ID} -f tsv -dp -iprlookup -goterms --tempdir /tmp pcc_cmd=python3 ./scripts/pcc.py ${in} ${out} ${mcl_out} mcl_cmd=mcl ${in} --abc -o ${out} -te 4 +; ADJUST THIS mcxdeblast_cmd=perl /apps/biotools/mcl-14.137/bin/mcxdeblast --m9 --line-mode=abc ${blast_in} > ${abc_out} +; ADJUST THIS orthofinder_cmd=python /home/sepro/OrthoFinder-0.4/orthofinder.py -f ${fasta_dir} -t 8 +; qsub parameters (OGE) -; qsub parameters - -qsub_bowtie='' +qsub_indexing='' qsub_trimmomatic='' qsub_tophat='-pe cores 4' qsub_htseq_count='' @@ -81,6 +86,53 @@ qsub_mcl='-pe cores 4' qsub_orthofinder='-pe cores 8' qsub_mcxdeblast='' +; qsub parameters (PBS/Torque) + +; qsub_indexing='' +; qsub_trimmomatic='' +; qsub_tophat='-l nodes=1,ppn=4' +; qsub_htseq_count='' +; qsub_interproscan='-l nodes=1,ppn=5' +; qsub_pcc='' +; qsub_mcl='-l nodes=1,ppn=4' +; qsub_orthofinder='-l nodes=1,ppn=8' +; qsub_mcxdeblast='' + +; qsub parameters (PBS/Torque with walltimes) + +; qsub_indexing='-l walltime=00:10:00' +; qsub_trimmomatic='-l walltime=00:10:00' +; qsub_tophat='-l nodes=1,ppn=4 -l walltime=00:10:00' +; qsub_htseq_count=' -l walltime=00:02:00' +; qsub_interproscan='-l nodes=1,ppn=5 -l walltime=00:10:00' +; qsub_pcc=' -l walltime=00:10:00' +; qsub_mcl='-l nodes=1,ppn=4 -l walltime=00:10:00' +; qsub_orthofinder='-l nodes=1,ppn=8 -l walltime=01:00:00' +; qsub_mcxdeblast='-l walltime=00:10:00' + +; Module names +; These need to be configured if the required tools are installed in the environment modules. +; You can find the modules installed on your system using +; +; module avail +; +; In case there is no module load system on the system set the module name to None + +bowtie_module=biotools/bowtie2-2.2.6 +samtools_module=biotools/samtools-1.3 +sratoolkit_module=biotools/sratoolkit-2.5.7 +tophat_module=biotools/tophat-2.1.0 + +hisat2_module= + +interproscan_module=biotools/interproscan-5.16-55.0 + +blast_module=biotools/ncbi-blast-2.3.0+ +mcl_module=biotools/mcl-14.137 + +python_module=devel/Python-2.7.10 +python3_module=devel/Python-3.5.1 + ``` ## data.ini @@ -90,7 +142,7 @@ Example data.ini file: ```ini [GLOBAL] ; add all genomes, use semi-colons to separate multiple cfr. zma;ath -genomes=zma;ath +genomes=zma ; enter email to receive status updates from the cluster ; setting the email to None will disable this @@ -100,10 +152,10 @@ email=None orthofinder_output=./output/orthofinder [zma] -cds_fasta=./data/zma/cds.fasta -protein_fasta=./data/zma/pep.fasta -genome_fasta=./data/zma/genome.fasta -gff_file=./data/zma/genes.gff3 +cds_fasta= +protein_fasta= +genome_fasta= +gff_file= gff_feature=CDS gff_id=Parent @@ -113,10 +165,9 @@ fastq_dir=./data/zma/fastq tophat_cutoff=65 htseq_cutoff=40 -bowtie_output=./output/bowtie-build/zma +indexing_output=./output/bowtie-build/zma trimmomatic_output=./output/trimmed_fastq/zma -tophat_output=./output/tophat/zma -samtools_output=./output/samtools/zma +alignment_output=./tmp/tophat/zma htseq_output=./output/htseq/zma exp_matrix_output=./output/zma/exp_matrix.txt @@ -126,32 +177,5 @@ interpro_output=./output/interpro/zma pcc_output=./output/zma/pcc.std.txt pcc_mcl_output=./output/zma/pcc.mcl.txt - -[ath] -cds_fasta=./data/ath/cds.fasta -protein_fasta=./data/ath/pep.fasta -genome_fasta=./data/ath/genome.fasta -gff_file=./data/ath/genes.gff3 - -gff_feature=CDS -gff_id=Parent - -fastq_dir=./data/ath/fastq - -tophat_cutoff=65 -htseq_cutoff=40 - -bowtie_output=./output/bowtie-build/ath -trimmomatic_output=./output/trimmed_fastq/ath -tophat_output=./output/tophat/ath -samtools_output=./output/samtools/ath -htseq_output=./output/htseq/ath - -exp_matrix_output=./output/ath/exp_matrix.txt -exp_matrix_tpm_output=./output/ath/exp_matrix.tpm.txt -exp_matrix_rpkm_output=./output/ath/exp_matrix.rpkm.txt -interpro_output=./output/interpro/ath - -pcc_output=./output/ath/pcc.std.txt -pcc_mcl_output=./output/ath/pcc.mcl.txt +mcl_cluster_output=./output/zma/mcl.clusters.txt ``` diff --git a/docs/helper.md b/docs/helper.md index 8db0ab7..5d854fb 100644 --- a/docs/helper.md +++ b/docs/helper.md @@ -18,6 +18,11 @@ Script to convert sra files into fastq. Sratools is required. python3 sra_to_fastq.py /sra/files/directory /fastq/output/directory +## Running LSTrAP on transcriptome data + +To use LSTrAP on a *de novo* assembled transcriptome a little pre-processing is required. Instead of the genome a fasta +file containing **coding** sequences can be used (remove UTRs). Using the helper script fasta_to_gff.py a gff file suited +for LSTrAP can be generated. ### parse_gff.py @@ -82,10 +87,17 @@ on a normalized expression matrix. ![matrix example](images/matrix.png "Sample distance heatmap (with hierarchical clustering)") - + +### pca_plot.py + +Script to perform a PCA analysis on any expression matrix. + + python3 pca_plot.py ./data/sbi.expression.matrix.tpm.txt ### pca_powerlaw.py +*This script and the required data are included to recreate results from the manuscript (Proost et al., under review)* + Script to perform a PCA analysis on the *Sorghum bicolor* data (case study) and draw the node degree distribution. The required data is included here as well. Note that this script requires sklearn and seaborn. diff --git a/docs/images/LSTrAP_workflow.png b/docs/images/LSTrAP_workflow.png index 38d38c1..8fde6e7 100644 Binary files a/docs/images/LSTrAP_workflow.png and b/docs/images/LSTrAP_workflow.png differ diff --git a/docs/preparation.md b/docs/preparation.md index 0915462..6b86d93 100644 --- a/docs/preparation.md +++ b/docs/preparation.md @@ -1,23 +1,29 @@ # Preparing your system -LSTrAP is designed to run on the head node of a Oracle Grid Engine cluster. Apart from a running compute cluster, the essential -tools need to be installed. A full list is provided below, tools can be installed on the grid nodes directly or inside modules. -When opting for the latter, the configuration file needs to contain the exact names of the modules containing the tools. +LSTrAP is designed with High Performance Computing in mind and requires a computer cluster running +[Oracle Grid Engine]((https://www.oracle.com/sun/index.html)) or [PBS](https://en.wikipedia.org/wiki/Portable_Batch_System) +/ [Torque](http://www.adaptivecomputing.com/products/open-source/torque/). Furthermore, the essential +tools (see below) need to be installed prior to running LSTrAP. +Using the [Environment modules](http://modules.sourceforge.net/) are supported, in that case the configuration file +needs to contain the exact names of the modules containing the tools. +## Required Tools * [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) * [TopHat](https://ccb.jhu.edu/software/tophat/manual.shtml) + * [HISAT2](http://ccb.jhu.edu/software/hisat2/index.shtml) * [Samtools](http://www.htslib.org/) * [SRAtools](http://ncbi.github.io/sra-tools/) - * [Python 2.7](https://www.python.org/download/releases/2.7/) + [HTSeq](http://www-huber.embl.de/users/anders/HTSeq/doc/index.html) + all dependencies (including [PySam](https://github.com/pysam-developers/pysam)) + * [HTSeq](http://www-huber.embl.de/users/anders/HTSeq/doc/index.html) + all dependencies (including [PySam](https://github.com/pysam-developers/pysam)) * [Python 3.5](https://www.python.org/download/releases/3.5.1/) + SciPy + [Numpy](http://www.numpy.org/) - * [InterProScan](https://www.ebi.ac.uk/interpro/) - * [OrthoFinder](https://github.com/davidemms/OrthoFinder) * [MCL](http://www.micans.org/mcl/index.html?sec_software) * [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic) -Optional tools +## Optional tools + * [InterProScan](https://www.ebi.ac.uk/interpro/) + * [OrthoFinder](https://github.com/davidemms/OrthoFinder) + * [Python 2.7](https://www.python.org/download/releases/2.7/) (for OrthoFinder) * [scikit-learn](http://scikit-learn.org/) for Python 3, required for PCA analysis (helper script) * [seaborn](https://stanford.edu/~mwaskom/software/seaborn/) for Python 3, required for PCA analysis (helper script) * [Aspera connect client](http://downloads.asperasoft.com/en/downloads/2), required for the *get_sra_ip.py* (helper script) \ No newline at end of file diff --git a/helper/pca_plot.py b/helper/pca_plot.py index 3444afd..21d5a7e 100644 --- a/helper/pca_plot.py +++ b/helper/pca_plot.py @@ -16,6 +16,7 @@ def run_pca(expression): # Load Expression data + df = pd.read_table(expression, header=0, index_col=0) run_ids = list(df.columns.values) dataMatrix = np.transpose(np.array(df)) diff --git a/pipeline/base.py b/pipeline/base.py index 2da5288..e707d3f 100644 --- a/pipeline/base.py +++ b/pipeline/base.py @@ -7,11 +7,11 @@ class PipelineBase: - def __init__(self, config, data, enable_log=False): + def __init__(self, config, data, enable_log=False, use_hisat2=False): """ Constructor run with path to ini file with settings - :param config: path to setttings ini file + :param config: path to settings ini file """ self.cp = configparser.ConfigParser() self.cp.read(config) @@ -24,6 +24,7 @@ def __init__(self, config, data, enable_log=False): self.blast_module = None if self.cp['TOOLS']['blast_module'] is 'None' else self.cp['TOOLS']['blast_module'] self.bowtie_module = None if self.cp['TOOLS']['bowtie_module'] is 'None' else self.cp['TOOLS']['bowtie_module'] self.tophat_module = '' if self.cp['TOOLS']['tophat_module'] is 'None' else self.cp['TOOLS']['tophat_module'] + self.hisat2_module = '' if self.cp['TOOLS']['hisat2_module'] is 'None' else self.cp['TOOLS']['hisat2_module'] self.samtools_module = None if self.cp['TOOLS']['samtools_module'] is 'None' else self.cp['TOOLS']['samtools_module'] self.python_module = None if self.cp['TOOLS']['python_module'] is 'None' else self.cp['TOOLS']['python_module'] self.python3_module = None if self.cp['TOOLS']['python3_module'] is 'None' else self.cp['TOOLS']['python3_module'] @@ -31,19 +32,26 @@ def __init__(self, config, data, enable_log=False): self.mcl_module = None if self.cp['TOOLS']['mcl_module'] is 'None' else self.cp['TOOLS']['mcl_module'] self.bowtie_build_cmd = self.cp['TOOLS']['bowtie_cmd'] + self.hisat2_build_cmd = self.cp['TOOLS']['hisat2_build_cmd'] + self.trimmomatic_se_cmd = self.cp['TOOLS']['trimmomatic_se_command'] self.trimmomatic_pe_cmd = self.cp['TOOLS']['trimmomatic_pe_command'] + self.tophat_se_cmd = self.cp['TOOLS']['tophat_se_cmd'] self.tophat_pe_cmd = self.cp['TOOLS']['tophat_pe_cmd'] + self.hisat2_se_cmd = self.cp['TOOLS']['hisat2_se_cmd'] + self.hisat2_pe_cmd = self.cp['TOOLS']['hisat2_pe_cmd'] + self.htseq_count_cmd = self.cp['TOOLS']['htseq_count_cmd'] - self.interproscan_cmd = self.cp['TOOLS']['interproscan_cmd'] - self.orthofinder_cmd = self.cp['TOOLS']['orthofinder_cmd'] self.pcc_cmd = self.cp['TOOLS']['pcc_cmd'] self.mcl_cmd = self.cp['TOOLS']['mcl_cmd'] self.mcxdeblast_cmd = self.cp['TOOLS']['mcxdeblast_cmd'] - self.qsub_bowtie = shlex.split(self.cp['TOOLS']['qsub_bowtie'].strip('\'')) + self.interproscan_cmd = self.cp['TOOLS']['interproscan_cmd'] + self.orthofinder_cmd = self.cp['TOOLS']['orthofinder_cmd'] + + self.qsub_indexing = shlex.split(self.cp['TOOLS']['qsub_indexing'].strip('\'')) self.qsub_trimmomatic = shlex.split(self.cp['TOOLS']['qsub_trimmomatic'].strip('\'')) self.qsub_tophat = shlex.split(self.cp['TOOLS']['qsub_tophat'].strip('\'')) self.qsub_htseq_count = shlex.split(self.cp['TOOLS']['qsub_htseq_count'].strip('\'')) @@ -57,6 +65,7 @@ def __init__(self, config, data, enable_log=False): self.email = None if self.dp['GLOBAL']['email'] == 'None' else self.cp['DEFAULT']['email'] self.enable_log = enable_log + self.use_hisat2 = use_hisat2 if self.enable_log: self.log = open('lstrap.log', 'w') diff --git a/pipeline/check/quality.py b/pipeline/check/quality.py index 3f6eff8..1097238 100644 --- a/pipeline/check/quality.py +++ b/pipeline/check/quality.py @@ -25,7 +25,34 @@ def check_tophat(filename, cutoff=0, log=None): return True else: if log is not None: - print('WARNING:', filename, 'didn\'t pass TopHat Quality check!', value, 'reads mapped. Cutoff,', + print('WARNING:', filename, 'didn\'t pass alignment check!', value, 'reads mapped. Cutoff,', + cutoff, file=log) + + return False + + +def check_hisat2(filename, cutoff=0, log=None): + """ + Checks the alignment summary of TopHat's output, if it passes it returns true, else false + Optionally information can be written to a log file + + :param filename: align_summary.txt to check + :param cutoff: If the percentage of mapped reads is below this the sample won't pass (default= 0, no check) + :param log: filehandle to write log to, set to None for no log + :return: True if the sample passed, false otherwise + """ + re_mapped = re.compile('\t(.*)% overall alignment rate') + + with open(filename, 'r') as f: + lines = '\t'.join(f.readlines()) + hits = re_mapped.search(lines) + if hits: + value = float(hits.group(1)) + if value >= cutoff: + return True + else: + if log is not None: + print('WARNING:', filename, 'didn\'t pass alignment check!', value, 'reads mapped. Cutoff,', cutoff, file=log) return False diff --git a/pipeline/check/sanity.py b/pipeline/check/sanity.py index 14f7e1f..83f1871 100644 --- a/pipeline/check/sanity.py +++ b/pipeline/check/sanity.py @@ -21,7 +21,7 @@ def check_sanity_data(filename): genomes = cp['GLOBAL']['genomes'].split(';') # For each genome test that section required_keys = ['cds_fasta', 'protein_fasta', 'genome_fasta', 'gff_file', 'gff_feature', 'gff_id', - 'fastq_dir', 'bowtie_output', 'trimmomatic_output', 'tophat_output', + 'fastq_dir', 'indexing_output', 'trimmomatic_output', 'alignment_output', 'htseq_output', 'exp_matrix_output', 'exp_matrix_tpm_output', 'exp_matrix_rpkm_output', 'interpro_output', 'pcc_output', 'pcc_mcl_output', 'mcl_cluster_output'] required_paths = ['cds_fasta', 'protein_fasta', 'genome_fasta', 'gff_file', 'fastq_dir'] @@ -69,8 +69,9 @@ def check_sanity_config(filename): 'blast_module', 'mcl_module', 'python_module', 'python3_module', 'bowtie_cmd', 'trimmomatic_se_command', 'trimmomatic_pe_command', 'tophat_se_cmd', 'tophat_pe_cmd', 'htseq_count_cmd', 'interproscan_cmd', 'pcc_cmd', 'mcl_cmd', 'orthofinder_cmd', 'mcxdeblast_cmd', - 'trimmomatic_path', 'qsub_bowtie', 'qsub_trimmomatic', 'qsub_tophat', 'qsub_htseq_count', - 'qsub_interproscan', 'qsub_pcc', 'qsub_mcl', 'qsub_orthofinder', 'qsub_mcxdeblast'] + 'trimmomatic_path', 'qsub_indexing', 'qsub_trimmomatic', 'qsub_tophat', 'qsub_htseq_count', + 'qsub_interproscan', 'qsub_pcc', 'qsub_mcl', 'qsub_orthofinder', 'qsub_mcxdeblast', + 'hisat2_se_cmd', 'hisat2_pe_cmd'] required_paths = ['trimmomatic_path'] if 'TOOLS' in cp: diff --git a/pipeline/transcriptome.py b/pipeline/transcriptome.py index a1dd12f..edccb99 100644 --- a/pipeline/transcriptome.py +++ b/pipeline/transcriptome.py @@ -6,7 +6,7 @@ from cluster import wait_for_job from utils.matrix import read_matrix, write_matrix, normalize_matrix_counts, normalize_matrix_length from .base import PipelineBase -from .check.quality import check_tophat, check_htseq +from .check.quality import check_tophat, check_hisat2, check_htseq class TranscriptomePipeline(PipelineBase): @@ -17,19 +17,25 @@ def prepare_genome(self): """ Runs bowtie-build for each genome on the cluster. All settings are obtained from the settings fasta file """ - filename, jobname = self.write_submission_script("bowtie_build_%d", - self.bowtie_module, - self.bowtie_build_cmd, - "bowtie_build_%d.sh") + if self.use_hisat2: + filename, jobname = self.write_submission_script("build_index_%d", + self.hisat2_module, + self.hisat2_build_cmd, + "build_index_%d.sh") + else: + filename, jobname = self.write_submission_script("build_index_%d", + self.bowtie_module, + self.bowtie_build_cmd, + "build_index_%d.sh") for g in self.genomes: con_file = self.dp[g]['genome_fasta'] - output = self.dp[g]['bowtie_output'] + output = self.dp[g]['indexing_output'] os.makedirs(os.path.dirname(output), exist_ok=True) shutil.copy(con_file, output + '.fa') - command = ["qsub"] + self.qsub_bowtie + ["-v", "in=" + con_file + ",out=" + output, filename] + command = ["qsub"] + self.qsub_indexing + ["-v", "in=" + con_file + ",out=" + output, filename] subprocess.call(command) @@ -135,7 +141,7 @@ def trim_fastq(self, overwrite=False): print("Done\n\n") - def run_tophat(self, overwrite=False, keep_previous=False): + def __run_tophat(self, overwrite=False, keep_previous=False): """ Maps the reads from the trimmed fastq files to the bowtie-indexed genome @@ -155,8 +161,8 @@ def run_tophat(self, overwrite=False, keep_previous=False): print('Mapping reads with tophat...') for g in self.genomes: - tophat_output = self.dp[g]['tophat_output'] - bowtie_output = self.dp[g]['bowtie_output'] + tophat_output = self.dp[g]['alignment_output'] + bowtie_output = self.dp[g]['indexing_output'] trimmed_fastq_dir = self.dp[g]['trimmomatic_output'] os.makedirs(tophat_output, exist_ok=True) @@ -215,9 +221,107 @@ def run_tophat(self, overwrite=False, keep_previous=False): # remove OUT_ files PipelineBase.clean_out_files(jobname) + def __run_hisat2(self, overwrite=False, keep_previous=False): + """ + Maps the reads from the trimmed fastq files to the bowtie-indexed genome + + :param overwrite: when true the pipeline will start tophat even if the output exists + :param keep_previous: when true trimmed fastq files will not be removed after tophat completes + """ + filename_se, jobname = self.write_submission_script("hisat2_%d", + self.hisat2_module, + self.hisat2_se_cmd, + "hisat2_se_%d.sh") + + filename_pe, jobname = self.write_submission_script("hisat2_%d", + self.hisat2_module, + self.hisat2_pe_cmd, + "hisat2_pe_%d.sh") + + print('Mapping reads with HISAT2...') + + for g in self.genomes: + alignment_output = self.dp[g]['alignment_output'] + indexing_output = self.dp[g]['indexing_output'] + trimmed_fastq_dir = self.dp[g]['trimmomatic_output'] + os.makedirs(alignment_output, exist_ok=True) + + pe_files = [] + se_files = [] + + for file in os.listdir(trimmed_fastq_dir): + if file.endswith('.paired.fq.gz') or file.endswith('.paired.fastq.gz'): + pe_files.append(file) + elif not (file.endswith('.unpaired.fq.gz') or file.endswith('.unpaired.fastq.gz')): + se_files.append(file) + + # sort required to make sure _1 files are before _2 + pe_files.sort() + se_files.sort() + + for pe_file in pe_files: + if '_1.trimmed.paired.' in pe_file: + pair_file = pe_file.replace('_1.trimmed.paired.', '_2.trimmed.paired.') + + output_sam = pe_file.replace('_1.trimmed.paired.fq.gz', '').replace('_1.trimmed.paired.fastq.gz', '') + '.sam' + output_stats = pe_file.replace('_1.trimmed.paired.fq.gz', '').replace('_1.trimmed.paired.fastq.gz', '') + '.stats' + output_sam = os.path.join(alignment_output, output_sam) + output_stats = os.path.join(alignment_output, output_stats) + forward = os.path.join(trimmed_fastq_dir, pe_file) + reverse = os.path.join(trimmed_fastq_dir, pair_file) + if overwrite or not os.path.exists(output_sam): + print('Submitting pair %s, %s' % (pe_file, pair_file)) + command = ["qsub"] + self.qsub_tophat + \ + ["-v", "out=%s,genome=%s,forward=%s,reverse=%s,stats=%s" % + (output_sam, indexing_output, forward, reverse, output_stats), filename_pe] + subprocess.call(command) + else: + print('Output exists, skipping', pe_file) + + for se_file in se_files: + output_sam = se_file.replace('.trimmed.fq.gz', '').replace('.trimmed.fastq.gz', '') + '.sam' + output_sam = os.path.join(alignment_output, output_sam) + output_stats = se_file.replace('.trimmed.fq.gz', '').replace('.trimmed.fastq.gz', '') + '.stats' + output_stats = os.path.join(alignment_output, output_stats) + + if overwrite or not os.path.exists(output_sam): + print('Submitting single %s' % se_file) + command = ["qsub"] + self.qsub_tophat + ["-v", + "out=%s,genome=%s,fq=%s,stats=%s" % + (output_sam, indexing_output, + os.path.join(trimmed_fastq_dir, se_file), output_stats), + filename_se] + subprocess.call(command) + else: + print('Output exists, skipping', se_file) + + # wait for all jobs to complete + wait_for_job(jobname, sleep_time=1) + + # remove the submission script + os.remove(filename_se) + os.remove(filename_pe) + + # remove OUT_ files + PipelineBase.clean_out_files(jobname) + + def run_alignment(self, overwrite=False, keep_previous=False): + """ + + Determine which aligner to use and align reads to indexed genome + + :param overwrite: will overwrite existing data when True, otherwise existing runs will be skipped + :param keep_previous: will keep trimmed reads upon completion when true, + otherwise the trimmed reads will be deleted + """ + if self.use_hisat2: + self.__run_hisat2(overwrite=overwrite, keep_previous=keep_previous) + else: + self.__run_tophat(overwrite=overwrite, keep_previous=keep_previous) + print("Done\n\n") - def run_htseq_count(self, keep_previous=False): + def __run_htseq_count_tophat(self, keep_previous=False): """ Based on the gff file and sam file counts the number of reads that map to a given gene @@ -229,7 +333,7 @@ def run_htseq_count(self, keep_previous=False): "htseq_count_%d.sh") for g in self.genomes: - tophat_output = self.dp[g]['tophat_output'] + tophat_output = self.dp[g]['alignment_output'] htseq_output = self.dp[g]['htseq_output'] os.makedirs(htseq_output, exist_ok=True) @@ -248,7 +352,9 @@ def run_htseq_count(self, keep_previous=False): htseq_out = os.path.join(htseq_output, d + '.htseq') print(d, bam_file, htseq_out) - command = ["qsub"] + self.qsub_htseq_count + ["-v", "feature=%s,field=%s,bam=%s,gff=%s,out=%s" % (gff_feature, gff_id, bam_file, gff_file, htseq_out), filename] + command = ["qsub"] + self.qsub_htseq_count + ["-v", "itype=bam,feature=%s,field=%s,bam=%s,gff=%s,out=%s" + % (gff_feature, gff_id, bam_file, gff_file, htseq_out), + filename] subprocess.call(command) # wait for all jobs to complete @@ -258,7 +364,7 @@ def run_htseq_count(self, keep_previous=False): # NOTE: only the large bam file is removed (for now) if not keep_previous: for g in self.genomes: - tophat_output = self.dp[g]['tophat_output'] + tophat_output = self.dp[g]['alignment_output'] dirs = [o for o in os.listdir(tophat_output) if os.path.isdir(os.path.join(tophat_output, o))] for d in dirs: bam_file = os.path.join(tophat_output, d, 'accepted_hits.bam') @@ -271,6 +377,65 @@ def run_htseq_count(self, keep_previous=False): # remove OUT_ files PipelineBase.clean_out_files(jobname) + def __run_htseq_count_hisat2(self, keep_previous=False): + filename, jobname = self.write_submission_script("htseq_count_%d", + (self.samtools_module + '\t' + self.python_module), + self.htseq_count_cmd, + "htseq_count_%d.sh") + for g in self.genomes: + alignment_output = self.dp[g]['alignment_output'] + htseq_output = self.dp[g]['htseq_output'] + os.makedirs(htseq_output, exist_ok=True) + + gff_file = self.dp[g]['gff_file'] + gff_feature = self.dp[g]['gff_feature'] + gff_id = self.dp[g]['gff_id'] + + sam_files = [o for o in os.listdir(alignment_output) if os.path.isfile(os.path.join(alignment_output, o)) and + o.endswith('.sam')] + + for sam_file in sam_files: + htseq_out = os.path.join(htseq_output, sam_file.replace('.sam', '.htseq')) + print(sam_file, htseq_out) + + command = ["qsub"] + self.qsub_htseq_count + ["-v", "itype=sam,feature=%s,field=%s,bam=%s,gff=%s,out=%s" + % (gff_feature, gff_id, + os.path.join(alignment_output, sam_file), + gff_file, htseq_out), + filename] + subprocess.call(command) + + # wait for all jobs to complete + wait_for_job(jobname, sleep_time=1) + + if not keep_previous: + for g in self.genomes: + alignment_output = self.dp[g]['alignment_output'] + sam_files = [os.path.isfile(os.path.join(alignment_output, o)) for o in os.listdir(alignment_output) if + os.path.isfile(os.path.join(alignment_output, o)) and + o.endswith('.sam')] + for sam_file in sam_files: + if os.path.exists(sam_file): + os.remove(sam_file) + + # remove the submission script + os.remove(filename) + + # remove OUT_ files + PipelineBase.clean_out_files(jobname) + + def run_htseq_count(self, keep_previous=False): + """ + Depending on which alinger was used, run htseq-counts to determine expression levels. + + :param keep_previous: when true sam files output will not be removed after htseq-count completes + """ + + if self.use_hisat2: + self.__run_htseq_count_hisat2(keep_previous=keep_previous) + else: + self.__run_htseq_count_tophat(keep_previous=keep_previous) + print("Done\n\n") def check_quality(self): @@ -278,28 +443,40 @@ def check_quality(self): Function that checks tophat and htseq quality and throws warnings if insufficient reads map. If the log file is enabled it writes more detailed statistics there. """ - print("Checking quality of samples based on TopHat and HTSEQ-Count mapping statistics") + print("Checking quality of samples based on TopHat 2/HISAT2 and HTSEQ-Count mapping statistics") for g in self.genomes: - tophat_output = self.dp[g]['tophat_output'] + alignment_output = self.dp[g]['alignment_output'] htseq_output = self.dp[g]['htseq_output'] - dirs = [o for o in os.listdir(tophat_output) if os.path.isdir(os.path.join(tophat_output, o))] - summary_files = [] - for d in dirs: - summary_file = os.path.join(tophat_output, d, 'align_summary.txt') - if os.path.exists(summary_file): - summary_files.append((d, summary_file)) - - htseq_files = [os.path.join(htseq_output, f) for f in os.listdir(htseq_output) if f.endswith('.htseq')] + if self.use_hisat2: + stats_files = [os.path.join(alignment_output, o) for o in os.listdir(alignment_output) if + os.path.isfile(os.path.join(alignment_output, o)) and + o.endswith('.stats')] + + for stats_file in stats_files: + cutoff = int(self.dp[g]['tophat_cutoff']) if 'tophat_cutoff' in self.dp[g] else 0 + passed = check_hisat2(stats_file, cutoff=cutoff, log=self.log) + if not passed: + print('WARNING: sample with insufficient quality (HISAT2) detected:', stats_file, file=sys.stderr) + print('WARNING: check the log for additional information', file=sys.stderr) + else: + dirs = [o for o in os.listdir(alignment_output) if os.path.isdir(os.path.join(alignment_output, o))] + summary_files = [] + for d in dirs: + summary_file = os.path.join(alignment_output, d, 'align_summary.txt') + if os.path.exists(summary_file): + summary_files.append((d, summary_file)) - for (d, s) in summary_files: - cutoff = int(self.dp[g]['tophat_cutoff']) if 'tophat_cutoff' in self.dp[g] else 0 - passed = check_tophat(s, cutoff=cutoff, log=self.log) + for (d, s) in summary_files: + cutoff = int(self.dp[g]['tophat_cutoff']) if 'tophat_cutoff' in self.dp[g] else 0 + passed = check_tophat(s, cutoff=cutoff, log=self.log) - if not passed: - print('WARNING: sample with insufficient quality (TopHat) detected:', d, file=sys.stderr) - print('WARNING: check the log for additional information', file=sys.stderr) + if not passed: + print('WARNING: sample with insufficient quality (TopHat) detected:', d, file=sys.stderr) + print('WARNING: check the log for additional information', file=sys.stderr) + # Check HTSeq-Counts + htseq_files = [os.path.join(htseq_output, f) for f in os.listdir(htseq_output) if f.endswith('.htseq')] for h in htseq_files: cutoff = int(self.dp[g]['htseq_cutoff']) if 'htseq_cutoff' in self.dp[g] else 0 passed = check_htseq(h, cutoff=cutoff, log=self.log) @@ -426,9 +603,8 @@ def cluster_pcc(self): "cluster_pcc_%d.sh") for g in self.genomes: - # TODO naming convention here is confusing, improve this ! - mcl_out = self.dp[g]['pcc_mcl_output'] - mcl_clusters = self.dp[g]['mcl_cluster_output'] + mcl_out = self.dp[g]['pcc_mcl_output'] # This is the PCC table in mcl format + mcl_clusters = self.dp[g]['mcl_cluster_output'] # Desired path for the clusters command = ["qsub"] + self.qsub_mcl + ["-v", "in=%s,out=%s" % (mcl_out, mcl_clusters), filename] subprocess.call(command) diff --git a/run.py b/run.py index e73a4ec..474e040 100755 --- a/run.py +++ b/run.py @@ -16,22 +16,25 @@ def run_pipeline(args): """ if check_sanity_config(args.config) and check_sanity_data(args.data): if args.transcriptomics: - tp = TranscriptomePipeline(args.config, args.data, enable_log=args.enable_log) + tp = TranscriptomePipeline(args.config, + args.data, + enable_log=args.enable_log, + use_hisat2=args.use_hisat2) - if args.bowtie_build: + if args.indexing: tp.prepare_genome() else: - print("Skipping Bowtie-build", file=sys.stderr) + print("Skipping Indexing", file=sys.stderr) if args.trim_fastq: tp.trim_fastq() else: print("Skipping Trimmomatic", file=sys.stderr) - if args.tophat: - tp.run_tophat(keep_previous=args.keep_intermediate) + if args.alignment: + tp.run_alignment(keep_previous=args.keep_intermediate) else: - print("Skipping Tophat", file=sys.stderr) + print("Skipping Alignment", file=sys.stderr) if args.htseq: tp.run_htseq_count(keep_previous=args.keep_intermediate) @@ -95,9 +98,11 @@ def run_pipeline(args): parser.add_argument('--enable-interpro', dest='interpro', action='store_true', help='Runs InterProScan to detect protein domains and provide functional annotation') parser.add_argument('--enable-orthology', dest='orthology', action='store_true', help='Runs OrhtoFinder and MCL to detect orthogroups, gene families and orthologs') - parser.add_argument('--skip-bowtie-build', dest='bowtie_build', action='store_false', help='add --skip-bowtie-build to skip the step that indexes the genomes using bowtie-build') + parser.add_argument('--use-hisat2', dest='use_hisat2', action='store_true', help='Use HISAT2 to build the index and align reads instead of BowTie2 and TopHat2') + + parser.add_argument('--skip-indexing', dest='indexing', action='store_false', help='add --skip-indexing to skip building an index (for read alignment) on the genome (BowTie2 or HISAT2)') parser.add_argument('--skip-trim-fastq', dest='trim_fastq', action='store_false', help='add --skip-trim-fastq to skip trimming fastq files using trimmomatic') - parser.add_argument('--skip-tophat', dest='tophat', action='store_false', help='add --skip-tophat to skip read mapping with tophat') + parser.add_argument('--skip-alignment', dest='alignment', action='store_false', help='add --skip-alignment to skip the read alignment step (TopHat 2 or HISAT2)') parser.add_argument('--skip-htseq', dest='htseq', action='store_false', help='add --skip-htseq to skip counting reads per gene with htseq-count') parser.add_argument('--skip-qc', dest='qc', action='store_false', help='add --skip-qc to skip quality control of tophat and htseq output') parser.add_argument('--skip-exp-matrix', dest='exp_matrix', action='store_false', help='add --skip-exp-matrix to skip converting htseq files to an expression matrix') @@ -116,10 +121,12 @@ def run_pipeline(args): parser.set_defaults(interpro=False) parser.set_defaults(orthology=False) + parser.set_defaults(use_hisat2=False) + # Flags for individual tools for transcriptomics - parser.set_defaults(bowtie_build=True) + parser.set_defaults(indexing=True) parser.set_defaults(trim_fastq=True) - parser.set_defaults(tophat=True) + parser.set_defaults(alignment=True) parser.set_defaults(htseq=True) parser.set_defaults(qc=True) parser.set_defaults(exp_matrix=True)