TOuCAN: Targeted chrOmatin Capture ANalysis - A Nextflow Pipeline

TOuCAN is a Nextflow Pipeline for analysing Targeted Chromation Capture (T2C) and High-throughput Chromatin Capture (HiC) experiments. The basic analysing steps are taken from the original pipeline (Petros Kolovos et al.). TOuCAN combines these steps to an easy to use pipeline. Furthermore, it adds additional features for the analysis.

Features

T2C Analysis:

MultiPlot
- Interaction matrix
- TAD Boundary score
- Gene annotation
bed file with all interactions
- raw/normalized
bed file with all interactions inside the target region
- with and without uropa annotation
easy to read viewpoint format for annotated interactions (see Results for explanation)
restriction maps for different runs on the genome

HiC Analysis:

Plots
- Interaction matrix
- more will be implemented

MultiPlot Example

Note: Green shows a value higher than the scale.

Installation and Command-line usage

Dependencies

Linux
Nextflow version >= 0.30
R version version 3.4.4
- ggplot2
- plyr
- gridExtra
- gtable
- RColorBrewer
- getopt
- ggbio
- optparse
Python version 2.7.8
- pysam
- getopt
HiCExplorer version 2.1
Conda:
- bowtie2 version 2.3.3.1*
- bwa version 0.7.15*
- SAMtools version 1.3.1*
- BEDtools version 2.27.1*
- uropa version 2.0.2 alpha*

* Will be implemented automatically through conda enviroment.

Installation

To install Nextflow, follow the instructions on the offical Nextflow website. After installing all dependencies, download TOuCAN from the TOuCAN GitHub Page. Then add all required parameters to the configuration file. Please check the following link for detailed information about the configuraten file setup.

Usage

To run the pipeline, use following command-line: Parameter with default values are optional.

Usage: nextflow run TOuCAN.nf --in [Input Path] --out [Output Path] --mode [Modi] [options]

--mode help, h					- For showing this help message

--mode plot					- Plot data [currently only T2C plots]
	parameters:
		--path_matrix [PATH]	   	- Path to directory with *.normalized.bed files.
		--chr [chr1,chr2,...,chrY]	- On which chromosome is the target region.
		--start [INT]			- Start of target region.
		--end [INT]			- End of target region.
		--score_min [INT]		- Score range: minimum. [default: 0]
		--score_max [INT]	        - Score range: maximium. [default: autoscale]
        	--pn [STRING]               	- Name of the Project [default: 'Project']

--mode T2C				        - Full T2C analysis
	parameters:
		--in [PATH]		        - Path to directory with fastq / fastq.gz files.
        	--bam [PATH]                	- Path to directory with bam files. [if given --in [PATH] is ignored]
		--out [PATH]			- Path to output directory.
		--safe_all_files [0|1]	   	- If 1 safes all temporary files into "OUTPUT/02_analysis/". [default: 0]
		--check_res_maps [0|1]	    	- If 1 prints first 5 lines of every file from restriction maps. [def.: 0]
		--chr [chr1,chr2,...,chrY]	- On which chromosome is the target region.
		--start [INT]			- Start of target region.
		--end [INT]			- End of target region.
		--score_min [INT]		- Score range: minimum. [default: 0]
		--score_max [INT]		- Score range: maximium. [default: autoscale]
        	--pn [STRING]               	- Name of the Project [default: 'Project']
		--organsim [mm10,mm9,hg19]      - Type of the genome

--mode uropa			   	        - Uropa annoation [T2C]
	parameters:
		--in [PATH]			- Path to directory with *.normalized.bed
		--out [PATH]			- Path to output directory.
		--chr [chr1,chr2,...,chrY]	- On which chromosome is the target region.
		--start [INT]			- Start of target region.
		--end [INT]			- End of target region.
        	--pn [STRING]               	- Name of the Project [default: 'Project']

--mode multiplot			        - creating a plot with interaction map, TAD graph and gene annotation
	parameters:
		--in [PATH]			- Path to directory with *.normalized.bed
		--out [PATH]			- Path to output directory.
		--chr [chr1,chr2,...,chrY]	- On which chromosome is the target region.
		--start [INT]			- Start of target region.
		--end [INT]			- End of target region.
       		--score_min [INT]		- Score range: minimum. [default: 0]
       		--score_max [INT]		- Score range: maximium. [default: autoscale]
        	--pn [STRING]              	- Name of the Project [default: 'Project']
		--organsim [mm10,mm9,hg19]      - Type of the genome

--mode HiC 					- Full HiC analysis
	parameters:
		--in [PATH]			- Path to directory with fastq / fastq.gz files.
		--out [PATH]			- Path to output directory.
		--aln [bwa|bowtie2]		- Choose alignment tool. [default: bwa]
		--bin [INT]			- Binsize [default: 10000]

Skip Aligment -> BAM files as Input:
The BAM files need to be from this Pipeline with follwing
file extension: "[NAME].(normalized|matrix).bam" !

Skip creating restritction maps:
After creating the restriction maps write their path into the config file to skip
creating the restrction maps again. [path_T2C_restriction_maps]

Input

HiC and T2C Experiments will result in two fastq files for each sample: a forward and a reversed fastq file. Those files need to have the same basename. To identify those, each basename has to end with a 'sample extension'.
For example:
sample1_R1.fastq
sample1_R2.fastq
You have to give the extension as a parameter in the commandline or in the configuration file. In this case, it would look like this:

params{
	sample_extension = "_R[12]" // in the configuration file
}

nextflow run TOuCAN ... --sample_extension _R[12] // as command line parameter

Simple Example

The fastq files are stored in '/example/fastq/'. The target region is on chromosome 1 from base 123000000 to base 125000000. The command for the T2C analysis would look like this:

nextflow run TOuCAN.nf --mode T2C --in /example/fastq/ --out /out/ --chr chr1 --start 123000000 --end 125000000

Results

For a detailed explanation of all Results, follow this link.