masterJLU2018

De novo motif discovery and evaluation based on footprints identified by TOBIAS.

For further information read the documentation.

Dependencies

conda

Installation

Start with installing conda and downloading all files from the GitHub repository.
Every other dependency will be automatically installed using conda. For that a conda environment has to be created from the yaml-file given in this repository. It is required to create and activate the environment from the yaml-file beforehand. This can be done with following commands:

conda env create -f masterenv.yml
conda activate masterenv

Set the wd parameter in the nextflow.config file as path where the repository is saved. For example: '~/masterJLU2018/'.

Important Notes:

For the pipeline the package jellyfish from the channel bioconda is needed and NOT the jellyfish package from the channel conda-forge! Please make sure that the right jellyfish package is installed.

Quick Start

nextflow run pipeline.nf --bigwig [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --motif_db [MEME-file] --organism [mm10|mm9|hg19|hg38] --gtf_annotation [GTF-file]

Demo run

There are files provided inside ./demo/ for a demo run. Go to the main directory and run following command:

nextflow run pipeline.nf --bigwig ./demo/buenrostro50k_chr1_fp.bw --bed ./demo/buenrostro50k_chr1_peaks.bed --genome_fasta ./demo/hg38_chr1.fa --motif_db ./demo/jaspar_vertebrates.meme --out ./demo/buenrostro50k_chr1_out/ --organism hg38 --gtf_annotation ./demo/homo_sapiens.94.mainChr.gtf

Parameters

For a detailed overview for all parameters follow this link.

Required arguments:
	--bigwig			Path to BigWig-file
	--bed				Path to BED-file
	--genome_fasta			Path to genome in FASTA-format
	--motif_db			Path to motif-database in MEME-format
	--config			Path to UROPA configuration file
	--gtf_annotation		Path to gtf annotation file
	--organism 			Input organism [hg38 | hg19 | mm9 | mm10]
	--out				Output Directory (Default: './out/')

Optional arguments:

	--help [0|1]			1 to show this help message. (Default: 0)
	--gtf_merged			Path to gtf-file. If path is set the process which creates a gtf-file is skipped.
	--tfbs_path 			Path to directory with tfbsscan output. If given tfbsscan will be skipped.

	Footprint extraction:
	--window_length INT		This parameter sets the length of a sliding window. (Default: 200)
	--step INT			This parameter sets the number of positions to slide the window forward. (Default: 100)
	--percentage INT		Threshold in percent (Default: 0)
	--min_gap INT			If footprints are less than X bases apart the footprints will be merged (Default: 6)

	Filter motifs:
	--min_size_fp INT		Minimum sequence length threshold. Smaller sequences are discarded. (Default: 10)
	--max_size_fp INT		Maximum sequence length threshold. Discards all sequences longer than this value. (Default: 200)
	--tfbsscan_method [moods|fimo] 	Method used by tfbsscan. (Default: moods)

	Cluster:
	Sequence preparation/ reduction:
	--kmer INT			K-mer length (Default: 10)
	--aprox_motif_len INT		Motif length (Default: 10)
	--motif_occurrence FLOAT	Percentage of motifs over all sequences. Use 1 (Default) to assume every sequence contains a motif.
	--min_seq_length Interations	Remove all sequences below this value. (Default: 10)
	Clustering:
	--global INT			Global (=1) or local (=0) alignment. (Default: 0)
	--identity FLOAT		Identity threshold. (Default: 0.8)
	--sequence_coverage INT		Minimum aligned nucleotides on both sequences. (Default: 8)
	--memory INT			Memory limit in MB. 0 for unlimited. (Default: 800)
	--throw_away_seq INT		Remove all sequences equal or below this length before clustering. (Default: 9)
	--strand INT			Align +/+ & +/- (= 1). Or align only +/+ (= 0). (Default: 0)

	Motif estimation:
	--min_seq INT 			Sets the minimum number of sequences required for the FASTA-files given to GLAM2. (Default: 100)
	--motif_min_key INT		Minimum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 8)
	--motif_max_key INT		Maximum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 20)
	--iteration INT			Number of iterations done by GLAM2. More Iterations: better results, higher runtime. (Default: 10000)
	--tomtom_treshold FLOAT	T	hreshold for similarity score. (Default: 0.01)
	--best_motif INT		Get the best X motifs per cluster. (Default: 3)
	--gap_penalty INT		Set penalty for gaps in GLAM2 (Default: 1000)
	--seed String			Set seed for GLAM2 (Default: 123456789)
	Moitf clustering:
	--cluster_motif	Boolean		If 1 pipeline clusters motifs. If its 0 it does not. (Defaul: 0)
	--edge_weight INT		Minimum weight of edges in motif-cluster-graph (Default: 5)
	--motif_similarity_thresh FLOAT	Threshold for motif similarity score (Default: 0.00001)

	Creating GTF:
	--tissues List/String 		List of one or more keywords for tissue-/category-activity, categories must be specified as in JSON config
	Evaluation:
	--max_uropa_runs INT	 	Maximum number UROPA runs running parallelized (Default: 10)
All arguments can be set in the configuration files