# masterJLU2018

De novo motif discovery and evaluation based on footprints identified by TOBIAS.

For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki).

## Dependencies
* [conda](https://conda.io/docs/user-guide/install/linux.html)

## Installation
1. Start with installing conda and downloading all files from the [GitHub repository](https://github.molgen.mpg.de/loosolab/masterJLU2018).

2. Every other dependency will be automatically installed using conda. For that a conda environment has to be created from the yaml-file given in this repository.
It is required to create and activate the environment from the yaml-file beforehand.
This can be done with following commands:
```condsole
conda env create -f masterenv.yml
conda activate masterenv
```

3. Set the wd parameter in the nextflow.config file as path where the repository is saved. For example: '~/masterJLU2018/'.


**Important Notes:**
1. For the pipeline the package jellyfish from the channel bioconda is needed and **NOT** the jellyfish package from the channel conda-forge! Please make sure that the right jellyfish package is installed.


## Quick Start
```console
nextflow run pipeline.nf --bigwig [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --motif_db [MEME-file] --organism [mm10|mm9|hg19|hg38] --gtf_annotation [GTF-file]
```

### Demo run
There are files provided inside  ./demo/ for a demo run.
Go to the main directory and run following command:
```
nextflow run pipeline.nf --bigwig ./demo/buenrostro50k_chr1_fp.bw --bed ./demo/buenrostro50k_chr1_peaks.bed --genome_fasta ./demo/hg38_chr1.fa --motif_db ./demo/jaspar_vertebrates.meme --out ./demo/buenrostro50k_chr1_out/ --organism hg38 --gtf_annotation ./demo/homo_sapiens.94.mainChr.gtf
```

## Parameters
For a detailed overview for all parameters follow this [link](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki/Configuration).
```
Required arguments:
	--bigwig		 Path to BigWig-file
	--bed			 Path to BED-file
	--genome_fasta		 Path to genome in FASTA-format
	--motif_db		 Path to motif-database in MEME-format
	--config		 Path to UROPA configuration file
	--gtf_annotation	Path to gtf annotation file
	--organism 		 Input organism [hg38 | hg19 | mm9 | mm10]
	--out			 Output Directory (Default: './out/')

Optional arguments:

	--help [0|1]		1 to show this help message. (Default: 0)
	--gtf_merged		Path to gtf-file. If path is set the process which creates a gtf-file is skipped.
	--tfbs_path 		Path to directory with tfbsscan output. If given tfbsscan will be skipped.

	Footprint extraction:
	--window_length INT	This parameter sets the length of a sliding window. (Default: 200)
	--step INT		This parameter sets the number of positions to slide the window forward. (Default: 100)
	--percentage INT	Threshold in percent (Default: 0)
	--min_gap INT		If footprints are less than X bases apart the footprints will be merged (Default: 6)

	Filter motifs:
	--min_size_fp INT	Minimum sequence length threshold. Smaller sequences are discarded. (Default: 10)
	--max_size_fp INT	Maximum sequence length threshold. Discards all sequences longer than this value. (Default: 200)
	--tfbsscan_method [moods|fimo] Method used by tfbsscan. (Default: moods)

	Cluster:
	Sequence preparation/ reduction:
	--kmer INT		K-mer length (Default: 10)
	--aprox_motif_len INT	Motif length (Default: 10)
	--motif_occurrence FLOAT	Percentage of motifs over all sequences. Use 1 (Default) to assume every sequence contains a motif.
	--min_seq_length Interations	Remove all sequences below this value. (Default: 10)
	Clustering:
	--global INT		Global (=1) or local (=0) alignment. (Default: 0)
	--identity FLOAT	Identity threshold. (Default: 0.8)
	--sequence_coverage INT	Minimum aligned nucleotides on both sequences. (Default: 8)
	--memory INT		Memory limit in MB. 0 for unlimited. (Default: 800)
	--throw_away_seq INT	Remove all sequences equal or below this length before clustering. (Default: 9)
	--strand INT		Align +/+ & +/- (= 1). Or align only +/+ (= 0). (Default: 0)

	Motif estimation:
	--min_seq INT 		Sets the minimum number of sequences required for the FASTA-files given to GLAM2. (Default: 100)
	--motif_min_key INT	Minimum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 8)
	--motif_max_key INT	Maximum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 20)
	--iteration INT		Number of iterations done by GLAM2. More Iterations: better results, higher runtime. (Default: 10000)
	--tomtom_treshold FLOAT	Threshold for similarity score. (Default: 0.01)
	--best_motif INT	Get the best X motifs per cluster. (Default: 3)
	--gap_penalty INT	Set penalty for gaps in GLAM2 (Default: 1000)
	--seed Set seed for GLAM2 (Default: 123456789)
	Moitf clustering:
	--cluster_motif	Boolean	If 1 pipeline clusters motifs. If its 0 it does not. (Defaul: 0)
	--edge_weight INT	Minimum weight of edges in motif-cluster-graph (Default: 5)
	--motif_similarity_thresh FLOAT	Threshold for motif similarity score (Default: 0.00001)

	Creating GTF:
	--tissues List/String 	List of one or more keywords for tissue-/category-activity, categories must be specified as in JSON
				config
	Evaluation:
	--max_uropa_runs INT	 Maximum number UROPA runs running parallelized (Default: 10)
All arguments can be set in the configuration files
 ```

For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki).