# masterJLU2018

De novo motif discovery and evaluation based on footprints identified by TOBIAS

## Dependencies
* [conda](https://conda.io/docs/user-guide/install/linux.html)
* [Nextflow](https://www.nextflow.io/)
* [MEME-Suite](http://meme-suite.org/doc/install.html?man_type=web)

## Installation
Start with installing all dependencies listed above. It is required to set the [enviroment paths for meme-suite](http://meme-suite.org/doc/install.html?man_type=web#installingtar).
this can be done with following commands:
```
export PATH=[meme-suite instalation path]/libexec/meme-[meme-suite version]:$PATH
export PATH=[meme-suite instalation path]/bin:$PATH
```


Download all files from the [GitHub repository](https://github.molgen.mpg.de/loosolab/masterJLU2018). 
The Nextflow-script needs a conda enviroment to run. Nextflow can create the needed enviroment from the given yaml-file.
On some systems Nrxtflow exits the run with following error:
```
Caused by:
  Failed to create Conda environment
  command: conda env create --prefix  --file env.yml
  status : 143
  message:
```
If this error occurs you have to create the enviroment before starting the pipeline.
To create this enviroment you need the yml-file from the repository.
Run the following commands to create the enviroment:
```console
path=[Path to given masterenv.yml file]
conda env create --name masterenv -f=$path
```
When the enviroment is created, set the variable 'path_env' in the configuration file as the path to it. 

## Quick Start
```console
nextflow run pipeline.nf --input [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --jaspar_db [MEME-file]
```
## Parameters
```
Required arguments:
	--input Path to BigWig-file
	--bed Path to BED-file
	--genome_fasta Path to genome in FASTA-format
	--jaspar_db Path to motif-database in MEME-format


Optional arguments:
	Footprint extraction:
	--window_length INT (Default: 200)
	--step INT (Default: 100)
	--percentage INT(Default: 0)

	Filter unknown motifs:
	--min_size_fp INT (Default: 10)
	--max_size_fp INT (Default: 100)

	Cluster:
	Sequence preparation/ reduction:
	--kmer INT Kmer length (Default: 10)
	--aprox_motif_len INT Motif length (Default: 10)
	--motif_occurence FLOAT Percentage of motifs over all sequences. Use 1 (Default) to assume every sequence contains a motif.
	--min_seq_length INT Remove all sequences below this value. (Default: 10)
	Clustering:
	--global INT Global (=1) or local (=0) alignment. (Default: 0)
	--identity FLOAT Identity threshold. (Default: 0.8)
	--sequence_coverage INT Minimum aligned nucleotides on both sequences. (Default: 8)
	--memory INT Memory limit in MB. 0 for unlimited. (Default: 800)
	--throw_away_seq INT Remove all sequences equal or below this length before clustering. (Default: 9)
	--strand INT Align +/+ & +/- (= 1). Or align only +/+ (= 0). (Default: 0)

	Motif estimation:
	--motif_min_len INT	Minimum length of Motif (Default: 8)
	--motif_max_len INT	Maximum length of Motif (Default: 20)
	--interation INT	Number of iterations done by glam2. More Interations: better results, higher runtime. (Default: 10000)
	--tomtom_treshold float	Threshold for similarity score. (Default: 0.01)

	Creating GTF:
	--organism [homo_sapiens | mus_musculus]
	--tissues
  
 All arguments can be set in the configuration files.
 ```



For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki)