README.md

# masterJLU2018

De novo motif discovery and evaluation based on footprints identified by TOBIAS.

For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki).

## Dependencies
* [conda](https://conda.io/docs/user-guide/install/linux.html)
* [Nextflow](https://www.nextflow.io/)

## Installation
Start with installing all dependencies listed above (Nextflow, conda) and downloading all files from the [GitHub repository](https://github.molgen.mpg.de/loosolab/masterJLU2018).

Every other dependency will be automatically  installed by Nextflow using conda. For that a new conda enviroment will be created, which can be found in the from Nextflow created work directory after the first pipeline run.
It is **not** required to create and activate the enviroment from the yaml-file beforehand.

**Important Note:** For conda the channel bioconda needs to be set as highest priority! This is required due to two differnt packages with the same name in different channels. For the pipeline the package jellyfish from the channel bioconda is needed and **NOT** the jellyfisch package from the channel conda-forge!


## Quick Start
```console
nextflow run pipeline.nf --bigwig [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --motif_db [MEME-file] --config [UROPA-config-file]
```
## Parameters
For a detailed overview for all parameters follow this [link](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki/Configuration).
```
Required arguments:
	--bigwig		 Path to BigWig-file
	--bed			 Path to BED-file
	--genome_fasta		 Path to genome in FASTA-format
	--motif_db		 Path to motif-database in MEME-format
	--config		 Path to UROPA configuration file
 	--organism 		 Input organism [hg38 | hg19 | mm9 | mm10]
	--out			 Output Directory (Default: './out/')

Optional arguments:

	--help [0|1]		1 to show this help message. (Default: 0)
	--tfbs_path 		Path to directory with output from tfbsscan. If given tfbsscan will not be run.
	--create_known_tfbs_path Path to directory where output from tfbsscan (known motifs) are stored.
				 Path can be set as tfbs_path in next run. (Default: './')
	--gtf_path			Path to gtf-file. If path is set the process which creats a gtf-file is skipped.

	Footprint extraction:
	--window_length INT	This parameter sets the length of a sliding window. (Default: 200)
	--step INT		This parameter sets the number of positions to slide the window forward. (Default: 100)
	--percentage INT	Threshold in percent (Default: 0)

	Filter unknown motifs:
	--min_size_fp INT	Minimum sequence length threshold. Smaller sequences are discarded. (Default: 10)
	--max_size_fp INT	Maximum sequence length threshold. Discards all sequences longer than this value. (Default: 100)

	Clustering:
	Sequence preparation/ reduction:
	--kmer INT		Kmer length (Default: 10)
	--aprox_motif_len INT	Motif length (Default: 10)
	--motif_occurence FLOAT	Percentage of motifs over all sequences. Use 1 (Default) to assume every sequence contains a motif.
	--min_seq_length Interations	Remove all sequences below this value. (Default: 10)

	Clustering:
	--global INT		Global (=1) or local (=0) alignment. (Default: 0)
	--identity FLOAT	Identity threshold. (Default: 0.8)
	--sequence_coverage INT	Minimum aligned nucleotides on both sequences. (Default: 8)
	--memory INT		Memory limit in MB. 0 for unlimited. (Default: 800)
	--throw_away_seq INT	Remove all sequences equal or below this length before clustering. (Default: 9)
	--strand INT		Align +/+ & +/- (= 1). Or align only +/+ (= 0). (Default: 0)

	Motif estimation:
	--min_seq INT 		Sets the minimum number of sequences required for the FASTA-files given to GLAM2. (Default: 100)
	--motif_min_key INT	Minimum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 8)
	--motif_max_key INT	Maximum number of key positions (aligned columns) in the alignment done by GLAM2.f (Default: 20)
	--iteration INT		Number of iterations done by glam2. More Iterations: better results, higher runtime. (Default: 10000)
	--tomtom_treshold float	Threshold for similarity score. (Default: 0.01)
	--best_motif INT	Get the best X motifs per cluster. (Default: 3)

	Moitf clustering:
	--cluster_motif	Boolean	If 1 pipeline clusters motifs. If its 0 it does not. (Defaul: 0)
	--edge_weight INT	Minimum weight of edges in motif-cluster-graph (Default: 5)
	--motif_similarity_thresh FLOAT	Threshold for motif similarity score (Default: 0.00001)

	Creating GTF:
	--tissues List/String 	List of one or more keywords for tissue-/category-activity, categories must be specified as in JSON
				config
All arguments can be set in the configuration files
 ```

For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki).

## Known issues
The Nextflow-script needs a conda enviroment to run. Nextflow creates the needed enviroment from the given yaml-file.
On some systems Nextflow exits the run with following error:
```
Caused by:
  Failed to create Conda environment
  command: conda env create --prefix  --file env.yml
  status : 143
  message:
```
If this error occurs you have to create the enviroment before starting the pipeline.
To create this enviroment you need the yml-file from the repository.
Run the following commands to create the enviroment:
```console
path=[Path to given masterenv.yml file]
conda env create --name masterenv -f $path
```
When the enviroment is created, set the variable 'path_env' in the configuration file as the path to it.
	# masterJLU2018

	De novo motif discovery and evaluation based on footprints identified by TOBIAS.

	For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki).

	## Dependencies
	* [conda](https://conda.io/docs/user-guide/install/linux.html)
	* [Nextflow](https://www.nextflow.io/)

	## Installation
	Start with installing all dependencies listed above (Nextflow, conda) and downloading all files from the [GitHub repository](https://github.molgen.mpg.de/loosolab/masterJLU2018).

	Every other dependency will be automatically installed by Nextflow using conda. For that a new conda enviroment will be created, which can be found in the from Nextflow created work directory after the first pipeline run.
	It is not required to create and activate the enviroment from the yaml-file beforehand.

	Important Note: For conda the channel bioconda needs to be set as highest priority! This is required due to two differnt packages with the same name in different channels. For the pipeline the package jellyfish from the channel bioconda is needed and NOT the jellyfisch package from the channel conda-forge!


	## Quick Start
	```console
	nextflow run pipeline.nf --bigwig [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --motif_db [MEME-file] --config [UROPA-config-file]
	```
	## Parameters
	For a detailed overview for all parameters follow this [link](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki/Configuration).
	```
	Required arguments:
	--bigwig Path to BigWig-file
	--bed Path to BED-file
	--genome_fasta Path to genome in FASTA-format
	--motif_db Path to motif-database in MEME-format
	--config Path to UROPA configuration file
	--organism Input organism [hg38 \| hg19 \| mm9 \| mm10]
	--out Output Directory (Default: './out/')

	Optional arguments:

	--help [0\|1] 1 to show this help message. (Default: 0)
	--tfbs_path Path to directory with output from tfbsscan. If given tfbsscan will not be run.
	--create_known_tfbs_path Path to directory where output from tfbsscan (known motifs) are stored.
	Path can be set as tfbs_path in next run. (Default: './')
	--gtf_path Path to gtf-file. If path is set the process which creats a gtf-file is skipped.

	Footprint extraction:
	--window_length INT This parameter sets the length of a sliding window. (Default: 200)
	--step INT This parameter sets the number of positions to slide the window forward. (Default: 100)
	--percentage INT Threshold in percent (Default: 0)

	Filter unknown motifs:
	--min_size_fp INT Minimum sequence length threshold. Smaller sequences are discarded. (Default: 10)
	--max_size_fp INT Maximum sequence length threshold. Discards all sequences longer than this value. (Default: 100)

	Clustering:
	Sequence preparation/ reduction:
	--kmer INT Kmer length (Default: 10)
	--aprox_motif_len INT Motif length (Default: 10)
	--motif_occurence FLOAT Percentage of motifs over all sequences. Use 1 (Default) to assume every sequence contains a motif.
	--min_seq_length Interations Remove all sequences below this value. (Default: 10)

	Clustering:
	--global INT Global (=1) or local (=0) alignment. (Default: 0)
	--identity FLOAT Identity threshold. (Default: 0.8)
	--sequence_coverage INT Minimum aligned nucleotides on both sequences. (Default: 8)
	--memory INT Memory limit in MB. 0 for unlimited. (Default: 800)
	--throw_away_seq INT Remove all sequences equal or below this length before clustering. (Default: 9)
	--strand INT Align +/+ & +/- (= 1). Or align only +/+ (= 0). (Default: 0)

	Motif estimation:
	--min_seq INT Sets the minimum number of sequences required for the FASTA-files given to GLAM2. (Default: 100)
	--motif_min_key INT Minimum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 8)
	--motif_max_key INT Maximum number of key positions (aligned columns) in the alignment done by GLAM2.f (Default: 20)
	--iteration INT Number of iterations done by glam2. More Iterations: better results, higher runtime. (Default: 10000)
	--tomtom_treshold float Threshold for similarity score. (Default: 0.01)
	--best_motif INT Get the best X motifs per cluster. (Default: 3)

	Moitf clustering:
	--cluster_motif Boolean If 1 pipeline clusters motifs. If its 0 it does not. (Defaul: 0)
	--edge_weight INT Minimum weight of edges in motif-cluster-graph (Default: 5)
	--motif_similarity_thresh FLOAT Threshold for motif similarity score (Default: 0.00001)

	Creating GTF:
	--tissues List/String List of one or more keywords for tissue-/category-activity, categories must be specified as in JSON
	config
	All arguments can be set in the configuration files
	```

	For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki).

	## Known issues
	The Nextflow-script needs a conda enviroment to run. Nextflow creates the needed enviroment from the given yaml-file.
	On some systems Nextflow exits the run with following error:
	```
	Caused by:
	Failed to create Conda environment
	command: conda env create --prefix --file env.yml
	status : 143
	message:
	```
	If this error occurs you have to create the enviroment before starting the pipeline.
	To create this enviroment you need the yml-file from the repository.
	Run the following commands to create the enviroment:
	```console
	path=[Path to given masterenv.yml file]
	conda env create --name masterenv -f $path
	```
	When the enviroment is created, set the variable 'path_env' in the configuration file as the path to it.