loosolab · renewiegandt · Jan 13, 2019 · Jan 10, 2019 · Jan 10, 2019 · Jan 12, 2019
diff --git a/README.md b/README.md
@@ -10,28 +10,35 @@ For further information read the [documentation](https://github.molgen.mpg.de/lo
 * [MEME-Suite](http://meme-suite.org/doc/install.html?man_type=web)
 
 ## Installation
-Start with installing all dependencies listed above (Nextflow, conda, MEME-Suite) and downloading all files from the [GitHub repository](https://github.molgen.mpg.de/loosolab/masterJLU2018).
-It is required to set the [enviroment paths for meme-suite](http://meme-suite.org/doc/install.html?man_type=web#installingtar).
+1. Start with installing all dependencies listed above (Nextflow, conda, MEME-Suite) and downloading all files from the [GitHub repository](https://github.molgen.mpg.de/loosolab/masterJLU2018).
+2. It is required to set the [environment paths for meme-suite](http://meme-suite.org/doc/install.html?man_type=web#installingtar).
 this can be done with following commands:
 ```
 export PATH=[meme-suite instalation path]/libexec/meme-[meme-suite version]:$PATH
 export PATH=[meme-suite instalation path]/bin:$PATH
 ```
 
-Every other dependency will be automatically  installed by Nextflow using conda. For that a new conda enviroment will be created, which can be found in the from Nextflow created work directory after the first pipeline run.
-It is **not** required to create and activate the enviroment from the yaml-file beforehand.
+3. Every other dependency will be automatically installed using conda. For that a conda environment has to be created from the yaml-file given in this repository.
+It is required to create and activate the environment from the yaml-file beforehand.
+This can be done with following commands:
+```condsole
+conda env create -f masterenv.yml
+conda activate masterenv
+```
+
 
 **Important Note:** For conda the channel bioconda needs to be set as highest priority! This is required due to two different packages with the same name in different channels. For the pipeline the package jellyfish from the channel bioconda is needed and **NOT** the jellyfish package from the channel conda-forge!
 
 
+
 ## Quick Start
 ```console
-nextflow run pipeline.nf --bigwig [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --motif_db [MEME-file] --config [UROPA-config-file]
+nextflow run pipeline.nf --bigwig [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --motif_db [MEME-file] --organism [mm10|mm9|hg19|hg38]
 ```
 ## Parameters
 For a detailed overview for all parameters follow this [link](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki/Configuration).
 ```
-Required arguments:
+equired arguments:
 	--bigwig		 Path to BigWig-file
 	--bed			 Path to BED-file
 	--genome_fasta		 Path to genome in FASTA-format
@@ -52,18 +59,19 @@ Optional arguments:
 	--window_length INT	This parameter sets the length of a sliding window. (Default: 200)
 	--step INT		This parameter sets the number of positions to slide the window forward. (Default: 100)
 	--percentage INT	Threshold in percent (Default: 0)
+	--max_bp_between INT	If footprints are less than X bases appart the footprints will be merged (Default: 6)
 
-	Filter unknown motifs:
+	Filter motifs:
 	--min_size_fp INT	Minimum sequence length threshold. Smaller sequences are discarded. (Default: 10)
-	--max_size_fp INT	Maximum sequence length threshold. Discards all sequences longer than this value. (Default: 100)
+	--max_size_fp INT	Maximum sequence length threshold. Discards all sequences longer than this value. (Default: 200)
+	--tfbsscan_method [moods|fimo] Method used by tfbsscan. (Default: moods)
 
-	Clustering:
+	Cluster:
 	Sequence preparation/ reduction:
-	--kmer INT		Kmer length (Default: 10)
+	--kmer INT		K-mer length (Default: 10)
 	--aprox_motif_len INT	Motif length (Default: 10)
 	--motif_occurence FLOAT	Percentage of motifs over all sequences. Use 1 (Default) to assume every sequence contains a motif.
 	--min_seq_length Interations	Remove all sequences below this value. (Default: 10)
-
 	Clustering:
 	--global INT		Global (=1) or local (=0) alignment. (Default: 0)
 	--identity FLOAT	Identity threshold. (Default: 0.8)
@@ -75,11 +83,10 @@ Optional arguments:
 	Motif estimation:
 	--min_seq INT 		Sets the minimum number of sequences required for the FASTA-files given to GLAM2. (Default: 100)
 	--motif_min_key INT	Minimum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 8)
-	--motif_max_key INT	Maximum number of key positions (aligned columns) in the alignment done by GLAM2.f (Default: 20)
-	--iteration INT		Number of iterations done by glam2. More Iterations: better results, higher runtime. (Default: 10000)
-	--tomtom_treshold float	Threshold for similarity score. (Default: 0.01)
+	--motif_max_key INT	Maximum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 20)
+	--iteration INT		Number of iterations done by GLAM2. More Iterations: better results, higher runtime. (Default: 10000)
+	--tomtom_treshold FLOAT	Threshold for similarity score. (Default: 0.01)
 	--best_motif INT	Get the best X motifs per cluster. (Default: 3)
-
 	Moitf clustering:
 	--cluster_motif	Boolean	If 1 pipeline clusters motifs. If its 0 it does not. (Defaul: 0)
 	--edge_weight INT	Minimum weight of edges in motif-cluster-graph (Default: 5)
@@ -94,20 +101,11 @@ All arguments can be set in the configuration files
 For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki).
 
 ## Known issues
-The Nextflow-script needs a conda environment to run. Nextflow creates the needed environment from the given yaml-file.
-On some systems Nextflow exits the run with following error:
-```
-Caused by:
-  Failed to create Conda environment
-  command: conda env create --prefix  --file env.yml
-  status : 143
-  message:
+
+For unknown reasons, the tool [MOODS](https://www.cs.helsinki.fi/group/pssmfind/), which is called by the tfbsscan, rarely returns empty bedfiles, the problem is probably with the function _pfm_to_log_odds_. If MOODS does not work as expected and has problems with this function, you will see following error message:
 ```
-If this error occurs you have to create the environment before starting the pipeline.
-To create this environment you need the yml-file from the repository.
-Run the following commands to create the environment:
-```console
-path=[Path to given masterenv.yml file]
-conda env create --name masterenv -f $path
+ERROR
+All motiffiles have less than 2 lines!
+Fix motiffiles and try again.
 ```
-When the environment is created, set the variable 'path_env' in the configuration file as the path to it.
+There is no known fix so far. As a workaround either restart the pipeline in some hours with the same parameters or change the parameter tfbsscan_method to _fimo_ which forces the tfbsscan to use [fimo](http://meme-suite.org/doc/fimo.html). This methods takes longer but will cause no known error with empty bed files.
diff --git a/config/create_gtf.config b/config/create_gtf.config
@@ -1,4 +1,3 @@
 params{
-  organism="hg38"
   tissue=""
 }
diff --git a/config/filter_unknown_motifs.config → config/filter_motifs.config b/config/filter_unknown_motifs.config → config/filter_motifs.config
@@ -1,4 +1,5 @@
 params{
   min_size_fp=10
   max_size_fp=100
+  tfbsscan_method = "moods"
 }
diff --git a/config/peak_calling.config → config/footprint_extraction.config b/config/peak_calling.config → config/footprint_extraction.config
@@ -2,4 +2,5 @@ params{
 	window_length = 200
 	step = 100
 	percentage = 0
+	max_bp_between = 6
 }
diff --git a/masterenv.yml b/masterenv.yml
@@ -1,9 +1,9 @@
 
 name: masterenv
 dependencies:
-  - python >=3
+  - python >=3.6.7
   - r-seqinr
-  - numpy
+  - numpy=1.15.4
   - pybigWig
   - cd-hit
   - jellyfish

diff --git a/nextflow.config b/nextflow.config
@@ -1,4 +1,4 @@
-wd = "/mnt/agnerds/Rene.Wiegandt/10_Master/masterJLU2018"
+wd = ""
 createTimeout = 40
 params.threads=60 //Parameter for for scripts! Not for nextflow processes.
 params.config="${wd}/config/uropa.config"
@@ -8,8 +8,8 @@ env {
 	path_env = "${wd}/masterenv.yml"
 }
 
-includeConfig "${wd}/config/peak_calling.config"
-includeConfig "${wd}/config/filter_unknown_motifs.config"
+includeConfig "${wd}/config/footprint_extraction.config"
+includeConfig "${wd}/config/filter_motifs.config"
 includeConfig "${wd}/config/cluster.config"
 includeConfig "${wd}/config/motif_estimation.config"
 includeConfig "${wd}/config/create_gtf.config"
diff --git a/pipeline.nf b/pipeline.nf
@@ -12,14 +12,15 @@
 	params.gtf_path=""
 	params.out = "./out/"
 
-//peak_calling
+//footprint_extraction
 	params.window_length = 200
 	params.step = 100
 	params.percentage = 0
+	params.max_bp_between = 6
 
 //filter_unknown_motifs
 	params.min_size_fp=10
-	params.max_size_fp=100
+	params.max_size_fp=200
 	params.tfbsscan_method = "moods"
 
 //clustering
@@ -52,7 +53,7 @@
 	//cluster motifs
 	params.cluster_motif = 0 // Boolean if 1 motifs are clustered else they are not
 	params.edge_weight = 50 // Minimum weight of edges in motif-cluster-graph
-	motif_similarity_thresh = 0.00001 // threshold for motif similarity score
+	params.motif_similarity_thresh = 0.00001 // threshold for motif similarity score
 
 	params.best_motif = 3 // Top n motifs per cluster
 
@@ -85,19 +86,19 @@ Optional arguments:
 	--window_length INT	This parameter sets the length of a sliding window. (Default: 200)
 	--step INT		This parameter sets the number of positions to slide the window forward. (Default: 100)
 	--percentage INT	Threshold in percent (Default: 0)
+	--max_bp_between INT	If footprints are less than X bases appart the footprints will be merged (Default: 6)
 
-	Filter unknown motifs:
+	Filter motifs:
 	--min_size_fp INT	Minimum sequence length threshold. Smaller sequences are discarded. (Default: 10)
-	--max_size_fp INT	Maximum sequence length threshold. Discards all sequences longer than this value. (Default: 100)
+	--max_size_fp INT	Maximum sequence length threshold. Discards all sequences longer than this value. (Default: 200)
 	--tfbsscan_method [moods|fimo] Method used by tfbsscan. (Default: moods)
 
-	Clustering:
+	Cluster:
 	Sequence preparation/ reduction:
-	--kmer INT		Kmer length (Default: 10)
+	--kmer INT		K-mer length (Default: 10)
 	--aprox_motif_len INT	Motif length (Default: 10)
 	--motif_occurence FLOAT	Percentage of motifs over all sequences. Use 1 (Default) to assume every sequence contains a motif.
 	--min_seq_length Interations	Remove all sequences below this value. (Default: 10)
-
 	Clustering:
 	--global INT		Global (=1) or local (=0) alignment. (Default: 0)
 	--identity FLOAT	Identity threshold. (Default: 0.8)
@@ -109,11 +110,10 @@ Optional arguments:
 	Motif estimation:
 	--min_seq INT 		Sets the minimum number of sequences required for the FASTA-files given to GLAM2. (Default: 100)
 	--motif_min_key INT	Minimum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 8)
-	--motif_max_key INT	Maximum number of key positions (aligned columns) in the alignment done by GLAM2.f (Default: 20)
-	--iteration INT		Number of iterations done by glam2. More Iterations: better results, higher runtime. (Default: 10000)
-	--tomtom_treshold float	Threshold for similarity score. (Default: 0.01)
+	--motif_max_key INT	Maximum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 20)
+	--iteration INT		Number of iterations done by GLAM2. More Iterations: better results, higher runtime. (Default: 10000)
+	--tomtom_treshold FLOAT	Threshold for similarity score. (Default: 0.01)
 	--best_motif INT	Get the best X motifs per cluster. (Default: 3)
-
 	Moitf clustering:
 	--cluster_motif	Boolean	If 1 pipeline clusters motifs. If its 0 it does not. (Defaul: 0)
 	--edge_weight INT	Minimum weight of edges in motif-cluster-graph (Default: 5)
@@ -212,7 +212,7 @@ process footprint_extraction {
 
 	script:
 	"""
-	python ${path_bin}/1.1_footprint_extraction/footprints_extraction.py --bigwig ${bigWig} --bed ${bed} --output_file ${name}_called_peaks.bed --window_length ${params.window_length} --step ${params.step} --percentage ${params.percentage}
+	python ${path_bin}/1.1_footprint_extraction/footprints_extraction.py --bigwig ${bigWig} --bed ${bed} --output_file ${name}_called_peaks.bed --window_length ${params.window_length} --step ${params.step} --percentage ${params.percentage} --max_bp_between ${params.max_bp_between}
 	"""
 }