loosolab · anastasiia · Jan 4, 2019 · Dec 18, 2018 · Dec 18, 2018 · Dec 18, 2018
diff --git a/README.md b/README.md
@@ -1,44 +1,22 @@
 # masterJLU2018
 
-De novo motif discovery and evaluation based on footprints identified by TOBIAS
+De novo motif discovery and evaluation based on footprints identified by TOBIAS.
 
-For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki)
+For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki).
 
 ## Dependencies
 * [conda](https://conda.io/docs/user-guide/install/linux.html)
 * [Nextflow](https://www.nextflow.io/)
-* [MEME-Suite](http://meme-suite.org/doc/install.html?man_type=web)
 
 ## Installation
-Start with installing all dependencies listed above. It is required to set the [enviroment paths for meme-suite](http://meme-suite.org/doc/install.html?man_type=web#installingtar).
-this can be done with following commands:
-```
-export PATH=[meme-suite instalation path]/libexec/meme-[meme-suite version]:$PATH
-export PATH=[meme-suite instalation path]/bin:$PATH
-```
+Start with installing all dependencies listed above (Nextflow, conda) and downloading all files from the [GitHub repository](https://github.molgen.mpg.de/loosolab/masterJLU2018).
 
-
-Download all files from the [GitHub repository](https://github.molgen.mpg.de/loosolab/masterJLU2018).
-The Nextflow-script needs a conda enviroment to run. Nextflow can create the needed enviroment from the given yaml-file.
-On some systems Nextflow exits the run with following error:
-```
-Caused by:
-  Failed to create Conda environment
-  command: conda env create --prefix  --file env.yml
-  status : 143
-  message:
-```
-If this error occurs you have to create the enviroment before starting the pipeline.
-To create this enviroment you need the yml-file from the repository.
-Run the following commands to create the enviroment:
-```console
-path=[Path to given masterenv.yml file]
-conda env create --name masterenv -f=$path
-```
-When the enviroment is created, set the variable 'path_env' in the configuration file as the path to it.
+Every other dependency will be automatically  installed by Nextflow using conda. For that a new conda enviroment will be created, which can be found in the from Nextflow created work directory after the first pipeline run.
+It is **not** required to create and activate the enviroment from the yaml-file beforehand.
 
 **Important Note:** For conda the channel bioconda needs to be set as highest priority! This is required due to two differnt packages with the same name in different channels. For the pipeline the package jellyfish from the channel bioconda is needed and **NOT** the jellyfisch package from the channel conda-forge!
 
+
 ## Quick Start
 ```console
 nextflow run pipeline.nf --bigwig [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --motif_db [MEME-file] --config [UROPA-config-file]
@@ -52,14 +30,16 @@ Required arguments:
 	--genome_fasta		 Path to genome in FASTA-format
 	--motif_db		 Path to motif-database in MEME-format
 	--config		 Path to UROPA configuration file
-	--create_known_tfbs_path Path to directory where output from tfbsscan (known motifs) are stored.
-				 Path can be set as tfbs_path in next run. (Default: './')
-	--out			 Output Directory (Default: './out/')	
-
+ 	--organism 		 Input organism [hg38 | hg19 | mm9 | mm10]
+	--out			 Output Directory (Default: './out/')
+
 Optional arguments:
-	
+
 	--help [0|1]		1 to show this help message. (Default: 0)
 	--tfbs_path 		Path to directory with output from tfbsscan. If given tfbsscan will not be run.
+	--create_known_tfbs_path Path to directory where output from tfbsscan (known motifs) are stored.
+				 Path can be set as tfbs_path in next run. (Default: './')
+	--gtf_path			Path to gtf-file. If path is set the process which creats a gtf-file is skipped.
 
 	Footprint extraction:
 	--window_length INT	This parameter sets the length of a sliding window. (Default: 200)
@@ -99,12 +79,28 @@ Optional arguments:
 	--motif_similarity_thresh FLOAT	Threshold for motif similarity score (Default: 0.00001)
 
 	Creating GTF:
-	--organism [hg38 | hg19 | mm9 | mm10]	Input organism
 	--tissues List/String 	List of one or more keywords for tissue-/category-activity, categories must be specified as in JSON
 				config
 All arguments can be set in the configuration files
  ```
 
+For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki).
 
-
-For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki)
+## Known issues
+The Nextflow-script needs a conda enviroment to run. Nextflow creates the needed enviroment from the given yaml-file.
+On some systems Nextflow exits the run with following error:
+```
+Caused by:
+  Failed to create Conda environment
+  command: conda env create --prefix  --file env.yml
+  status : 143
+  message:
+```
+If this error occurs you have to create the enviroment before starting the pipeline.
+To create this enviroment you need the yml-file from the repository.
+Run the following commands to create the enviroment:
+```console
+path=[Path to given masterenv.yml file]
+conda env create --name masterenv -f $path
+```
+When the enviroment is created, set the variable 'path_env' in the configuration file as the path to it.
diff --git a/bin/footprints_extraction.py → ...print_extraction/footprints_extraction.py b/bin/footprints_extraction.py → ...print_extraction/footprints_extraction.py
diff --git a/bin/compareBed.sh → bin/1.2_filter_motifs/compareBed.sh b/bin/compareBed.sh → bin/1.2_filter_motifs/compareBed.sh
diff --git a/bin/maxScore.R → bin/1.2_filter_motifs/maxScore.R b/bin/maxScore.R → bin/1.2_filter_motifs/maxScore.R
diff --git a/bin/merge.R → bin/1.2_filter_motifs/merge.R b/bin/merge.R → bin/1.2_filter_motifs/merge.R
diff --git a/bin/tfbsscan.py → bin/1.2_filter_motifs/tfbsscan.py b/bin/tfbsscan.py → bin/1.2_filter_motifs/tfbsscan.py
diff --git a/bin/cdhit_wrapper.R → bin/2.1_clustering/cdhit_wrapper.R b/bin/cdhit_wrapper.R → bin/2.1_clustering/cdhit_wrapper.R
@@ -1,10 +1,10 @@
 #! /bin/Rscript
-library("optparse")
+if (!require(optparse)) install.packages("optparse"); library(optparse)
 
 option_list <- list(
   make_option(opt_str = c("-i", "--input"), default = NULL, help = "Input bed-file. Fourth column is expected to contain names, last column must be sequences.", metavar = "character"),
   make_option(opt_str = c("-c", "--identity"), default = 0.8, help = "Identity threshold. Default = %default (CD-HIT parameter)", metavar = "double >= 0.8"),
-  make_option(opt_str = c("-A", "--coverage"), default = 8, help = "Minimal alignment length for both sequences in nucelotides. Default = %default (CD-HIT parameter)", metavar = "integer"),
+  make_option(opt_str = c("-A", "--coverage"), default = 8, help = "Minimal alignment length for both sequences in nucleotides. Default = %default (CD-HIT parameter)", metavar = "integer"),
   make_option(opt_str = c("-o", "--output"), default = "cluster.bed", help = "Output file same as input but with appended column of cluster numbers. Default = %default", metavar = "character"),
   make_option(opt_str = c("--clean"), default = TRUE, help = "Delete all temporary files. Default = %default", metavar = "logical"),
   make_option(opt_str = c("-G", "--global"), default = 0, help = "Global sequence identity = 1, local = 0. Default = %default (CD-HIT parameter)", metavar = "integer"),
@@ -14,15 +14,15 @@ option_list <- list(
   make_option(opt_str = c("-n", "--word_length"), default = 3, help = "Default = %default (CD-HIT parameter)", metavar = "integer"),
   make_option(opt_str = c("-l", "--throw_away_sequences"), default = 5, help = "Maximum length of sequences thrown away. Default = %default (CD-HIT parameter)", metavar = "integer"),
   # make_option(opt_str = c("-d", "--description")), # can not produce bed if this is != 0
-  make_option(opt_str = c("-s", "--length_dif_cutoff_shorter_p"), default = 0.0, help = "Shorter sequences length must be at least x percent of longer sequenecs length. Default = %default (CD-HIT parameter)", metavar = "double"),
+  make_option(opt_str = c("-s", "--length_dif_cutoff_shorter_p"), default = 0.0, help = "Shorter sequences length must be at least x percent of longer sequences length. Default = %default (CD-HIT parameter)", metavar = "double"),
   make_option(opt_str = c("-S", "--length_dif_cutoff_shorter_n"), default = 999999, help = "Length difference between sequences can not be higher than x nucleotides. Default = %default (CD-HIT parameter)", metavar = "integer"),
   make_option(opt_str = c("-e", "--alignment_coverage_longer_p"), default = 0.0, help = "Alignment must cover x percent of longer sequence. Default = %default (CD-HIT parameter: -aL)", metavar = "double"),
   make_option(opt_str = c("-E", "--alignment_coverage_longer_n"), default = 99999999, help = "There can not be more than x unaligned nucleotides on the longer sequence. Default = %default (CD-HIT parameter: -AL)", metavar = "integer"),
   make_option(opt_str = c("-f", "--alignment_coverage_shorter_p"), default = 0.0, help = "Alignment must cover x percent of shorter sequence. Default = %default (CD-HIT parameter: -aS)", metavar = "double"),
   make_option(opt_str = c("-F", "--alignment_coverage_shorter_n"), default = 99999999, help = "There can not be more than x unaligned nucleotides on the longer sequence. Default = %default (CD-HIT parameter: -AS)", metavar = "integer"),
   make_option(opt_str = c("-w", "--max_unmatched_longer_p"), default = 1.0, help = "Maximum unmatched percentage on longer sequence (excludes leading tailing gap). Default = %default (CD-HIT parameter: -uL)", metavar = "double"),
   make_option(opt_str = c("-W", "--max_unmatched_shorter_p"), default = 1.0, help = "Maximum unmatched percentage on shorter sequence (excludes leading tailing gap). Default = %default (CD-HIT parameter: -uS)", metavar = "double"),
-  make_option(opt_str = c("-U", "--max_unmatched_both_n"), default = 99999999, help = "Maximum unmatched nucleotied on both sequences (excludes leading tailing gap). Default = %default (CD-HIT parameter)", metavar = "integer"),
+  make_option(opt_str = c("-U", "--max_unmatched_both_n"), default = 99999999, help = "Maximum unmatched nucleotide on both sequences (excludes leading tailing gap). Default = %default (CD-HIT parameter)", metavar = "integer"),
   make_option(opt_str = c("-g", "--fast_cluster"), default = 1, help = "Cluster sequence in first cluster that meets threshold = 0 (fast). Or cluster in best Cluster = 1 (accurate). Default = %default (CD-HIT parameter)", metavar = "integer"),
   make_option(opt_str = c("-r", "--strand"), default = 0, help = "Align +/+ & +/- (= 1). Or align only +/+ (= 0). Default = %default (CD-HIT parameter)", metavar = "integer"),
   make_option(opt_str = c("--match"), default = 2, help = "Matching score. Default = %default (CD-HIT parameter)", metavar = "integer"),
@@ -33,7 +33,8 @@ option_list <- list(
 )
 
 opt_parser <- OptionParser(option_list = option_list, 
-                           description = "CD-HIT-EST Wrapper function. See https://github.com/weizhongli/cdhit/wiki for more information regarding CD-HITs parameters.")
+                           description = "CD-HIT-EST Wrapper function. See https://github.com/weizhongli/cdhit/wiki for more information regarding CD-HITs parameters.",
+                           epilogue = "Author: Hendrik Schultheis <Hendrik.Schultheis@mpi-bn.mpg.de>")
 
 opt <- parse_args(opt_parser)
 
@@ -42,50 +43,61 @@ opt <- parse_args(opt_parser)
 #' 
 #' @param input Data.table or file in bed format (requires names in fourth column and sequences in last column).
 #' @param identity Identity threshold. Default = 0.8 (CD-HIT parameter)
-#' @param coverage Minimal alignment length for both sequences in nucelotides. Default = 8 (CD-HIT parameter)
-#' @param output Clustered bedfile. Adds cluster number in last column (numbering depend on sort_by_cluster_size parameter). Default = cluster.bed
+#' @param coverage Minimal alignment length for both sequences in nucleotides. Default = 8 (CD-HIT parameter)
+#' @param output Clustered bed-file. Adds cluster number in last column (numbering depend on sort_by_cluster_size parameter). Default = cluster.bed
 #' @param clean Clean up after run. Default = TRUE
 #' @param threads Number of threads to use (0 = all cores). Default = 1 (CD-HIT parameter)
 #' @param global Global sequence identity = 1, local = 0. Default = 0 (CD-HIT parameter)
 #' @param band_width "Alignment band width. Default = 20 (CD-HIT parameter)"
 #' @param memory Memory limit in MB. 0 for unlimited. Default = 800 (CD-HIT parameter)
 #' @param word_length Default = 3 (CD-HIT parameter)
 #' @param throw_away_sequences Maximum length of sequences thrown away. Default = %default (CD-HIT parameter)
-#' @param length_dif_cutoff_shorter_p Shorter sequences length must be at least x percent of longer sequenecs length. Default = 0 (CD-HIT parameter)
+#' @param length_dif_cutoff_shorter_p Shorter sequences length must be at least x percent of longer sequences length. Default = 0 (CD-HIT parameter)
 #' @param length_dif_cutoff_shorter_n Length difference between sequences can not be higher than x nucleotides. Default = 999999 (CD-HIT parameter)
 #' @param alignment_coverage_longer_p Alignment must cover x percent of longer sequence. Default = 0 (CD-HIT parameter)
 #' @param alignment_coverage_longer_n There can not be more than x unaligned nucleotides on the longer sequence. Default = 99999999 (CD-HIT parameter)
 #' @param alignment_coverage_shorter_p Alignment must cover x percent of shorter sequence. Default = 0 (CD-HIT parameter)
 #' @param alignment_coverage_shorter_n There can not be more than x unaligned nucleotides on the longer sequence. Default = 99999999 (CD-HIT parameter)
 #' @param max_unmatched_longer_p Maximum unmatched percentage on longer sequence (excludes leading tailing gap). Default = 1 (CD-HIT parameter)
 #' @param max_unmatched_shorter_p Maximum unmatched percentage on shorter sequence (excludes leading tailing gap). Default = 1 (CD-HIT parameter)
-#' @param max_unmatched_both_n Maximum unmatched nucleotied on both sequences (excludes leading tailing gap). Default = 99999999 (CD-HIT parameter)
+#' @param max_unmatched_both_n Maximum unmatched nucleotide on both sequences (excludes leading tailing gap). Default = 99999999 (CD-HIT parameter)
 #' @param fast_cluster Cluster sequence in first cluster that meets threshold = 0 (fast). Or cluster in best Cluster = 1 (accurate). Default = 1 (CD-HIT parameter)
 #' @param strand Align +/+ & +/- (= 1). Or align only +/+ (= 0). Default = 0 (CD-HIT parameter)
 #' @param match Matching score. Default = 2 (CD-HIT parameter)
 #' @param mismatch Mismatch score. Default = -2 (CD-HIT parameter)
 #' @param gap Gap score. Default = -6 (CD-HIT parameter)
-#' @param gat_ext Gap extension score. Default = -1 (CD-HIT parameter)
+#' @param gap_ext Gap extension score. Default = -1 (CD-HIT parameter)
 #' @param sort_cluster_by_size Either sort cluster by decreasing length (= 0) or by decreasing size (= 1). Default = 1 (CD-HIT parameter)
 #' 
-#' TODO check whether cdhit is installed
+#' @details If there is a header supplied other then the default data.table naming scheme ('V1', 'V2', etc.) it will be kept and extended.
+#' 
+#' @author Hendrik Schultheis <Hendrik.Schultheis@@mpi-bn.mpg.de>
+#' 
 cdhitest <- function(input, identity = 0.8, coverage = 8, output = "cluster.bed", clean = TRUE, threads = 1, global = 0, band_width = 20, memory = 800, word_length = 3, throw_away_sequences = 5, length_dif_cutoff_shorter_p = 0, length_dif_cutoff_shorter_n = 999999, alignment_coverage_longer_p = 0, alignment_coverage_longer_n = 99999999, alignment_coverage_shorter_p = 0, alignment_coverage_shorter_n = 99999999, max_unmatched_longer_p = 1, max_unmatched_shorter_p = 1, max_unmatched_both_n = 99999999, fast_cluster = 1, strand = 0, match = 2, mismatch = -2, gap = -6, gap_ext = -1, sort_cluster_by_size = 1) {
+  if (system("which cd-hit-est", ignore.stdout = FALSE) != 0) {
+    stop("Required program CD-HIT not found! Please check whether it is installed.")
+  }
+
   if (missing(input)) {
-    stop("Input parameter missing!")
+    stop("No input specified! Please forward a valid bed-file.")
   }
 
   message("Loading bed.")
-  # load bed if neccessary
+  # load bed if necessary
   if (!data.table::is.data.table(input)) {
-    bed_table <- data.table::fread(input = input, header = FALSE)
+    bed_table <- data.table::fread(input = input)
   } else {
     bed_table <- input
   }
 
+  # check if there are column names to keep them
+  default_col_names <- grepl(pattern = "^V+\\d$", names(bed_table), perl = TRUE)
+  keep_col_names <- ifelse(any(default_col_names), FALSE, TRUE)
+
   # check for duplicated names
-  if (anyDuplicated(bed_table[, "V4"])) {
+  if (anyDuplicated(bed_table[, 4])) {
     warning("Found duplicated names. Making names unique.")
-    bed_table[, V4 := make.unique(V4)]
+    bed_table[, 4 := make.unique(names(bed_table)[4])]
   }
 
   message("Convert to fasta.")
@@ -137,17 +149,18 @@ cdhitest <- function(input, identity = 0.8, coverage = 8, output = "cluster.bed"
   system(command = cluster_call, wait = TRUE)
 
   # load reformated file
-  cluster <- data.table::fread(cluster_file)
+  cluster <- data.table::fread(cluster_file)[, c("id", "clstr")]
+  names(cluster) <- c("id", "cluster")
 
   ### add cluster to bed_table
-  result <- merge(x = bed_table, y = cluster[, c("id", "clstr")], by.x = "V4", by.y = "id", sort = FALSE)[, union(names(bed_table), names(cluster)[2]), with = FALSE]
+  result <- merge(x = bed_table, y = cluster, by.x = names(bed_table)[4], by.y = "id", sort = FALSE)[, union(names(bed_table), names(cluster)[2]), with = FALSE]
 
   # delete files
   if (clean) {
     file.remove(fasta_file, paste0(cdhit_output, ".clstr"), cdhit_output, cluster_file)
   }
 
-  data.table::fwrite(x = result, file = output, sep = "\t", col.names = FALSE)
+  data.table::fwrite(x = result, file = output, sep = "\t", col.names = keep_col_names)
 }
 
 # call function with given parameter if not in interactive context (e.g. run from shell)