Skip to content

Dev to my branch #37

Merged
merged 44 commits into from
Jan 4, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
58c8478
Merge pull request #9 from loosolab/dev
renewiegandt Dec 18, 2018
cc23250
Bugfix in bed_to_fasta.R: Get last and second last instead of fixed i…
renewiegandt Dec 18, 2018
62f6f3f
bed_to_fasta.R: Improved documentation
renewiegandt Dec 18, 2018
cf9dcd8
bed_to_fasta.R: Imporved parametercalling with optparse
renewiegandt Dec 19, 2018
6d5c604
adaption of pipeline.nf to changes in bed_to_fasta.R
renewiegandt Dec 19, 2018
ce52871
Refactoring
renewiegandt Dec 19, 2018
98985d1
refactoring; renamed reduce_bed to reduce_sequence
HendrikSchultheis Dec 19, 2018
8faf399
Merge pull request #15 from loosolab/peak_calling
renewiegandt Dec 19, 2018
e0b9d38
check whether jellyfish is installed
HendrikSchultheis Dec 19, 2018
1730868
reduce_bed renamed to reduce_sequence
HendrikSchultheis Dec 19, 2018
3c4f733
get_best_motif.py: fixed bug which caused to print motif header as la…
renewiegandt Dec 19, 2018
88fa298
check whether jellyfish is installed
HendrikSchultheis Dec 19, 2018
e17d1db
check whether cdhit is installed
HendrikSchultheis Dec 19, 2018
dcd185e
omit TODO
HendrikSchultheis Dec 19, 2018
4c16f6f
check for header and forward it if provided
HendrikSchultheis Dec 19, 2018
5a7c84e
automatically detect and keep column names if provided
HendrikSchultheis Dec 19, 2018
97464ca
added author; better missing input error
HendrikSchultheis Dec 19, 2018
cc532bf
added author
HendrikSchultheis Dec 19, 2018
8389226
Fixed typos in get_best_motif.py
renewiegandt Dec 19, 2018
2fca158
Reads BED-files with or without header
renewiegandt Dec 19, 2018
4dea8e4
Imporved description for installation in README.md
renewiegandt Dec 20, 2018
1a7a812
Removed snakemake from yaml-file
renewiegandt Dec 20, 2018
4844609
Set parameter organism as required wihtout an default value
renewiegandt Dec 20, 2018
d60faa7
spell check
HendrikSchultheis Dec 20, 2018
46cfc59
Added Parameter gtf_path. If path is set process create_gtf will be s…
renewiegandt Dec 20, 2018
6507643
spell check
HendrikSchultheis Dec 20, 2018
d86f788
fixed more typos
HendrikSchultheis Dec 20, 2018
756e98f
process description for reduce_sequence and clustering
HendrikSchultheis Dec 20, 2018
5e46266
Fixed typo in bed_to_fasta.R
renewiegandt Dec 20, 2018
80963a3
Merge pull request #12 from loosolab/motif_estiamtion
renewiegandt Dec 20, 2018
1c6bcf1
Added new parameter list to README.mf
renewiegandt Dec 20, 2018
d70610e
Merge branch 'motif_estiamtion' of https://github.molgen.mpg.de/looso…
renewiegandt Dec 20, 2018
935ba3f
Merge pull request #17 from loosolab/cluster
HendrikSchultheis Dec 21, 2018
13bccda
Fixed bug in pipeline.nf: parameter gtf_path is now working
renewiegandt Dec 21, 2018
181fc68
Merge pull request #21 from loosolab/motif_estiamtion
renewiegandt Dec 22, 2018
e29ad65
Merge pull request #24 from loosolab/peak_calling
renewiegandt Jan 3, 2019
b7c80c8
sorting scripts depending on their function
renewiegandt Jan 3, 2019
83460e9
Renaming output paths
renewiegandt Jan 3, 2019
fde8a8b
install optparse if not yet installed; added missing author to docume…
HendrikSchultheis Jan 3, 2019
1c392da
Merge pull request #30 from loosolab/cluster
HendrikSchultheis Jan 3, 2019
ab6f883
Merge branch 'dev' into estimation_motifs
renewiegandt Jan 3, 2019
8993670
Merge pull request #28 from loosolab/estimation_motifs
renewiegandt Jan 3, 2019
e629131
missing points in readme
anastasiia Jan 3, 2019
c9e7c82
Merge pull request #31 from loosolab/anastasiia-patch-1
renewiegandt Jan 3, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
68 changes: 32 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,22 @@
# masterJLU2018

De novo motif discovery and evaluation based on footprints identified by TOBIAS
De novo motif discovery and evaluation based on footprints identified by TOBIAS.

For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki)
For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki).

## Dependencies
* [conda](https://conda.io/docs/user-guide/install/linux.html)
* [Nextflow](https://www.nextflow.io/)
* [MEME-Suite](http://meme-suite.org/doc/install.html?man_type=web)

## Installation
Start with installing all dependencies listed above. It is required to set the [enviroment paths for meme-suite](http://meme-suite.org/doc/install.html?man_type=web#installingtar).
this can be done with following commands:
```
export PATH=[meme-suite instalation path]/libexec/meme-[meme-suite version]:$PATH
export PATH=[meme-suite instalation path]/bin:$PATH
```
Start with installing all dependencies listed above (Nextflow, conda) and downloading all files from the [GitHub repository](https://github.molgen.mpg.de/loosolab/masterJLU2018).


Download all files from the [GitHub repository](https://github.molgen.mpg.de/loosolab/masterJLU2018).
The Nextflow-script needs a conda enviroment to run. Nextflow can create the needed enviroment from the given yaml-file.
On some systems Nextflow exits the run with following error:
```
Caused by:
Failed to create Conda environment
command: conda env create --prefix --file env.yml
status : 143
message:
```
If this error occurs you have to create the enviroment before starting the pipeline.
To create this enviroment you need the yml-file from the repository.
Run the following commands to create the enviroment:
```console
path=[Path to given masterenv.yml file]
conda env create --name masterenv -f=$path
```
When the enviroment is created, set the variable 'path_env' in the configuration file as the path to it.
Every other dependency will be automatically installed by Nextflow using conda. For that a new conda enviroment will be created, which can be found in the from Nextflow created work directory after the first pipeline run.
It is **not** required to create and activate the enviroment from the yaml-file beforehand.

**Important Note:** For conda the channel bioconda needs to be set as highest priority! This is required due to two differnt packages with the same name in different channels. For the pipeline the package jellyfish from the channel bioconda is needed and **NOT** the jellyfisch package from the channel conda-forge!


## Quick Start
```console
nextflow run pipeline.nf --bigwig [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --motif_db [MEME-file] --config [UROPA-config-file]
Expand All @@ -52,14 +30,16 @@ Required arguments:
--genome_fasta Path to genome in FASTA-format
--motif_db Path to motif-database in MEME-format
--config Path to UROPA configuration file
--create_known_tfbs_path Path to directory where output from tfbsscan (known motifs) are stored.
Path can be set as tfbs_path in next run. (Default: './')
--out Output Directory (Default: './out/')

--organism Input organism [hg38 | hg19 | mm9 | mm10]
--out Output Directory (Default: './out/')

Optional arguments:

--help [0|1] 1 to show this help message. (Default: 0)
--tfbs_path Path to directory with output from tfbsscan. If given tfbsscan will not be run.
--create_known_tfbs_path Path to directory where output from tfbsscan (known motifs) are stored.
Path can be set as tfbs_path in next run. (Default: './')
--gtf_path Path to gtf-file. If path is set the process which creats a gtf-file is skipped.

Footprint extraction:
--window_length INT This parameter sets the length of a sliding window. (Default: 200)
Expand Down Expand Up @@ -99,12 +79,28 @@ Optional arguments:
--motif_similarity_thresh FLOAT Threshold for motif similarity score (Default: 0.00001)

Creating GTF:
--organism [hg38 | hg19 | mm9 | mm10] Input organism
--tissues List/String List of one or more keywords for tissue-/category-activity, categories must be specified as in JSON
config
All arguments can be set in the configuration files
```

For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki).


For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki)
## Known issues
The Nextflow-script needs a conda enviroment to run. Nextflow creates the needed enviroment from the given yaml-file.
On some systems Nextflow exits the run with following error:
```
Caused by:
Failed to create Conda environment
command: conda env create --prefix --file env.yml
status : 143
message:
```
If this error occurs you have to create the enviroment before starting the pipeline.
To create this enviroment you need the yml-file from the repository.
Run the following commands to create the enviroment:
```console
path=[Path to given masterenv.yml file]
conda env create --name masterenv -f $path
```
When the enviroment is created, set the variable 'path_env' in the configuration file as the path to it.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
51 changes: 32 additions & 19 deletions bin/cdhit_wrapper.R → bin/2.1_clustering/cdhit_wrapper.R
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
#! /bin/Rscript
library("optparse")
if (!require(optparse)) install.packages("optparse"); library(optparse)

option_list <- list(
make_option(opt_str = c("-i", "--input"), default = NULL, help = "Input bed-file. Fourth column is expected to contain names, last column must be sequences.", metavar = "character"),
make_option(opt_str = c("-c", "--identity"), default = 0.8, help = "Identity threshold. Default = %default (CD-HIT parameter)", metavar = "double >= 0.8"),
make_option(opt_str = c("-A", "--coverage"), default = 8, help = "Minimal alignment length for both sequences in nucelotides. Default = %default (CD-HIT parameter)", metavar = "integer"),
make_option(opt_str = c("-A", "--coverage"), default = 8, help = "Minimal alignment length for both sequences in nucleotides. Default = %default (CD-HIT parameter)", metavar = "integer"),
make_option(opt_str = c("-o", "--output"), default = "cluster.bed", help = "Output file same as input but with appended column of cluster numbers. Default = %default", metavar = "character"),
make_option(opt_str = c("--clean"), default = TRUE, help = "Delete all temporary files. Default = %default", metavar = "logical"),
make_option(opt_str = c("-G", "--global"), default = 0, help = "Global sequence identity = 1, local = 0. Default = %default (CD-HIT parameter)", metavar = "integer"),
Expand All @@ -14,15 +14,15 @@ option_list <- list(
make_option(opt_str = c("-n", "--word_length"), default = 3, help = "Default = %default (CD-HIT parameter)", metavar = "integer"),
make_option(opt_str = c("-l", "--throw_away_sequences"), default = 5, help = "Maximum length of sequences thrown away. Default = %default (CD-HIT parameter)", metavar = "integer"),
# make_option(opt_str = c("-d", "--description")), # can not produce bed if this is != 0
make_option(opt_str = c("-s", "--length_dif_cutoff_shorter_p"), default = 0.0, help = "Shorter sequences length must be at least x percent of longer sequenecs length. Default = %default (CD-HIT parameter)", metavar = "double"),
make_option(opt_str = c("-s", "--length_dif_cutoff_shorter_p"), default = 0.0, help = "Shorter sequences length must be at least x percent of longer sequences length. Default = %default (CD-HIT parameter)", metavar = "double"),
make_option(opt_str = c("-S", "--length_dif_cutoff_shorter_n"), default = 999999, help = "Length difference between sequences can not be higher than x nucleotides. Default = %default (CD-HIT parameter)", metavar = "integer"),
make_option(opt_str = c("-e", "--alignment_coverage_longer_p"), default = 0.0, help = "Alignment must cover x percent of longer sequence. Default = %default (CD-HIT parameter: -aL)", metavar = "double"),
make_option(opt_str = c("-E", "--alignment_coverage_longer_n"), default = 99999999, help = "There can not be more than x unaligned nucleotides on the longer sequence. Default = %default (CD-HIT parameter: -AL)", metavar = "integer"),
make_option(opt_str = c("-f", "--alignment_coverage_shorter_p"), default = 0.0, help = "Alignment must cover x percent of shorter sequence. Default = %default (CD-HIT parameter: -aS)", metavar = "double"),
make_option(opt_str = c("-F", "--alignment_coverage_shorter_n"), default = 99999999, help = "There can not be more than x unaligned nucleotides on the longer sequence. Default = %default (CD-HIT parameter: -AS)", metavar = "integer"),
make_option(opt_str = c("-w", "--max_unmatched_longer_p"), default = 1.0, help = "Maximum unmatched percentage on longer sequence (excludes leading tailing gap). Default = %default (CD-HIT parameter: -uL)", metavar = "double"),
make_option(opt_str = c("-W", "--max_unmatched_shorter_p"), default = 1.0, help = "Maximum unmatched percentage on shorter sequence (excludes leading tailing gap). Default = %default (CD-HIT parameter: -uS)", metavar = "double"),
make_option(opt_str = c("-U", "--max_unmatched_both_n"), default = 99999999, help = "Maximum unmatched nucleotied on both sequences (excludes leading tailing gap). Default = %default (CD-HIT parameter)", metavar = "integer"),
make_option(opt_str = c("-U", "--max_unmatched_both_n"), default = 99999999, help = "Maximum unmatched nucleotide on both sequences (excludes leading tailing gap). Default = %default (CD-HIT parameter)", metavar = "integer"),
make_option(opt_str = c("-g", "--fast_cluster"), default = 1, help = "Cluster sequence in first cluster that meets threshold = 0 (fast). Or cluster in best Cluster = 1 (accurate). Default = %default (CD-HIT parameter)", metavar = "integer"),
make_option(opt_str = c("-r", "--strand"), default = 0, help = "Align +/+ & +/- (= 1). Or align only +/+ (= 0). Default = %default (CD-HIT parameter)", metavar = "integer"),
make_option(opt_str = c("--match"), default = 2, help = "Matching score. Default = %default (CD-HIT parameter)", metavar = "integer"),
Expand All @@ -33,7 +33,8 @@ option_list <- list(
)

opt_parser <- OptionParser(option_list = option_list,
description = "CD-HIT-EST Wrapper function. See https://github.com/weizhongli/cdhit/wiki for more information regarding CD-HITs parameters.")
description = "CD-HIT-EST Wrapper function. See https://github.com/weizhongli/cdhit/wiki for more information regarding CD-HITs parameters.",
epilogue = "Author: Hendrik Schultheis <Hendrik.Schultheis@mpi-bn.mpg.de>")

opt <- parse_args(opt_parser)

Expand All @@ -42,50 +43,61 @@ opt <- parse_args(opt_parser)
#'
#' @param input Data.table or file in bed format (requires names in fourth column and sequences in last column).
#' @param identity Identity threshold. Default = 0.8 (CD-HIT parameter)
#' @param coverage Minimal alignment length for both sequences in nucelotides. Default = 8 (CD-HIT parameter)
#' @param output Clustered bedfile. Adds cluster number in last column (numbering depend on sort_by_cluster_size parameter). Default = cluster.bed
#' @param coverage Minimal alignment length for both sequences in nucleotides. Default = 8 (CD-HIT parameter)
#' @param output Clustered bed-file. Adds cluster number in last column (numbering depend on sort_by_cluster_size parameter). Default = cluster.bed
#' @param clean Clean up after run. Default = TRUE
#' @param threads Number of threads to use (0 = all cores). Default = 1 (CD-HIT parameter)
#' @param global Global sequence identity = 1, local = 0. Default = 0 (CD-HIT parameter)
#' @param band_width "Alignment band width. Default = 20 (CD-HIT parameter)"
#' @param memory Memory limit in MB. 0 for unlimited. Default = 800 (CD-HIT parameter)
#' @param word_length Default = 3 (CD-HIT parameter)
#' @param throw_away_sequences Maximum length of sequences thrown away. Default = %default (CD-HIT parameter)
#' @param length_dif_cutoff_shorter_p Shorter sequences length must be at least x percent of longer sequenecs length. Default = 0 (CD-HIT parameter)
#' @param length_dif_cutoff_shorter_p Shorter sequences length must be at least x percent of longer sequences length. Default = 0 (CD-HIT parameter)
#' @param length_dif_cutoff_shorter_n Length difference between sequences can not be higher than x nucleotides. Default = 999999 (CD-HIT parameter)
#' @param alignment_coverage_longer_p Alignment must cover x percent of longer sequence. Default = 0 (CD-HIT parameter)
#' @param alignment_coverage_longer_n There can not be more than x unaligned nucleotides on the longer sequence. Default = 99999999 (CD-HIT parameter)
#' @param alignment_coverage_shorter_p Alignment must cover x percent of shorter sequence. Default = 0 (CD-HIT parameter)
#' @param alignment_coverage_shorter_n There can not be more than x unaligned nucleotides on the longer sequence. Default = 99999999 (CD-HIT parameter)
#' @param max_unmatched_longer_p Maximum unmatched percentage on longer sequence (excludes leading tailing gap). Default = 1 (CD-HIT parameter)
#' @param max_unmatched_shorter_p Maximum unmatched percentage on shorter sequence (excludes leading tailing gap). Default = 1 (CD-HIT parameter)
#' @param max_unmatched_both_n Maximum unmatched nucleotied on both sequences (excludes leading tailing gap). Default = 99999999 (CD-HIT parameter)
#' @param max_unmatched_both_n Maximum unmatched nucleotide on both sequences (excludes leading tailing gap). Default = 99999999 (CD-HIT parameter)
#' @param fast_cluster Cluster sequence in first cluster that meets threshold = 0 (fast). Or cluster in best Cluster = 1 (accurate). Default = 1 (CD-HIT parameter)
#' @param strand Align +/+ & +/- (= 1). Or align only +/+ (= 0). Default = 0 (CD-HIT parameter)
#' @param match Matching score. Default = 2 (CD-HIT parameter)
#' @param mismatch Mismatch score. Default = -2 (CD-HIT parameter)
#' @param gap Gap score. Default = -6 (CD-HIT parameter)
#' @param gat_ext Gap extension score. Default = -1 (CD-HIT parameter)
#' @param gap_ext Gap extension score. Default = -1 (CD-HIT parameter)
#' @param sort_cluster_by_size Either sort cluster by decreasing length (= 0) or by decreasing size (= 1). Default = 1 (CD-HIT parameter)
#'
#' TODO check whether cdhit is installed
#' @details If there is a header supplied other then the default data.table naming scheme ('V1', 'V2', etc.) it will be kept and extended.
#'
#' @author Hendrik Schultheis <Hendrik.Schultheis@@mpi-bn.mpg.de>
#'
cdhitest <- function(input, identity = 0.8, coverage = 8, output = "cluster.bed", clean = TRUE, threads = 1, global = 0, band_width = 20, memory = 800, word_length = 3, throw_away_sequences = 5, length_dif_cutoff_shorter_p = 0, length_dif_cutoff_shorter_n = 999999, alignment_coverage_longer_p = 0, alignment_coverage_longer_n = 99999999, alignment_coverage_shorter_p = 0, alignment_coverage_shorter_n = 99999999, max_unmatched_longer_p = 1, max_unmatched_shorter_p = 1, max_unmatched_both_n = 99999999, fast_cluster = 1, strand = 0, match = 2, mismatch = -2, gap = -6, gap_ext = -1, sort_cluster_by_size = 1) {
if (system("which cd-hit-est", ignore.stdout = FALSE) != 0) {
stop("Required program CD-HIT not found! Please check whether it is installed.")
}

if (missing(input)) {
stop("Input parameter missing!")
stop("No input specified! Please forward a valid bed-file.")
}

message("Loading bed.")
# load bed if neccessary
# load bed if necessary
if (!data.table::is.data.table(input)) {
bed_table <- data.table::fread(input = input, header = FALSE)
bed_table <- data.table::fread(input = input)
} else {
bed_table <- input
}

# check if there are column names to keep them
default_col_names <- grepl(pattern = "^V+\\d$", names(bed_table), perl = TRUE)
keep_col_names <- ifelse(any(default_col_names), FALSE, TRUE)

# check for duplicated names
if (anyDuplicated(bed_table[, "V4"])) {
if (anyDuplicated(bed_table[, 4])) {
warning("Found duplicated names. Making names unique.")
bed_table[, V4 := make.unique(V4)]
bed_table[, 4 := make.unique(names(bed_table)[4])]
}

message("Convert to fasta.")
Expand Down Expand Up @@ -137,17 +149,18 @@ cdhitest <- function(input, identity = 0.8, coverage = 8, output = "cluster.bed"
system(command = cluster_call, wait = TRUE)

# load reformated file
cluster <- data.table::fread(cluster_file)
cluster <- data.table::fread(cluster_file)[, c("id", "clstr")]
names(cluster) <- c("id", "cluster")

### add cluster to bed_table
result <- merge(x = bed_table, y = cluster[, c("id", "clstr")], by.x = "V4", by.y = "id", sort = FALSE)[, union(names(bed_table), names(cluster)[2]), with = FALSE]
result <- merge(x = bed_table, y = cluster, by.x = names(bed_table)[4], by.y = "id", sort = FALSE)[, union(names(bed_table), names(cluster)[2]), with = FALSE]

# delete files
if (clean) {
file.remove(fasta_file, paste0(cdhit_output, ".clstr"), cdhit_output, cluster_file)
}

data.table::fwrite(x = result, file = output, sep = "\t", col.names = FALSE)
data.table::fwrite(x = result, file = output, sep = "\t", col.names = keep_col_names)
}

# call function with given parameter if not in interactive context (e.g. run from shell)
Expand Down