Skip to content

Improved: Motif clustering #39

Merged
merged 7 commits into from Jan 6, 2019
Merged

Improved: Motif clustering #39

merged 7 commits into from Jan 6, 2019

Conversation

renewiegandt
Copy link
Collaborator

Improved motif clustering by comparing the motifs of each cluster separately with the merged motif file.
Added new R-script which labels the TSV-files with the corresponding cluster ID.

@renewiegandt renewiegandt added the enhancement New feature or request label Jan 5, 2019
@HendrikSchultheis HendrikSchultheis changed the title Imporved: Motif clustering Improved: Motif clustering Jan 5, 2019
Copy link
Collaborator

@HendrikSchultheis HendrikSchultheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some things you should change but nothing major. Also there are some spelling errors that should be fixed. You can use the hunspell package to check your scripts for typos.

@@ -1,13 +1,13 @@
#!/usr/bin/env Rscript
library("optparse")
if (!require(optparse)) install.packages("optparse"); library(optparse)

option_list <- list(
make_option(opt_str = c("-i", "--input"), default = NULL, help = "Input bed-file. Second last column must be sequences and last column must be the cluster_id.", metavar = "character"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster_id -> cluster id


option_list <- list(
make_option(opt_str = c("-i", "--input"), default = NULL, help = "Input bed-file. Second last column must be sequences and last column must be the cluster_id.", metavar = "character"),
make_option(opt_str = c("-p", "--prefix"), default = "" , help = "Prefix for file names. Default = '%default'", metavar = "character"),
make_option(opt_str = c("-m", "--min_seq"), default = 100, help = "Minimum amount of sequences in clusters. Default = %default", metavar = "integer")
)

opt_parser <- OptionParser(option_list = option_list,
opt_parser <- OptionParser(option_list = option_list,
description = "Convert BED-file to one FASTA-file per cluster")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...cluster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Author and email are missing.


option_list <- list(
make_option(opt_str = c("-i", "--input"), default = NULL, help = "Input bed-file. Second last column must be sequences and last column must be the cluster_id.", metavar = "character"),
make_option(opt_str = c("-p", "--prefix"), default = "" , help = "Prefix for file names. Default = '%default'", metavar = "character"),
make_option(opt_str = c("-m", "--min_seq"), default = 100, help = "Minimum amount of sequences in clusters. Default = %default", metavar = "integer")
)

opt_parser <- OptionParser(option_list = option_list,
opt_parser <- OptionParser(option_list = option_list,
description = "Convert BED-file to one FASTA-file per cluster")

opt <- parse_args(opt_parser)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Sequences of each cluster are written as an FASTA-file.

if (is.null(bedInput)) {
stop("ERROR: Input parameter cannot be null! Please specify the input parameter.")
}

bed <- data.table::fread(bedInput, sep = "\t")


# Get last column of data.table, which refers to the cluster, as a vector.
cluster_no <- as.vector(bed[[ncol(bed)]])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove as.vector. Using [[]] already returns a vector.

# Split data.table bed on its last column (cluster_no) into list of data.frames
clusters <- split(bed, cluster_no, sorted = TRUE, flatten = FALSE)

# For each data.frame(cluster) in list clusters:
discard <- lapply(1:length(clusters), function(i){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nicer to use seq_len instead of 1:x.

#' @contact rene.wiegandt(at)mpi-bn.mpg.de
merge_similar <- function(tsv_in, file_list, min_weight){

files <- unlist(as.list(strsplit(file_list, ",")))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The as.list is redundant strsplit already returns a list.


# split the string on the character "." in the first to columns and safe the last value each, to get the number of the cluster.
tsv <- data.table::fread(tsv_in, header = TRUE, sep = "\t",colClasses = 'character')
query_cluster <- unlist(lapply(strsplit(tsv[["Query_ID"]],"\\."), function(l){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use vapply instead it already returns a vector.

tail(l,n=1)
}))

# create data.table with only the cluster-numbers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to convert later if you create your data.table with numeric columns right away.

sim_not_unique[, query_cluster := as.numeric(query_cluster)]
sim_not_unique[, target_cluster := as.numeric(target_cluster)]

# remove rows if column 1 is idential to column 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*identical

system(paste("cat",f,">",basename(f)))
})
}
# merge FASTA-files depending on the clustered graphs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One or two more comments would be nice in this lapply.

@renewiegandt
Copy link
Collaborator Author

Link #16

Copy link
Collaborator

@HendrikSchultheis HendrikSchultheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@renewiegandt renewiegandt merged commit eb36ef7 into dev Jan 6, 2019
Sign in to join this conversation on GitHub.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants