Skip to content

updated script. Now the path of subscripts is automatically aquired. … #33

Merged
merged 21 commits into from
Jan 9, 2019
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
35 changes: 26 additions & 9 deletions bin/1.2_filter_motifs/compareBed.sh
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
#!/bin/bash
#data
#motifs
#workdir
#fasta
#min
#max
#output

# This script utilizes bedtools to gain non-overlapping sequence parts between bed-files
# merge.R and maxScore.R are needed to be saved in the same directory than this to make it work
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...same directory as ...

# For more information read the wiki or run ./compareBed.sh without parameters
# author Jannik Hamp
# jannik.hamp@googlemail.com

wrong=()
da=false
mo=false
Expand All @@ -17,11 +17,13 @@ ou=false
pa=false
he=false

#no parameters chosen means help is going to be displayed
if [ $# -eq 0 ]
then
he=true
fi

#parsing the parameters, if a parameter is chosen, a value must be provided aswell. No standalone parameters
while [[ $# -gt 0 ]]
HendrikSchultheis marked this conversation as resolved.
Show resolved Hide resolved
do
key="$1"
Expand Down Expand Up @@ -87,6 +89,7 @@ case $key in
esac
done

# stores unknown selected parameters for error report
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nicer to write something like "check for unknown arguments".

count=${#wrong[@]}
if [ $count -gt 0 ]
then
Expand All @@ -99,6 +102,7 @@ then
exit 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing indentation

fi

# the help message
if [ $he == true ]
then
echo "This script utilies bedtools to select new footprints from data."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utilizes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both in bed-format and the (no comma)

Expand All @@ -119,13 +123,14 @@ then
echo " -w --workdir the path to directory where temporary files will be created"
echo " default is the current directory"
echo " -min --min minimum size of footprints; default is 10"
echo " -max --max maximum size of footprints; default is 80"
echo " -max --max maximum size of footprints; default is 200"
echo " -o --output output path/file ; default dir is workdir and filename is newMotifs.bed and newMotifs.bed.fasta"
echo " -h --help shows this help message"
echo " -p --path the path where the required scripts merge.R and maxScore.R are stored. Default: same path as this scripts path"
exit 0
fi

# prints which parameters have been selected
echo selected parameters
echo -------------------
echo data: $da \(required\)
Expand All @@ -138,19 +143,22 @@ echo output: $ou
echo help: $he
echo path of scripts: $pa

#checks if the 3 required parameters have been selected
if [ $da == false ] || [ $mo == false ] || [ $fa == false ]
then
echo required parameters not given.
echo required are: --data \<path/data.bed\> --motifs \<path/motifs.bed\> --fasta \<path/file.fasta\>
exit 1
fi

#workdir is set to the current directory if it was not specified with a parameter
if [ $wo == false ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should set $workdir in the beginning to './' and overwrite it if necessary. This would save you this whole if statement + an additional $wo variable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking a bit further I think that most if not all of your if statements in this part are overcomplicated. You should set default values at the beginning of your script. This would enable you to discard most of the if statements and leave you only with the required parameters. These should be checked for their existence or in other words whether they were set (https://stackoverflow.com/questions/3601515/how-to-check-if-a-variable-is-set-in-bash).

then
wo=true
workdir=$PWD
fi

#if output was not specified it will be at "$workdir"/newMotifs.bed
if [ $ou == false ]
then
output=${workdir}/"newMotifs.bed"
Expand All @@ -165,19 +173,25 @@ fi

if [ $ma == false ]
then
max=80
max=200
ma=true
fi

#default path of the scripts merge.R and maxScore.R is the same path as compareBed.sh has
# the `echo $0 | sed 's/\/[^\/]*$/\//g'` command extracts the path from $0, where the command itself is stored
if [ $pa == false ]
then
path=`echo $0 | sed 's/\/[^\/]*$/\//g'`
pa=true
fi

#1. first filter. no overlap vs. overlap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Filter for sequences without overlap" would be a nicer comment.

# This is done with the data of new footprints and each motif file subsequently.
# The output of one iteration is the input data for the new footprints of the next iteration, in which a new motif is checked.
echo get sequences with no overlap
cat $data > "$workdir"/pass1Tr.bed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please rename 'pass1Tr.bed' ('pass1Trhelp.bed' aswell) to something more descriptive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename pass1Fa.bed, pass1FaHelp.bed, etc.

# help variable is needed, because in bash I cannot write to the same file from which I am also reading the data.
# Thus, there are 2 files which are alternating in each iteration
help=true

if [ -d "$motifs" ]
Expand All @@ -194,6 +208,7 @@ do
fi
echo $i
done
# if the -m parameter is not a directory, it is expected to be a comma separated list of files
else
declare -a motiffiles=(`echo $motifs | sed 's/,/ /g'`)
for i in ${motiffiles[@]}
Expand All @@ -210,6 +225,7 @@ do
done
fi

# After the final iteration, the last output is written to pass1Tr.bed if it is not already there
if [ $help == false ]
then
cat "$workdir"/pass1TrHelp.bed > "$workdir"/pass1Tr.bed
Expand Down Expand Up @@ -256,6 +272,7 @@ do
done
fi

#The output of the last iteration of the subtract loop is written to pass2Tr.bed
if [ $help == false ]
then
cat "$workdir"/pass1FaHelp.bed > "$workdir"/pass2Tr.bed
Expand Down
8 changes: 7 additions & 1 deletion bin/1.2_filter_motifs/maxScore.R
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
#!/bin/Rscript

# The script is used with the script: compareBed.sh
# This calculates the absolute maxpos values of a bed-file and overwrites them
# author: Jannik Hamp
# email: jannik.hamp@googlemail.com

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write something about your input parameter.

library(data.table)
args = commandArgs(TRUE)
file = args[1]

tab = fread(file)
colnames(tab) = c("chromosome", "start", "stop", "id", "score", "length", "maxpos", "info")
colnames(tab) = c("chromosome", "start", "stop", "id", "score", "strand", "length", "maxpos", "info")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script fails if your table has not exactly 9 columns! Please change it to be more robust.


tab$maxpos = tab$start + tab$maxpos
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the data.frame way of combining columns. As you already use a data.table please try using the := function.


Expand Down
27 changes: 24 additions & 3 deletions bin/1.2_filter_motifs/merge.R
Original file line number Diff line number Diff line change
@@ -1,29 +1,50 @@
#!/bin/Rscript

# The script is used in the script: compareBed.sh
# This script merges the non-overlapping regions and the non-overlapping parts of overlapping regions
# It also removes sequences that are smaller than the --min paramter or bigger than the --max parameter
# Additionally, information of the comparison is written into an output file
# author Jannik Hamp
# email jannik.hamp@googlemail.com

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe input parameters and outputs.

library(data.table)
args=commandArgs(TRUE)
min=as.numeric(args[1])
max=as.numeric(args[2])
folder=args[3]

# reading the first dataframe: called splitted (contains splitted regions because of partial overlap)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a data.table.

splitted = fread(paste(folder, "/pass2Tr.bed", sep=''))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use file.path instead of paste.

colnames(splitted) = c("chromosome", "start", "stop", "id", "score", "length", "maxpos", "info")
colnames(splitted) = c("chromosome", "start", "stop", "id", "score", "strand", "length", "maxpos", "info")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not robust see comment above.


# reading the second dataframe: called p1 (all sequences with zero overlap)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a data.table.

p1 = fread(paste(folder, "/pass1Tr.bed", sep=''))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paste -> file.path

colnames(p1) = c("chromosome", "start", "stop", "id", "score", "length", "maxpos", "info")
colnames(p1) = c("chromosome", "start", "stop", "id", "score", "strand", "length", "maxpos", "info")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not robust.


# calculates the absolute position of maximum signal of each footprint
p1$maxpos = p1$start + p1$maxpos

# the new combined dataframe is now called splitted
splitted=rbind(splitted, p1)

# only keep entries with a larger sequence than min and a smaller sequence than max
splitted=splitted[which(splitted$stop - splitted$start >= min),]
splitted=splitted[which(splitted$stop - splitted$start <= max),]

# make the ids unique (because of duplicated ids of some footprints that got spliited in 2)
splitted$id=make.unique(as.character(splitted$id))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make.unique only appends numbers to duplicates leaving the first (of the duplicates) as is but we want every duplicate to get a number. (I think that already was discussed at some point.)

# calculate new length values (because of the splitted footprints)
splitted$length=splitted$stop - splitted$start

# add column containsMaxpos (0 means that maxpos has any overlap and 1 means that maxpos has no overlap with any motif)
splitted=cbind(splitted, containsMaxpos=0)
splitted$containsMaxpos[intersect(which(splitted$start <= splitted$maxpos), which(splitted$stop > splitted$maxpos))] = 1

#calculate relative maxpos values
splitted$maxpos = splitted$maxpos - splitted$start
data.table::fwrite(splitted, paste(folder, "/merged.bed", sep=''), row.names=FALSE, col.names=FALSE, quote=FALSE, sep='\t')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call library(data.table) data.table:: is not needed. Also file.path instead of paste.


#aditional information of the comparison of the unknown footprints and the known motifs are computed and written to FilterMotifs.stats
before = fread(args[4], header=FALSE)

sumb=sum(before$V3-before$V2)
Expand All @@ -33,4 +54,4 @@ loss = formatC(1 - suma/sumb, digits=2)
lengthb = formatC(mean(before$V3-before$V2), digits=4)
lengtha = formatC(mean(splitted$length), digits=4)
stats=data.frame(sum_nt_input=sumb, sum_nt_filtered=suma, factor=difference, loss=loss, mean_length_input=lengthb, mean_length_filtered=lengtha, flag_1_ratio=length(which(splitted$containsMaxpos == 1))/dim(splitted)[1])
write.table(stats, "../FilterMotifs.stats", row.names=FALSE, quote=FALSE, sep='\t')
write.table(stats, "./FilterMotifs.stats", row.names=FALSE, quote=FALSE, sep='\t')