Skip to content

updated script. Now the path of subscripts is automatically aquired. … #33

Merged
merged 21 commits into from
Jan 9, 2019

Conversation

JannikHamp
Copy link
Collaborator

…Fixed error which wrote output to the path of subscripts instead of the working directory

…Fixed error which wrote output to the path of subscripts instead of the working directory
@renewiegandt
Copy link
Collaborator

  1. There is close to no documentation. Please document your code!

  2. please add your name and email address to the script

  3. See Critical bug in compareBed.sh #23
    1

  4. Still not working with the nextflow script! What is the -p parameter for now?

  5. I need the location of the output file! Where is it saved? in './' or somewhere else?

Copy link
Collaborator

@renewiegandt renewiegandt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment.

@renewiegandt
Copy link
Collaborator

Maybe you could check if there are new/unknown motif and if not you should throw an error message.
What happens right now if the output is empty?

@renewiegandt renewiegandt added the urgent Needs to be done as fast as possible label Jan 4, 2019
@JannikHamp
Copy link
Collaborator Author

  1. I want to write the documentation later, at home
  2. same
  3. paths with double // still work with bash. even ///bla/file.fasta I am not sure if this is the source of the error but I ll implement a check for tailing /
  4. -p is the path where the two subscripts merge.R and maxScore.R are located the parameter can be skipped now since the path is aquired automatically
  5. output directory is at . if no path is given

@renewiegandt
Copy link
Collaborator

./compareBed.sh -d /mnt/agnerds/masterjlu2018/out5/footprint_extraction/buenrostro50k_chr1_fp_called_peaks.bed -m /mnt/workspace1/rene.wiegandt/tmp6/ --fasta /mnt/agnerds/masterjlu2018/testdata/presentation/mm10_chr.fa -o test.beb

Gets following:

Warning message:
In fread(paste(folder, "/pass2Tr.bed", sep = "")) :
  File '/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/1.2_filter_motifs/pass2Tr.bed' has size 0. Returning a NULL data.table.
Error in setnames(x, value) :
  Can't assign 8 names to a 0 column data.table
Calls: colnames<- -> names<- -> names<-.data.table -> setnames
Execution halted
Error: The requested file (/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/1.2_filter_motifs/merged.bed) could not be opened. Error message: (No such file or directory). Exiting!

The // paths are working but It should be changed.

Copy link
Collaborator

@HendrikSchultheis HendrikSchultheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree with @renewiegandt. Right now it is really hard for someone else to unterstand what the code is doing, please add a lot more comments!

@HendrikSchultheis
Copy link
Collaborator

Please refer to our guidelines while doing so.

@JannikHamp
Copy link
Collaborator Author

Anastasiia hat doch ein neues bed-format mit 9 spalten. in den daten /mnt/agnerds/masterjlu2018/out5/footprint_extraction/buenrostro50k_chr1_fp_called_peaks.bed sind nur 8 Spalten enthalten. Es fehlt strand nach score. Das wäre das erste was mir auffällt

@JannikHamp
Copy link
Collaborator Author

Edit: Diese version arbeitet jedoch auch noch mit 8 spalten glaube ich. Ist ein anderer Fehler

@renewiegandt
Copy link
Collaborator

Please write your comments in English, thanks.
I tried it with new data. It is still not working.

@JannikHamp
Copy link
Collaborator Author

the path /mnt/workspace/rene.wiegandt/tmp6/ does not exist.
So it runs in an error when this is the -m parameter, when I try to reproduce the error

@renewiegandt
Copy link
Collaborator

The path exists its /mnt/workspace1/rene.wiegandt/tmp6/. You can't reach it from your VM.
I'm going to run it on the full dataset. Going to take a while it could be possible that everything is filtered in this dataset! You should definitely catch this case with an error message.

@anastasiia
Copy link
Collaborator

anastasiia commented Jan 4, 2019

I have just made a pull request #38 with my script that always produces the same number of columns, if I receive no information from original input bed file, there will be a "." at the last column. Please make sure @JannikHamp that your script is working now.

@renewiegandt
Copy link
Collaborator

You should also add documentation to the two subscripts.

I'm not sure why you are writting different files depending on teh help parameter?
Could you explain it to me?

@JannikHamp
Copy link
Collaborator Author

The help parameter at this point is just introduced at that point and has nothing to do with the help of the command compareBed.sh
It is just a memory which file was used in the last iteration.
Bash can only write to file if it is not reading from it at the same time. But the bedtools command are reading and writing at the same time, so i need to write the output to another file. Example:
I have data with new footprints and motiffile1, motiffile2, and motiffile3:
I write data to pass1Tr.bed
then i call the bedtools command with pass1Tr.bed and motiffile1 > pass1TrHelp.bed
Next I compare pass1TrHelp.bed with motiffile2 > pass1Tr.bed
Last I compare pass1Tr.bed with motiffile3 > pass1TrHelp.bed

So pass1TrHelp.bed now contains the information of all comparisons

@renewiegandt
Copy link
Collaborator

Ah than I got it mixed up. Maybe you could rename the variable?

@renewiegandt
Copy link
Collaborator

I tried it on the full mDux data. I still got the same error.
Where is your test data I would like to run the script with your data?

@JannikHamp
Copy link
Collaborator Author

-d /mnt/agnerdsJannik.Hamp/call_peaks_output_to_check.bed
-m /mnt/agnerds/Jannik.Hamp/TFBSscanOutmm10/test/
--fasta /mnt/agnerds/masterjlu2018/fasta/mm10_upper.fasta

but I corrected the script now for 9 columns bed-files. this will not work with the 8 column file /mnt/agnerdsJannik.Hamp/call_peaks_output_to_check.bed

@renewiegandt
Copy link
Collaborator

If an error occurs in your script prints an error message to stdout but does not stop the script.

Copy link
Collaborator

@HendrikSchultheis HendrikSchultheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall try to be more consistent with your spacing. And please check your code for typos.

#output

# This script utilizes bedtools to gain non-overlapping sequence parts between bed-files
# merge.R and maxScore.R are needed to be saved in the same directory than this to make it work
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...same directory as ...

bin/1.2_filter_motifs/compareBed.sh Outdated Show resolved Hide resolved
@@ -87,6 +89,7 @@ case $key in
esac
done

# stores unknown selected parameters for error report
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nicer to write something like "check for unknown arguments".

@@ -99,6 +102,7 @@ then
exit 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing indentation

@@ -99,6 +102,7 @@ then
exit 1
fi

# the help message
if [ $he == true ]
then
echo "This script utilies bedtools to select new footprints from data."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utilizes

colnames(splitted) = c("chromosome", "start", "stop", "id", "score", "length", "maxpos", "info")
colnames(splitted) = c("chromosome", "start", "stop", "id", "score", "strand", "length", "maxpos", "info")

# reading the second dataframe: called p1 (all sequences with zero overlap)
p1 = fread(paste(folder, "/pass1Tr.bed", sep=''))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paste -> file.path

p1 = fread(paste(folder, "/pass1Tr.bed", sep=''))
colnames(p1) = c("chromosome", "start", "stop", "id", "score", "length", "maxpos", "info")
colnames(p1) = c("chromosome", "start", "stop", "id", "score", "strand", "length", "maxpos", "info")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not robust.

colnames(splitted) = c("chromosome", "start", "stop", "id", "score", "length", "maxpos", "info")
colnames(splitted) = c("chromosome", "start", "stop", "id", "score", "strand", "length", "maxpos", "info")

# reading the second dataframe: called p1 (all sequences with zero overlap)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a data.table.

splitted=splitted[which(splitted$stop - splitted$start >= min),]
splitted=splitted[which(splitted$stop - splitted$start <= max),]

# make the ids unique (because of duplicated ids of some footprints that got spliited in 2)
splitted$id=make.unique(as.character(splitted$id))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make.unique only appends numbers to duplicates leaving the first (of the duplicates) as is but we want every duplicate to get a number. (I think that already was discussed at some point.)

splitted=cbind(splitted, containsMaxpos=0)
splitted$containsMaxpos[intersect(which(splitted$start <= splitted$maxpos), which(splitted$stop > splitted$maxpos))] = 1

#calculate relative maxpos values
splitted$maxpos = splitted$maxpos - splitted$start
data.table::fwrite(splitted, paste(folder, "/merged.bed", sep=''), row.names=FALSE, col.names=FALSE, quote=FALSE, sep='\t')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call library(data.table) data.table:: is not needed. Also file.path instead of paste.

@renewiegandt
Copy link
Collaborator

critical bug is fixed by cca8ac4 and f983d22 .

@renewiegandt
Copy link
Collaborator

review_comparebed
review_comparebed_2

@renewiegandt
Copy link
Collaborator

I get following error with the mDux data:

Command error:
  Error in setnames(x, value) :
    Can't assign 9 names to a 10 column data.table
  Calls: colnames<- -> names<- -> names<-.data.table -> setnames
  Execution halted
  Error in setnames(x, value) :
    Can't assign 9 names to a 10 column data.table
  Calls: colnames<- -> names<- -> names<-.data.table -> setnames
  Execution halted
  index file mm10_chr.fa.fai not found, generating...
  Error: The requested file (merged.bed) could not be opened. Error message: (No such file or directory). Exiting!

No idea where the 10th column comes from but this proves that you need to make the naming more robust. You should just use the column names from the bed file you get as input.

@renewiegandt
Copy link
Collaborator

@JannikHamp I committed a few fixes to your branch. Please pull before doing further changes!

Copy link
Collaborator

@renewiegandt renewiegandt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few typos, wording.
R: Please use '<-' instead of '=' to assign a variable. In rare cases '=' can lead to errors in the script.

You should also add more comments to the statistics.
For each line what information is generated.
For example:
sum_data = sum(data[[3]]-data[[2]])
No idea what information is stored in sum_data.

path=`echo $0 | sed 's/\/[^\/]*$/\//g'`
help=false

# display help when no parameters chosen
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. no parameters are given

fi

if [ $ma == false ]
# motiffiles either from a directory OR comma separated list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get motif files either...

ma=true
fi
# creates an array of all files with bed in its name in the directory $motifs
declare -a motiffiles=(`ls $motifs | grep bed | sed "s|^|$motifs\/|g" | tr '\n' ' ' | sed "s|//|/|g"`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you check for the file ending? I would use: ' grep "*.bed" '

if [ ! -d $workdir ]
then
mkdir $workdir
# the else case means, that the motiffiles were passed comma separated with no whitespace.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

motif files

for i in ${motiffiles[@]}
do
if [ $help == true ]
# remove trailing tabs in motiffile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

motif file


Rscript --vanilla $path/merge.R $min $max $workdir $data
# check if header existed. If so, final output also has a header.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... exists.

stop("footprint file has less than 9 columns. exiting.")
}

# remove sequences that are smaller than minimum (parameter)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(parameter: min)
(parameter: max)

else
help=true
bedtools intersect -v -a "$workdir"/pass1TrHelp.bed -b $i > "$workdir"/pass1Tr.bed
echo file $i does not exist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write ERROR: or Error: in front of the error message. So the user can identify the error quicker.

@HendrikSchultheis HendrikSchultheis mentioned this pull request Jan 9, 2019
35 tasks
Copy link
Collaborator

@HendrikSchultheis HendrikSchultheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this finally looks good!

@renewiegandt renewiegandt merged commit 81c6fcd into dev Jan 9, 2019
@HendrikSchultheis HendrikSchultheis mentioned this pull request Jan 10, 2019
Sign in to join this conversation on GitHub.
Labels
urgent Needs to be done as fast as possible
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants