Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
1 contributor

Users who have contributed to this file

227 lines (216 sloc) 8.92 KB
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="http://deep.mpi-inf.mpg.de/DAC/files/style/deep_process_style.css"?>
<process>
<name>SXP</name>
<version>2</version>
<author>
<name>Filippos Klironomos</name>
<email>filippos.klironomos@mdc-berlin.de</email>
</author>
<description>
*) miRDeep2 pipeline involves:
*) mapping of reads to genome and keeping those uniquely mapped
*) extracting bracketing DNA of the uniquely mapped reads
*) RNAfold extracted sequences and keeping those that form unbifurcated hairpins
*) scoring putative precursors:
*) expect greater number of reads mapping to either the -5p or -3p strand and very little to the hairpin
*) short 3&apos; duplex overhang characteristic of Drosha/Dicer processing adds to the score
*) relative and absolute stabilities contribute to the score
*) if 5&apos; end of mature sequence is identical to that of known mature sequence it adds to the score
*) randomly permuting read signatures with putative precursor sequences in order to determine the FPR
Internally miRDeep2 uses the following packages:
RNAfold version 2.1.7
RANDFOLD version 2
</description>
<inputs>
<filetype>
<identifier>config</identifier>
<format>TSV</format>
<quantity>single</quantity>
<comment>
this is the configuration file that miRDeep2 uses to locate the FASTQ library and assign the 3-character identification to it
</comment>
</filetype>
</inputs>
<!-- Following section: list reference files [e.g. reference genomes] used in this process -->
<references>
<filetype>
<identifier>genome</identifier>
<format>fasta</format>
<quantity>single</quantity>
<comment>
<![CDATA[
hs37d5 and GRCm38mm10 genomes are modified as follows:
*) IDs are simplified, everything to the right of the first white space encountered is removed,
*) all ambiguously called nucleotides [URYSWKMBDHV] have been masked to 'N'.
The following script does all this:
sed -e 's/^>\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' hs37d5.fa > hs37d5_simple.fa
sed -e 's/^>\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' GRCm38mm10.fa > GRCm38mm10_simple.fa
]]>
</comment>
</filetype>
<filetype>
<identifier>genome_index</identifier>
<format>bowtie-index</format>
<quantity>collection</quantity>
<comment>
bowtie version 1.1.1 index of hs37d5_simple.fa and GRCm38mm10_simple.fa generated as follows:
bowtie-build -f hs37d5_simple.fa hs37d5_simple.fa
bowtie-build -f GRCm38mm10_simple.fa GRCm38mm10_simple.fa
</comment>
</filetype>
<filetype>
<identifier>miRBase_mature</identifier>
<format>fasta</format>
<quantity>single</quantity>
<comment>mature known miRNA reference from miRBase Release 20 uploaded to ASPERA</comment>
</filetype>
<filetype>
<identifier>miRBase_hairpin</identifier>
<format>fasta</format>
<quantity>single</quantity>
<comment>precursor (hairpin) known miRNA reference from miRBase Release 20 uploaded to ASPERA</comment>
</filetype>
</references>
<!-- Following section: list output files of process [e.g. bed files, wiggle tracks] -->
<outputs>
<filetype>
<identifier>SampleID.SXPv2.DATE.known.csv</identifier>
<format>csv</format>
<quantity>single</quantity>
<comment>
expression of known miRNAs quantified by miRDeep2
</comment>
</filetype>
<filetype>
<identifier>SampleID.SXPv2.DATE.known.bed</identifier>
<format>bed</format>
<quantity>single</quantity>
<comment>
BED track of expression of known miRNAs quantified by miRDeep2
</comment>
</filetype>
<filetype>
<identifier>SampleID.SXPv2.DATE.known.bedGraph</identifier>
<format>bedGraph</format>
<quantity>single</quantity>
<comment>
bedGraph track of expression of known miRNAs quantified by miRDeep2
</comment>
</filetype>
<filetype>
<identifier>SampleID.SXPv2.DATE.novel.bed</identifier>
<format>bed</format>
<quantity>single</quantity>
<comment>
bed track of expression of novel miRNAs predicted by miRDeep2
</comment>
</filetype>
<filetype>
<identifier>SampleID.SXPv2.DATE.novel.bedGraph</identifier>
<format>bedGraph</format>
<quantity>single</quantity>
<comment>
bedGraph track of expression of novel miRNAs predicted by miRDeep2
</comment>
</filetype>
</outputs>
<!-- Precise description of what this process does, what output is generated and what statistics are computed -->
<software>
<tool>
<name>generate_config</name>
<version>missing</version>
<command_line>
<![CDATA[
echo -ne "{SampleID.fastq}\tID1\n" > config
]]>
</command_line>
<loop>no looping</loop>
<comment>
this command creates the configuration file for miRDeep2 to use in order to locate the FASTQ library {SampleID.fastq} and assign
a 3-letter internal ID to it, in this case ID1
</comment>
</tool>
<tool>
<name>mapper.pl</name>
<version>miRDeep2.0.0.7</version>
<command_line>
<![CDATA[
mapper.pl config -d -e -h -j -k {Adaptor} -l 18 -m -p {genome_index} -s reads_collapsed.fa -t reads_vs_genome.arf -v -o 12 &> mapper_summary.log
]]>
</command_line>
<loop>no looping</loop>
<comment>
use the configuration file to locate the library; remove adaptor provided by {Adaptor};
collapse the reads to the file "read_collapsed.fa";
map to the reference and output the alignments in the file "reads_vs_genome.arf";
print out summary in "mapper_summary.log"
The ARF is a text-based format consisting of the following columns:
readID # the ID of the read
readLength # length of the read
start # start position of the alignment relative to the read
end # end position of the alignment relative to the read
readSeq # sequence of the read
chr # chromosome of reference where read maps
refLength # length of the reference sequence where read maps to
start # start position of reference sequence where read maps to
end # end position of reference sequence where read maps to
referenceSeq # reference sequence where read maps to
strand # strand of reference
mm # number of mismatches in the alignment
MAPQ-like-string # m==perfect match, M==mismatch
</comment>
</tool>
<tool>
<name>miRDeep2</name>
<version>miRDeep2.0.0.7</version>
<command_line>
<![CDATA[
miRDeep2.pl reads_collapsed.fa {genome} reads_vs_genome.arf {miRBase_mature} none {miRBase_hairpin} -t {Species} -P -d -v 2> miRDeep2.report.log
]]>
</command_line>
<loop>no looping</loop>
<comment>quantify known miRNAs and predict putative novel miRNAs across samples</comment>
</tool>
<tool>
<name>rename_according_to_metadata_standards</name>
<version>missing</version>
<command_line>
<![CDATA[
cp miRNAs_expressed_all_samples_DATE_t_TIME.csv {SampleID}.SXPv2.{DATE}.known.csv
]]>
</command_line>
<loop>no looping</loop>
<comment>rename output data file to conform to metadata naming standards</comment>
</tool>
<tool>
<name>mirdeep2_csv2bed.pl</name>
<version>missing</version>
<command_line>
<![CDATA[
mirdeep2_csv2bed.pl -r result_DATE_t_TIME.csv -p -T {SampleID}
cp known_pres_DATE_t_TIME_score-50_to_na.bed {SampleID}.SXPv2.{DATE}.known.bed
echo "track name=\"{SampleID}.novel_miRNAs\" description=\"novel miRNAs detected by miRDeep2 for {SampleID}\" visibility=2 itemRgb=\"On\"" > "{SampleID}.SXPv2.{DATE}.novel.bed"
cat "novel_pres_DATE_t_TIME_score-50_to_na.bed" >> "{SampleID}.SXPv2.{DATE}.novel.bed"
]]>
</command_line>
<loop>no looping</loop>
<comment>
Generate BED tracks from the total precursor read counts of known and novel miRNAs and rename them according to metadata standards.
This tool has been uploaded to ASPERA.
</comment>
</tool>
<tool>
<name>bed_to_bedGraph</name>
<version>missing</version>
<command_line>
<![CDATA[
gawk 'NR==3 {print "track type=bedGraph description=\"miRDeep2 known miRNAs\" visibility=2 color=0,0,255 altColor=255,0,0" > FILENAME"Graph"; print $1,$2,$3,$5 >> FILENAME"Graph"} NR>3 {print $1,$2,$3,$5 >> FILENAME"Graph"}' "{SampleID}.SXPv2.{DATE}.known.bed"
gawk 'NR==1 {print "track type=bedGraph description=\"miRDeep2 novel miRNAs\" visibility=2 color=0,0,255 altColor=255,0,0" > FILENAME"Graph"; print $1,$2,$3,$5 >> FILENAME"Graph"} NR>1 {print $1,$2,$3,$5 >> FILENAME"Graph"}' "{SampleID}.SXPv2.{DATE}.novel.bed"
]]>
</command_line>
<loop>no looping</loop>
<comment>convert BED tracks to bedGraph</comment>
</tool>
</software>
</process>