docs/quantification/transcriptome/SXPv2.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="http://deep.mpi-inf.mpg.de/DAC/files/style/deep_process_style.css"?>
<process>
	<name>SXP</name>
	<version>2</version>
	<author>
		<name>Filippos Klironomos</name>
		<email>filippos.klironomos@mdc-berlin.de</email>
	</author>
    <description>
    *) miRDeep2 pipeline involves:

      *) mapping of reads to genome and keeping those uniquely mapped
      *) extracting bracketing DNA of the uniquely mapped reads
      *) RNAfold extracted sequences and keeping those that form unbifurcated hairpins
      *) scoring putative precursors:
         *) expect greater number of reads mapping to either the -5p or -3p strand and very little to the hairpin
         *) short 3&apos; duplex overhang characteristic of Drosha/Dicer processing adds to the score
         *) relative and absolute stabilities contribute to the score
         *) if 5&apos; end of mature sequence is identical to that of known mature sequence it adds to the score
      *) randomly permuting read signatures with putative precursor sequences in order to determine the FPR

    Internally miRDeep2 uses the following packages:

    RNAfold version 2.1.7
    RANDFOLD version 2
	</description>
	<inputs>
		<filetype>
      <identifier>config</identifier>
			<format>TSV</format>
      <quantity>single</quantity>
      <comment>
        this is the configuration file that miRDeep2 uses to locate the FASTQ library and assign the 3-character identification to it
      </comment>
		</filetype>
	</inputs>
	<!-- Following section: list reference files [e.g. reference genomes] used in this process -->
	<references>
		<filetype>
			<identifier>genome</identifier>
			<format>fasta</format>
			<quantity>single</quantity>
      <comment>
<![CDATA[
        hs37d5 and GRCm38mm10 genomes are modified as follows:

          *) IDs are simplified, everything to the right of the first white space encountered is removed,

          *) all ambiguously called nucleotides [URYSWKMBDHV] have been masked to 'N'.

        The following script does all this:
        
          sed -e 's/^>\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' hs37d5.fa > hs37d5_simple.fa
          sed -e 's/^>\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' GRCm38mm10.fa > GRCm38mm10_simple.fa
]]>
      </comment>
		</filetype>
		<filetype>
			<identifier>genome_index</identifier>
			<format>bowtie-index</format>
			<quantity>collection</quantity>
      <comment>
  	    bowtie version 1.1.1 index of hs37d5_simple.fa and GRCm38mm10_simple.fa generated as follows:

          bowtie-build -f hs37d5_simple.fa hs37d5_simple.fa
          bowtie-build -f GRCm38mm10_simple.fa GRCm38mm10_simple.fa
      </comment>
		</filetype>
		<filetype>
			<identifier>miRBase_mature</identifier>
			<format>fasta</format>
			<quantity>single</quantity>
			<comment>mature known miRNA reference from miRBase Release 20 uploaded to ASPERA</comment>
		</filetype>
		<filetype>
			<identifier>miRBase_hairpin</identifier>
			<format>fasta</format>
			<quantity>single</quantity>
			<comment>precursor (hairpin) known miRNA reference from miRBase Release 20 uploaded to ASPERA</comment>
		</filetype>
	</references>
	<!-- Following section: list output files of process [e.g. bed files, wiggle tracks] -->
	<outputs>
		<filetype>
      <identifier>SampleID.SXPv2.DATE.known.csv</identifier>
			<format>csv</format>
			<quantity>single</quantity>
      <comment>
        expression of known miRNAs quantified by miRDeep2
      </comment>
		</filetype>
		<filetype>
      <identifier>SampleID.SXPv2.DATE.known.bed</identifier>
			<format>bed</format>
			<quantity>single</quantity>
      <comment>
        BED track of expression of known miRNAs quantified by miRDeep2
      </comment>
		</filetype>
		<filetype>
      <identifier>SampleID.SXPv2.DATE.known.bedGraph</identifier>
			<format>bedGraph</format>
			<quantity>single</quantity>
      <comment>
        bedGraph track of expression of known miRNAs quantified by miRDeep2
      </comment>
		</filetype>
		<filetype>
      <identifier>SampleID.SXPv2.DATE.novel.bed</identifier>
			<format>bed</format>
			<quantity>single</quantity>
      <comment>
        bed track of expression of novel miRNAs predicted by miRDeep2
      </comment>
		</filetype>
		<filetype>
      <identifier>SampleID.SXPv2.DATE.novel.bedGraph</identifier>
			<format>bedGraph</format>
			<quantity>single</quantity>
      <comment>
        bedGraph track of expression of novel miRNAs predicted by miRDeep2
      </comment>
		</filetype>
	</outputs>
	<!-- Precise description of what this process does, what output is generated and what statistics are computed -->

	<software>
		<tool>
      <name>generate_config</name>
			<version>missing</version>
      <command_line>
            <![CDATA[
                echo -ne "{SampleID.fastq}\tID1\n" > config
            ]]>
      </command_line>
      <loop>no looping</loop>
      <comment>
        this command creates the configuration file for miRDeep2 to use in order to locate the FASTQ library {SampleID.fastq} and assign
        a 3-letter internal ID to it, in this case ID1
      </comment>
		</tool>
		<tool>
			<name>mapper.pl</name>
			<version>miRDeep2.0.0.7</version>
      <command_line>
        <![CDATA[
            mapper.pl config -d -e -h -j -k {Adaptor} -l 18 -m -p {genome_index} -s reads_collapsed.fa -t reads_vs_genome.arf -v -o 12  &> mapper_summary.log
        ]]>
      </command_line>
      <loop>no looping</loop>
      <comment>
        use the configuration file to locate the library; remove adaptor provided by {Adaptor}; 
        collapse the reads to the file "read_collapsed.fa";
        map to the reference and output the alignments in the file "reads_vs_genome.arf";
        print out summary in "mapper_summary.log"

        The ARF is a text-based format consisting of the following columns:

          readID  #  the ID of the read 
          readLength  #  length of the read
          start  #  start position of the alignment relative to the read
          end  #  end position of the alignment relative to the read
          readSeq  #  sequence of the read
          chr  #  chromosome of reference where read maps
          refLength  #  length of the reference sequence where read maps to
          start  #  start position of reference sequence where read maps to 
          end  #  end position of reference sequence where read maps to 
          referenceSeq  #  reference sequence where read maps to 
          strand  #  strand of reference
          mm  #  number of mismatches in the alignment
          MAPQ-like-string  #  m==perfect match, M==mismatch
      </comment>
		</tool>
		<tool>
			<name>miRDeep2</name>
			<version>miRDeep2.0.0.7</version>
      <command_line>
<![CDATA[
        miRDeep2.pl reads_collapsed.fa {genome} reads_vs_genome.arf {miRBase_mature} none {miRBase_hairpin} -t {Species} -P -d -v 2> miRDeep2.report.log
]]>
      </command_line>
      <loop>no looping</loop>
			<comment>quantify known miRNAs and predict putative novel miRNAs across samples</comment>
		</tool>
		<tool>
			<name>rename_according_to_metadata_standards</name>
			<version>missing</version>
      <command_line>
<![CDATA[
        cp miRNAs_expressed_all_samples_DATE_t_TIME.csv {SampleID}.SXPv2.{DATE}.known.csv
]]>
      </command_line>
      <loop>no looping</loop>
      <comment>rename output data file to conform to metadata naming standards</comment>
		</tool>
		<tool>
			<name>mirdeep2_csv2bed.pl</name>
			<version>missing</version>
      <command_line>
<![CDATA[
        mirdeep2_csv2bed.pl -r result_DATE_t_TIME.csv -p -T {SampleID} 
        cp known_pres_DATE_t_TIME_score-50_to_na.bed {SampleID}.SXPv2.{DATE}.known.bed
        echo "track name=\"{SampleID}.novel_miRNAs\" description=\"novel miRNAs detected by miRDeep2 for {SampleID}\" visibility=2 itemRgb=\"On\"" > "{SampleID}.SXPv2.{DATE}.novel.bed"
        cat "novel_pres_DATE_t_TIME_score-50_to_na.bed" >> "{SampleID}.SXPv2.{DATE}.novel.bed"
]]>
      </command_line>
      <loop>no looping</loop>
      <comment>
        Generate BED tracks from the total precursor read counts of known and novel miRNAs and rename them according to metadata standards.
        This tool has been uploaded to ASPERA.
      </comment>
		</tool>
		<tool>
			<name>bed_to_bedGraph</name>
			<version>missing</version>
      <command_line>
<![CDATA[
        gawk 'NR==3 {print "track type=bedGraph description=\"miRDeep2 known miRNAs\" visibility=2 color=0,0,255 altColor=255,0,0" > FILENAME"Graph";  print $1,$2,$3,$5 >> FILENAME"Graph"} NR>3 {print $1,$2,$3,$5 >> FILENAME"Graph"}' "{SampleID}.SXPv2.{DATE}.known.bed"
        gawk 'NR==1 {print "track type=bedGraph description=\"miRDeep2 novel miRNAs\" visibility=2 color=0,0,255 altColor=255,0,0" > FILENAME"Graph";  print $1,$2,$3,$5 >> FILENAME"Graph"} NR>1 {print $1,$2,$3,$5 >> FILENAME"Graph"}' "{SampleID}.SXPv2.{DATE}.novel.bed"
]]>
      </command_line>
      <loop>no looping</loop>
			<comment>convert BED tracks to bedGraph</comment>
		</tool>
	</software>
</process>