comp-metadata/docs/quantification/transcriptome/SXPv1.xml at c73efaf32150397fab3c189031c8796fa441cc47 · DEEP/comp-metadata

1

<?xml version="1.0"?>

2

<?xml-stylesheet type="text/css" href="http://deep.mpi-inf.mpg.de/DAC/files/style/deep_process_style.css"?>

3

4

5

6

7

<name>Filippos Klironomos</name>

8

<email>filippos.klironomos@mdc-berlin.de</email>

9

</author>

10

11

*) miRDeep2 pipeline involves:

12

*) mapping of reads to genome and keeping those uniquely mapped

13

*) extracting bracketing DNA of the uniquely mapped reads

14

      *) RNAfold extracted sequences and keeping those that form unbifurcated hairpins

15

*) scoring putative precursors:

16

         *) expect greater number of reads mapping to either the -5p or -3p strand and very little to the hairpin

17

         *) short 3&apos; duplex overhang characteristic of Drosha/Dicer processing adds to the score

18

*) relative and absolute stabilities contribute to the score

19

         *) if 5&apos; end of mature sequence is identical to that of known mature sequence it adds to the score

20

      *) randomly permuting read signatures with putative precursor sequences in order to determine the FPR

21

Internally miRDeep2 uses the following packages:

22

RNAfold version 2.1.7

23

RANDFOLD version 2

24

</description>

25

26

27

<identifier>config</identifier>

28

29

<quantity>single</quantity>

30

31

        this is the configuration file that miRDeep2 uses to locate the FASTQ library and assign the 3-character identification to it

32

</comment>

33

</filetype>

34

</inputs>

35

36

37

<identifier>genome</identifier>

38

<format>fasta</format>

39

<quantity>single</quantity>

40

41

hs37d5 and GRCm38mm10 genomes are modified as follows:

42

          *) IDs are simplified, everything to the right of the first white space encountered is removed,

43

          *) all ambiguously called nucleotides [URYSWKMBDHV] have been masked to &quot;N&quot;.

44

The following script does all this:

45

<![CDATA[

46

          sed -e 's/^>\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' hs37d5.fa > hs37d5_simple.fa

47

          sed -e 's/^>\(\S\+\)\s.*$/>\1/' -e '/^[^>]/s/[UuRrYySsWwKkMmBbDdHhVv]/N/g' GRCm38mm10.fa > GRCm38mm10_simple.fa

48

]]>

49

</comment>

50

</filetype>

51

52

<identifier>genome_index</identifier>

53

<format>bowtie-index</format>

54

<quantity>collection</quantity>

55

56

  	    bowtie version 0.12.7 index of hs37d5_simple.fa and GRCm38mm10_simple.fa generated as follows:

57

bowtie-build -f hs37d5_simple.fa hs37d5_simple.fa

58

bowtie-build -f GRCm38mm10_simple.fa GRCm38mm10_simple.fa

59

</comment>

60

</filetype>

61

62

<identifier>miRBase_mature</identifier>

63

<format>fasta</format>

64

<quantity>single</quantity>

65

			<comment>mature known miRNA reference from miRBase Release 20 uploaded to ASPERA</comment>

66

</filetype>

67

68

<identifier>miRBase_hairpin</identifier>

69

<format>fasta</format>

70

<quantity>single</quantity>

71

			<comment>precursor (hairpin) known miRNA reference from miRBase Release 20 uploaded to ASPERA</comment>

72

</filetype>

73

</references>

74

75

76

<identifier>SampleID.SXPv1.DATE.known.csv</identifier>

77

78

<quantity>single</quantity>

79

80

expression of known miRNAs quantified by miRDeep2

81

</comment>

82

</filetype>

83

84

<identifier>SampleID.SXPv1.DATE.known.bed</identifier>

85

86

<quantity>single</quantity>

87

88

BED track of expression of known miRNAs quantified by miRDeep2

89

</comment>

90

</filetype>

91

92

<identifier>SampleID.SXPv1.DATE.known.bedGraph</identifier>

93

<format>bedGraph</format>

94

<quantity>single</quantity>

95

96

bedGraph track of expression of known miRNAs quantified by miRDeep2

97

</comment>

98

</filetype>

99

100

<identifier>SampleID.SXPv1.DATE.novel.bed</identifier>

101

102

<quantity>single</quantity>

103

104

bed track of expression of novel miRNAs predicted by miRDeep2

105

</comment>

106

</filetype>

107

108

<identifier>SampleID.SXPv1.DATE.novel.bedGraph</identifier>

109

<format>bedGraph</format>

110

<quantity>single</quantity>

111

112

bedGraph track of expression of novel miRNAs predicted by miRDeep2

113

</comment>

114

</filetype>

115

</outputs>

116

117

<tool>

118

<name>generate_config</name>

119

<version>missing</version>

120

<command_line>

121

<![CDATA[ echo -ne "{SampleID.fastq}\tID1\n" > config ]]>

122

</command_line>

123

<loop>no looping</loop>

124

125

            this command creates the configuration file for miRDeep2 to use in order to locate the FASTQ library {SampleID.fastq} and assign

126

a 3-letter internal ID to it, in this case ID1

127

</comment>

128

</tool>

129

<tool>

130

<name>mapper.pl</name>

131

<version>miRDeep2.0.0.6</version>

132

<command_line>

133

            <![CDATA[ mapper.pl config -d -e -h -j -k {Adaptor} -l 18 -m -p {genome_index} -s reads_collapsed.fa -t reads_vs_genome.arf -v -o 12  &> mapper_summary.log ]]>

134

</command_line>

135

<loop>no looping</loop>

136

137

                use the configuration file to locate the library; remove adaptor provided by {Adaptor};

138

collapse the reads to the file "read_collapsed.fa";

139

                map to the reference and output the alignments in the file &quot;reads_vs_genome.arf&quot;;

140

print out summary in "mapper_summary.log"

141

142

The ARF is a text-based format consisting of the following columns:

143

144

readID # the ID of the read

145

readLength # length of the read

146

start # start position of the alignment relative to the read

147

end # end position of the alignment relative to the read

148

readSeq # sequence of the read

149

chr # chromosome of reference where read maps

150

refLength # length of the reference sequence where read maps to

151

start # start position of reference sequence where read maps to

152

end # end position of reference sequence where read maps to

153

referenceSeq # reference sequence where read maps to

154

strand # strand of reference

155

mm # number of mismatches in the alignment

156

MAPQ-like-string # m==perfect match, M==mismatch

157

</comment>

158

</tool>

159

<tool>

160

<name>miRDeep2</name>

161

<version>miRDeep2.0.0.6</version>

162

<command_line>

163

            <![CDATA[ miRDeep2.pl reads_collapsed.fa {genome} reads_vs_genome.arf {miRBase_mature} none {miRBase_hairpin} -t {Species} -P 2> miRDeep2.report.log ]]>

164

</command_line>

165

<loop>no looping</loop>

166

            <comment>quantify known miRNAs and predict putative novel miRNAs across samples</comment>

167

</tool>

168

<tool>

169

<name>rename_according_to_metadata_standards</name>

170

<version>missing</version>

171

<command_line>

172

            <![CDATA[ cp miRNAs_expressed_all_samples_DATE_t_TIME.csv {SampleID}.SXPv1.{DATE}.known.csv ]]>

173

</command_line>

174

<loop>no looping</loop>

175

			<comment>rename output data file to conform to metadata naming standards</comment>

176

</tool>

177

<tool>

178

<name>mirdeep2_csv2bed.pl</name>

179

<version>missing</version>

180

<command_line>

181

<![CDATA[

182

mirdeep2_csv2bed.pl -r result_DATE_t_TIME.csv -p -T {SampleID}

183

cp known_pres_DATE_t_TIME_score-50_to_na.bed {SampleID}.SXPv1.{DATE}.known.bed

184

                echo "track name=\"{SampleID}.novel_miRNAs\" description=\"novel miRNAs detected by miRDeep2 for {SampleID}\" visibility=2 itemRgb=\"On\"" > "{SampleID}.SXPv1.{DATE}.novel.bed"

185

                cat "novel_pres_DATE_t_TIME_score-50_to_na.bed" >> "{SampleID}.SXPv1.{DATE}.novel.bed"

186

]]>

187

</command_line>

188

<loop>no looping</loop>

189

190

                Generate BED tracks from the total precursor read counts of known and novel miRNAs and rename them according to metadata standards.

191

This tool has been uploaded to ASPERA.

192

</comment>

193

</tool>

194

<tool>

195

<name>bed_to_bedGraph</name>

196

<version>missing</version>

197

<command_line>

198

<![CDATA[

199

                gawk 'NR==3 {print "track type=bedGraph description=\"miRDeep2 known miRNAs\" visibility=2 color=0,0,255 altColor=255,0,0" > FILENAME"Graph";  print $1,$2,$3,$5 >> FILENAME"Graph"} NR>3 {print $1,$2,$3,$5 >> FILENAME"Graph"}' "{SampleID}.SXPv1.{DATE}.known.bed"

200

                gawk 'NR==1 {print "track type=bedGraph description=\"miRDeep2 novel miRNAs\" visibility=2 color=0,0,255 altColor=255,0,0" > FILENAME"Graph";  print $1,$2,$3,$5 >> FILENAME"Graph"} NR>1 {print $1,$2,$3,$5 >> FILENAME"Graph"}' "{SampleID}.SXPv1.{DATE}.novel.bed"

201

]]>

202

</command_line>

203

<loop>no looping</loop>

204

<comment>convert BED tracks to bedGraph</comment>

205

</tool>

206

</software>

207

</process>