From 9094b4eb44120ef7c5df0804feb853b3f6d9006e Mon Sep 17 00:00:00 2001 From: Peter Ebert Date: Fri, 30 Dec 2016 15:52:40 +0100 Subject: [PATCH] ADD: track hub conversion processes (THB) --- docs/misc/THBv1.xml | 81 +++++++++++++++++++++++++ docs/misc/THBv2.xml | 125 ++++++++++++++++++++++++++++++++++++++ docs/misc/THBv3.xml | 142 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 348 insertions(+) create mode 100644 docs/misc/THBv1.xml create mode 100644 docs/misc/THBv2.xml create mode 100644 docs/misc/THBv3.xml diff --git a/docs/misc/THBv1.xml b/docs/misc/THBv1.xml new file mode 100644 index 0000000..213c675 --- /dev/null +++ b/docs/misc/THBv1.xml @@ -0,0 +1,81 @@ + + + + THB + 1 + + Peter Ebert + pebert@mpi-inf.mpg.de + + + The trackhub_conv.py Python3 script adds the 'chr' prefix to the chromosome names and filters + for the chromosomes 1-22 / 1-19 and X,Y for reasons of compatibility of genomic coordinates between assemblies. + Note that the script just reads the folder contents and converts every file in the folder that appears + to be output of a DEEP process and to be a peak or bigwig file (based on file naming). + The converted files are put in the same folder. + Important: MACS2 outputs narrowPeak/broadPeak files that are not fully compliant to ENCODE standards, + the score column (index 5) has to be between 0-1000, so the conversion script rescales these values. + Please note that the peak name still refers to the original (unconverted) file. + Approximately 1 out 10 files is chosen at random and checked for consistency by reversing the conversion + (except for scaling of the score column in case of peak files) and computing the MD5 checksum, + which is then compared to the MD5 checksum of the original file after filtering + for the appropriate chromosomes as explained above. + + + + CHP_peaks + narrowPeak + collection + Standard output of MACS2 in ENCODE narrowPeak format + + + CHP_peaks + broadPeak + collection + Standard output of MACS2 in ENCODE broadPeak format + + + DEEP_bigwig + bigwig + collection + Any bigwig output of a standardized DEEP process + + + + + chrom_sizes + table + single + File holding information on chromosome sizes for UCSC assembly (i.e. hg19, mm10) + + + field_names + AutoSQL + collection + Field_names is a folder containing files in AutoSQL format necessary for conversion of narrowPeak and broadPeak format into bigbed + + + + + THB_peaks + bigbed + collection + Converted peak files + + + THB_bigwig + bigwig + collection + Converted bigwig files + + + + + trackhub_conv.py + 0.1 + + CHP_peaks, DEEP_bigwig + Simple Python3 script to handle the batch conversion of files + + + diff --git a/docs/misc/THBv2.xml b/docs/misc/THBv2.xml new file mode 100644 index 0000000..1770803 --- /dev/null +++ b/docs/misc/THBv2.xml @@ -0,0 +1,125 @@ + + + + THB + 2 + + Peter Ebert + pebert@mpi-inf.mpg.de + + + This process merely describes the conversion - not production - of DEEP data files into an IHEC compatible format. + If you have any questions about the actual data, please refer to the process XML files describing + the data production and contact the author named in the respective file. The trackhub conversion process + describes the conversion of standardized DEEP process output files into one of the BIG formats + needed to submit the data as IHEC track hub. Since the reference assemblies used by IHEC are different + to the ones used by DEEP, the conversion consists of the following steps: + (i) filter data files for chromosomes 1-22 (hsa)/1-19 (mmu) and X, + (ii) add "chr" prefix to chromosome names and + (iii) for all BED or BED-like files, ensure that these represent a regular BED6+ file; in particular, + the "score" column is adjusted by default to be in the range 0-1000 (for details about the + formats used, please refer to https://genome.ucsc.edu/FAQ/FAQformat.html). + The adjustment works as follows: + select one meaningful column (e.g. coverage, signal enrichment or similar), bin the data according + to the gray shading schema used by the UCSC genome browser (see link above) and then assign fix score + values according according to the binning. + + + + DEEP_bigwig + bigWig + collection + bigWig output of a standardized DEEP process (libraries: histone, DNase, NOMe, WGBS; only raw/unfiltered signal tracks for histone and DNase libs) + + + DEEP_bed + BED or BED-like + collection + BED or BED-like output of a standardized DEEP process; comprises of histone, DNase and NOMe peaks and expressed small/long RNAs + + + + + chrom_sizes + table + collection + Common files containing information about the chromosome sizes for the respective assemblies + + + field_names + AutoSQL + collection + AutoSQL files describing the different BED files: narrowPeak, broadPeak, gNOMePeak, snRNAexpr, longRNAexpr + + + + + DEEPID.PROC.DATE.bigBed + bigBed + collection + Converted BED or BED-like files + + + DEEPID.PROC.DATE.bigWig + bigWig + collection + Converted bigWig files + + + + + bigWigToBedGraph, egrep, sort, sed + 4, 2.12, 8.13, 4.2.1 + + temp_signal.bg ]]> + + DEEP_bigwig + Filter all signal tracks and add prefix, make sure that output is sorted (should be by construction) + + + bedGraphToBigWig + 4 + + + + temp_signal.bg + Create final signal tracks + + + egrep, sort, sed + 2.12, 8.13, 4.2.1 + + temp_region.bed ]]> + + DEEP_bed + Filter all uncompressed BED files and add prefix, make sure that output is sorted + + + gzip, egrep, sort, sed + 1.5, 2.12, 8.13, 4.2.1 + + temp_region.bed ]]> + + DEEP_bed (gzipped) + Filter all gzipped BED files and add prefix, make sure that output is sorted + + + python3, numpy + 3.2.3, 1.6.2 + + + + temp_region.bed + Python3 function to adjust score column is implemented as part of the pipeline code and executed for all BED files by default + + + bedToBigBed + 2.6 + + + + temp_region.bed + Create final region files. n==1 for snRNA; n==3 for NOMe and broad peaks; n==4 for narrow peaks and long RNAs + + + diff --git a/docs/misc/THBv3.xml b/docs/misc/THBv3.xml new file mode 100644 index 0000000..19ccf30 --- /dev/null +++ b/docs/misc/THBv3.xml @@ -0,0 +1,142 @@ + + + + THB + 3 + + Peter Ebert + pebert@mpi-inf.mpg.de + + + This process merely describes the conversion - not production - of DEEP data files into an IHEC compatible format. + If you have any questions about the actual data, please refer to the process XML files describing + the data production and contact the author named in the respective file. The trackhub conversion process + describes the conversion of standardized DEEP process output files into one of the BIG formats + needed to submit the data as IHEC track hub. Since the reference assemblies used by IHEC are different + to the ones used by DEEP, the conversion consists of the following steps: + (i) filter data files for chromosomes 1-22 (hsa)/1-19 (mmu) and X, + (ii) add "chr" prefix to chromosome names and + (iii) for all BED or BED-like files, ensure that these represent a regular BED6+ file; in particular, + the "score" column is adjusted by default to be in the range 0-1000 (for details about the + formats used, please refer to https://genome.ucsc.edu/FAQ/FAQformat.html). + The adjustment works as follows: + select one meaningful column (e.g. coverage, signal enrichment or similar), bin the data according + to the gray shading schema used by the UCSC genome browser (see link above) and then assign fix score + values according according to the binning. + Version 3 of the THB process also creates a mapping between filename and track property + (~ what does this data represent?) as required by the updated IHEC trackhub specification (JSON format). + + + + DEEP_signal + bigWig or bedGraph + collection + bigWig output of a standardized DEEP process (libraries: histone, DNase, NOMe, WGBS; only raw/unfiltered signal tracks for histone and DNase libs) + + + DEEP_bed + BED or BED-like + collection + BED or BED-like output of a standardized DEEP process; comprises of histone, DNase and NOMe peaks and expressed small/long RNAs + + + + + chrom_sizes + table + collection + Common files containing information about the chromosome sizes for the respective assemblies + + + field_names + AutoSQL + collection + AutoSQL files describing the different BED files: narrowPeak, broadPeak, gNOMePeak, snRNAexpr, longRNAexpr + + + + + DEEPID.PROC.DATE.bigBed + bigBed + collection + Converted BED or BED-like files + + + DEEPID.PROC.DATE.bigWig + bigWig + collection + Converted bigWig files + + + DACID.PROC.DATE.prop.tsv + tab separated table + single + trackhub property mapping + + + + + bigWigToBedGraph, egrep, sort, sed + 4, 2.12, 8.13, 4.2.1 + + temp_signal.bg ]]> + + DEEP_bigwig + Filter all signal tracks and add prefix, make sure that output is sorted (should be by construction) + + + bedGraphToBigWig + 4 + + + + temp_signal.bg + Create final signal tracks + + + egrep, sort, sed + 2.12, 8.13, 4.2.1 + + temp_region.bed ]]> + + DEEP_bed + Filter all uncompressed BED files and add prefix, make sure that output is sorted + + + gzip, egrep, sort, sed + 1.5, 2.12, 8.13, 4.2.1 + + temp_region.bed ]]> + + DEEP_bed (gzipped) + Filter all gzipped BED files and add prefix, make sure that output is sorted + + + python3, numpy + 3.2.3, 1.6.2 + + + + temp_region.bed + Python3 function to adjust score column is implemented as part of the pipeline code and executed for all BED files by default + + + bedToBigBed + 2.6 + + + + temp_region.bed + Create final region files. N==1 for snRNA; N==3 for NOMe and broad peaks; N==4 for narrow peaks and long RNAs + + + python3 + 3.2.3 + + + + no looping + Python3 function to write the track property mapping (filename to property) to a text file + + +