-
Notifications
You must be signed in to change notification settings - Fork 1
Home
To run the snakemake pipeline, we need to set up a configfile containing information on the input data and parameters for the analysis in yaml-format. Motifs can be downloaded from JASPAR first. The required keys/value pairs are:
data:
Bcell: [data/Bcell_chr4.bam] #list of .bam-files
Tcell: [data/Tcell_chr4_1.bam, data/Tcell_chr4_2.bam] #list of .bam-files
run_info:
organism: human #mouse/human
fasta: data/genome_chr4.fa.gz #.fasta-file containing organism genome
blacklist: data/blacklist_chr4.bed #.bed-file containing blacklisted regions
gtf: data/genes_chr4.gtf #.gtf-file for annotation of peaks
motifs: data/individual_motifs #directory containing motifs (single files in MEME/JASPAR/PFM format)
output: test_output #output directory
Additionally, it is possible to set parameters for single methods, which will be appended to the command-line call for the given tool. Examples are:
macs: "--nomodel --shift -100 --extsize 200 --broad"
atacorrect: ""
footprinting: ""
bindetect: "--prefix immune"
If the additional key/values for tools are not set, the default is always "" (empty string).
After setting all necessary paths in the configfile, Snakemake can be run using:
$ conda activate TOBIAS_ENV
$ snakemake --configfile TOBIAS_example.config --cores [number of cores] --keep-going
The pipeline extracts all needed information from its config file. In the preprocessing part, peaks are called for each sample. To ensure that a condition comparison is possible, these peaks are merged across conditions. Afterwards, peaks are annotated using UROPA. Additionally, the condition .bam files are merged to be used in the subsequent analysis. In the analysis part, the condition .bam file(s) are be corrected by ATACorrect and investigated for binding affinity by FootprintScores. Using the PWM motifs as specified in the config, motif occurrences are determined and assigned with scores related to the FootprintScores by BINDetect. Beside textual results, the pipeline creates various visualizations such as aggregated footprint plots for comparison within and between conditions.
The pipeline run will create a folder-structure containing results from each part of the analysis, with the upper folder being the output_folder given in the configfile:
- mapping/
- Joined bamfile (per condition)
- motifs/
- all_motifs.txt: All input motifs joined to pfm format
- peak_calling/
- Peak calling results for each replicate + the joined peak sets
- peak_annotation/
- Annotation of merged peaks using UROPA.
- bias_correction/
- output from TOBIAS ATACorrect (per condition)
- footprinting/
- Output .bigwigs from TOBIAS FootprintScores (per condition)
- TFBS/
- Output from TOBIAS BINDetect
- overview/
- Overview files (such as joined pdf files and bindetect overview)
- logs/
- Individual logfiles for each tool
This directory includes results of the annotation with UROPA. Besides the original output files of the annotation tool, there is the all_merged_annotated.bed file. This is the processed output of UROPAs finalhits, including only a subset of the original columns.
This directory includes results of TOBIAS BINDetect, PlotAggregate and PlotHeatmap. An overview in .txt and .xlsx format, e.g. NFATC2_MA0152.1_overview.txt, is located in the top directory of each TF. All overview tables are build up like this:
TFBS_chr | TFBS_start | TFBS_end | TFBS_name | TFBS_score | TFBS_strand | peak_chr | peak_start | peak_end | peak_id | feature | feat_start | feat_end | feat_strand | feat_anchor | distance | gene_biotype | gene_id | gene_name | Tcell_score | Bcell_score | Tcell_bound | Bcell_bound | Tcell_Bcell_log2fc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
chr4 | 49511177 | 49511184 | NFATC2_MA0152.1 | 8,51223 | - | chr4 | 49510270 | 49511264 | Tcell,Bcell | gene | 49507764 | 49508806 | - | start | 1961 | processed_pseudogene | ENSG00000250769 | AC119751.5 | 18,09185 | 1,89763 | 1 | 1 | 3,16718 |
chr4 | 99507937 | 99507944 | NFATC2_MA0152.1 | 8,51223 | - | chr4 | 99507736 | 99508445 | Tcell,Bcell | gene | 99511004 | 99542303 | + | start | 2914 | protein_coding | ENSG00000138813 | C4orf17 | 11,99971 | 1,27645 | 1 | 1 | 3,10753 |
chr4 | 40114339 | 40114346 | NFATC2_MA0152.1 | 8,51223 | + | chr4 | 40114197 | 40114397 | Tcell | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1,75915 | 0,08972 | 1 | 0 | 3,09833 |
chr4 | 120549506 | 120549513 | NFATC2_MA0152.1 | 8,51223 | - | chr4 | 120549407 | 120549732 | Tcell | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1,77557 | 0,10178 | 1 | 0 | 3,03405 |
chr4 | 122456529 | 122456536 | NFATC2_MA0152.1 | 8,51223 | - | chr4 | 122456311 | 122456554 | Tcell | gene | 122451470 | 122456725 | - | start | 293 | protein_coding | ENSG00000109471 | IL2 | 3,31647 | 0,29002 | 1 | 0 | 3,03364 |
The first six columns represent TFBScan results including the genomic location information, the name consisting of the TF name and id, the sequence match score and the strand on which the match was found. The middle columns represent the peak annotation information including their genomic location and annotation information, e.g. gene name, id and biotype. Sites without annotation are included but denoted with "NA". The last columns (dependent on the number of conditions) represent footprinting information. For the example data set, there were two conditions. Thus, there are five columns including one footprint score and one bound information for each condition, and one log2 fold change column. In the bound columns, a 1 indicates that this binding site was identified as bound for this condition. The log2 fold change is calculated in the direction as the conditions are specified in the config. In the example config, the order of conditions was [Tcell, Bcell], and therefore, a positive fold change indicates an increase in activity towards Tcell and vice versa.
Further textual and visual results are partitioned into a beds and plots sub directories. The beds directory includes the individual transcription factor binding sites separated for each condition to bound and unbound sites. As an example, given the example config file, the /bed/-directory for the transcription factor NFATC2 has the following files:
- NFATC2_MA0152.1_all.bed - all binding sites within peaks
- NFATC2_MA0152.1_Bcell_bound.bed - bound NFATC2 sites in Bcell condition
- NFATC2_MA0152.1_Bcell_unbound.bed - unbound NFATC2 sites in Bcell condition
- NFATC2_MA0152.1_Tcell_bound.bed - bound NFATC2 sites in Tcell condition
- NFATC2_MA0152.1_Tcell_unbound.bed - bound NFATC2 sites in Tcell condition
These files are all in standard .bed-format and can be visualized with any genome browser, like IGV. But the format is similar to the overview files. The _all.bed file represents the TFBScan results plus the peak annotation details. The bound/unbound files show additionally the binding score of the represented condition.
The plots directory includes the PlotAggregate and PlotHeatmap results. There are two major visualizations per TF, a heatmap and an aggregated signal plot. The heatmap plot shows the signal of all TFBS and the aggregated plot summarizes these observations. If there are multiple conditions, there are additionally comparison plots for both, the heatmaps and the aggregated plots. Example files from this folder are:
- NFATC2_MA0152.1_aggregate_comparison_all.pdf - Aggregated signal across all sites for all conditions
- NFATC2_MA0152.1_aggregate_comparison_bound.pdf - Aggregated signal across bound sites for all conditions
- NFATC2_MA0152.1_heatmap_comparison.pdf - Heatmaps of signals across bound/unbound sites for all conditions
- NFATC2_MA0152.1_Tcell_aggregate.pdf - Aggregated signals across all/bound/unbound for uncorrected/expected/corrected signals within the T cell condition
- NFATC2_MA0152.1_Bcell_aggregate.pdf - Aggregated signals across all/bound/unbound for uncorrected/expected/corrected signals within the B cell condition