Home

Configfile

To run the snakemake pipeline, we need to set up a configfile containing information on the input data and parameters for the analysis in yaml-format. Motifs can be downloaded from JASPAR first. The required keys/value pairs are:

data:
  Bcell: [data/Bcell_chr4.bam]                                   #list of .bam-files
  Tcell: [data/Tcell_chr4_1.bam, data/Tcell_chr4_2.bam]  #list of .bam-files
  
run_info:
  organism: human                        #mouse/human
  fasta: data/genome_chr4.fa.gz          #.fasta-file containing organism genome
  blacklist: data/blacklist_chr4.bed     #.bed-file containing blacklisted regions
  gtf: data/genes_chr4.gtf               #.gtf-file for annotation of peaks
  motifs: data/individual_motifs         #directory containing motifs (single files in MEME/JASPAR/PFM format)
  output: test_output                    #output directory

Additionally, it is possible to set parameters for single methods, which will be appended to the command-line call for the given tool. Examples are:

macs: "--nomodel --shift -100 --extsize 200 --broad"
atacorrect: ""
footprinting: ""
bindetect: "--prefix immune"

If the additional key/values for tools are not set, the default is always "" (empty string).

Running the pipeline

After setting all necessary paths in the configfile, Snakemake can be run using:

$ conda activate TOBIAS_ENV
$ snakemake --configfile TOBIAS_example.config --cores [number of cores] --keep-going

Pipeline workflow

The pipeline extracts all needed information from its config file. In the preprocessing part, peaks are called for each sample. To ensure that a condition comparison is possible, these peaks are merged across conditions. Afterwards, peaks are annotated using UROPA. Additionally, the condition .bam files are merged to be used in the subsequent analysis. In the analysis part, the condition .bam file(s) are be corrected by ATACorrect and investigated for binding affinity by FootprintScores. Using the PWM motifs as specified in the config, motif occurrences are determined and assigned with scores related to the FootprintScores by BINDetect. Beside textual results, the pipeline creates various visualizations such as aggregated footprint plots for comparison within and between conditions.

Output

The pipeline run will create a folder-structure containing results from each part of the analysis, with the upper folder being the output_folder given in the configfile:

<Output_folder>/

mapping/
- Joined bamfile (per condition)
motifs/
- all_motifs.txt: All input motifs joined to pfm format
peak_calling/
- Peak calling results for each replicate + the joined peak sets
peak_annotation/
- Annotation of merged peaks using UROPA.
bias_correction/
- output from TOBIAS ATACorrect (per condition)
footprinting/
- Output .bigwigs from TOBIAS FootprintScores (per condition)
TFBS/
- Output from TOBIAS BINDetect
overview/
- Overview files (such as joined pdf files and bindetect overview)
logs/
- Individual logfiles for each tool

Detailed output description

peak_annotation

This directory includes results of the annotation with UROPA. Besides the original output files of the annotation tool, there is the all_merged_annotated.bed file. This is the processed output of UROPAs finalhits, including only a subset of the original columns.

TFBS directory

This directory includes results of TOBIAS BINDetect, PlotAggregate and PlotHeatmap. An overview in .txt and .xlsx format, e.g. NFATC2_MA0152.1_overview.txt, is located in the top directory of each TF. All overview tables are build up like this:

TFBS_chr	TFBS_start	TFBS_end	TFBS_name	TFBS_score	TFBS_strand	peak_chr	peak_start	peak_end	peak_id	feature	feat_start	feat_end	feat_strand	feat_anchor	distance	gene_biotype	gene_id	gene_name	Tcell_score	Bcell_score	Tcell_bound	Bcell_bound	Tcell_Bcell_log2fc
chr4	49511177	49511184	NFATC2_MA0152.1	8,51223	-	chr4	49510270	49511264	Tcell,Bcell	gene	49507764	49508806	-	start	1961	processed_pseudogene	ENSG00000250769	AC119751.5	18,09185	1,89763	1	1	3,16718
chr4	99507937	99507944	NFATC2_MA0152.1	8,51223	-	chr4	99507736	99508445	Tcell,Bcell	gene	99511004	99542303	+	start	2914	protein_coding	ENSG00000138813	C4orf17	11,99971	1,27645	1	1	3,10753
chr4	40114339	40114346	NFATC2_MA0152.1	8,51223	+	chr4	40114197	40114397	Tcell	NA	NA	NA	NA	NA	NA	NA	NA	NA	1,75915	0,08972	1	0	3,09833
chr4	120549506	120549513	NFATC2_MA0152.1	8,51223	-	chr4	120549407	120549732	Tcell	NA	NA	NA	NA	NA	NA	NA	NA	NA	1,77557	0,10178	1	0	3,03405
chr4	122456529	122456536	NFATC2_MA0152.1	8,51223	-	chr4	122456311	122456554	Tcell	gene	122451470	122456725	-	start	293	protein_coding	ENSG00000109471	IL2	3,31647	0,29002	1	0	3,03364

The first six columns represent TFBScan results including the genomic location information, the name consisting of the TF name and id, the sequence match score and the strand on which the match was found. The middle columns represent the peak annotation information including their genomic location and annotation information, e.g. gene name, id and biotype. Sites without annotation are included but denoted with "NA". The last columns (dependent on the number of conditions) represent footprinting information. For the example data set, there were two conditions. Thus, there are five columns including one footprint score and one bound information for each condition, and one log2 fold change column. In the bound columns, a 1 indicates that this binding site was identified as bound for this condition. The log2 fold change is calculated in the direction as the conditions are specified in the config. In the example config, the order of conditions was [Tcell, Bcell], and therefore, a positive fold change indicates an increase in activity towards Tcell and vice versa.

Further textual and visual results are partitioned into a beds and plots sub directories. The beds directory includes the individual transcription factor binding sites separated for each condition to bound and unbound sites. As an example, given the example config file, the /bed/-directory for the transcription factor NFATC2 has the following files:

NFATC2_MA0152.1_all.bed - all binding sites within peaks
NFATC2_MA0152.1_Bcell_bound.bed - bound NFATC2 sites in Bcell condition
NFATC2_MA0152.1_Bcell_unbound.bed - unbound NFATC2 sites in Bcell condition
NFATC2_MA0152.1_Tcell_bound.bed - bound NFATC2 sites in Tcell condition
NFATC2_MA0152.1_Tcell_unbound.bed - bound NFATC2 sites in Tcell condition

These files are all in standard .bed-format and can be visualized with any genome browser, like IGV. But the format is similar to the overview files. The _all.bed file represents the TFBScan results plus the peak annotation details. The bound/unbound files show additionally the binding score of the represented condition.

The plots directory includes the PlotAggregate and PlotHeatmap results. There are two major visualizations per TF, a heatmap and an aggregated signal plot. The heatmap plot shows the signal of all TFBS and the aggregated plot summarizes these observations. If there are multiple conditions, there are additionally comparison plots for both, the heatmaps and the aggregated plots. Example files from this folder are:

NFATC2_MA0152.1_aggregate_comparison_all.pdf - Aggregated signal across all sites for all conditions
NFATC2_MA0152.1_aggregate_comparison_bound.pdf - Aggregated signal across bound sites for all conditions
NFATC2_MA0152.1_heatmap_comparison.pdf - Heatmaps of signals across bound/unbound sites for all conditions
NFATC2_MA0152.1_Tcell_aggregate.pdf - Aggregated signals across all/bound/unbound for uncorrected/expected/corrected signals within the T cell condition
NFATC2_MA0152.1_Bcell_aggregate.pdf - Aggregated signals across all/bound/unbound for uncorrected/expected/corrected signals within the B cell condition