How it Works

1.1 Peak calling

The script uses the uncontinuous score from a bigWig file to estimate footprints within the peaks of interest. A sliding window algorithm is used to look through the peaks. A region with significant high signal will be saved as a footprint. To estimate the significance of a footprint the signal is compared with the mean of the signals within the window. After the run the footprints are saved in a .bed file.

Required input

There are two required input files for the peak calling. They are:

--bigwig a bigWig file with the signal;
--bed a corresponding .bed file with peaks of interest.

Output files

The user can set the name for the output .bed file using the parameter --output_file. By default the output file is called call_peaks_output.bed and is saved to the working directory. For each footprint an individual score is found. The score shows the quality of the footprint and is found as the mean of the signals from bigWig file for the corresponding positions. The output .bed file has 8 columns:

chr for the chromosom;
start for the start position of the footprint;
end for the end position of the footprint;
name for the unique name of the footprint;
score for the quality of the footprint;
len for the length of the footprint;
max_pos for the position where the highest signal from the bigWig file is (if there are several positions with the highest score the position in the middle of those will be set as the max_pos);
bonus_info for the additional information from the input .bed file.

During the run, a log-file is written. The log-file is called call_peaks_log.log and is saved to the working directory as well. If the user does not want to receive the messages about the run in the terminal, the parameter --silent can be used to force the script write the messages only to the log-file.

Optional parameters

There are also optional parameters that could be set to refine the search:

--window_length. This parameter sets the length of a sliding window. By default the length 200 bp was taken as the smallest peak of the test data was 200 bp long.
--step. This parameter sets the number of positions to slide the window forward. By default the step of a length 100 bp was taken.
--percentage. By default each signal from the bigWig file is compared to the threshold which is the mean of signals within a window (or within a peak, if it is smaller than the chosen window_length). Though the user has a possibility to move this threshold using the parameter --percentage. For example --percentage 10 will add 10% of the found threshold within a window and set it as a new threshold to compare the signal to. The default percentage is set to 0.

Changing the optional parameters can lead to varying the number of found footprints and their length.

1.2 Motif Filtering

Motivation

Footprints detected in the peak calling step 1.1 can belong to already annotated or to new, unknown motifs. This step (1.2) removes sequences in the footprints that are present in a set of known motifs. The calculation of the filtering is done with the bedtools commands intersect and subtract. First, non overlapping sequences between the footprints and the known motifs pass the filter. Second, no noverlapping parts of the overlapping sequences also pass the filter, if they are longer than a minimum length. The known motifs can either be calculated within this pipeline or the path to the directory where the known motifs bed-files are stored can be submitted.

Required Input

The 3 required input parameters are:

Path to the fasta genome
Path to the bed file from output of 1.1 (8 column bed-file required)
Path to the directory of known motifs bed-file

Optional Parameter

min size of sequences (default is 10). Don't set this to 0 since very small sequences can be produced with bedtools subtract
max size of sequences (default is 80). Removes sequences greater than max. Can be set even higher since the clustering in 2.2 uses a k-mer approach which can basicly handle any length of sequences.
workdir is by default the current directory.
output path/file is by default at ./NewMotifs.bed and ./NewMotifs.bed.fasta

The ouput bed-file contains the same information as the output file from 1.1 but with 2 added columns:

a flag with possible values 1 or 0 which say if the point of maximum score of the footprint has no overlap(1) or has overlap(0) with the known motifs.
the DNA sequence

2.1 Clustering

2.2 Motif Estimation

BED_to_FASTA

The process BED_to_FASTA calls the R-script bed_to_fasta.R, which extracts the sequences from a given BED-file and creates a FASTA-file per cluster. The cluster is given by an additional column in the BED-file.

GLAM2

The process GLAM2 uses the FASTA-files created from BED_to_FASTA as input and calls GLAM2. GLAM2 is a tool provided inside the MEME Suite. "GLAM2 is a program for finding motifs in sequences, typically amino-acid or nucleotide sequences. The main innovation of GLAM2 is that it allows insertions and deletions in motifs." GLAM2 uses Dirichlet processes to align the sequences and find Motifs inside the sequences. The output is a MEME-file for each cluster.

Motig clustering

Because the previous done clustering step does not create perfekt clusters. The motifs created from GLAM2 are clustered. To cluster the motifs all MEME-files are merged with meme2meme, a tool provided inside the MEME Suite. The merged MEME-file is compared with itself by Tomtom. Tomtom is a tool provided inside the MEME Suite and "compares one or more motifs against a database of motifs (e.g., JASPAR). Tomtom will rank the motifs in the database and produce an alignment for each significant match". The similarity between the motifs gives the information about the cluster similarity.

From the TSV-file created by Tomtom, an edgelist is build.

Columns of the TSV-file:

Query_ID	Target_ID	Optimal_offset	p-value	E-value	q-value	Overlap	Query_consensus	Target_consensus	Orientation

The edgelist is transformed to an weighted adjacency matrix. The Graph shows clusters of similar clusters.

The FASTA-files from those clustered clusters are merged and given to GLAM2 to find the clustered motifs.

Tomtom

The process Tomtom gets the MEME-files from the Motif clustering and a database with known motifs as input. It compares the Motifs to an Motif-database with Tomtom.

filter_Motif

filter_Motifs is the last step of the motif estimation. It filters the motifs found by GLAM2 on the already known motifs found by Tomtom. All known motifs are removed from the channel of motifs. The other motifs are given to Part 3.2. The filtering step is implemented with groovy inside the nextflow skript. If no unknonw motifs are found, nextflow runs a process called check_for_unknown_motifs. The process stops the script with an message for the user.