De novo motif discovery and evaluation based on footprints identified by TOBIAS.
For further information read the documentation.
- Start with installing all dependencies listed above (Nextflow, conda, MEME-Suite) and downloading all files from the GitHub repository.
- It is required to set the enviroment paths for meme-suite. this can be done with following commands:
export PATH=[meme-suite instalation path]/libexec/meme-[meme-suite version]:$PATH
export PATH=[meme-suite instalation path]/bin:$PATH
- Every other dependency will be automatically installed using conda. For that a conda enviroment has to be created from the yaml-file given in this repository. It is required to create and activate the enviroment from the yaml-file beforehand. This can be done with following commands:
conda env create -f masterenv.yml
conda activate masterenv
Important Note: For conda the channel bioconda needs to be set as highest priority! This is required due to two different packages with the same name in different channels. For the pipeline the package jellyfish from the channel bioconda is needed and NOT the jellyfish package from the channel conda-forge!
nextflow run pipeline.nf --bigwig [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --motif_db [MEME-file] --config [UROPA-config-file]
For a detailed overview for all parameters follow this link.
Required arguments:
--bigwig Path to BigWig-file
--bed Path to BED-file
--genome_fasta Path to genome in FASTA-format
--motif_db Path to motif-database in MEME-format
--config Path to UROPA configuration file
--organism Input organism [hg38 | hg19 | mm9 | mm10]
--out Output Directory (Default: './out/')
Optional arguments:
--help [0|1] 1 to show this help message. (Default: 0)
--tfbs_path Path to directory with output from tfbsscan. If given tfbsscan will not be run.
--create_known_tfbs_path Path to directory where output from tfbsscan (known motifs) are stored.
Path can be set as tfbs_path in next run. (Default: './')
--gtf_path Path to gtf-file. If path is set the process which creats a gtf-file is skipped.
Footprint extraction:
--window_length INT This parameter sets the length of a sliding window. (Default: 200)
--step INT This parameter sets the number of positions to slide the window forward. (Default: 100)
--percentage INT Threshold in percent (Default: 0)
Filter unknown motifs:
--min_size_fp INT Minimum sequence length threshold. Smaller sequences are discarded. (Default: 10)
--max_size_fp INT Maximum sequence length threshold. Discards all sequences longer than this value. (Default: 100)
--tfbsscan_method [moods|fimo] Method used by tfbsscan. (Default: moods)
Clustering:
Sequence preparation/ reduction:
--kmer INT K-mer length (Default: 10)
--aprox_motif_len INT Motif length (Default: 10)
--motif_occurence FLOAT Percentage of motifs over all sequences. Use 1 (Default) to assume every sequence contains a motif.
--min_seq_length Interations Remove all sequences below this value. (Default: 10)
Clustering:
--global INT Global (=1) or local (=0) alignment. (Default: 0)
--identity FLOAT Identity threshold. (Default: 0.8)
--sequence_coverage INT Minimum aligned nucleotides on both sequences. (Default: 8)
--memory INT Memory limit in MB. 0 for unlimited. (Default: 800)
--throw_away_seq INT Remove all sequences equal or below this length before clustering. (Default: 9)
--strand INT Align +/+ & +/- (= 1). Or align only +/+ (= 0). (Default: 0)
Motif estimation:
--min_seq INT Sets the minimum number of sequences required for the FASTA-files given to GLAM2. (Default: 100)
--motif_min_key INT Minimum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 8)
--motif_max_key INT Maximum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 20)
--iteration INT Number of iterations done by GLAM2. More Iterations: better results, higher runtime. (Default: 10000)
--tomtom_treshold float Threshold for similarity score. (Default: 0.01)
--best_motif INT Get the best X motifs per cluster. (Default: 3)
Moitf clustering:
--cluster_motif Boolean If 1 pipeline clusters motifs. If its 0 it does not. (Defaul: 0)
--edge_weight INT Minimum weight of edges in motif-cluster-graph (Default: 5)
--motif_similarity_thresh FLOAT Threshold for motif similarity score (Default: 0.00001)
Creating GTF:
--tissues List/String List of one or more keywords for tissue-/category-activity, categories must be specified as in JSON
config
All arguments can be set in the configuration files
For further information read the documentation.
For unknown reasons, the tool MOODS, which is called by the tfbsscan, rarely returns empty bedfiles, the problem is probably with the function pfm_to_log_odds. If MOODS does not work as expected and has problems with this function, you will see following error message:
ERROR
All motiffiles have less than 2 lines!
Fix motiffiles and try again.
There is no known fix so far. As a workaround either restart the pipeline in some hours with the same parameters or change the parameter tfbsscan_method to fimo which forces the tfbsscan to use fimo. This methods takes longer but will cause no known error with empty bed files.