De novo motif discovery and evaluation based on footprints identified by TOBIAS
Start with installing all dependencies listed above. It is required to set the enviroment paths for meme-suite. this can be done with following commands:
export PATH=[meme-suite instalation path]/libexec/meme-[meme-suite version]:$PATH
export PATH=[meme-suite instalation path]/bin:$PATH
Download all files from the GitHub repository. The Nextflow-script needs a conda enviroment to run. Nextflow can create the needed enviroment from the given yaml-file. On some systems Nrxtflow exits the run with following error:
Caused by:
Failed to create Conda environment
command: conda env create --prefix --file env.yml
status : 143
message:
If this error occurs you have to create the enviroment before starting the pipeline. To create this enviroment you need the yml-file from the repository. Run the following commands to create the enviroment:
path=[Path to given masterenv.yml file]
conda env create --name masterenv -f=$path
When the enviroment is created, set the variable 'path_env' in the configuration file as the path to it.
nextflow run pipeline.nf --input [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --jaspar_db [MEME-file]
Required arguments:
--input Path to BigWig-file
--bed Path to BED-file
--genome_fasta Path to genome in FASTA-format
--jaspar_db Path to motif-database in MEME-format
Optional arguments:
Footprint extraction:
--window_length INT (Default: 200)
--step INT (Default: 100)
--percentage INT(Default: 0)
Filter unknown motifs:
--min_size_fp INT (Default: 10)
--max_size_fp INT (Default: 100)
Cluster:
Sequence preparation/ reduction:
--kmer INT Kmer length (Default: 10)
--aprox_motif_len INT Motif length (Default: 10)
--motif_occurence FLOAT Percentage of motifs over all sequences. Use 1 (Default) to assume every sequence contains a motif.
--min_seq_length INT Remove all sequences below this value. (Default: 10)
Clustering:
--global INT Global (=1) or local (=0) alignment. (Default: 0)
--identity FLOAT Identity threshold. (Default: 0.8)
--sequence_coverage INT Minimum aligned nucleotides on both sequences. (Default: 8)
--memory INT Memory limit in MB. 0 for unlimited. (Default: 800)
--throw_away_seq INT Remove all sequences equal or below this length before clustering. (Default: 9)
--strand INT Align +/+ & +/- (= 1). Or align only +/+ (= 0). (Default: 0)
Motif estimation:
--motif_min_len INT Minimum length of Motif (Default: 8)
--motif_max_len INT Maximum length of Motif (Default: 20)
--interation INT Number of iterations done by glam2. More Interations: better results, higher runtime. (Default: 10000)
--tomtom_treshold float Threshold for similarity score. (Default: 0.01)
Creating GTF:
--organism [homo_sapiens | mus_musculus]
--tissues
All arguments can be set in the configuration files.
For further information read the documentation