De novo motif discovery and evaluation based on footprints identified by TOBIAS
For further information read the documentation
Start with installing all dependencies listed above. It is required to set the enviroment paths for meme-suite. this can be done with following commands:
export PATH=[meme-suite instalation path]/libexec/meme-[meme-suite version]:$PATH
export PATH=[meme-suite instalation path]/bin:$PATH
Download all files from the GitHub repository. The Nextflow-script needs a conda enviroment to run. Nextflow can create the needed enviroment from the given yaml-file. On some systems Nextflow exits the run with following error:
Caused by:
Failed to create Conda environment
command: conda env create --prefix --file env.yml
status : 143
message:
If this error occurs you have to create the enviroment before starting the pipeline. To create this enviroment you need the yml-file from the repository. Run the following commands to create the enviroment:
path=[Path to given masterenv.yml file]
conda env create --name masterenv -f=$path
When the enviroment is created, set the variable 'path_env' in the configuration file as the path to it.
Important Note: For conda the channel bioconda needs to be set as highest priority! This required due two differnt packages with the same name in different channels. For the pipeline the package jellyfish from the channel bioconda is needed and NOT the jellyfisch package from the channel conda-forge!
nextflow run pipeline.nf --bigwig [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --motif_db [MEME-file]
For a detailed overview for all parameters follow this link.
Required arguments:
--bigwig Path to BigWig-file with scores on the peaks of interest
--bed Path to BED-file with peaks of interest corresponding to the BigWig file
--genome_fasta Path to genome in FASTA-format
--motif_db Path to motif-database in MEME-format
Optional arguments:
--tfbs_path Path to directory with output BED-files from tfbsscan. If given tfbsscan will not be run.
Footprint extraction:
--window_length INT (Default: 200) a length of a window
--step INT (Default: 100) an interval to slide the window
--percentage INT(Default: 0) a percentage to be added to background while searching for footprints
Filter unknown motifs:
--min_size_fp INT (Default: 10)
--max_size_fp INT (Default: 100)
Cluster:
Sequence preparation/ reduction:
--kmer INT Kmer length (Default: 10)
--aprox_motif_len INT Motif length (Default: 10)
--motif_occurence FLOAT Percentage of motifs over all sequences. Use 1 (Default) to assume every sequence contains a motif.
--min_seq_length INT Remove all sequences below this value. (Default: 10)
Clustering:
--global INT Global (=1) or local (=0) alignment. (Default: 0)
--identity FLOAT Identity threshold. (Default: 0.8)
--sequence_coverage INT Minimum aligned nucleotides on both sequences. (Default: 8)
--memory INT Memory limit in MB. 0 for unlimited. (Default: 800)
--throw_away_seq INT Remove all sequences equal or below this length before clustering. (Default: 9)
--strand INT Align +/+ & +/- (= 1). Or align only +/+ (= 0). (Default: 0)
Motif estimation:
--min_seq INT Minimum number of sequences required in the FASTA-files for GLAM2 (Default: 100)
--motif_min_key INT Maximum number of key positions (aligned columns) (Default: 8)
--motif_max_key INT Maximum number of key positions (aligned columns) (Default: 20)
--iteration INT Number of iterations done by glam2. More Iterations: better results, higher runtime. (Default: 10000)
--tomtom_treshold float Threshold for similarity score. (Default: 0.01)
Motif clustering:
--edge_weight INT Minimum weight of edges in motif-cluster-graph (Default: 50)
--motif_similarity_thresh FLOAT threshold for motif similarity score (Default: 0.00001)
Creating GTF:
--tissue STRING Filter for one or more tissue/category activity, categories as in JSON config (Default: None)
--organism STRING Source organism: [ hg19 | hg38 or mm9 | mm10 ] (Default: hg38)
All arguments can be set in the configuration files.
For further information read the documentation