De novo motif discovery and evaluation based on footprints identified by TOBIAS.
For further information read the documentation.
-
Start with installing conda and downloading all files from the GitHub repository.
-
Every other dependency will be automatically installed using conda. For that a conda environment has to be created from the yaml-file given in this repository. It is required to create and activate the environment from the yaml-file beforehand. This can be done with following commands:
conda env create -f masterenv.yml
conda activate masterenv
- Set the wd parameter in the nextflow.config file as path where the repository is saved. For example: '~/masterJLU2018/'.
Important Notes:
- For the pipeline the package jellyfish from the channel bioconda is needed and NOT the jellyfish package from the channel conda-forge! Please make sure that the right jellyfish package is installed.
nextflow run pipeline.nf --bigwig [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --motif_db [MEME-file] --organism [mm10|mm9|hg19|hg38] --gtf_annotation [GTF-file]
There are files provided inside ./demo/ for a demo run. Go to the main directory and run following command:
nextflow run pipeline.nf --bigwig ./demo/buenrostro50k_chr1_fp.bw --bed ./demo/buenrostro50k_chr1_peaks.bed --genome_fasta ./demo/hg38_chr1.fa --motif_db ./demo/jaspar_vertebrates.meme --out ./demo/buenrostro50k_chr1_out/ --organism hg38 --gtf_annotation ./demo/homo_sapiens.94.mainChr.gtf
For a detailed overview for all parameters follow this link.
Required arguments:
--bigwig Path to BigWig-file
--bed Path to BED-file
--genome_fasta Path to genome in FASTA-format
--motif_db Path to motif-database in MEME-format
--config Path to UROPA configuration file
--gtf_annotation Path to gtf annotation file
--organism Input organism [hg38 | hg19 | mm9 | mm10]
--out Output Directory (Default: './out/')
Optional arguments:
--help [0|1] 1 to show this help message. (Default: 0)
--gtf_merged Path to gtf-file. If path is set the process which creates a gtf-file is skipped.
--tfbs_path Path to directory with tfbsscan output. If given tfbsscan will be skipped.
Footprint extraction:
--window_length INT This parameter sets the length of a sliding window. (Default: 200)
--step INT This parameter sets the number of positions to slide the window forward. (Default: 100)
--percentage INT Threshold in percent (Default: 0)
--min_gap INT If footprints are less than X bases apart the footprints will be merged (Default: 6)
Filter motifs:
--min_size_fp INT Minimum sequence length threshold. Smaller sequences are discarded. (Default: 10)
--max_size_fp INT Maximum sequence length threshold. Discards all sequences longer than this value. (Default: 200)
--tfbsscan_method [moods|fimo] Method used by tfbsscan. (Default: moods)
Cluster:
Sequence preparation/ reduction:
--kmer INT K-mer length (Default: 10)
--aprox_motif_len INT Motif length (Default: 10)
--motif_occurrence FLOAT Percentage of motifs over all sequences. Use 1 (Default) to assume every sequence contains a motif.
--min_seq_length Interations Remove all sequences below this value. (Default: 10)
Clustering:
--global INT Global (=1) or local (=0) alignment. (Default: 0)
--identity FLOAT Identity threshold. (Default: 0.8)
--sequence_coverage INT Minimum aligned nucleotides on both sequences. (Default: 8)
--memory INT Memory limit in MB. 0 for unlimited. (Default: 800)
--throw_away_seq INT Remove all sequences equal or below this length before clustering. (Default: 9)
--strand INT Align +/+ & +/- (= 1). Or align only +/+ (= 0). (Default: 0)
Motif estimation:
--min_seq INT Sets the minimum number of sequences required for the FASTA-files given to GLAM2. (Default: 100)
--motif_min_key INT Minimum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 8)
--motif_max_key INT Maximum number of key positions (aligned columns) in the alignment done by GLAM2. (Default: 20)
--iteration INT Number of iterations done by GLAM2. More Iterations: better results, higher runtime. (Default: 10000)
--tomtom_treshold FLOAT T hreshold for similarity score. (Default: 0.01)
--best_motif INT Get the best X motifs per cluster. (Default: 3)
--gap_penalty INT Set penalty for gaps in GLAM2 (Default: 1000)
--seed String Set seed for GLAM2 (Default: 123456789)
Moitf clustering:
--cluster_motif Boolean If 1 pipeline clusters motifs. If its 0 it does not. (Defaul: 0)
--edge_weight INT Minimum weight of edges in motif-cluster-graph (Default: 5)
--motif_similarity_thresh FLOAT Threshold for motif similarity score (Default: 0.00001)
Creating GTF:
--tissues List/String List of one or more keywords for tissue-/category-activity, categories must be specified as in JSON config
Evaluation:
--max_uropa_runs INT Maximum number UROPA runs running parallelized (Default: 10)
All arguments can be set in the configuration files
For further information read the documentation.