# masterJLU2018 De novo motif discovery and evaluation based on footprints identified by TOBIAS ## Dependencies * [conda](https://conda.io/docs/user-guide/install/linux.html) * [Nextflow](https://www.nextflow.io/) * [MEME-Suite](http://meme-suite.org/doc/install.html?man_type=web) ## Installation Start with installing all dependencies listed above. It is required to set the [enviroment paths for meme-suite](http://meme-suite.org/doc/install.html?man_type=web#installingtar). this can be done with following commands: ``` export PATH=[meme-suite instalation path]/libexec/meme-[meme-suite version]:$PATH export PATH=[meme-suite instalation path]/bin:$PATH ``` Download all files from the [GitHub repository](https://github.molgen.mpg.de/loosolab/masterJLU2018). The Nextflow-script needs a conda enviroment to run. Nextflow can create the needed enviroment from the given yaml-file. On some systems Nrxtflow exits the run with following error: ``` Caused by: Failed to create Conda environment command: conda env create --prefix --file env.yml status : 143 message: ``` If this error occurs you have to create the enviroment before starting the pipeline. To create this enviroment you need the yml-file from the repository. Run the following commands to create the enviroment: ```console path=[Path to given masterenv.yml file] conda env create --name masterenv -f=$path ``` When the enviroment is created, set the variable 'path_env' in the configuration file as the path to it. ## Quick Start ```console nextflow run pipeline.nf --input [BigWig-file] --bed [BED-file] --genome_fasta [FASTA-file] --jaspar_db [MEME-file] ``` ## Parameters ``` Required arguments: --input Path to BigWig-file --bed Path to BED-file --genome_fasta Path to genome in FASTA-format --jaspar_db Path to motif-database in MEME-format Optional arguments: Footprint extraction: --window_length INT (Default: 200) --step INT (Default: 100) --percentage INT(Default: 0) Filter unknown motifs: --min_size_fp INT (Default: 10) --max_size_fp INT (Default: 100) Cluster: Sequence preparation/ reduction: --kmer INT Kmer length (Default: 10) --aprox_motif_len INT Motif length (Default: 10) --motif_occurence FLOAT Percentage of motifs over all sequences. Use 1 (Default) to assume every sequence contains a motif. --min_seq_length INT Remove all sequences below this value. (Default: 10) Clustering: --global INT Global (=1) or local (=0) alignment. (Default: 0) --identity FLOAT Identity threshold. (Default: 0.8) --sequence_coverage INT Minimum aligned nucleotides on both sequences. (Default: 8) --memory INT Memory limit in MB. 0 for unlimited. (Default: 800) --throw_away_seq INT Remove all sequences equal or below this length before clustering. (Default: 9) --strand INT Align +/+ & +/- (= 1). Or align only +/+ (= 0). (Default: 0) Motif estimation: --motif_min_len INT Minimum length of Motif (Default: 8) --motif_max_len INT Maximum length of Motif (Default: 20) --interation INT Number of iterations done by glam2. More Interations: better results, higher runtime. (Default: 10000) --tomtom_treshold float Threshold for similarity score. (Default: 0.01) Creating GTF: --organism [homo_sapiens | mus_musculus] --tissues All arguments can be set in the configuration files. ``` For further information read the [documentation](https://github.molgen.mpg.de/loosolab/masterJLU2018/wiki)