Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
Moved snakemake from TOBIAS main to separate repository; Added Wilson…
… visualization rules
  • Loading branch information
msbentsen committed Mar 22, 2019
0 parents commit bf02df0
Show file tree
Hide file tree
Showing 111 changed files with 3,794 additions and 0 deletions.
7 changes: 7 additions & 0 deletions .gitignore
@@ -0,0 +1,7 @@
*.pyc
*.c
.snakemake/
build/
dist/
*.egg
*.egg-info
21 changes: 21 additions & 0 deletions LICENSE
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2017 MPI for Heart and Lung Research

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
28 changes: 28 additions & 0 deletions README.md
@@ -0,0 +1,28 @@
TOBIAS Snakemake pipeline
=======================================

Introduction
------------

ATAC-seq (Assay for Transposase-Accessible Chromatin using high-throughput sequencing) is a sequencing assay for investigating genome-wide chromatin accessibility. The assay applies a Tn5 Transposase to insert sequencing adapters into accessible chromatin, enabling mapping of regulatory regions across the genome. Additionally, the local distribution of Tn5 insertions contains information about transcription factor binding due to the visible depletion of insertions around sites bound by protein - known as _footprints_.

**TOBIAS** is a collection of command-line bioinformatics tools for performing footprinting analysis on ATAC-seq data. Please see the [TOBIAS github repository](https://github.molgen.mpg.de/loosolab/TOBIAS/) for details about the individual tools.

Snakemake how-to:
-----------------

To use the snakemake pipeline, make sure the included conda environments are installed:
```
$ conda env create -f environments/tobias.yaml
$ conda env create -f environments/macs.yaml
```

You can use the example config (TOBIAS_example.config) or adjust to your own data by replacing the values for each key. Run using:
```bash
$ conda activate TOBIAS_ENV
$ snakemake --configfile example_config.yaml --cores [number of cores]
```

Contact
------------
Mette Bentsen (mette.bentsen (at) mpi-bn.mpg.de)
171 changes: 171 additions & 0 deletions Snakefile
@@ -0,0 +1,171 @@
"""
Upper level TOBIAS snake
"""

import os
import subprocess
import itertools

#Set config
if workflow.overwrite_configfile != None:
configfile: str(workflow.overwrite_configfile)
else:
configfile: 'TOBIAS.config'
CONFIGFILE = str(workflow.overwrite_configfile)

#Snake modules used to setup run
include: "snakefiles/helper.snake"

#shell.prefix("")

#-------------------------------------------------------------------------------#
#------------------------- CHECK FORMAT OF CONFIG FILE -------------------------#
#-------------------------------------------------------------------------------#

required = [("data",),
("run_info",),
("run_info", "organism"),
("run_info", "fasta"),
("run_info", "blacklist"),
("run_info", "gtf"),
("run_info", "motifs"),
("run_info", "output"),
]

#Check if all keys are existing and contain information
for key_list in required:
lookup_dict = config
for key in key_list:
try:
lookup_dict = lookup_dict[key]
if lookup_dict == None:
print("ERROR: Missing input for key {0}".format(key_list))
except:
print("ERROR: Could not find key(s) \"{0}\" in configfile {1}. Please check that your configfile has right format for TOBIAS.".format(":".join(key_list), CONFIGFILE))
sys.exit()

#Check if there is at least one condition with bamfiles
if len(config["data"]) > 0:
for condition in config["data"]:
if len(config["data"][condition]) == 0:
print("ERROR: Could not find any bamfiles in \"{0}\" in configfile {1}".format(":".join(("data", condition)), CONFIGFILE))
else:
print("ERROR: Could not find any conditions (\"data:\{condition\}\") in configfile {0}".format(CONFIGFILE))
sys.exit()


#-------------------------------------------------------------------------------#
#------------------------- WHICH FILES/INFO WERE INPUT? ------------------------#
#-------------------------------------------------------------------------------#

input_files = []

#Files related to experimental data (bam)
CONDITION_IDS = list(config["data"].keys())
for condition in CONDITION_IDS:
if not isinstance(config["data"][condition], list):
config['data'][condition] = [config['data'][condition]]
input_files.extend(config['data'][condition])


#Flatfiles independent from experimental data (run_info)
FASTA = config['run_info']['fasta']
BLACKLIST = config['run_info']['blacklist']
GTF = config['run_info']['gtf']
OUTPUTDIR = config['run_info']["output"]
BLACKLIST = config['run_info']['blacklist']
MOTIFDIR = config['run_info']['motifs']

input_files.extend([FASTA, BLACKLIST, GTF])


#---------- Test that input files exist -----------#
for file in input_files:
if file != None:
full_path = os.path.abspath(file)
if not os.path.exists(full_path):
exit("ERROR: The following file given in config does not exist: {0}".format(full_path))


#--------------------------------- MOTIFS --------------------------------------#
#Identify IDS of motifs
files = os.listdir(MOTIFDIR)
MOTIF_FILES = {}
for file in files:
full_file = os.path.join(MOTIFDIR, file)
with open(full_file) as f:
for line in f:
if line.startswith("MOTIF"):
columns = line.rstrip().split()
ID = columns[2] + "_" + columns[1]
ID = filafy(ID)
elif line.startswith(">"):
columns = line.replace(">", "").rstrip().split()
ID = columns[1] + "_" + columns[0]
ID = filafy(ID)
MOTIF_FILES[ID] = full_file

TF_IDS = list(MOTIF_FILES.keys())


#-------------------------------------------------------------------------------#
#------------------------ WHICH FILES SHOULD BE CREATED? -----------------------#
#-------------------------------------------------------------------------------#

output_files = []


id2bam = {condition:{} for condition in CONDITION_IDS}
for condition in CONDITION_IDS:
config_bams = config['data'][condition]
sampleids = [os.path.splitext(os.path.basename(bam))[0] for bam in config_bams]
id2bam[condition] = {sampleids[i]:config_bams[i] for i in range(len(sampleids))} # Link sample ids to bams

PLOTNAMES = expand("{condition}_{plotname}", condition=CONDITION_IDS, plotname=["aggregate"])
if len(CONDITION_IDS) > 1:
PLOTNAMES.extend(["heatmap_comparison", "aggregate_comparison_all", "aggregate_comparison_bound"])

output_files.append(expand(os.path.join(OUTPUTDIR, "footprinting", "{condition}_footprints.bw"), condition=CONDITION_IDS))

#output_files.append(os.path.join(OUTPUTDIR, "TFBS", "bindetect_results.txt"))
#output_files.append(os.path.join(OUTPUTDIR, "overview", "bindetect_results.txt"))

#Visualization
output_files.extend(expand(os.path.join(OUTPUTDIR, "TFBS", "{TF}", "plots", "{TF}_{plotname}.pdf"), TF=TF_IDS, plotname=PLOTNAMES))
output_files.extend(expand(os.path.join(OUTPUTDIR, "overview", "all_{plotname}.pdf"), plotname=PLOTNAMES))

#Wilson
output_files.extend(expand(os.path.join(OUTPUTDIR, "wilson", "data", "{TF}_overview.clarion"), TF=TF_IDS))
output_files.append(os.path.join(OUTPUTDIR, "wilson", "HOW_TO_WILSON.txt"))

#-------------------------------------------------------------------------------#
#------------------------ DEAL WITH SPECIAL ENVIRONMENTS -----------------------#
#-------------------------------------------------------------------------------#

sys_env = subprocess.check_output(['conda', 'env', 'list'], universal_newlines=True)
env_list = [line.split()[0] for line in sys_env.split("\n") if len(line.split()) > 0]

# default TOBIAS environment
if "TOBIAS_ENV" not in env_list:
print("Creating TOBIAS environment for the first time")
subprocess.call(["conda", "env", "create", "--file", "environments/tobias.yaml"])

# python 2 related envs
if "MACS_ENV" not in env_list:
print("Creating macs environment for the first time")
subprocess.call(["conda", "env", "create", "--file", "environments/macs.yaml"])


#-------------------------------------------------------------------------------#
#---------------------------------- RUN :-) ------------------------------------#
#-------------------------------------------------------------------------------#

include: "snakefiles/preprocessing.snake"
include: "snakefiles/footprinting.snake"
include: "snakefiles/visualization.snake"
include: "snakefiles/wilson.snake"

rule all:
input:
output_files
message: "Rule all"
Binary file added data/Bcell_chr4.bam
Binary file not shown.
Binary file added data/Bcell_chr4.bam.bai
Binary file not shown.
Binary file added data/Tcell_chr4_1.bam
Binary file not shown.
Binary file added data/Tcell_chr4_1.bam.bai
Binary file not shown.
Binary file added data/Tcell_chr4_2.bam
Binary file not shown.
Binary file added data/Tcell_chr4_2.bam.bai
Binary file not shown.
2 changes: 2 additions & 0 deletions data/blacklist_chr4.bed
@@ -0,0 +1,2 @@
chr4 49118760 49119010
chr4 49120790 49121130

0 comments on commit bf02df0

Please sign in to comment.