Initial commit

loosolab · Dec 11, 2018 · 05e2dbd · 05e2dbd
commit 05e2dbd
Show file tree

Hide file tree

Showing 82 changed files with 1,075,298 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,7 @@
+*.pyc
+*.c
+.snakemake/
+build/
+dist/
+*.egg
+*.egg-info
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2017 MPI for Heart and Lung Research
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,2 @@
+include README.md
+include LICENSE
diff --git a/README.md b/README.md
@@ -0,0 +1,76 @@
+TOBIAS - Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal 
+=======================================
+
+Introduction 
+------------
+
+ATAC-seq (Assay for Transposase-Accessible Chromatin using high-throughput sequencing) is a sequencing assay for investigating genome-wide chromatin accessibility. The assay applies a Tn5 Transposase to insert sequencing adapters into accessible chromatin, enabling mapping of regulatory regions across the genome. Additionally, the local distribution of Tn5 insertions contains information about transcription factor binding due to the visible depletion of insertions around sites bound by protein - known as _footprints_. 
+
+**TOBIAS** is a collection of command-line bioinformatics tools for performing footprinting analysis on ATAC-seq data, and includes:
+
+<img align="right" width=150 src="/figures/tobias.png">
+
+- Correction of Tn5 insertion bias
+- Calculation of footprint scores within regulatory regions
+- Estimation of bound/unbound transcription factor binding sites
+- Visualization of footprints within and across different conditions
+
+For information on each tool, please see the [wiki](https://github.molgen.mpg.de/loosolab/TOBIAS/wiki/).
+
+Installation
+------------
+TOBIAS is written as a python package and can be quickly installed within a conda environment using:
+```bash
+$ git clone https://github.molgen.mpg.de/loosolab/TOBIAS
+$ cd TOBIAS
+$ conda env create -f snakemake_pipeline/environments/tobias.yaml
+$ conda activate TOBIAS_ENV
+$ python setup.py install
+```
+Please see the [installation](https://github.molgen.mpg.de/loosolab/TOBIAS/wiki/installation) page for more info.
+
+Usage
+------------
+All tools are available through the command-line as ```TOBIAS <TOOLNAME>```, for example:
+``` 
+$ TOBIAS ATACorrect
+__________________________________________________________________________________________
+
+                                   TOBIAS ~ ATACorrect
+__________________________________________________________________________________________
+
+ATACorrect corrects the cutsite-signal from ATAC-seq with regard to the underlying
+sequence preference of Tn5 transposase.
+
+Usage:
+TOBIAS ATACorrect --bam <reads.bam> --genome <genome.fa> --peaks <peaks.bed>
+
+Output files:
+- <outdir>/<prefix>_uncorrected.bw
+- <outdir>/<prefix>_bias.bw
+- <outdir>/<prefix>_expected.bw
+- <outdir>/<prefix>_corrected.bw
+- <outdir>/<prefix>_atacorrect.pdf
+
+(...)
+```
+
+Snakemake pipeline
+------------
+
+You can run each TOBIAS tool independently or as part of a pipeline using the included snakemake workflow. Simply set the paths to required data within snakemake_pipeline/TOBIAS.config and run using:
+```bash
+$ cd snakemake_pipeline
+$ conda activate TOBIAS_ENV
+$ snakemake --snakefile TOBIAS.snake --configfile TOBIAS.config --cores [number of cores] --keep-going
+```
+For further info on setup, configfile and output, please consult the [wiki](https://github.molgen.mpg.de/loosolab/TOBIAS/wiki/snakemake-pipeline).
+
+License
+------------
+This project is licensed under the [MIT license](LICENSE). 
+
+
+Contact
+------------
+Mette Bentsen (mette.bentsen (at) mpi-bn.mpg.de)
diff --git a/figures/Thumbs.db b/figures/Thumbs.db
diff --git a/figures/atacorrect.png b/figures/atacorrect.png
diff --git a/figures/bindetect.png b/figures/bindetect.png
diff --git a/figures/footprinting.png b/figures/footprinting.png
diff --git a/figures/tobias.png b/figures/tobias.png
diff --git a/setup.py b/setup.py
@@ -0,0 +1,49 @@
+from setuptools import setup, Extension
+import numpy as np
+
+def readme():
+    with open('README.md') as f:
+        return f.read()
+
+ext_modules = [Extension("tobias.utils.ngs", ["tobias/utils/ngs.pyx"], include_dirs=[np.get_include()]),
+              Extension("tobias.utils.sequences", ["tobias/utils/sequences.pyx"], include_dirs=[np.get_include()]),
+               Extension("tobias.utils.signals", ["tobias/utils/signals.pyx"], include_dirs=[np.get_include()])]
+
+setup(name='tobias',
+      version='1.0.0',
+      description='Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal',
+      long_description=readme(),
+      url='https://github.molgen.mpg.de/loosolab/TOBIAS',
+      author='Mette Bentsen',
+      author_email='mette.bentsen@mpi-bn.mpg.de',
+      license='MIT',
+      packages=['tobias', 'tobias.footprinting', 'tobias.utils', 'tobias.plotting', 'tobias.motifs'],
+      entry_points = {
+        'console_scripts': ['TOBIAS=tobias.TOBIAS:main']
+      },
+      install_requires=[
+        'setuptools_cython',
+        'numpy',
+        'scipy',
+        'pyBigWig',
+        'pysam',
+        'pybedtools',
+        'matplotlib>=2',
+        'scikit-learn',
+        'pandas',
+        'pypdf2',
+        'xlsxwriter',
+        'adjustText',
+      ],
+      #dependency_links=['https://github.com/jhkorhonen/MOODS/tarball/master'],
+      classifiers = [
+        'License :: OSI Approved :: MIT License',
+        'Intended Audience :: Science/Research',
+        'Topic :: Scientific/Engineering :: Bio-Informatics',
+        'Programming Language :: Python :: 3'
+      ],
+      zip_safe=False,
+      include_package_data=True,
+      ext_modules = ext_modules,
+      scripts=["tobias/utils/peak_annotation.sh"]
+      )
diff --git a/snakemake_pipeline/TOBIAS.config b/snakemake_pipeline/TOBIAS.config
@@ -0,0 +1,33 @@
+#-------------------------------------------------------------------------#
+#-------------------------- TOBIAS input data ----------------------------#
+#-------------------------------------------------------------------------#
+
+data:
+    control: [test_data/control_s1.bam, test_data/control_s2.bam]         #list of bam files
+    treatment: [test_data/treatment_s1.bam]   #list of bam files
+
+run_info:
+  organism: human                             #mouse/human
+  fasta: test_data/genome.fa                  #.fasta-file containing organism genome
+  blacklist: test_data/blacklist.bed          #.bed-file containing blacklisted regions
+  gtf: test_data/genes.gtf                    #.gtf-file for annotation of peaks
+  motifs: test_data/motifs/                   #directory containing motifs (single files in meme or JASPAR pfm format)  
+  output: test_output/                        #output directory 
+
+
+
+#-------------------------------------------------------------------------#
+#----------------------- Default module parameters -----------------------#
+#-------------------------------------------------------------------------#
+
+macs: "--nomodel --shift -100 --extsize 200 --broad"
+
+# for parameter description see uropa manual: http://uropa-manual.readthedocs.io/config.html
+# adjust filter attribute for given gtf: ensembl gene_biotype / genecode gene_type
+# other optional parameters: --filter_attribute gene_biotype --attribute_value protein_coding
+uropa: "--feature gene --feature_anchor start --distance [10000,1000] --show_attribute gene_name,gene_id,gene_biotype" 
+
+atacorrect: ""
+footprinting: ""
+bindetect: ""
+plotting: ""
diff --git a/snakemake_pipeline/TOBIAS.snake b/snakemake_pipeline/TOBIAS.snake
@@ -0,0 +1,195 @@
+"""
+Upper level TOBIAS snake
+"""
+
+import os
+import subprocess
+import itertools
+
+#Set config
+if workflow.overwrite_configfile != None:
+	configfile: str(workflow.overwrite_configfile)
+else:
+	configfile: 'TOBIAS.config'
+CONFIGFILE = str(workflow.overwrite_configfile)
+
+include: "snakefiles/helper.snake"
+#shell.prefix("")
+
+#-------------------------------------------------------------------------------#
+#------------------------- CHECK FORMAT OF CONFIG FILE -------------------------#
+#-------------------------------------------------------------------------------#
+
+required = [("data",),
+			("run_info",),
+				("run_info", "organism"),
+				("run_info", "fasta"),
+				("run_info", "blacklist"),
+				("run_info", "gtf"),
+				("run_info", "motifs"),
+				("run_info", "output"),
+			]
+
+#Check if all keys are existing and contain information
+for key_list in required:
+	lookup_dict = config
+	for key in key_list:
+		try:
+			lookup_dict = lookup_dict[key]
+			if lookup_dict == None:
+				print("ERROR: Missing input for key {0}".format(key_list))
+		except:
+			print("ERROR: Could not find key(s) \"{0}\" in configfile {1}. Please check that your configfile has right format for TOBIAS.".format(":".join(key_list), CONFIGFILE))
+			sys.exit()
+
+#Check if there is at least one condition with bamfiles
+if len(config["data"]) > 0:
+	for condition in config["data"]:
+		if len(config["data"][condition]) == 0:
+			print("ERROR: Could not find any bamfiles in \"{0}\" in configfile {1}".format(":".join(("data", condition)), CONFIGFILE))
+else:
+	print("ERROR: Could not find any conditions (\"data:\{condition\}\") in configfile {0}".format(CONFIGFILE))
+	sys.exit()
+
+
+#-------------------------------------------------------------------------------#
+#------------------------- WHICH FILES/INFO WERE INPUT? ------------------------#
+#-------------------------------------------------------------------------------#
+
+input_files = []
+
+#Files related to experimental data (bam)
+CONDITION_IDS = list(config["data"].keys())
+for condition in CONDITION_IDS:
+	if not isinstance(config["data"][condition], list):
+		config['data'][condition] = [config['data'][condition]]
+	input_files.extend(config['data'][condition])
+
+
+#Flatfiles independent from experimental data (run_info)
+FASTA = config['run_info']['fasta']
+BLACKLIST = config['run_info']['blacklist']
+GTF = config['run_info']['gtf']
+OUTPUTDIR = config['run_info']["output"]
+BLACKLIST = config['run_info']['blacklist']
+MOTIFDIR = config['run_info']['motifs']
+
+input_files.extend([FASTA, BLACKLIST, GTF])
+
+
+#---------- Test that input files exist -----------#
+for file in input_files:
+	if file != None:
+		full_path = os.path.abspath(file) 
+		if not os.path.exists(full_path):
+			exit("ERROR: The following file given in config does not exist: {0}".format(full_path))
+
+
+
+#-------------------------------------------------------------------------------#
+#------------------------ WHICH FILES SHOULD BE CREATED? -----------------------#
+#-------------------------------------------------------------------------------#
+
+output_files = []
+
+#--------------------------------- MOTIFS --------------------------------------#
+#Identify IDS of motifs
+files = os.listdir(MOTIFDIR)
+MOTIF_FILES = {}
+for file in files:
+	full_file = os.path.join(MOTIFDIR, file)
+	with open(full_file) as f:
+		for line in f:
+			if line.startswith("MOTIF"):
+				columns = line.rstrip().split()
+				ID = columns[2] + "_" + columns[1]
+				ID = filafy(ID)
+			elif line.startswith(">"):
+				columns = line.replace(">", "").rstrip().split()
+				ID = columns[1] + "_" + columns[0]
+				ID = filafy(ID)	
+		MOTIF_FILES[ID] = full_file
+
+TF_IDS = list(MOTIF_FILES.keys())
+
+
+#---------------------------- OUTPUT PER CONDITION -----------------------------#
+
+id2bam = {condition:{} for condition in CONDITION_IDS}
+
+for condition in CONDITION_IDS:
+
+	config_bams = config['data'][condition]
+	sampleids = [os.path.splitext(os.path.basename(bam))[0] for bam in config_bams]
+	id2bam[condition] = {sampleids[i]:config_bams[i] for i in range(len(sampleids))}	# Link sample ids to bams
+
+
+PLOTNAMES = expand("{condition}_{plotname}", condition=CONDITION_IDS, plotname=["heatmap", "aggregate"])
+if len(CONDITION_IDS) > 1:
+	PLOTNAMES.extend(["heatmap_comparison", "aggregate_comparison"]) 
+
+output_files.append(expand(os.path.join(OUTPUTDIR, "footprinting", "{condition}_footprints.bw"), condition=CONDITION_IDS))
+
+#output_files.append(os.path.join(OUTPUTDIR, "overview", "TFBS_distance.txt"))
+output_files.append(os.path.join(OUTPUTDIR, "TFBS", "bindetect_results.txt"))
+output_files.append(os.path.join(OUTPUTDIR, "overview", "bindetect_results.txt"))
+
+#Visualization
+output_files.extend(expand(os.path.join(OUTPUTDIR, "TFBS", "{TF}", "plots", "{TF}_{plotname}.pdf"), TF=TF_IDS, plotname=PLOTNAMES))
+output_files.extend(expand(os.path.join(OUTPUTDIR, "overview", "all_{plotname}.pdf"), plotname=PLOTNAMES))
+
+
+#-------------------------- OUTPUT ACROSS CONDITIONS ---------------------------#
+
+"""
+COMPARE_COND = 0
+if len(CONDITION_IDS) > 1:
+	COMPARE_COND = 1 	# flag
+	output_files.extend(expand(os.path.join(OUTPUTDIR, "TFBS", "{TF}", "plots", "{TF}_heatmap_comparison.pdf"), TF=TF_IDS))
+	output_files.extend(expand(os.path.join(OUTPUTDIR, "TFBS", "{TF}", "plots", "{TF}_aggregate_comparison.pdf"), TF=TF_IDS))
+	#output_files.extend([os.path.join(OUTPUTDIR, "overview", "diff_bind_plot.pdf")])
+
+"""
+#-------------------------------- OTHER OUTPUT ---------------------------------#
+
+
+
+
+#-------------------------------------------------------------------------------#
+#--------------------- WHICH SNAKE MODULES SHOULD BE USED? ---------------------#
+#-------------------------------------------------------------------------------#
+
+include: "snakefiles/preprocessing.snake"
+include: "snakefiles/footprinting.snake"
+include: "snakefiles/visualization.snake"
+
+
+
+#-------------------------------------------------------------------------------#
+#------------------------ DEAL WITH SPECIAL ENVIRONMENTS -----------------------#
+#-------------------------------------------------------------------------------#
+
+"""
+sys_env = subprocess.check_output(['conda', 'env', 'list'], universal_newlines=True)
+env_list = [line.split()[0] for line in sys_env.split("\n") if len(line.split()) > 0]
+
+# default TOBIAS environment
+if "TOBIAS_ENV" not in env_list:
+	print("Creating TOBIAS environment for the first time")
+	subprocess.call(["conda", "env", "create", "--file", "environments/tobias.yaml"])
+
+# python 2 related envs 
+if "MACS_ENV" not in env_list:
+	print("Creating macs environment for the first time")
+	subprocess.call(["conda", "env", "create", "--file", "environments/macs.yaml"])
+
+"""
+#-------------------------------------------------------------------------------#
+#---------------------------------- RUN :-) ------------------------------------#
+#-------------------------------------------------------------------------------#
+
+rule all:
+	input: 
+		output_files
+	message: "Rule all"
+
diff --git a/snakemake_pipeline/environments/macs.yaml b/snakemake_pipeline/environments/macs.yaml
@@ -0,0 +1,8 @@
+name: MACS_ENV
+
+channels:
+  - bioconda
+  - conda-forge
+
+dependencies:
+  - macs2