Skip to content
This repository has been archived by the owner. It is now read-only.

Commit

Permalink
Browse files Browse the repository at this point in the history
Initial commit
  • Loading branch information
msbentsen committed Dec 11, 2018
0 parents commit 05e2dbd
Show file tree
Hide file tree
Showing 82 changed files with 1,075,298 additions and 0 deletions.
7 changes: 7 additions & 0 deletions .gitignore
@@ -0,0 +1,7 @@
*.pyc
*.c
.snakemake/
build/
dist/
*.egg
*.egg-info
21 changes: 21 additions & 0 deletions LICENSE
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2017 MPI for Heart and Lung Research

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
2 changes: 2 additions & 0 deletions MANIFEST.in
@@ -0,0 +1,2 @@
include README.md
include LICENSE
76 changes: 76 additions & 0 deletions README.md
@@ -0,0 +1,76 @@
TOBIAS - Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal
=======================================

Introduction
------------

ATAC-seq (Assay for Transposase-Accessible Chromatin using high-throughput sequencing) is a sequencing assay for investigating genome-wide chromatin accessibility. The assay applies a Tn5 Transposase to insert sequencing adapters into accessible chromatin, enabling mapping of regulatory regions across the genome. Additionally, the local distribution of Tn5 insertions contains information about transcription factor binding due to the visible depletion of insertions around sites bound by protein - known as _footprints_.

**TOBIAS** is a collection of command-line bioinformatics tools for performing footprinting analysis on ATAC-seq data, and includes:

<img align="right" width=150 src="/figures/tobias.png">

- Correction of Tn5 insertion bias
- Calculation of footprint scores within regulatory regions
- Estimation of bound/unbound transcription factor binding sites
- Visualization of footprints within and across different conditions

For information on each tool, please see the [wiki](https://github.molgen.mpg.de/loosolab/TOBIAS/wiki/).

Installation
------------
TOBIAS is written as a python package and can be quickly installed within a conda environment using:
```bash
$ git clone https://github.molgen.mpg.de/loosolab/TOBIAS
$ cd TOBIAS
$ conda env create -f snakemake_pipeline/environments/tobias.yaml
$ conda activate TOBIAS_ENV
$ python setup.py install
```
Please see the [installation](https://github.molgen.mpg.de/loosolab/TOBIAS/wiki/installation) page for more info.

Usage
------------
All tools are available through the command-line as ```TOBIAS <TOOLNAME>```, for example:
```
$ TOBIAS ATACorrect
__________________________________________________________________________________________
TOBIAS ~ ATACorrect
__________________________________________________________________________________________
ATACorrect corrects the cutsite-signal from ATAC-seq with regard to the underlying
sequence preference of Tn5 transposase.
Usage:
TOBIAS ATACorrect --bam <reads.bam> --genome <genome.fa> --peaks <peaks.bed>
Output files:
- <outdir>/<prefix>_uncorrected.bw
- <outdir>/<prefix>_bias.bw
- <outdir>/<prefix>_expected.bw
- <outdir>/<prefix>_corrected.bw
- <outdir>/<prefix>_atacorrect.pdf
(...)
```

Snakemake pipeline
------------

You can run each TOBIAS tool independently or as part of a pipeline using the included snakemake workflow. Simply set the paths to required data within snakemake_pipeline/TOBIAS.config and run using:
```bash
$ cd snakemake_pipeline
$ conda activate TOBIAS_ENV
$ snakemake --snakefile TOBIAS.snake --configfile TOBIAS.config --cores [number of cores] --keep-going
```
For further info on setup, configfile and output, please consult the [wiki](https://github.molgen.mpg.de/loosolab/TOBIAS/wiki/snakemake-pipeline).

License
------------
This project is licensed under the [MIT license](LICENSE).


Contact
------------
Mette Bentsen (mette.bentsen (at) mpi-bn.mpg.de)
Binary file added figures/Thumbs.db
Binary file not shown.
Binary file added figures/atacorrect.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/bindetect.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/footprinting.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added figures/tobias.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
49 changes: 49 additions & 0 deletions setup.py
@@ -0,0 +1,49 @@
from setuptools import setup, Extension
import numpy as np

def readme():
with open('README.md') as f:
return f.read()

ext_modules = [Extension("tobias.utils.ngs", ["tobias/utils/ngs.pyx"], include_dirs=[np.get_include()]),
Extension("tobias.utils.sequences", ["tobias/utils/sequences.pyx"], include_dirs=[np.get_include()]),
Extension("tobias.utils.signals", ["tobias/utils/signals.pyx"], include_dirs=[np.get_include()])]

setup(name='tobias',
version='1.0.0',
description='Transcription factor Occupancy prediction By Investigation of ATAC-seq Signal',
long_description=readme(),
url='https://github.molgen.mpg.de/loosolab/TOBIAS',
author='Mette Bentsen',
author_email='mette.bentsen@mpi-bn.mpg.de',
license='MIT',
packages=['tobias', 'tobias.footprinting', 'tobias.utils', 'tobias.plotting', 'tobias.motifs'],
entry_points = {
'console_scripts': ['TOBIAS=tobias.TOBIAS:main']
},
install_requires=[
'setuptools_cython',
'numpy',
'scipy',
'pyBigWig',
'pysam',
'pybedtools',
'matplotlib>=2',
'scikit-learn',
'pandas',
'pypdf2',
'xlsxwriter',
'adjustText',
],
#dependency_links=['https://github.com/jhkorhonen/MOODS/tarball/master'],
classifiers = [
'License :: OSI Approved :: MIT License',
'Intended Audience :: Science/Research',
'Topic :: Scientific/Engineering :: Bio-Informatics',
'Programming Language :: Python :: 3'
],
zip_safe=False,
include_package_data=True,
ext_modules = ext_modules,
scripts=["tobias/utils/peak_annotation.sh"]
)
33 changes: 33 additions & 0 deletions snakemake_pipeline/TOBIAS.config
@@ -0,0 +1,33 @@
#-------------------------------------------------------------------------#
#-------------------------- TOBIAS input data ----------------------------#
#-------------------------------------------------------------------------#

data:
control: [test_data/control_s1.bam, test_data/control_s2.bam] #list of bam files
treatment: [test_data/treatment_s1.bam] #list of bam files

run_info:
organism: human #mouse/human
fasta: test_data/genome.fa #.fasta-file containing organism genome
blacklist: test_data/blacklist.bed #.bed-file containing blacklisted regions
gtf: test_data/genes.gtf #.gtf-file for annotation of peaks
motifs: test_data/motifs/ #directory containing motifs (single files in meme or JASPAR pfm format)
output: test_output/ #output directory



#-------------------------------------------------------------------------#
#----------------------- Default module parameters -----------------------#
#-------------------------------------------------------------------------#

macs: "--nomodel --shift -100 --extsize 200 --broad"

# for parameter description see uropa manual: http://uropa-manual.readthedocs.io/config.html
# adjust filter attribute for given gtf: ensembl gene_biotype / genecode gene_type
# other optional parameters: --filter_attribute gene_biotype --attribute_value protein_coding
uropa: "--feature gene --feature_anchor start --distance [10000,1000] --show_attribute gene_name,gene_id,gene_biotype"

atacorrect: ""
footprinting: ""
bindetect: ""
plotting: ""
195 changes: 195 additions & 0 deletions snakemake_pipeline/TOBIAS.snake
@@ -0,0 +1,195 @@
"""
Upper level TOBIAS snake
"""

import os
import subprocess
import itertools

#Set config
if workflow.overwrite_configfile != None:
configfile: str(workflow.overwrite_configfile)
else:
configfile: 'TOBIAS.config'
CONFIGFILE = str(workflow.overwrite_configfile)

include: "snakefiles/helper.snake"
#shell.prefix("")

#-------------------------------------------------------------------------------#
#------------------------- CHECK FORMAT OF CONFIG FILE -------------------------#
#-------------------------------------------------------------------------------#

required = [("data",),
("run_info",),
("run_info", "organism"),
("run_info", "fasta"),
("run_info", "blacklist"),
("run_info", "gtf"),
("run_info", "motifs"),
("run_info", "output"),
]

#Check if all keys are existing and contain information
for key_list in required:
lookup_dict = config
for key in key_list:
try:
lookup_dict = lookup_dict[key]
if lookup_dict == None:
print("ERROR: Missing input for key {0}".format(key_list))
except:
print("ERROR: Could not find key(s) \"{0}\" in configfile {1}. Please check that your configfile has right format for TOBIAS.".format(":".join(key_list), CONFIGFILE))
sys.exit()

#Check if there is at least one condition with bamfiles
if len(config["data"]) > 0:
for condition in config["data"]:
if len(config["data"][condition]) == 0:
print("ERROR: Could not find any bamfiles in \"{0}\" in configfile {1}".format(":".join(("data", condition)), CONFIGFILE))
else:
print("ERROR: Could not find any conditions (\"data:\{condition\}\") in configfile {0}".format(CONFIGFILE))
sys.exit()


#-------------------------------------------------------------------------------#
#------------------------- WHICH FILES/INFO WERE INPUT? ------------------------#
#-------------------------------------------------------------------------------#

input_files = []

#Files related to experimental data (bam)
CONDITION_IDS = list(config["data"].keys())
for condition in CONDITION_IDS:
if not isinstance(config["data"][condition], list):
config['data'][condition] = [config['data'][condition]]
input_files.extend(config['data'][condition])


#Flatfiles independent from experimental data (run_info)
FASTA = config['run_info']['fasta']
BLACKLIST = config['run_info']['blacklist']
GTF = config['run_info']['gtf']
OUTPUTDIR = config['run_info']["output"]
BLACKLIST = config['run_info']['blacklist']
MOTIFDIR = config['run_info']['motifs']

input_files.extend([FASTA, BLACKLIST, GTF])


#---------- Test that input files exist -----------#
for file in input_files:
if file != None:
full_path = os.path.abspath(file)
if not os.path.exists(full_path):
exit("ERROR: The following file given in config does not exist: {0}".format(full_path))



#-------------------------------------------------------------------------------#
#------------------------ WHICH FILES SHOULD BE CREATED? -----------------------#
#-------------------------------------------------------------------------------#

output_files = []

#--------------------------------- MOTIFS --------------------------------------#
#Identify IDS of motifs
files = os.listdir(MOTIFDIR)
MOTIF_FILES = {}
for file in files:
full_file = os.path.join(MOTIFDIR, file)
with open(full_file) as f:
for line in f:
if line.startswith("MOTIF"):
columns = line.rstrip().split()
ID = columns[2] + "_" + columns[1]
ID = filafy(ID)
elif line.startswith(">"):
columns = line.replace(">", "").rstrip().split()
ID = columns[1] + "_" + columns[0]
ID = filafy(ID)
MOTIF_FILES[ID] = full_file

TF_IDS = list(MOTIF_FILES.keys())


#---------------------------- OUTPUT PER CONDITION -----------------------------#

id2bam = {condition:{} for condition in CONDITION_IDS}

for condition in CONDITION_IDS:

config_bams = config['data'][condition]
sampleids = [os.path.splitext(os.path.basename(bam))[0] for bam in config_bams]
id2bam[condition] = {sampleids[i]:config_bams[i] for i in range(len(sampleids))} # Link sample ids to bams


PLOTNAMES = expand("{condition}_{plotname}", condition=CONDITION_IDS, plotname=["heatmap", "aggregate"])
if len(CONDITION_IDS) > 1:
PLOTNAMES.extend(["heatmap_comparison", "aggregate_comparison"])

output_files.append(expand(os.path.join(OUTPUTDIR, "footprinting", "{condition}_footprints.bw"), condition=CONDITION_IDS))

#output_files.append(os.path.join(OUTPUTDIR, "overview", "TFBS_distance.txt"))
output_files.append(os.path.join(OUTPUTDIR, "TFBS", "bindetect_results.txt"))
output_files.append(os.path.join(OUTPUTDIR, "overview", "bindetect_results.txt"))

#Visualization
output_files.extend(expand(os.path.join(OUTPUTDIR, "TFBS", "{TF}", "plots", "{TF}_{plotname}.pdf"), TF=TF_IDS, plotname=PLOTNAMES))
output_files.extend(expand(os.path.join(OUTPUTDIR, "overview", "all_{plotname}.pdf"), plotname=PLOTNAMES))


#-------------------------- OUTPUT ACROSS CONDITIONS ---------------------------#

"""
COMPARE_COND = 0
if len(CONDITION_IDS) > 1:
COMPARE_COND = 1 # flag
output_files.extend(expand(os.path.join(OUTPUTDIR, "TFBS", "{TF}", "plots", "{TF}_heatmap_comparison.pdf"), TF=TF_IDS))
output_files.extend(expand(os.path.join(OUTPUTDIR, "TFBS", "{TF}", "plots", "{TF}_aggregate_comparison.pdf"), TF=TF_IDS))
#output_files.extend([os.path.join(OUTPUTDIR, "overview", "diff_bind_plot.pdf")])

"""
#-------------------------------- OTHER OUTPUT ---------------------------------#




#-------------------------------------------------------------------------------#
#--------------------- WHICH SNAKE MODULES SHOULD BE USED? ---------------------#
#-------------------------------------------------------------------------------#

include: "snakefiles/preprocessing.snake"
include: "snakefiles/footprinting.snake"
include: "snakefiles/visualization.snake"



#-------------------------------------------------------------------------------#
#------------------------ DEAL WITH SPECIAL ENVIRONMENTS -----------------------#
#-------------------------------------------------------------------------------#

"""
sys_env = subprocess.check_output(['conda', 'env', 'list'], universal_newlines=True)
env_list = [line.split()[0] for line in sys_env.split("\n") if len(line.split()) > 0]

# default TOBIAS environment
if "TOBIAS_ENV" not in env_list:
print("Creating TOBIAS environment for the first time")
subprocess.call(["conda", "env", "create", "--file", "environments/tobias.yaml"])

# python 2 related envs
if "MACS_ENV" not in env_list:
print("Creating macs environment for the first time")
subprocess.call(["conda", "env", "create", "--file", "environments/macs.yaml"])

"""
#-------------------------------------------------------------------------------#
#---------------------------------- RUN :-) ------------------------------------#
#-------------------------------------------------------------------------------#

rule all:
input:
output_files
message: "Rule all"

8 changes: 8 additions & 0 deletions snakemake_pipeline/environments/macs.yaml
@@ -0,0 +1,8 @@
name: MACS_ENV

channels:
- bioconda
- conda-forge

dependencies:
- macs2

0 comments on commit 05e2dbd

Please sign in to comment.