Predicting co-occurring transcription factors on cell-type specific accessible chromatin regions
Pipeline
- Derive the cell-type specific DNase hypersensitive sites (CTS-DHSs) and ubiquitous DNase hypersensitive sites (ubiq-DHSs) for DNase-seq experiments. All input file (read counts on 200bp windows without repeats) names including path and corresponding cell types should be saved in data/all_files.csv. Output files are saved in: output_directory/top_regions/ as bed files, each cell type in separate file.
Usage:
Rscript scripts/calculate.cts-dhs.R -c "count.directory" -t "data/all_files.csv" -w "data/ranges_hg19_200bp_masked_sorted.bed" -o "output.dir" -tpr 10000 -m 1
The CTS-DHSs and ubiq-DHSs derived for 90 cell types from ENCODE, genome hg19 are stored in data/top_regions/
- Get the fasta files for all top_regions (each cell type in separate folder) using:
scripts/Get_fasta_tissues.sh
Before using Get_fasta_tissues.sh bedtools must be installed, the corresponding genome must be downloaded.
- Calculate binding affinities for PWMs of interest with TRAP using calculate.affinity.R:
Rscript scripts/calculate.affinity.R -m file.with.matrices -f format.matrices -s "data/top_regions/fasta" -t "data/cell_types.dat" -o "output.folder"
The pre-calculated affinities for TRANSFAC matrices are stored in: data/affinity/ (separate folder for each cell type and separate file for each PWM).
- Calculate the TF-enrichment for all PWMs in a cell-type specific way and plot heatmaps of p-values and of odd ratios for all matrices and all cell types
Rscript scripts/calculate.tf.enrichment.R -l "data/list.of.matrices" -a "data/affinity" -t "data/cell_types_test.dat" -k 500 -n 5000 -o "results/enrichment" -p TRUE -d "results/plots"
- Calculate the TF co-occurrence in a cell-type specific way for all possible pairs of TFs
Rscript scripts/calculate.tf.pairs.R -l "data/list.of.matrices" -a "data/affinity" -t "data/cell_types_test.dat" -k 500 -n 5000 -o "results/tf.pairs"
Testing scripts in: tests/call_functions.sh