From fcd33b9d3afe199bee34d4dde55680f7212cae14 Mon Sep 17 00:00:00 2001 From: David Heller Date: Thu, 9 Feb 2017 11:49:44 +0100 Subject: [PATCH] Shorten readme; reference documentation --- README.md | 213 +++--------------------------------------------------- 1 file changed, 9 insertions(+), 204 deletions(-) diff --git a/README.md b/README.md index 73722e3..7ddd342 100755 --- a/README.md +++ b/README.md @@ -1,214 +1,19 @@ ## ssHMM - Sequence-structure hidden Markov model -A motif finder for sequence-structure binding preferences of RNA-binding proteins. +ssHMM is an RNA motif finder. It recovers sequence-structure motifs from RNA-binding protein data, such as CLIP-Seq data. -RNA-binding proteins (RBPs) play a vital role in the post-transcriptional control of RNAs. They are known to recognize RNA molecules by their nucleotide sequence as well as their three-dimensional structure. ssHMM is an RNA motif finder that combines a hidden Markov model (HMM) with Gibbs sampling to learn the joint sequence and structure binding preferences of RBPs from high-throughput RNA-binding experiments, such as CLIP-Seq. The model can be visualized as an intuitive graph illustrating the interplay between RNA sequence and structure. +### Background -### Overview +RNA-binding proteins (RBPs) play a vital role in the post-transcriptional control of RNAs. They are known to recognize RNA molecules by their nucleotide sequence as well as their three-dimensional structure. ssHMM combines a hidden Markov model (HMM) with Gibbs sampling to learn the joint sequence and structure binding preferences of RBPs from high-throughput RNA-binding experiments, such as CLIP-Seq. The model can be visualized as an intuitive graph illustrating the interplay between RNA sequence and structure. -ssHMM consists of 3 main scripts: -- **preprocess_dataset**: Prepares a CLIP-Seq dataset in BED format for ssHMM. It filters the BED file, fetches the genomic sequences, and predicts RNA secondary structures. -- **train_seqstructhmm**: Trains ssHMM on a given CLIP-Seq dataset and produces an intuitive visualization of the recovered motif. -- **batch_seqstructhmm**: Trains ssHMM on several different CLIP-Seq datasets and recovers one motif for each dataset. +### Scope -### Installation +ssHMM was developed for the analysis of data from RNA-binding assays. Its aim is to help biologists to derive a binding motif for one or a number of RNA-binding proteins. ssHMM was written in Python and is a pure command-line tool. -We distribute ssHMM both as a Python package and as a Docker image (similar to a virtual machine). The Docker image has already all dependencies installed and is much easier to use. When using the Python package, all dependencies have to be installed manually. +### Documentation -#### Installation via Docker +Check out our documentation: http://sshmm.readthedocs.io -1. Install Docker on your platform as described on https://docs.docker.com/engine/getstarted/ -2. Run Docker and check whether everything is working: ```docker version``` -3. Run ```docker run -t -i -v hellerd/sshmm```. This will download the Docker image containing ssHMM (if you have not yet downloaded it) and will run it. Once the image has been started, you can access it via a command line interface. You can now run ssHMM, e.g. by typing ```train_seqstructhmm --help```. You can exit the image with ```exit```. -4. To access data located on your machine (e.g. in ```/home/someuser/data```) from inside the image, use the ```-v``` option: ```docker run -t -i -v /home/someuser/data:/data hellerd/sshmm```. This will make your available inside the image in the ```/data``` directory. You can now run ssHMM on this data, e.g. ```train_seqstructhmm /data/sequences.fasta /data/shapes.txt```. +### Citation -#### Installation as a Python package - -##### Prerequisites: -- GHMM (http://ghmm.org/) -- GraphViz (http://www.graphviz.org/) - -For preprocessing BED files with preprocess_dataset (tested only on Linux and macOS): -- bedtools (https://github.com/arq5x/bedtools2) -- awk -- RNAshapes (http://bibiserv.techfak.uni-bielefeld.de/rnashapes) -- RNAstructure (http://rna.urmc.rochester.edu/rnastructure.html) - -##### Installation of prerequisites: - -1. Install prerequisites for GHMM as described on http://ghmm.sourceforge.net/installation.html. The commands for Ubuntu are: - - ```bash - sudo apt-get update - sudo apt-get install build-essential automake autoconf libtool - sudo apt-get install python-dev - sudo apt-get install libxml++2.6-dev - sudo apt-get install swig - ``` -2. Download and unpack GHMM from https://sourceforge.net/projects/ghmm/ -3. Install GHMM as described on http://ghmm.sourceforge.net/installation.html. The commands for Ubuntu are: - - ```bash - cd ghmm - sh autogen.sh - sudo ./configure - sudo make - sudo make install - sudo ldconfig - ``` -4. Install GraphViz. On Ubuntu: - - ```bash - sudo apt-get install graphviz - sudo apt-get install libgraphviz-dev - ``` -5. Install pip if not already installed. On Ubuntu: - - ```bash - sudo apt-get install python-pip - ``` -6. Install PyGraphViz: - - ```bash - sudo PKG_CONFIG_ALLOW_SYSTEM_LIBS=OHYESPLEASE pip install pygraphviz - ``` -7. Install bedtools as described on http://bedtools.readthedocs.io/en/latest/content/installation.html -8. Download and install RNAshapes as described on http://bibiserv.techfak.uni-bielefeld.de/rnashapes?id=rnashapes_view_download. -9. Download and install RNAstructure from http://rna.urmc.rochester.edu/register.html. - -##### Installation of ssHMM: - -1. Download ssHMM from this page -2. Install ssHMM: - - ```bash - #as root - sudo python setup.py install - - #as non-root - python setup.py install --user - ``` - This will install ssHMM and the following python package dependencies: numpy, graphviz, pygraphviz, weblogo, forgi. If setuptools fails to install any of the dependencies, try to install it separately (e.g. with `sudo pip install numpy`). - - -### Preprocessing a CLIP-Seq dataset: *preprocess_dataset* - -**usage**: preprocess_dataset [-h] [--genome GENOME] [--min_length MIN_LENGTH] - [--max_length MAX_LENGTH] - directory dataset_name jump_to min_score - -**positional arguments**: - * directory: root directory for data - * dataset_name: dataset name - * jump_to: preprocessing step to jump to (as integer): 1 - filter bed, 2 - shuffle bed, 3 - enlongate bed, 4 - fetch sequences, 5 - format FASTA, 6 - calculate RNA shapes, 7 - calculate RNA structures - * min_score: minimum score for binding site (default: 0.0) - -**optional arguments**: - * -h, --help: show this help message and exit - * --genome GENOME: genome version to use (default: hg19) - * --min_length MIN_LENGTH: minimum binding site length (default: 8) - * --max_length MAX_LENGTH: maximum binding site length (default: 75) - - -This script prepares a CLIP-Seq dataset in BED format for the training of ssHMM. The following preprocessing steps are taken: -1 - Filter (positive) BED file -2 - Shuffle (positive) BED file to generate negative dataset -3 - Enlongate positive and negative BED files for later structure prediction -4 - Fetch genomic sequences for elongated BED files -5 - Produce FASTA files with genomic sequences in viewpoint format -6 - Calculate RNA shapes -7 - Calculate RNA structures - -A root directory for the datasets and a dataset name (e.g., the protein name) has to be given. The following files will be created in the root directory and its subdirectories: -- ``/bed//positive_raw.bed`` - positive BED file from CLIP-Seq experiment -- ``/bed//positive.bed`` - filtered positive BED file -- ``/bed//negative.bed`` - filtered negative BED file -- ``/bed//positive_long.bed`` - elongated positive BED file -- ``/bed//negative_long.bed`` - elongated negative BED file -- ``/temp//positive_long.fasta`` - genomic sequences of elongated positive BED file -- ``/temp//negative_long.fasta`` - genomic sequences of elongated negative BED file -- ``/fasta//positive.fasta`` - positive genomic sequences in viewpoint format -- ``/fasta//negative.fasta`` - negative genomic sequences in viewpoint format -- ``/shapes//positive.txt`` - secondary structures of positive genomic sequence (predicted by RNAshapes) -- ``/shapes//negative.txt`` - secondary structures of negative genomic sequence (predicted by RNAshapes) -- ``/structures//positive.txt`` - secondary structures of positive genomic sequence (predicted by RNAstructures) -- ``/structures//negative.txt`` - secondary structures of negative genomic sequence (predicted by RNAstructures) - -The preprocessing step to begin with can be chosen. For each step, the files generated by the previous step need to be present. To execute all steps, only the positive_raw.bed must be present. For the filtering step, the minimum score and binding site lengths can be defined with parameters. - -**IMPORTANT**: To fetch genomic sequences in step 4, the following file must be present in the genomes/ subdirectory: - -- ``/genomes/[version]/[version].genome`` -> BED file defining the size of the chromosomes -- ``/genomes/[version]/UCSCGenesTrack.bed`` -> BED file defining the gene intervals -- ``/genomes/[version]/[version].fa`` -> FASTA file containing the human genome - -The version of the genome can be given as an optional parameter. It defaults to 'hg19'. The files for the ``genomes/`` directory can be obtained from UCSC: - -``/genomes/[version]/[version].genome``: --> download from ``http://hgdownload.soe.ucsc.edu/downloads.html#human`` (Full data set), e.g. ``http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes`` - -``/genomes/[version]/UCSCGenesTrack.bed``: --> download in table browser (http://genome.ucsc.edu/cgi-bin/hgTables); choose most recent GENCODE track (currently GENCODE Gene V24lift37->Basic (for hg19) and All GENCODE V24->Basic (for hg38)) and 'BED' as output format - -``/genomes/[version]/[version].fa``: --> download chromosomes from ``http://hgdownload.soe.ucsc.edu/downloads.html``; e.g. ``wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/*'``; concatenate chromosomes with cat and print into .fa file (e.g. with ``zcat chr* > hg19.fa``) - -### Train ssHMM on a CLIP-Seq dataset: *train_seqstructhmm* - -**usage**: train_seqstructhmm [-h] [--motif_length MOTIF_LENGTH] [--baum_welch] - [--flexibility FLEXIBILITY] - [--block_size BLOCK_SIZE] [--threshold THRESHOLD] - [--job_name JOB_NAME] - [--output_directory OUTPUT_DIRECTORY] - [--termination_interval TERMINATION_INTERVAL] - [--write_model_state] [--only_best_shape] - training_sequences training_structures - -**positional arguments**: - * training_sequences: FASTA file storing the training sequences - * training_shapes: FASTA file storing the training RNA shapes - -**optional arguments**: - * -h, --help: show this help message and exit - * --motif_length MOTIF_LENGTH, -n MOTIF_LENGTH: length of the motif that shall be found (default: 6) - * --baum_welch, -b: should the model be initialized with a Baum-Welch optimized sequence motif (default: yes) - * --flexibility FLEXIBILITY, -f FLEXIBILITY: greedyness of Gibbs sampler: model parameters are sampled from among the top f configurations (default: f=10), set f to 0 in order to include all possible configurations - * --block_size BLOCK_SIZE, -s BLOCK_SIZE: number of sequences to be held-out in each iteration (default: 1) - * --threshold THRESHOLD, -t THRESHOLD: the iterative algorithm is terminated if this reduction in sequence structure loglikelihood is not reached for any of the 3 last measurements (default: 10) - * --job_name JOB_NAME, -j JOB_NAME: name of the job (default: "job") - * --output_directory OUTPUT_DIRECTORY, -o OUTPUT_DIRECTORY: directory to write output files to (default: current directory) - * --termination_interval TERMINATION_INTERVAL, -i TERMINATION_INTERVAL: produce output every i iterations (default: i=100) - * --write_model_state, -w: write model state every i iterations - * --only_best_shape: train only using best structure for each sequence (default: use all structures) - -This script trains an hidden Markov model for the sequence-structure binding preferences of an RNA-binding protein. The model is trained on sequences and structures from a CLIP-seq experiment given in two FASTA-like files. -During the training process, statistics about the model are printed on stdout. In every iteration, the current model and a visualization of the model can be stored in the output directory. -The training process terminates when no significant progress has been made for three iterations. - - -### Train ssHMM on a batch of CLIP-Seq datasets: *batch_seqstructhmm* - -**usage**: batch_seqstructhmm [-h] [--cores CORES] [--motif_length MOTIF_LENGTH] - [--baum_welch] [--flexibility FLEXIBILITY] - [--block_size BLOCK_SIZE] [--threshold THRESHOLD] - [--termination_interval TERMINATION_INTERVAL] - data_directory proteins batch_directory - -**positional arguments**: - * data_directory: data directory (must have the following subdirectories: fasta/, shapes/, structures/ - * proteins: list of RNA-binding proteins to analyze (surrounded by quotation marks, separated by whitespace) - * batch_directory: directory for batch output - -**optional arguments**: - * -h, --help: show this help message and exit - * --cores CORES: number of cores to use (if not given, all cores are used) - * --motif_length MOTIF_LENGTH, -n MOTIF_LENGTH: length of the motifs that shall be found (default: 6) - * --baum_welch, -b: should the models be initialized with a Baum-Welch optimized sequence motif (default: yes) - * --flexibility FLEXIBILITY, -f FLEXIBILITY: greedyness of Gibbs sampler: model parameters are sampled from among the top f configurations (default: f=10), set f to 0 in order to include all possible configurations - * --block_size BLOCK_SIZE, -s BLOCK_SIZE: number of sequences to be held-out in each iteration (default: 1) - * --threshold THRESHOLD, -t THRESHOLD: the iterative algorithm is terminated if this reduction in sequence structure loglikelihood is not reached for any of the 3 last measurements (default: 10) - * --termination_interval TERMINATION_INTERVAL, -i TERMINATION_INTERVAL: produce output every iterations (default: i=100) - -This script trains multiple Hidden Markov models for the sequence-structure binding preferences of a given set of RNA-binding protein. The models are trained on sequences and structures in FASTA format located in a given data directory. -During the training process, statistics about the models are printed on stdout. In every iteration, the current model and a visualization of the model are stored in the batch directory. -The training processes terminate when no significant progress has been made for three iterations. +Heller, D., Krestel, R., Ohler, U., Vingron, M., & Marsico, A. (2016). ssHMM: [Extracting intuitive sequence-structure motifs from high-throughput RNA-binding protein data](http://dx.doi.org/10.1101/076034). bioRxiv, 076034.