Tutorial

This tutorial will make you familiar with ssHMM and explain its usage.

0. Installation

To begin this tutorial, first install ssHMM as described in the :ref:`installation` section. To make sure that the installation was successfull, check whether you can execute the ssHMM scripts:

preprocess_dataset -h
train_seqstructhmm -h
batch_seqstructhmm -h

If all goes well, you should see the help messages of the ssHMM scripts.

1. Download a CLIP-Seq dataset

Now we need a CLIP-Seq dataset to work with. On Github we provide a repository of 25 CLIP-Seq and 24 synthetic datasets (https://github.molgen.mpg.de/heller/ssHMM_data). For this tutorial, we use the PUM2 CLIP-Seq dataset and download the pre-processed files for sequence and structure:

cd /home/myuser
mkdir clipseq
cd clipseq
wget https://github.molgen.mpg.de/raw/heller/ssHMM_data/master/clip-seq/fasta/PUM2/positive.fasta
wget https://github.molgen.mpg.de/raw/heller/ssHMM_data/master/clip-seq/shapes/PUM2/positive.txt

2. Start Docker image (if installed with Docker)

If ssHMM is installed as a Docker image, you first have to start the image:

docker run -t -i -v /home/myuser/:/home/myuser/ hellerd/sshmm

This boots the ssHMM image and opens a command line to control the running container. The -v option makes the home directory (containing the clipseq directory) available from within the container. Continue with the tutorial by running all commands in the container.

3. Inspect the dataset

Let's have a look at the two files we downloaded:

head /home/myuser/clipseq/positive.fasta

>chr6:89794035-89794147(+)
aaaaaattacatacaaacagCTTGTATTATATTTTATATTTTGTAAATACTGTATACCATGTATTATGTGTATATTGTTCATACTTGAGAGGtatattatagttttgttatg
>chr10:102767488-102767578(+)
cacccaggtttatggcctcgTTTTCACTTGTATATTTTTCACACTGTAAATTTCTTGTACAAACCCAAAGaaaaaattaaaaaaaatttt
>chr2:99234790-99234904(+)
taactgtgtcaacagtattgTGAAGTGATCATTTCTTGTAAAACTTGTAAATAAACTATCATCTTTGTAGATATCTTAAAGGTGTAAAGTTTGCaaatttgaagaaatatatat
>chr12:49521563-49521638(-)
gtgatcatgtcttttccatgTGTACCTGTAATATTTTTCCATCATATCTCAAAGTaaagtcattaacatcaaaag

The FASTA file contains the nucleotide sequence of the PUM2 binding sites as determined by a CLIP-Seq experiment. Every two lines of the FASTA file hold one binding site. The first line (beginning with >) specifies the genomic location of the site while the second line contains the genomic sequence.

head /home/myuser/clipseq/positive.txt

>chr6:89794035-89794147(+)
EEEEEEEEESSSSSSSSIISSISSSSISSSSSSSSIIISSIISSSIIISSISSSSSSSSSSHHHSSSSISSSSSSISSIIISSSIISSSSSSSSSSISSSSSSSSSSSISSS 0.008824
>chr10:102767488-102767578(+)
EEEEESSSSHHHHHSSSSMMMMMMMMMSSSSSSSHHHHHHHHHHHHHHHHHHHHHSSSSSSSEEEEEEEEEEEEEEEEEEEEEEEEEEEE 0.0312072
EEEEESSSSHHHHHSSSSMMMMMMMMMSSSSSSSHHHHHHHHHHHHHHHHHHHHHSSSSSSSMMMMMMMMMMSSSSSSHHHHHHSSSSSS 0.0163077
>chr2:99234790-99234904(+)
EESSSSSHHHHSSSSSMMMMMMMMMMMMMMSSSSSSSIIISSSISSSSSSSSIIIIIIISSSSSSSSISSHHHHSSSSSSSSSSIIIISSSSSSSSISSSISSSSSSSEEEEEE 0.0677326
EESSSSSHHHHSSSSSMMMSSSSHHHHHSSSSSSSSSIIISSSISSSSSSSSIIIIIIISSSSSSSSISSHHHHSSSSSSSSSSIIIISSSSSSSSISSSISSSSSEEEEEEEE 0.0031042
>chr12:49521563-49521638(-)
SSSSSIISSISSSSIIISSSSSHHHHHHHHHHHHHHHHHHHHSSSSSIIISSSSSSIISSSSSEEEEEEEEEEEE 0.098404

The structure file contains the predicted secondary structures of the binding sites. The prediction were performed with the RNAshapes tool. Again, the lines starting with > specify the genomic location of a binding site. The subsequent lines contain the predicted structural context of each nucleotide in the FASTA file. Note that these structure sequences have the same length as the nucleotide sequences from the FASTA file we have looked at before.

4. Training ssHMM on the dataset

Now we can train ssHMM on the CLIP-Seq dataset we downloaded:

cd /home/myuser
mkdir results
train_seqstructhmm clipseq/positive.fasta clipseq/positive.txt -o results/

This creates a new directory results and starts the training of ssHMM using the train_seqstructhmm script. train_seqstructhmm has two mandatory arguments: the sequence and the structure file. We use the files that we downloaded and additionally tell ssHMM to write its output into the new results directory. For a description of all arguments of train_seqstructhmm see its :ref:`reference <trainseqstructhmm>`.

While the train_seqstructhmm script runs, it writes information to the standard output. For more information on the output, refer to the :ref:`output` section. When the script finishes, it prints messages on standard output that look similar to:

2017-02-23 11:45:33,381 - main_logger - INFO - Terminate model after 7000 iterations.
2017-02-23 11:45:33,381 - main_logger - INFO - Completed training. Write sequence logos..
2017-02-23 11:45:35,675 - main_logger - INFO - Completed writing sequence logos. Print model graph..
2017-02-23 11:45:35,987 - main_logger - INFO - Printed model graph: ./job_170223_114151/final_graph.png. Write model file..
2017-02-23 11:45:35,992 - main_logger - INFO - Wrote model file: ./job_170223_114151/final_model.xml
2017-02-23 11:45:35,992 - main_logger - INFO - Finished ssHMM successfully.

The lines tell you how many iterations the training took (7000) and where you can find a graph and an XML of the trained model.

5. Inspect the trained model

Hint

If you ran ssHMM in the Docker container, it is now time to exit from the container. As the results directory is a subdirectory of /home/myuser, the training results can also be found on your host machine. Exiting from the Docker container is easy:

exit

We can now have a look at the model graph. See :ref:`trainingoutput` for an explanation of what the model graph shows.

Congratulations, you finished our tutorial! Check out the :ref:`reference` section for more information about the ssHMM scripts.