Preprocessing

This tutorial will lead you through the preprocessing of your own CLIP-Seq dataset for subsequent analysis with ssHMM. All you need is a file in BED format (https://genome.ucsc.edu/FAQ/FAQformat.html#format1). This file contains genomic regions bound by a specific RBP as determined in a CLIP-Seq experiment. Theoretically however, you can use any BED file with RNA regions that you want to find the sequence-structure motif for.

1. Preparation

For the preprocessing, you need the reference genome on which the genome regions in the BED file are defined. If you performed alignment and peak calling e.g. on hg19, this same reference genome has to be used in the preprocessing now. You require two files for the genome:

the genome sequence in FASTA format
the chromosome sizes of the genome (e.g. from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes)

2. Run the preprocessing script

Now, we can start the preprocessing. You need to pass the following parameters:

a working directory for the output
a dataset name
the input BED file
the genome sequence file
the chromosome sizes file

preprocess_dataset WORKING_DIR DATASET_NAME INPUT_BED GENOME_SEQ GENOME_SIZES

Additionally, you could set optional parameters (see preprocess_dataset --help).

The script will now perform the following steps:

Filter BED file
Elongate BED file for later structure prediction
Fetch genomic sequences for elongated BED file
Produce FASTA file with genomic sequences in viewpoint format
Secondary structure prediction with RNAshapes
Secondary structure prediction with RNAstructures

3. Results

Once, the script has finished, we can inspect the results:

WORKING_DIR/fasta/DATASET_NAME/positive.fasta - genomic sequences in viewpoint format
WORKING_DIR/shapes/DATASET_NAME/positive.txt - secondary structures of genomic sequence (predicted by RNAshapes)
WORKING_DIR/structures/DATASET_NAME/positive.txt - secondary structures of genomic sequence (predicted by RNAstructures)

You can now proceed to analyze these genomic sequences and secondary structures for a common sequence-structure motif with the train_seqstructhmm script (see :ref:`train`).

ssHMM_docs/preprocess.rst

Preprocessing

1. Preparation

2. Run the preprocessing script

3. Results