Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time

Preprocessing

This tutorial will lead you through the preprocessing of your own CLIP-Seq dataset for subsequent analysis with ssHMM. All you need is a file in BED format (https://genome.ucsc.edu/FAQ/FAQformat.html#format1). This file contains genomic regions bound by a specific RBP as determined in a CLIP-Seq experiment. Theoretically however, you can use any BED file with RNA regions that you want to find the sequence-structure motif for.

1. Preparation

For the preprocessing, you need the reference genome on which the genome regions in the BED file are defined. If you performed alignment and peak calling e.g. on hg19, this same reference genome has to be used in the preprocessing now. You require two files for the genome:

2. Run the preprocessing script

Now, we can start the preprocessing. You need to pass the following parameters:

  • a working directory for the output
  • a dataset name
  • the input BED file
  • the genome sequence file
  • the chromosome sizes file
preprocess_dataset WORKING_DIR DATASET_NAME INPUT_BED GENOME_SEQ GENOME_SIZES

Additionally, you could set optional parameters (see preprocess_dataset --help).

The script will now perform the following steps:

  1. Filter BED file
  2. Elongate BED file for later structure prediction
  3. Fetch genomic sequences for elongated BED file
  4. Produce FASTA file with genomic sequences in viewpoint format
  5. Secondary structure prediction with RNAshapes
  6. Secondary structure prediction with RNAstructures

3. Results

Once, the script has finished, we can inspect the results:

  • WORKING_DIR/fasta/DATASET_NAME/positive.fasta - genomic sequences in viewpoint format
  • WORKING_DIR/shapes/DATASET_NAME/positive.txt - secondary structures of genomic sequence (predicted by RNAshapes)
  • WORKING_DIR/structures/DATASET_NAME/positive.txt - secondary structures of genomic sequence (predicted by RNAstructures)

You can now proceed to analyze these genomic sequences and secondary structures for a common sequence-structure motif with the train_seqstructhmm script (see :ref:`train`).