Skip to content
Synthetic and biological datasets used to evaluate the ssHMM motif finder
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
clip-seq
synthetic
.gitignore
README.md

README.md

ssHMM_data

This repository contains the synthetic and biological sequences that were used to evaluate the ssHMM motif finder under https://github.molgen.mpg.de/heller/ssHMM.

Directory structure:

  • clip-seq: This directory contains all 25 CLIP-Seq datasets from the paper. Each dataset consists of a positive set (binding sites of the protein in question) and two negative sets: The first negative set (negative_shuffle) was produced by moving the positive binding sites to random locations in the same or a different gene with bedtools shuffle. The second negative set (negative_clip) consists of the positive binding sites of all 24 other CLIP-Seq datasets. The subdirectories fasta/ and shapes/ additionally contain training and test sets for the positve dataset and the negative_clip dataset. The training and test sets of the positive dataset were generated randomly by splitting the positive dataset with a 90% to 10% ratio. The training and test set of the negative_clip dataset were produced by randomly selecting sequences from the negative_clip dataset that possessed a similar length to the sequences of the positive. The selected sequences were subsequently split with a 90% to 10% ratio.

    • bed: .bed files with genomic coordinates of the positive set and the negative_shuffle set
    • fasta: .fasta files with genomic sequences of the positive and both negative sets
    • shapes: secondary structures as predicted by RNAshapes of the positive and both negative sets
    • structures: secondary structures as predicted by RNAstructures of the positve set and the negative_shuffle set
  • synthetic: This directory contains all 24 synthetic datasets that were generated.

    • hairpin: Synthetic datasets with motifs implanted into a hairpin context.
      • A_IC1_SF1_random: Synthetic sequences with information content 1.0, hairpin fraction 100%, and random background sequences
        • fasta: .fasta files with synthetic sequences containing an implanted motif
        • pwms: images and numerical representations of implanted motifs
        • shapes: secondary structures as predicted by RNAshapes
        • structures: secondary structures as predicted by RNAstructures
      • B_IC1_SF05_random
      • C_IC1_SF01_random
      • D_IC05_SF1_random
      • E_IC05_SF05_random
      • F_IC05_SF01_random
      • G_IC1_SF1_utr
      • H_IC1_SF05_utr
      • I_IC1_SF01_utr
      • K_IC05_SF1_utr
      • L_IC05_SF05_utr
      • M_IC05_SF01_utr
    • stem: Synthetic datasets with motifs implanted into a stem context.
      • A_IC1_SF1_random: Synthetic sequences with information content 1.0, stem fraction 100%, and random background sequences
        • fasta: .fasta files with synthetic sequences containing an implanted motif
        • pwms: images and numerical representations of implanted motifs
        • shapes: secondary structures as predicted by RNAshapes
        • structures: secondary structures as predicted by RNAstructures
      • B_IC1_SF05_random
      • C_IC1_SF01_random
      • D_IC05_SF1_random
      • E_IC05_SF05_random
      • F_IC05_SF01_random
      • G_IC1_SF1_utr
      • H_IC1_SF05_utr
      • I_IC1_SF01_utr
      • K_IC05_SF1_utr
      • L_IC05_SF05_utr
      • M_IC05_SF01_utr
You can’t perform that action at this time.