Skip to content

heller/ssHMM_data

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 

ssHMM_data

This repository contains the synthetic and biological sequences that were used to evaluate the ssHMM motif finder under https://github.molgen.mpg.de/heller/ssHMM.

Directory structure:

  • clip-seq: This directory contains all 25 CLIP-Seq datasets from the paper. Each dataset consists of a positive set (binding sites of the protein in question) and two negative sets: The first negative set (negative_shuffle) was produced by moving the positive binding sites to random locations in the same or a different gene with bedtools shuffle. The second negative set (negative_clip) consists of the positive binding sites of all 24 other CLIP-Seq datasets. The subdirectories fasta/ and shapes/ additionally contain training and test sets for the positve dataset and the negative_clip dataset. The training and test sets of the positive dataset were generated randomly by splitting the positive dataset with a 90% to 10% ratio. The training and test set of the negative_clip dataset were produced by randomly selecting sequences from the negative_clip dataset that possessed a similar length to the sequences of the positive. The selected sequences were subsequently split with a 90% to 10% ratio.

    • bed: .bed files with genomic coordinates of the positive set and the negative_shuffle set
    • fasta: .fasta files with genomic sequences of the positive and both negative sets
    • shapes: secondary structures as predicted by RNAshapes of the positive and both negative sets
    • structures: secondary structures as predicted by RNAstructures of the positve set and the negative_shuffle set
  • synthetic: This directory contains all 24 synthetic datasets that were generated.

    • hairpin: Synthetic datasets with motifs implanted into a hairpin context.
      • A_IC1_SF1_random: Synthetic sequences with information content 1.0, hairpin fraction 100%, and random background sequences
        • fasta: .fasta files with synthetic sequences containing an implanted motif
        • pwms: images and numerical representations of implanted motifs
        • shapes: secondary structures as predicted by RNAshapes
        • structures: secondary structures as predicted by RNAstructures
      • B_IC1_SF05_random
      • C_IC1_SF01_random
      • D_IC05_SF1_random
      • E_IC05_SF05_random
      • F_IC05_SF01_random
      • G_IC1_SF1_utr
      • H_IC1_SF05_utr
      • I_IC1_SF01_utr
      • K_IC05_SF1_utr
      • L_IC05_SF05_utr
      • M_IC05_SF01_utr
    • stem: Synthetic datasets with motifs implanted into a stem context.
      • A_IC1_SF1_random: Synthetic sequences with information content 1.0, stem fraction 100%, and random background sequences
        • fasta: .fasta files with synthetic sequences containing an implanted motif
        • pwms: images and numerical representations of implanted motifs
        • shapes: secondary structures as predicted by RNAshapes
        • structures: secondary structures as predicted by RNAstructures
      • B_IC1_SF05_random
      • C_IC1_SF01_random
      • D_IC05_SF1_random
      • E_IC05_SF05_random
      • F_IC05_SF01_random
      • G_IC1_SF1_utr
      • H_IC1_SF05_utr
      • I_IC1_SF01_utr
      • K_IC05_SF1_utr
      • L_IC05_SF05_utr
      • M_IC05_SF01_utr

About

Synthetic and biological datasets used to evaluate the ssHMM motif finder

Resources

Stars

Watchers

Forks

Releases

No releases published

Languages