De-TE-ctor
Tool to detect potential transposable elements in a fasta file. Some classes of transposable elements (TE) encode for proteins. Often these look, without prior knowledge, exactly like a protein coding gene. Some genomes include these TEs in their annotation while others filter them out. For comparative genomics this is problematic as it creates differences (e.g. a simple gene count would not be comparable between genomes with and without TEs included) based on a technical bias and not a valid biological trait.
To solve this issue these TEs should be detected and removed in all genomes prior to doing a comparative analysis (e.g. counting genes, constructing gene families, ...). De-TE-ctor is a small pipeline to quickly check for putative TEs included in the set of protein coding genes of a genome and report them so they can be removed.
TODO
- Collect sequences for known protein coding transposable elements.
- Script to initiate pipeline (build blast DB)
- Script to blast fasta files (with support for a cluster)
- Script to parse output and report potential TEs
- Script to remove TEs from initial input producing a final clean fasta file