-
Notifications
You must be signed in to change notification settings - Fork 0
Review RegGTFExtractor.py #29
Comments
Most of your comments should be solved with above commits. |
If the data folder is not complete the script throws an error instead of downloading the rest. You have to delete the full data folder and download everything again. You should look for missing ones and download them or print an error message that the user should try deleting the full data folder and restart the script. |
Run 1: worked
Run 3:
I get following error after running the script a second time. I have to delete the data folder to get it working again. |
The comments are good! The problem I have now is with the validation of gtf files. I used the tool designed by Keibler and Brend https://www.ncbi.nlm.nih.gov/pmc/articles/PMC270064/ named validate GTF. Here are the errors I receive with each of the organisms. hg19 Warnings encountered: Statistics: hg38 Warnings encountered: Statistics: mm10 Warnings encountered: Statistics: mm9 Warnings encountered: Statistics: |
Most of the issues are no real issues, GTF for regulatory features has a different format / annotation than gtf for coding regions ect. -> There are no CDS in the dataset because most features are regulatory and not coding. Transcript-id is therefore missing too. Bad terminator doesnt detect that the activitydata is in one giant field and not multiple with wrong teminator, maybe a ";" is missing in the end i will check for that. 1320 Inconsistent value across gene_id. -> Because both UCSC and RefSeq gene_ids are use this is by design and no bug. The non numerical error for the start / stop fields can be a bug, i will check for that, sometimes this data has a "version" as decimal number behind. (e.g. "23215214.0" ). Most programms should detect the right number but i can change them to straight integer numbers. |
@SebastianBeyvers, I am not sure if I should test your script once more. Do you want to change something or are these not real issues irrelevant and your script is already finished? |
To run the script I have activated the masterenv conda environment and have used the command
python RegGTFExtractor.py mm10
. The run was finished within 30 mins, I found the messages about the run printed on the terminal really helpful. The output file is produced.Here are some issues:
I have tried to validate the output gtf file produced by RegGTFExtractor.py, but the validation was not possible.
The text was updated successfully, but these errors were encountered: