Skip to content

Review RegGTFExtractor.py #29

Closed
anastasiia opened this issue Jan 3, 2019 · 6 comments
Closed

Review RegGTFExtractor.py #29

anastasiia opened this issue Jan 3, 2019 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@anastasiia
Copy link
Collaborator

To run the script I have activated the masterenv conda environment and have used the command python RegGTFExtractor.py mm10. The run was finished within 30 mins, I found the messages about the run printed on the terminal really helpful. The output file is produced.

Here are some issues:

  1. Please consider the guidelines in ToDo List #10 . There are no explanation for the input parameters in functions.
  2. The output file has not the standard gtf format and here is why:
  • According to http://genome.ucsc.edu/FAQ/FAQformat.html#format4 the ninth column in a gtf file should start with "gene_id". In the output file from RegGTFExtractor.py the ninth column starts with "ID".
  • According to https://www.ensembl.org/info/website/upload/gff.html the words in the third column should be separated with "_", for example "promoter flanking region" should be "promoter_flanking_region". I think this will make the reading of a gtf file afterwards easier, as such combination will not be considered as three columns. Please correct me if I am not right.

I have tried to validate the output gtf file produced by RegGTFExtractor.py, but the validation was not possible.

@SebastianBeyvers
Copy link
Collaborator

Most of your comments should be solved with above commits.

@renewiegandt
Copy link
Collaborator

If the data folder is not complete the script throws an error instead of downloading the rest. You have to delete the full data folder and download everything again. You should look for missing ones and download them or print an error message that the user should try deleting the full data folder and restart the script.

@renewiegandt renewiegandt added the bug Something isn't working label Jan 4, 2019
@renewiegandt
Copy link
Collaborator

Run 1: worked
Run 2:

ERROR ~ Error executing process > 'create_GTF'

Caused by:
  Process `create_GTF` terminated with an error exit status (1)

Command executed:

  python /mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/RegGTFExtractor.py hg38 --tissue null --wd /mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf

Command exit status:
  1

Command output:
  Working Dir: /mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf
  Filter not detected !
  Getting UCSC Data
  Path to Bin: /mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/Modules/ucsc/bigBedToBed
  UCSC finished !
  Starting Ensembl
  Current release is release-94
  Local Version found: release-94
  Newest Version installed, no update needed.
  No ActivityTable for A549 found, generating new one.
  New ActivityTable generated in: /mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/data/EnsemblData/release-94/homo_sapiens/activity/A549
  No ActivityTable for Aorta found, generating new one.

Command error:
  Traceback (most recent call last):
    File "/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/RegGTFExtractor.py", line 142, in <module>
      main_script(args["organism"], args["wd"], args["dir"],  args["out"], args["tissue"])
    File "/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/RegGTFExtractor.py", line 112, in main_script
      ense = Ensembl(org, wd, data_dir)
    File "/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/Modules/Ensembl/Ensembl.py", line 19, in __init__
      self.acttable.check_and_generate_activity_table()
    File "/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/Modules/Ensembl/ActivityTable.py", line 37, in check_and_generate_activity_table
      self.generate_table(folder_link)
    File "/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/Modules/Ensembl/ActivityTable.py", line 48, in generate_table
      f.write(self.generator.read_table(originpath))
    File "/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/Modules/Ensembl/ActivityTableGenerator.py", line 17, in read_table
      for line in f:
    File "/mnt/workspace1/rene.wiegandt/work/conda/masterenv-f17511358e7a5c5bfa9e7b862c4f9523/lib/python3.5/gzip.py", line 372, in readline
      return self._buffer.readline(size)
    File "/mnt/workspace1/rene.wiegandt/work/conda/masterenv-f17511358e7a5c5bfa9e7b862c4f9523/lib/python3.5/_compression.py", line 68, in readinto
      data = self.read(len(byte_view))
    File "/mnt/workspace1/rene.wiegandt/work/conda/masterenv-f17511358e7a5c5bfa9e7b862c4f9523/lib/python3.5/gzip.py", line 480, in read
      raise EOFError("Compressed file ended before the "
  EOFError: Compressed file ended before the end-of-stream marker was reached

Run 3:

ERROR ~ Error executing process > 'create_GTF'

Caused by:
  Process `create_GTF` terminated with an error exit status (1)

Command executed:

  python /mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/RegGTFExtractor.py hg38 --tissue null --wd /mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf

Command exit status:
  1

Command output:
  Working Dir: /mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf
  Filter not detected !
  Getting UCSC Data
  Path to Bin: /mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/Modules/ucsc/bigBedToBed
  UCSC finished !
  Starting Ensembl
  Current release is release-94
  Local Version found: release-94
  Newest Version installed, no update needed.
  All ActivityTables found, proceeding

Command error:
  Traceback (most recent call last):
    File "/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/RegGTFExtractor.py", line 142, in <module>
      main_script(args["organism"], args["wd"], args["dir"],  args["out"], args["tissue"])
    File "/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/RegGTFExtractor.py", line 112, in main_script
      ense = Ensembl(org, wd, data_dir)
    File "/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/Modules/Ensembl/Ensembl.py", line 20, in __init__
      self.categorizer = ActivityCategorizer(self.release, organism, wd, data_dir)
    File "/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/Modules/Ensembl/ActivityCategorizer.py", line 21, in __init__
      self.get_activity_data(release, organism, wd, data_dir)
    File "/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/Modules/Ensembl/ActivityCategorizer.py", line 52, in get_activity_data
      with open(file, "rb") as tables:
  FileNotFoundError: [Errno 2] No such file or directory: '/mnt/agnerds/Rene.Wiegandt/repo/masterJLU2018/bin/3.1_create_gtf/data/EnsemblData/release-94/homo_sapiens/activity/Fetal_Thymus/table.bin'

Work dir:
  /mnt/workspace1/rene.wiegandt/work/7f/24a725bf8dcf328c35a1125a119d43

I get following error after running the script a second time. I have to delete the data folder to get it working again.

@anastasiia
Copy link
Collaborator Author

The comments are good!

The problem I have now is with the validation of gtf files. I used the tool designed by Keibler and Brend https://www.ncbi.nlm.nih.gov/pmc/articles/PMC270064/ named validate GTF. Here are the errors I receive with each of the organisms.

hg19
./validate_gtf.pl ../gtf_creation/bin/3.1_create_gtf/hg19.gtf
Incorrect value for field, "open_chromatin", on line 1.
Non-numerical value for field, "74142100.0", on line 1.
Bad terminator, ",", after name-value pair, activity "A549>INACTIVE,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, Aorta>NA, Thymus>NA,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, B-Cells>INACTIVE, T-Cell>NA,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, Monocyte>INACTIVE, Neutrophil>NA,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, Eosinophil>NA, Macrophage>NA,, on line 1. Should be ";".
field missing quotes around values on line 1.
transcript_id not set in field on line 1.
Incorrect value for field, "open_chromatin", on line 2.
Non-numerical value for field, "67318737.0", on line 2.
transcript_id not set in field on line 2.
Incorrect value for field, "ctcf_binding_site", on line 3.
Non-numerical value for field, "62050318.0", on line 3.
transcript_id not set in field on line 3.
Incorrect value for field, "promoter", on line 4.
Non-numerical value for field, "40596252.0", on line 4.
transcript_id not set in field on line 4.
Incorrect value for field, "ctcf_binding_site", on line 5.
Non-numerical value for field, "96742904.0", on line 5.
transcript_id not set in field on line 5.
Non-numerical value for field, "180850074.0", on line 142.
Inconsistent value across gene_id = ENSR00000105157.
Non-numerical value for field, "44883556.0", on line 203.
Inconsistent value across gene_id = ENSR00000105157.
Non-numerical value for field, "44917448.0", on line 489.
Inconsistent value across gene_id = ENSR00000105157.
Non-numerical value for field, "11070057.0", on line 963.
Inconsistent value across gene_id = ENSR00000105157.
Non-numerical value for field, "48402509.0", on line 990.
Inconsistent value across gene_id = ENSR00000105157.
Transcript missing contains no CDS features.

Warnings encountered:
Count Description
342203 Illegal value for field. Should be 'CDS', 'exon', 'start_codon', or 'stop_codon'.
1542 Illegal value for field. Should be numerical.
342060 Illegal value for field. Should be numerical.
5817451 Missing ';' terminator after attribute in field.
5817451 Missing '"'s around attribute values in field.
342203 Missing 'transcript_id' attribute in field.
1 Transcript contains no CDS features.
1727 Inconsistent value across gene_id.

Statistics:
1 genes with 1 transcripts containing 0 cds.

hg38
./validate_gtf.pl ../gtf_creation/bin/3.1_create_gtf/hg38.gtf
Incorrect value for field, "open_chromatin", on line 1.
Bad terminator, ",", after name-value pair, activity "A549>INACTIVE,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, Aorta>NA, Thymus>NA,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, B-Cells>INACTIVE, T-Cell>NA,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, Monocyte>INACTIVE, Neutrophil>NA,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, Eosinophil>NA, Macrophage>NA,, on line 1. Should be ";".
field missing quotes around values on line 1.
transcript_id not set in field on line 1.
Incorrect value for field, "open_chromatin", on line 2.
transcript_id not set in field on line 2.
Incorrect value for field, "ctcf_binding_site", on line 3.
transcript_id not set in field on line 3.
Incorrect value for field, "promoter", on line 4.
transcript_id not set in field on line 4.
Incorrect value for field, "ctcf_binding_site", on line 5.
transcript_id not set in field on line 5.
Inconsistent value across gene_id = ENSR00000105157.
Inconsistent value across gene_id = ENSR00000105157.
Inconsistent value across gene_id = ENSR00000105157.
Inconsistent value across gene_id = ENSR00000105157.
Inconsistent value across gene_id = ENSR00000105157.
Transcript missing contains no CDS features.

Warnings encountered:
Count Description
347685 Illegal value for field. Should be 'CDS', 'exon', 'start_codon', or 'stop_codon'.
5910645 Missing ';' terminator after attribute in field.
5910645 Missing '"'s around attribute values in field.
347685 Missing 'transcript_id' attribute in field.
1 Transcript contains no CDS features.
136 Inconsistent value across gene_id.

Statistics:
1 genes with 1 transcripts containing 0 cds.

mm10
./validate_gtf.pl ../gtf_creation/bin/3.1_create_gtf/mm10.gtf
Incorrect value for field, "promoter_flanking_region", on line 1.
Bad terminator, ",", after name-value pair, activity "Heart>ACTIVE,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, Facial_Prominence>POISED, Forebrain>INACTIVE,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, Hindbrain>INACTIVE, Midbrain>POISED,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, NeuralTube>INACTIVE, Intestine>INACTIVE,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, Kidney>INACTIVE, Liver>INACTIVE,, on line 1. Should be ";".
field missing quotes around values on line 1.
transcript_id not set in field on line 1.
Incorrect value for field, "promoter_flanking_region", on line 2.
transcript_id not set in field on line 2.
Incorrect value for field, "ctcf_binding_site", on line 3.
transcript_id not set in field on line 3.
Incorrect value for field, "ctcf_binding_site", on line 4.
transcript_id not set in field on line 4.
Incorrect value for field, "enhancer", on line 5.
transcript_id not set in field on line 5.
Inconsistent value across gene_id = ENSMUSR00000727517.
Inconsistent value across gene_id = ENSMUSR00000727517.
Transcript missing contains no CDS features.

Warnings encountered:
Count Description
420968 Illegal value for field. Should be 'CDS', 'exon', 'start_codon', or 'stop_codon'.
3788712 Missing ';' terminator after attribute in field.
3788712 Missing '"'s around attribute values in field.
420968 Missing 'transcript_id' attribute in field.
1 Transcript contains no CDS features.
2 Inconsistent value across gene_id.

Statistics:
1 genes with 1 transcripts containing 0 cds.

mm9
./validate_gtf.pl ../gtf_creation/bin/3.1_create_gtf/mm9.gtf
Incorrect value for field, "promoter_flanking_region", on line 1.
Non-numerical value for field, "8422039.0", on line 1.
Bad terminator, ",", after name-value pair, activity "Heart>ACTIVE,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, Facial_Prominence>POISED, Forebrain>INACTIVE,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, Hindbrain>INACTIVE, Midbrain>POISED,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, NeuralTube>INACTIVE, Intestine>INACTIVE,, on line 1. Should be ";".
field missing quotes around values on line 1.
Bad terminator, ",", after name-value pair, Kidney>INACTIVE, Liver>INACTIVE,, on line 1. Should be ";".
field missing quotes around values on line 1.
transcript_id not set in field on line 1.
Incorrect value for field, "promoter_flanking_region", on line 2.
Non-numerical value for field, "103797019.0", on line 2.
transcript_id not set in field on line 2.
Incorrect value for field, "ctcf_binding_site", on line 3.
Non-numerical value for field, "152122105.0", on line 3.
transcript_id not set in field on line 3.
Incorrect value for field, "ctcf_binding_site", on line 4.
Non-numerical value for field, "24687490.0", on line 4.
transcript_id not set in field on line 4.
Incorrect value for field, "enhancer", on line 5.
Non-numerical value for field, "128404900.0", on line 5.
transcript_id not set in field on line 5.
Non-numerical value for field, "6644141.0", on line 205.
Inconsistent value across gene_id = ENSMUSR00000727517.
Non-numerical value for field, "23215214.0", on line 960.
Inconsistent value across gene_id = ENSMUSR00000727517.
Non-numerical value for field, "6410941.0", on line 1124.
Inconsistent value across gene_id = ENSMUSR00000727517.
Non-numerical value for field, "6142740.0", on line 1186.
Inconsistent value across gene_id = ENSMUSR00000727517.
Non-numerical value for field, "4897748.0", on line 1259.
Inconsistent value across gene_id = ENSMUSR00000727517.
Transcript missing contains no CDS features.

Warnings encountered:
Count Description
419428 Illegal value for field. Should be 'CDS', 'exon', 'start_codon', or 'stop_codon'.
1289 Illegal value for field. Should be numerical.
419332 Illegal value for field. Should be numerical.
3774852 Missing ';' terminator after attribute in field.
3774852 Missing '"'s around attribute values in field.
419428 Missing 'transcript_id' attribute in field.
1 Transcript contains no CDS features.
1320 Inconsistent value across gene_id.

Statistics:
1 genes with 1 transcripts containing 0 cds.

@SebastianBeyvers
Copy link
Collaborator

Most of the issues are no real issues, GTF for regulatory features has a different format / annotation than gtf for coding regions ect. -> There are no CDS in the dataset because most features are regulatory and not coding. Transcript-id is therefore missing too.

Bad terminator doesnt detect that the activitydata is in one giant field and not multiple with wrong teminator, maybe a ";" is missing in the end i will check for that.

1320 Inconsistent value across gene_id. -> Because both UCSC and RefSeq gene_ids are use this is by design and no bug.

The non numerical error for the start / stop fields can be a bug, i will check for that, sometimes this data has a "version" as decimal number behind. (e.g. "23215214.0" ). Most programms should detect the right number but i can change them to straight integer numbers.

@anastasiia
Copy link
Collaborator Author

@SebastianBeyvers, I am not sure if I should test your script once more. Do you want to change something or are these not real issues irrelevant and your script is already finished?

Sign in to join this conversation on GitHub.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants