mcall
analyses CpG in a quite fixed pattern: xxCGx
, where x
is one of [ACGT]
.
Chromosomes starting with 5-'xCGxx
, 5'-CGxxx
or ending with a terminal xxxCG
-3' will make mcall
fail if these regions are covered by sequencing reads and if quality is enough to analyse CpG
methylation ..
As a consequence, mcall
writes a corrupted BED file for the given chromosome and analysis stops ("successfully finished") :-)
A set of 6 WGBS libraries (custom genome reference, here dr11
, but the same happened with killifish
genome), 4/6 seem to be okay at a first glance and 2/6 looked broken.
- one library had calls for all chromosomes except
chr9
- another library had only calls for
chr1
,chr10
andchr11
Simply checking the logs for a failing grep command actually shows that all libraries seem to have issues:
€ grep -A 1 grep *.log
AM-WGBS-325.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-325.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-326.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-326.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-327.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-327.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-328.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-328.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-329.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-329.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-330.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-330.WGBS.mxqout.log-Program successfully finished
- coincidence which chromosomes are affected first due to multi-threading?
- regions with zero-coverage are not called, so no issues here
For testing I took out these libs, with 327
being a "good one".
mpimg_L27321-1_AM-WGBS-325
mpimg_L27323-1_AM-WGBS-327
mpimg_L27326-1_AM-WGBS-330
I ran these datasets again through our pipeline, directing mcall
to keep all temporary BED files for a more detailed inspection.
Searching for NULL-Bytes in the temporary BED data reveals the problematic chromosomes and "stop points".
grep -Pa '\x00' wgbs_AM-WGBS-3*/*.G.bed
The output of the command has been shortened and formatted for better understanding:
Here we can see, that the first bases are covered by reads, analysed by mcall
and that the localSeq
starts with a NULL byte, caused by the CG
being to close to the 5'-end (one extra base is needed to fulfill the required pattern as stated at the very beginning "tl;dr").
Such a NULL byte in an otherwise simple text file (here BED) makes many programs assume that this is a binary file!
In the case of mcall
this results in an error of an internally called grep
command grep: (standard input): binary file matches
and stops further analysis of the dataset with mcall
.
But as mcall
reports "Success" even with this error occurring, the WGBS pipeline finishes without errors.
- none of the libraries of the original "corrupted" (in terms of uncomplete) results files has calls, except for "AM-WGBS-328":
€ fgrep -c chrM wgbs_AM-*/*.CpG.bed
wgbs_AM-WGBS-325_dr11_PE_20221106/AM-WGBS-325_dr11.bsmap.srt.rd.bam.CpG.bed:0
wgbs_AM-WGBS-326_dr11_PE_20221103/AM-WGBS-326_dr11.bsmap.srt.rd.bam.CpG.bed:0
wgbs_AM-WGBS-327_dr11_PE_20221103/AM-WGBS-327_dr11.bsmap.srt.rd.bam.CpG.bed:0
wgbs_AM-WGBS-328_dr11_PE_20221103/AM-WGBS-328_dr11.bsmap.srt.rd.bam.CpG.bed:400
wgbs_AM-WGBS-329_dr11_PE_20221103/AM-WGBS-329_dr11.bsmap.srt.rd.bam.CpG.bed:0
wgbs_AM-WGBS-330_dr11_PE_20221104/AM-WGBS-330_dr11.bsmap.srt.rd.bam.CpG.bed:0
Why? Because the very first CG
seems not to be covered or alignment quality is bad, so that mcall
starts analysis with the second CG
(pos 5-7):
€ fgrep chrM AM-WGBS-328_dr11.bsmap.srt.rd.bam.CpG.bed|head -n 2
chrM 5 7 0 10 0 B G + 4 0 - 6 0 GCCGG
chrM 8 10 0 18 0 B G + 9 0 - 9 0 GGCGA
Just to confirm that there a potentially problematic CpG
in the reference genome fasta file (output shortened for better overview):
€ seqkit locate -p "^[ACGT]CG[ACGT]{6}" -j 8 -r -i dr11_mod.fa_single-lines
seqID pattern str beg end matched
chr1_KZ115000v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 CcgAACCTT
chr4_KZ115085v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 CcgTTCTGC
chr5_KZ115129v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 GcgTTTCTG
chr6_KZ115189v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 GcgGTAAAA
chr7_KZ115244v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 TcgACATCT
chr8_KZ115260v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 TcgTGGTCA
chr9_KZ115297v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 CcgCCTCAA
chr11_KZ115352v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 GcgCAGGTG
chr18_KZ115538v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 TcgAGTTTC
chr24_KZ115705v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 TcgGGGCCC
chr24_KZ115714v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 GcgTGTATT
chr11_KZ114924v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 GcgCAGGTG
chrUn_KN148038v2 ^[ACGT]CG[ACGT]{6} + 1 9 GcgGAATGA
chrUn_KN149803v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgCTCGAT
chrUn_KN149897v1 ^[ACGT]CG[ACGT]{6} + 1 9 TcgCCACAG
chrUn_KN149900v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgTCACGC
chrUn_KN149935v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgTTTACA
chrUn_KN150141v1 ^[ACGT]CG[ACGT]{6} + 1 9 AcgCAGACA
chrUn_KN150338v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgCCTCAG
chrUn_KN150440v1 ^[ACGT]CG[ACGT]{6} + 1 9 AcgTGTGTG
chrUn_KN150501v1 ^[ACGT]CG[ACGT]{6} + 1 9 AcgCGGTGT
chrUn_KN150534v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgAGCTTT
chrUn_KN150541v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgCTATAG
chrUn_KN150621v1 ^[ACGT]CG[ACGT]{6} + 1 9 CcgCCGTAC
chrUn_KN150643v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgAGAGCG
chrUn_KZ115992v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgCGCGCG
chrM ^[ACGT]CG[ACGT]{6} + 1 9 AcgGCCGGC
- in hg19 all chr start with NNN<..> except for chrM (GATCACAGGT) and chr17 (AAGCTTCTCAC). So there is no (almost-)terminal CG pattern.
Methylation analysis by mcall
is strictly done along xxCGx
.
Reference sequences must have:
- at least two leading bases before the first
CG
at the 5' end - at least one trailing base base after the last
CG
at the 3' end
The 5' end can be masked with N
(prefixing with N
would shift coordinates and thus is not desired).
seqkit
is a very versatile tool .. see https://bioinf.shenwei.me/seqkit/
Original fasta - here dr11_mod.fa
- has a line width of 75 bases, so in this example I'll stick to this value.
If you prefer a single-line file, than just use a width of 0
(zero).
seqkit replace \
--by-seq \
--ignore-case \
--pattern "^.{5}" \
--replacement 'nnnnn' \
--line-width 75 \
--threads 8 \
input.fa > output_mod.fa
.. replaces the first five arbitrary chars at the beginning of a fasta record (->sequence) with 'nnnnn'.
seqkit replace \
--by-seq \
--ignore-case \
--pattern ".{5}$" \
--replacement 'NNNNN' \
--line-width 75 \
--threads 8 \
input.fa > output_mod.fa
.. replaces the last five arbitrary chars at the very end of a fasta record (->sequence) with 'NNNNN'.
pattern/replacement lengths in these examples are arbitrary, just modify to fit your needs.
--ignore-case
is of no use with a pattern like .
(arbitrary char). But it becomes important if you modify the pattern to something more specific like [ACGT]
.
-
AM-WGBS-325 to AM-WGBS-330 (zebrafish), 5' end, 11/22
-
AM-WGBS-062 and some others (killifish), 3' end, 10/22
-- The End