WGBS/RRBS: Potential issues with custom genomes

tl;dr

mcall analyses CpG in a quite fixed pattern: xxCGx, where x is one of [ACGT].

Chromosomes starting with 5-'xCGxx, 5'-CGxxx or ending with a terminal xxxCG-3' will make mcall fail if these regions are covered by sequencing reads and if quality is enough to analyse CpG methylation ..

As a consequence, mcall writes a corrupted BED file for the given chromosome and analysis stops ("successfully finished") :-)

The longer story ... problem symptoms

A set of 6 WGBS libraries (custom genome reference, here dr11, but the same happened with killifish genome), 4/6 seem to be okay at a first glance and 2/6 looked broken.

one library had calls for all chromosomes except chr9
another library had only calls for chr1, chr10 and chr11

Simply checking the logs for a failing grep command actually shows that all libraries seem to have issues:

€ grep -A 1 grep *.log
AM-WGBS-325.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-325.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-326.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-326.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-327.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-327.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-328.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-328.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-329.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-329.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-330.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-330.WGBS.mxqout.log-Program successfully finished

coincidence which chromosomes are affected first due to multi-threading?
regions with zero-coverage are not called, so no issues here

The workflow datasets

For testing I took out these libs, with 327being a "good one".

mpimg_L27321-1_AM-WGBS-325
mpimg_L27323-1_AM-WGBS-327
mpimg_L27326-1_AM-WGBS-330

I ran these datasets again through our pipeline, directing mcall to keep all temporary BED files for a more detailed inspection.

Searching for NULL-Bytes in the temporary BED data reveals the problematic chromosomes and "stop points".

grep -Pa '\x00' wgbs_AM-WGBS-3*/*.G.bed

The output of the command has been shortened and formatted for better understanding:

Here we can see, that the first bases are covered by reads, analysed by mcall and that the localSeq starts with a NULL byte, caused by the CGbeing to close to the 5'-end (one extra base is needed to fulfill the required pattern as stated at the very beginning "tl;dr").

Such a NULL byte in an otherwise simple text file (here BED) makes many programs assume that this is a binary file! In the case of mcall this results in an error of an internally called grepcommand grep: (standard input): binary file matches and stops further analysis of the dataset with mcall. But as mcall reports "Success" even with this error occurring, the WGBS pipeline finishes without errors.

Example chrM

none of the libraries of the original "corrupted" (in terms of uncomplete) results files has calls, except for "AM-WGBS-328":

€ fgrep -c chrM  wgbs_AM-*/*.CpG.bed
wgbs_AM-WGBS-325_dr11_PE_20221106/AM-WGBS-325_dr11.bsmap.srt.rd.bam.CpG.bed:0
wgbs_AM-WGBS-326_dr11_PE_20221103/AM-WGBS-326_dr11.bsmap.srt.rd.bam.CpG.bed:0
wgbs_AM-WGBS-327_dr11_PE_20221103/AM-WGBS-327_dr11.bsmap.srt.rd.bam.CpG.bed:0
wgbs_AM-WGBS-328_dr11_PE_20221103/AM-WGBS-328_dr11.bsmap.srt.rd.bam.CpG.bed:400
wgbs_AM-WGBS-329_dr11_PE_20221103/AM-WGBS-329_dr11.bsmap.srt.rd.bam.CpG.bed:0
wgbs_AM-WGBS-330_dr11_PE_20221104/AM-WGBS-330_dr11.bsmap.srt.rd.bam.CpG.bed:0

Why? Because the very first CG seems not to be covered or alignment quality is bad, so that mcall starts analysis with the second CG (pos 5-7):

€ fgrep chrM AM-WGBS-328_dr11.bsmap.srt.rd.bam.CpG.bed|head -n 2
chrM    5   7   0   10  0   B   G   +   4   0   -   6   0   GCCGG
chrM    8   10  0   18  0   B   G   +   9   0   -   9   0   GGCGA

The reference genome sequence file

Just to confirm that there a potentially problematic CpGin the reference genome fasta file (output shortened for better overview):

€ seqkit locate -p "^[ACGT]CG[ACGT]{6}" -j 8 -r -i dr11_mod.fa_single-lines

seqID                  pattern            str beg end   matched
chr1_KZ115000v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    CcgAACCTT
chr4_KZ115085v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    CcgTTCTGC
chr5_KZ115129v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    GcgTTTCTG
chr6_KZ115189v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    GcgGTAAAA
chr7_KZ115244v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    TcgACATCT
chr8_KZ115260v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    TcgTGGTCA
chr9_KZ115297v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    CcgCCTCAA
chr11_KZ115352v1_alt   ^[ACGT]CG[ACGT]{6}  +   1   9    GcgCAGGTG
chr18_KZ115538v1_alt   ^[ACGT]CG[ACGT]{6}  +   1   9    TcgAGTTTC
chr24_KZ115705v1_alt   ^[ACGT]CG[ACGT]{6}  +   1   9    TcgGGGCCC
chr24_KZ115714v1_alt   ^[ACGT]CG[ACGT]{6}  +   1   9    GcgTGTATT
chr11_KZ114924v1_alt   ^[ACGT]CG[ACGT]{6}  +   1   9    GcgCAGGTG
chrUn_KN148038v2       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgGAATGA
chrUn_KN149803v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgCTCGAT
chrUn_KN149897v1       ^[ACGT]CG[ACGT]{6}  +   1   9    TcgCCACAG
chrUn_KN149900v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgTCACGC
chrUn_KN149935v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgTTTACA
chrUn_KN150141v1       ^[ACGT]CG[ACGT]{6}  +   1   9    AcgCAGACA
chrUn_KN150338v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgCCTCAG
chrUn_KN150440v1       ^[ACGT]CG[ACGT]{6}  +   1   9    AcgTGTGTG
chrUn_KN150501v1       ^[ACGT]CG[ACGT]{6}  +   1   9    AcgCGGTGT
chrUn_KN150534v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgAGCTTT
chrUn_KN150541v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgCTATAG
chrUn_KN150621v1       ^[ACGT]CG[ACGT]{6}  +   1   9    CcgCCGTAC
chrUn_KN150643v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgAGAGCG
chrUn_KZ115992v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgCGCGCG
chrM                   ^[ACGT]CG[ACGT]{6}  +   1   9    AcgGCCGGC

in hg19 all chr start with NNN<..> except for chrM (GATCACAGGT) and chr17 (AAGCTTCTCAC). So there is no (almost-)terminal CG pattern.

What to do?

Methylation analysis by mcall is strictly done along xxCGx.

Reference sequences must have:

at least two leading bases before the first CG at the 5' end
at least one trailing base base after the last CG at the 3' end

The 5' end can be masked with N (prefixing with N would shift coordinates and thus is not desired).

Masking with `seqkit`

seqkit is a very versatile tool .. see https://bioinf.shenwei.me/seqkit/

Original fasta - here dr11_mod.fa - has a line width of 75 bases, so in this example I'll stick to this value. If you prefer a single-line file, than just use a width of 0 (zero).

modifying the 5' end

seqkit replace \
  --by-seq \
  --ignore-case \
  --pattern "^.{5}" \
  --replacement 'nnnnn' \
  --line-width 75 \
  --threads 8 \
  input.fa > output_mod.fa

.. replaces the first five arbitrary chars at the beginning of a fasta record (->sequence) with 'nnnnn'.

modifying the 3' end

seqkit replace \
  --by-seq \
  --ignore-case \
  --pattern ".{5}$" \
  --replacement 'NNNNN' \
  --line-width 75 \
  --threads 8 \
  input.fa > output_mod.fa

.. replaces the last five arbitrary chars at the very end of a fasta record (->sequence) with 'NNNNN'.

pattern/replacement lengths in these examples are arbitrary, just modify to fit your needs.

--ignore-case is of no use with a pattern like . (arbitrary char). But it becomes important if you modify the pattern to something more specific like [ACGT].

History

AM-WGBS-325 to AM-WGBS-330 (zebrafish), 5' end, 11/22
AM-WGBS-062 and some others (killifish), 3' end, 10/22

-- The End

data_workflows/misc/WGBS-RRBS_Potential_issues_with_custom_genomes.md