misc/WGBS-RRBS_Potential_issues_with_custom_genomes.md

# WGBS/RRBS: Potential issues with custom genomes

## tl;dr
`mcall` analyses CpG in a quite fixed pattern: **`xxCGx`**, where `x` is one of `[ACGT]`.

Chromosomes starting with 5-'**`xCGxx`**, 5'-**`CGxxx`** or ending with a terminal **`xxxCG`**-3' will make `mcall` fail if these regions are covered by sequencing reads and if quality is enough to analyse `CpG` methylation ..

As a consequence, `mcall` writes a corrupted BED file for the given chromosome and analysis stops ("successfully finished") :-)

## The longer story ... problem symptoms

A set of 6 WGBS libraries (custom genome reference, here `dr11`, but the same happened with `killifish` genome), 4/6 seem to be okay at a first glance and 2/6 looked broken.
* one library had calls for all chromosomes except `chr9`
* another library had only calls for `chr1`, `chr10` and `chr11`

Simply checking the logs for a failing *grep* command actually shows that **all** libraries seem to have issues:

```
€ grep -A 1 grep *.log
AM-WGBS-325.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-325.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-326.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-326.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-327.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-327.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-328.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-328.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-329.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-329.WGBS.mxqout.log-Program successfully finished
--
AM-WGBS-330.WGBS.mxqout.log:grep: (standard input): binary file matches
AM-WGBS-330.WGBS.mxqout.log-Program successfully finished
```

* coincidence which chromosomes are affected first due to multi-threading?
* regions with zero-coverage are not called, so no issues here

## The workflow datasets

For testing I took out these libs, with `327`being a "good one".
```
mpimg_L27321-1_AM-WGBS-325
mpimg_L27323-1_AM-WGBS-327
mpimg_L27326-1_AM-WGBS-330
```

I ran these datasets again through our pipeline, directing `mcall` to keep all temporary BED files for a more detailed inspection.

Searching for NULL-Bytes in the temporary BED data reveals the problematic chromosomes and "stop points".

```
grep -Pa '\x00' wgbs_AM-WGBS-3*/*.G.bed
```
The output of the command has been shortened and formatted for better understanding:

![](media/mcall01.jpg)

Here we can see, that the first bases are covered by reads, analysed by `mcall` and that the `localSeq` starts with a NULL byte, caused by the `CG`being to close to the 5'-end (one extra base is needed to fulfill the required pattern as stated at the very beginning "tl;dr").

Such a NULL byte in an otherwise simple text file (here BED) makes many programs assume that this is a binary file!
In the case of `mcall` this results in an error of an internally called `grep`command `grep: (standard input): binary file matches` and stops further analysis of the dataset with `mcall`. 
**But as `mcall` reports "Success" even with this error occurring, the WGBS pipeline finishes without errors.**  

### Example chrM
* none of the libraries of the original "corrupted" (in terms of uncomplete) results files has calls, except for "AM-WGBS-328":
```
€ fgrep -c chrM  wgbs_AM-*/*.CpG.bed
wgbs_AM-WGBS-325_dr11_PE_20221106/AM-WGBS-325_dr11.bsmap.srt.rd.bam.CpG.bed:0
wgbs_AM-WGBS-326_dr11_PE_20221103/AM-WGBS-326_dr11.bsmap.srt.rd.bam.CpG.bed:0
wgbs_AM-WGBS-327_dr11_PE_20221103/AM-WGBS-327_dr11.bsmap.srt.rd.bam.CpG.bed:0
wgbs_AM-WGBS-328_dr11_PE_20221103/AM-WGBS-328_dr11.bsmap.srt.rd.bam.CpG.bed:400
wgbs_AM-WGBS-329_dr11_PE_20221103/AM-WGBS-329_dr11.bsmap.srt.rd.bam.CpG.bed:0
wgbs_AM-WGBS-330_dr11_PE_20221104/AM-WGBS-330_dr11.bsmap.srt.rd.bam.CpG.bed:0
```
Why? Because the very first `CG` seems not to be covered or alignment quality is bad, so that `mcall` starts analysis with the second `CG` (pos 5-7):
```
€ fgrep chrM AM-WGBS-328_dr11.bsmap.srt.rd.bam.CpG.bed|head -n 2
chrM    5   7   0   10  0   B   G   +   4   0   -   6   0   GCCGG
chrM    8   10  0   18  0   B   G   +   9   0   -   9   0   GGCGA
```

## The reference genome sequence file

Just to confirm that there a potentially problematic `CpG`in the reference genome fasta file (output shortened for better overview):
```
€ seqkit locate -p "^[ACGT]CG[ACGT]{6}" -j 8 -r -i dr11_mod.fa_single-lines

seqID                  pattern            str beg end   matched
chr1_KZ115000v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    CcgAACCTT
chr4_KZ115085v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    CcgTTCTGC
chr5_KZ115129v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    GcgTTTCTG
chr6_KZ115189v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    GcgGTAAAA
chr7_KZ115244v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    TcgACATCT
chr8_KZ115260v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    TcgTGGTCA
chr9_KZ115297v1_alt    ^[ACGT]CG[ACGT]{6}  +   1   9    CcgCCTCAA
chr11_KZ115352v1_alt   ^[ACGT]CG[ACGT]{6}  +   1   9    GcgCAGGTG
chr18_KZ115538v1_alt   ^[ACGT]CG[ACGT]{6}  +   1   9    TcgAGTTTC
chr24_KZ115705v1_alt   ^[ACGT]CG[ACGT]{6}  +   1   9    TcgGGGCCC
chr24_KZ115714v1_alt   ^[ACGT]CG[ACGT]{6}  +   1   9    GcgTGTATT
chr11_KZ114924v1_alt   ^[ACGT]CG[ACGT]{6}  +   1   9    GcgCAGGTG
chrUn_KN148038v2       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgGAATGA
chrUn_KN149803v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgCTCGAT
chrUn_KN149897v1       ^[ACGT]CG[ACGT]{6}  +   1   9    TcgCCACAG
chrUn_KN149900v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgTCACGC
chrUn_KN149935v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgTTTACA
chrUn_KN150141v1       ^[ACGT]CG[ACGT]{6}  +   1   9    AcgCAGACA
chrUn_KN150338v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgCCTCAG
chrUn_KN150440v1       ^[ACGT]CG[ACGT]{6}  +   1   9    AcgTGTGTG
chrUn_KN150501v1       ^[ACGT]CG[ACGT]{6}  +   1   9    AcgCGGTGT
chrUn_KN150534v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgAGCTTT
chrUn_KN150541v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgCTATAG
chrUn_KN150621v1       ^[ACGT]CG[ACGT]{6}  +   1   9    CcgCCGTAC
chrUn_KN150643v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgAGAGCG
chrUn_KZ115992v1       ^[ACGT]CG[ACGT]{6}  +   1   9    GcgCGCGCG
chrM                   ^[ACGT]CG[ACGT]{6}  +   1   9    AcgGCCGGC
```

* in hg19 all chr start with NNN<..> except for chrM (GATCACAGGT) and chr17 (AAGCTTCTCAC). So there is no (almost-)terminal CG pattern. 

## What to do?

Methylation analysis by `mcall` is strictly done along **`xxCGx`**. 

Reference sequences must have:
* at least two leading bases before the first `CG` at the 5' end
* at least one trailing base base after the last `CG` at the 3' end

The 5' end can be masked with `N` (*prefixing* with `N` would shift coordinates and thus is not desired).

### Masking with `seqkit`
`seqkit` is a very versatile tool .. see https://bioinf.shenwei.me/seqkit/

Original fasta - here `dr11_mod.fa` - has a line width of 75 bases, so in this example I'll stick to this value. 
If you prefer a single-line file, than just use a width of `0` (zero).

#### modifying the 5' end
```
seqkit replace \
  --by-seq \
  --ignore-case \
  --pattern "^.{5}" \
  --replacement 'nnnnn' \
  --line-width 75 \
  --threads 8 \
  input.fa > output_mod.fa
```
.. replaces the first five arbitrary chars at the beginning of a fasta record (->sequence) with 'nnnnn'.

#### modifying the 3' end
```
seqkit replace \
  --by-seq \
  --ignore-case \
  --pattern ".{5}$" \
  --replacement 'NNNNN' \
  --line-width 75 \
  --threads 8 \
  input.fa > output_mod.fa
```
.. replaces the last five arbitrary chars at the very end of a fasta record (->sequence) with 'NNNNN'.

pattern/replacement lengths in these examples are arbitrary, just modify to fit your needs.

`--ignore-case` is of no use with a pattern like `.` (arbitrary char). But it becomes important if you modify the pattern to something more specific like `[ACGT]`. 

## History
* AM-WGBS-325 to AM-WGBS-330 (zebrafish), 5' end, 11/22
* AM-WGBS-062 and some others (killifish), 3' end, 10/22


   -- The End