Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
data_workflows/misc/WGBS-RRBS_Potential_issues_with_custom_genomes.md
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
176 lines (141 sloc)
7.92 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# WGBS/RRBS: Potential issues with custom genomes | |
## tl;dr | |
`mcall` analyses CpG in a quite fixed pattern: **`xxCGx`**, where `x` is one of `[ACGT]`. | |
Chromosomes starting with 5-'**`xCGxx`**, 5'-**`CGxxx`** or ending with a terminal **`xxxCG`**-3' will make `mcall` fail if these regions are covered by sequencing reads and if quality is enough to analyse `CpG` methylation .. | |
As a consequence, `mcall` writes a corrupted BED file for the given chromosome and analysis stops ("successfully finished") :-) | |
## The longer story ... problem symptoms | |
A set of 6 WGBS libraries (custom genome reference, here `dr11`, but the same happened with `killifish` genome), 4/6 seem to be okay at a first glance and 2/6 looked broken. | |
* one library had calls for all chromosomes except `chr9` | |
* another library had only calls for `chr1`, `chr10` and `chr11` | |
Simply checking the logs for a failing *grep* command actually shows that **all** libraries seem to have issues: | |
``` | |
€ grep -A 1 grep *.log | |
AM-WGBS-325.WGBS.mxqout.log:grep: (standard input): binary file matches | |
AM-WGBS-325.WGBS.mxqout.log-Program successfully finished | |
-- | |
AM-WGBS-326.WGBS.mxqout.log:grep: (standard input): binary file matches | |
AM-WGBS-326.WGBS.mxqout.log-Program successfully finished | |
-- | |
AM-WGBS-327.WGBS.mxqout.log:grep: (standard input): binary file matches | |
AM-WGBS-327.WGBS.mxqout.log-Program successfully finished | |
-- | |
AM-WGBS-328.WGBS.mxqout.log:grep: (standard input): binary file matches | |
AM-WGBS-328.WGBS.mxqout.log-Program successfully finished | |
-- | |
AM-WGBS-329.WGBS.mxqout.log:grep: (standard input): binary file matches | |
AM-WGBS-329.WGBS.mxqout.log-Program successfully finished | |
-- | |
AM-WGBS-330.WGBS.mxqout.log:grep: (standard input): binary file matches | |
AM-WGBS-330.WGBS.mxqout.log-Program successfully finished | |
``` | |
* coincidence which chromosomes are affected first due to multi-threading? | |
* regions with zero-coverage are not called, so no issues here | |
## The workflow datasets | |
For testing I took out these libs, with `327`being a "good one". | |
``` | |
mpimg_L27321-1_AM-WGBS-325 | |
mpimg_L27323-1_AM-WGBS-327 | |
mpimg_L27326-1_AM-WGBS-330 | |
``` | |
I ran these datasets again through our pipeline, directing `mcall` to keep all temporary BED files for a more detailed inspection. | |
Searching for NULL-Bytes in the temporary BED data reveals the problematic chromosomes and "stop points". | |
``` | |
grep -Pa '\x00' wgbs_AM-WGBS-3*/*.G.bed | |
``` | |
The output of the command has been shortened and formatted for better understanding: | |
![](media/mcall01.jpg) | |
Here we can see, that the first bases are covered by reads, analysed by `mcall` and that the `localSeq` starts with a NULL byte, caused by the `CG`being to close to the 5'-end (one extra base is needed to fulfill the required pattern as stated at the very beginning "tl;dr"). | |
Such a NULL byte in an otherwise simple text file (here BED) makes many programs assume that this is a binary file! | |
In the case of `mcall` this results in an error of an internally called `grep`command `grep: (standard input): binary file matches` and stops further analysis of the dataset with `mcall`. | |
**But as `mcall` reports "Success" even with this error occurring, the WGBS pipeline finishes without errors.** | |
### Example chrM | |
* none of the libraries of the original "corrupted" (in terms of uncomplete) results files has calls, except for "AM-WGBS-328": | |
``` | |
€ fgrep -c chrM wgbs_AM-*/*.CpG.bed | |
wgbs_AM-WGBS-325_dr11_PE_20221106/AM-WGBS-325_dr11.bsmap.srt.rd.bam.CpG.bed:0 | |
wgbs_AM-WGBS-326_dr11_PE_20221103/AM-WGBS-326_dr11.bsmap.srt.rd.bam.CpG.bed:0 | |
wgbs_AM-WGBS-327_dr11_PE_20221103/AM-WGBS-327_dr11.bsmap.srt.rd.bam.CpG.bed:0 | |
wgbs_AM-WGBS-328_dr11_PE_20221103/AM-WGBS-328_dr11.bsmap.srt.rd.bam.CpG.bed:400 | |
wgbs_AM-WGBS-329_dr11_PE_20221103/AM-WGBS-329_dr11.bsmap.srt.rd.bam.CpG.bed:0 | |
wgbs_AM-WGBS-330_dr11_PE_20221104/AM-WGBS-330_dr11.bsmap.srt.rd.bam.CpG.bed:0 | |
``` | |
Why? Because the very first `CG` seems not to be covered or alignment quality is bad, so that `mcall` starts analysis with the second `CG` (pos 5-7): | |
``` | |
€ fgrep chrM AM-WGBS-328_dr11.bsmap.srt.rd.bam.CpG.bed|head -n 2 | |
chrM 5 7 0 10 0 B G + 4 0 - 6 0 GCCGG | |
chrM 8 10 0 18 0 B G + 9 0 - 9 0 GGCGA | |
``` | |
## The reference genome sequence file | |
Just to confirm that there a potentially problematic `CpG`in the reference genome fasta file (output shortened for better overview): | |
``` | |
€ seqkit locate -p "^[ACGT]CG[ACGT]{6}" -j 8 -r -i dr11_mod.fa_single-lines | |
seqID pattern str beg end matched | |
chr1_KZ115000v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 CcgAACCTT | |
chr4_KZ115085v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 CcgTTCTGC | |
chr5_KZ115129v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 GcgTTTCTG | |
chr6_KZ115189v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 GcgGTAAAA | |
chr7_KZ115244v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 TcgACATCT | |
chr8_KZ115260v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 TcgTGGTCA | |
chr9_KZ115297v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 CcgCCTCAA | |
chr11_KZ115352v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 GcgCAGGTG | |
chr18_KZ115538v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 TcgAGTTTC | |
chr24_KZ115705v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 TcgGGGCCC | |
chr24_KZ115714v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 GcgTGTATT | |
chr11_KZ114924v1_alt ^[ACGT]CG[ACGT]{6} + 1 9 GcgCAGGTG | |
chrUn_KN148038v2 ^[ACGT]CG[ACGT]{6} + 1 9 GcgGAATGA | |
chrUn_KN149803v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgCTCGAT | |
chrUn_KN149897v1 ^[ACGT]CG[ACGT]{6} + 1 9 TcgCCACAG | |
chrUn_KN149900v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgTCACGC | |
chrUn_KN149935v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgTTTACA | |
chrUn_KN150141v1 ^[ACGT]CG[ACGT]{6} + 1 9 AcgCAGACA | |
chrUn_KN150338v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgCCTCAG | |
chrUn_KN150440v1 ^[ACGT]CG[ACGT]{6} + 1 9 AcgTGTGTG | |
chrUn_KN150501v1 ^[ACGT]CG[ACGT]{6} + 1 9 AcgCGGTGT | |
chrUn_KN150534v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgAGCTTT | |
chrUn_KN150541v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgCTATAG | |
chrUn_KN150621v1 ^[ACGT]CG[ACGT]{6} + 1 9 CcgCCGTAC | |
chrUn_KN150643v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgAGAGCG | |
chrUn_KZ115992v1 ^[ACGT]CG[ACGT]{6} + 1 9 GcgCGCGCG | |
chrM ^[ACGT]CG[ACGT]{6} + 1 9 AcgGCCGGC | |
``` | |
* in hg19 all chr start with NNN<..> except for chrM (GATCACAGGT) and chr17 (AAGCTTCTCAC). So there is no (almost-)terminal CG pattern. | |
## What to do? | |
Methylation analysis by `mcall` is strictly done along **`xxCGx`**. | |
Reference sequences must have: | |
* at least two leading bases before the first `CG` at the 5' end | |
* at least one trailing base base after the last `CG` at the 3' end | |
The 5' end can be masked with `N` (*prefixing* with `N` would shift coordinates and thus is not desired). | |
### Masking with `seqkit` | |
`seqkit` is a very versatile tool .. see https://bioinf.shenwei.me/seqkit/ | |
Original fasta - here `dr11_mod.fa` - has a line width of 75 bases, so in this example I'll stick to this value. | |
If you prefer a single-line file, than just use a width of `0` (zero). | |
#### modifying the 5' end | |
``` | |
seqkit replace \ | |
--by-seq \ | |
--ignore-case \ | |
--pattern "^.{5}" \ | |
--replacement 'nnnnn' \ | |
--line-width 75 \ | |
--threads 8 \ | |
input.fa > output_mod.fa | |
``` | |
.. replaces the first five arbitrary chars at the beginning of a fasta record (->sequence) with 'nnnnn'. | |
#### modifying the 3' end | |
``` | |
seqkit replace \ | |
--by-seq \ | |
--ignore-case \ | |
--pattern ".{5}$" \ | |
--replacement 'NNNNN' \ | |
--line-width 75 \ | |
--threads 8 \ | |
input.fa > output_mod.fa | |
``` | |
.. replaces the last five arbitrary chars at the very end of a fasta record (->sequence) with 'NNNNN'. | |
pattern/replacement lengths in these examples are arbitrary, just modify to fit your needs. | |
`--ignore-case` is of no use with a pattern like `.` (arbitrary char). But it becomes important if you modify the pattern to something more specific like `[ACGT]`. | |
## History | |
* AM-WGBS-325 to AM-WGBS-330 (zebrafish), 5' end, 11/22 | |
* AM-WGBS-062 and some others (killifish), 3' end, 10/22 | |
-- The End | |