HAM-TBS project title: TBS_Mouse_Pilot_repeat
Responsible, Involved: Susann, Tobias, Simone, Darina, Mathias Schmidt, Elisabeth
Background/Aim:
Evaluate linearity of amplicons based on control
Assess methylation levels of CpGs in FKBP5 Murine DNA and look for regulation
Setup: Murine DNA Hippocampus 20 + Murine DNA Amygdala 20 + Murine DNA PVN 20
Tissue (H=Hippocampus/A=Amygdala/ P=PVN)
Run ID: 190523_M02431_0022_000000000-CGG57
We observe a significantly lower alignment rate than we usually get from the human samples. This is due to some issues with the primers, specifically of one amplicon 2_fk5_gre_pE, which results in an unalignable fraction of reads (~ up to 40% of our unaligned reads). Potentially excluding or re-designing this PCR would improve the performance. However, our linearity assessment for this amplicon showed very good results, the methylation data derived from it is absolutely usable. It has to be noted that the PCR itself does not seem to be over-represented in the pool. But still, for some samples the fraction of reads that cannot be aligned presenting with primers derived from this PCR is up to 40% suggesting that this artifact probably clusters better than our target. We know this behavior from index dimers that tend to overtake flow cells, however, this artifact is not short enough to exibit this behavior (I would have thought).
For more detail on the on the distribution of over-represented sequences, see section “QC - alignment efficiency”.
I would recommend to redesign this PCR 2_fk5_gre_pE if we are likely to use it more often. It could signifiicantly increase the yield. Additionally, this amplicon has an overlapping target sequence to amplicon 13_fk5_ctcf_5UTR. We usually try to avoid overalapping designs - is there a special reason for this?
Note on artefact status: Some amplicons show very high rates of artefacts. However, other features as linearity assessment of these amplicons are among the best of the panel!
The amplicons show good overall linearity. Only two amplicons may have linearity issues, but the evidence is not strong (failed replicates, atypical pattern). Compared to the amplicons assessed for the human FKBP5 panel, the amplicons here more often show mild bias (but no reason to exclude!). The amplicons that failed linearity in the human FKBP5 panel displayed a much stronger and pronounced bias than found anywhere here and with high consistency between replicates…
Amplicons with r2 < 0.95
10_fk5_gre_i5.1 does not show consistent pattern between the replicates. I cannot explain this since I have not come across this behaviour so far. One of the replicates presents with good linearity, the others show a bias in the 50% or 75% sample BUT the pattern does not align with the usual biased curve we observe.
15_fk5_ctcf_interg Here, we have the problem that only one replicate has all 5 data points present. Here, the pattern MAY be more in line with the usual bias we observe, however, also still rather mild compared to the failed amplicons in the human FKBP5 panel.
Methods:
Citation: To describe methods, please cite the HAM-TBS publication doi: 10.1186/s13072-018-0209-x. Please note the methods update that minimum Phred quality during methylation calling by methylKit was incresed to 30 (it used to be the default setting of 20).
Displayed are all amplicons included in this run.
<- read.delim("../GENOME/Amplicon_Location.bed", head = F)
amplicons colnames(amplicons) <- c("chr", "start", "stop", "amplicon")
<- amplicons %>% as_tibble() %>% mutate(amplicons, amplicon_length = amplicons$stop -
amplicons $start, amplicon = as.character(amplicon))
amplicons
paged_table(amplicons)
The targeted reference sequences was padded for alignment purpose (+/- 50bp up and down stream) => this needs to be regarded when calculating coordinates. Shift when reading in the data!
<- tibble(
cpg file = list.files(paste0("../05_methylKit"), pattern="CpG.txt", full.names=T)
%>%
) mutate(
sample = gsub(file, pattern="^.*methylKit/|_CpG.txt", replacement=""),
out = map(file, function(x) {
read.delim(x, head=T, sep="\t") %>%
as_tibble() %>%
mutate(
amplicon=gsub(chr, pattern="::.*$", replacement=""),
base = base - 50, # remove padding from targeted alignment coordinate
%>%
) select(-c(chrBase, chr, freqT)) %>%
rename(methylation = freqC)
})
)
<- cpg %>%
cpg unnest(out) %>% # Add amplicon annotation data
left_join(amplicons) %>%
filter(base > 20 & (amplicon_length - base) > 20) %>% # filter primer region (20 bp)
group_by(sample, amplicon) %>%
mutate(
max_coverage = max(coverage, na.rm = T),
is_artifact = coverage < max_coverage*0.3, # assumed at less than 30% of maximum coverage. 50% could be SNP
CpG_genomic = start+base,
median_coverage = median(coverage[!is_artifact], na.rm = T),
is_coverage_fail = median_coverage < 1000,
%>%
) separate(sample, sep = "-", into = c('region', 'timepoint', 'replicate'), remove = F) %>%
mutate(
region=recode_factor(factor(region, levels = c("Ctrl", "A", "H", "P")), Ctrl="control", A="amygdala", H="hippocampus", P="pfc"),
is_control = region == "control",
timepoint = factor(timepoint, levels = c("B", "4", "24", "0", "25", "50", "75", "100"))
)
# add plate layout
<-
cpg tibble(read.delim("plate_layout.csv", head=T, sep=";")) %>% mutate(
sample= gsub(Sample.ID, pattern="_", replacement = "-"),
Well=as.factor(Well)) %>%
rename(column=Well, row=Row) %>%
select(-Sample.ID) %>%
right_join(cpg)
# add flag id CpG is valid for analysis (requires a minimum of 3 replicates)
<-
cpg %>%
cpg filter(!is_control) %>%
group_by(region, amplicon, CpG_genomic, timepoint) %>%
nest() %>%
mutate(
n_replicates = map_dbl(data, ~ length(unique(.x$replicate))),
%>%
) group_by( region, amplicon, CpG_genomic ) %>% mutate(
n_timepoints = sum( n_replicates >= 3 ),
is_valid_for_analysis= n_replicates >= 3 & n_timepoints >= 2
%>% right_join(cpg) )
Extract conversion levels for each amplicon and each sample separately.
Track both mean and median conversion! This makes a difference here.
Check more closely! (TODO)
<- tibble(
chh file = list.files(paste0("../05_methylKit"), pattern="CHH.txt", full.names=T)
%>%
) mutate(
sample = gsub(file, pattern="^.*methylKit/|_CHH.txt", replacement=""),
out = map(file, function(x) {
read.delim(x, head=T, sep="\t") %>%
as_tibble() %>%
mutate(
amplicon=gsub(chr, pattern="::.*$", replacement=""),
base = base - 50, # remove padding from targeted alignment coordinate
%>%
) select(-c(chrBase, chr, freqT)) %>%
rename(methylation = freqC)
})
)
<- chh %>%
chh unnest(out) %>% # Add amplicon annotation data
left_join(amplicons) %>%
filter(base > 20 & (amplicon_length - base) > 20) %>% # filter primer region (20 bp)
group_by(sample, amplicon) %>%
mutate(
max_coverage = max(coverage, na.rm = T),
is_artifact = coverage < max_coverage*0.3, # assumed at less than 30% of maximum coverage. 50% could be SNP
CpG_genomic = start+base,
median_coverage = median(coverage[!is_artifact], na.rm = T),
is_coverage_fail = median_coverage < 1000
)
save(chh, file="01_CHH_data_raw_190523_M02431_0022_000000000-CGG57.RDS")
<- chh %>% filter( !is_artifact & !is_coverage_fail) %>%
chh summarize( mean_conversion= 100-mean(methylation),
median_conversion= 100-median(methylation),
is_conversion_fail= mean_conversion < 95)
Append CHH info to CpG data.
<- cpg %>% left_join(chh, by = c("sample", "amplicon"))
cpg rm(chh)
Save raw data.
saveRDS(cpg, file = "01_CpG_data_raw_190523_M02431_0022_000000000-CGG57.RDS")
Low alignment rate & very heterogeneous compared to the usual human TBS data. Whole genome alignment did not reveal off target hits. The low alignment rate seems to be rooted in the overrepresented sequences - see below.
As expected, alignment quality similar
<- tibble(file = list.files("../03_bismark/", pattern = ".txt", full.names = T),
dat sample = gsub(file, pattern = ".*//|_S.*$", replacement = ""), mapping_efficiency_pct = as.numeric(map_chr(file,
~gsub(system(paste0("grep \"Mapping efficiency:\" ", .x), intern = T),
pattern = ".*:.|%", replacement = "")))) %>% select(-file)
<- left_join(cpg, dat)
cpg
%>% # filter(!is_control) %>%
cpg ggplot(aes(x = timepoint, y = mapping_efficiency_pct, fill = region)) + geom_boxplot(width = 0.5,
alpha = 0.5) + geom_point(pch = 21, size = 2) + facet_grid(~region, scales = "free") +
labs(y = "Mapping efficiency [%]", x = "Timepoint") + theme(legend.position = "none")
In the unaligned reads some sequences were overrepresented at high proportions. The beginning of the sequence shows the presence of our primers at the start of the read, I cannot find the source of the sequence following it. It is - however - bisulfite converted which makes finding the source more complicated.
The same check was performed on whole genome alignment with similar outcome on the overrepresented sequences within the unaligned sequences
# read in primer info
<- tibble(read.delim("primer.csv", head = F, sep = ";")) %>% rename(amplicon = V1,
primers primer_number = V2, primer_sequence = V3) %>% select(-primer_number) %>%
mutate(primer_sequence = as.character(primer_sequence), primer_no = rep(c("forward_primer",
"reverse_primer"), 23), primer_length = nchar(primer_sequence), seq_start20 = str_sub(primer_sequence,
1, 20))
paged_table(primers)
# read in over-represented sequences per sample and merge with primer info
<- tibble(file = list.files(path = "../03_bismark/01_fastqc", pattern = "overrepresented_sequences.txt",
orseq recursive = T, full.names = T)) %>% mutate(sample = gsub(file, pattern = "^../03_bismark/01_fastqc/|_fastqc.*$",
replacement = ""), out = map(file, function(x) {
read.delim(x, head = T, sep = "\t") %>% as_tibble() %>% rename(sequence = X.Sequence) %>%
mutate(seq_start20 = str_sub(sequence, 1, 20))
%>% select(-file) %>% unnest(out) %>% select(-Possible.Source) %>% select(-sequence) %>%
})) group_by(sample, seq_start20) %>% summarize(count = sum(Count), percentage = sum(Percentage)) %>%
left_join(primers) %>% mutate(R1 = str_detect(sample, "_reads_1$")) %>%
separate(sample, sep = "-", into = c("region", "timepoint", "replicate"),
remove = F) %>% mutate(region = recode_factor(as.factor(region), Ctrl = "control",
A = "amygdala", H = "hippocampus", P = "pfc"), timepoint = as.factor(timepoint))
orseq
## # A tibble: 6,722 x 12
## # Groups: sample [174]
## sample region timepoint replicate seq_start20 count percentage amplicon
## <chr> <fct> <fct> <chr> <chr> <int> <dbl> <fct>
## 1 A-24-… amygd… 24 1_S83_L0… AAAAAAAAAA… 290 0.379 14_fk5_…
## 2 A-24-… amygd… 24 1_S83_L0… AAAAACAAAA… 83 0.108 <NA>
## 3 A-24-… amygd… 24 1_S83_L0… AAAAACAAAA… 27984 36.6 2_fk5_g…
## 4 A-24-… amygd… 24 1_S83_L0… AAAAACAAAA… 90 0.118 <NA>
## 5 A-24-… amygd… 24 1_S83_L0… AAAACAAAAC… 85 0.111 <NA>
## 6 A-24-… amygd… 24 1_S83_L0… AAAAGAGAGT… 25341 33.1 2_fk5_g…
## 7 A-24-… amygd… 24 1_S83_L0… AAAATCCTCT… 116 0.152 <NA>
## 8 A-24-… amygd… 24 1_S83_L0… AAACCAACCT… 246 0.321 9_fk5_g…
## 9 A-24-… amygd… 24 1_S83_L0… AAACCTTTAA… 85 0.111 12_fk5_…
## 10 A-24-… amygd… 24 1_S83_L0… AAAGAGAGTT… 453 0.592 <NA>
## # … with 6,712 more rows, and 4 more variables: primer_sequence <chr>,
## # primer_no <chr>, primer_length <int>, R1 <lgl>
# booxplot of percentages per amplicon
ggplot(subset(orseq, !is.na(primer_no)), aes(x = amplicon, y = percentage, col = R1)) +
geom_point() + labs(title = "primer presence in unaligned sequences", y = "percentage per sample") +
facet_wrap(~primer_no, nrow = 2) + theme(axis.text.x = element_text(angle = 90))
Display pooling ratio based on aligned reads per amplicon:
%>% # filter(!is_control) %>%
cpg select(sample, median_coverage, region, timepoint, amplicon) %>% unique() %>%
ggplot(aes(x = amplicon, y = median_coverage, col = timepoint)) + geom_point() +
# labs(title='primer presence in unaligned sequences', y='percentage per
# sample') +
facet_wrap(~region, nrow = 4) + theme(axis.text.x = element_text(angle = 90))
Pooling ration is very stable across amplicons. In which order is pooling performed? -> pooling oder could explain this pattern. (TODO ask Susi for detail)
In a previous meeting we hypothesized that the artefact status could be related to the specific strain of mice and differences in the genome compared to the reference assembly. However, on second thought, that would present as a SNP with 100% or 50% coverage and not as an artifact. The difference in strain can therefore not be the source of this.
It has to be noted that despite rather high artifact rates the linearity assessment using the control DNA
<- cpg %>% group_by(region, timepoint, replicate, sample, amplicon,
cpg_sum %>% summarize(n_artifacts = sum(is_artifact), n_coverage_fail = sum(is_coverage_fail),
is_control) n_good_quality_artifact = sum(is_artifact, !is_coverage_fail))
cpg_sum
## # A tibble: 1,735 x 9
## # Groups: region, timepoint, replicate, sample, amplicon [1,735]
## region timepoint replicate sample amplicon is_control n_artifacts
## <fct> <fct> <chr> <chr> <chr> <lgl> <int>
## 1 contr… 0 1 Ctrl-… 1_fk5_g… TRUE 2
## 2 contr… 0 1 Ctrl-… 10_fk5_… TRUE 0
## 3 contr… 0 1 Ctrl-… 10_fk5_… TRUE 12
## 4 contr… 0 1 Ctrl-… 10_fk5_… TRUE 23
## 5 contr… 0 1 Ctrl-… 11_fk5_… TRUE 2
## 6 contr… 0 1 Ctrl-… 11_fk5_… TRUE 1
## 7 contr… 0 1 Ctrl-… 12_fk5_… TRUE 2
## 8 contr… 0 1 Ctrl-… 12_fk5_… TRUE 2
## 9 contr… 0 1 Ctrl-… 13_fk5_… TRUE 0
## 10 contr… 0 1 Ctrl-… 14_fk5_… TRUE 6
## # … with 1,725 more rows, and 2 more variables: n_coverage_fail <int>,
## # n_good_quality_artifact <int>
Plot displays the number of artefacts removed per sample/amplicon. Quite stable per amplicon between samples.
ggplot(cpg_sum, aes(x = amplicon, y = sample, fill = n_good_quality_artifact)) +
geom_tile() + geom_text(aes(label = n_good_quality_artifact), size = 1.5) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_fill_distiller(type = "seq",
direction = 1) + theme(legend.position = "none")
Sum of artefacts per time point.
%>% group_by(region, timepoint, replicate, sample) %>% summarize(total_artifacts = sum(n_good_quality_artifact)) %>%
cpg_sum ggplot(aes(x = timepoint, y = total_artifacts)) + geom_boxplot() + geom_point(aes(fill = replicate),
pch = 21, size = 2) + facet_grid(. ~ region, scales = "free_x") + theme(legend.position = "none")
Sum of artefacts per amplicon.
%>% group_by(amplicon) %>% summarize(total_artifacts = sum(n_good_quality_artifact)) %>%
cpg_sum ggplot(aes(x = amplicon, y = total_artifacts)) + geom_col() + coord_flip() +
theme(legend.position = "none")
Coverage per sample per amplicon:
%>% select(sample, amplicon, median_coverage, region) %>% unique() %>% ggplot(aes(x = amplicon,
cpg y = sample, fill = log2(median_coverage))) + geom_tile() + geom_text(aes(label = round(median_coverage)),
size = 1.5) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_distiller(type = "seq", direction = 1) + theme(legend.position = "none")
Same plot as above - colored by threshold (minimum coverage of 1000):
%>% select(sample, amplicon, median_coverage, region) %>% unique() %>% ggplot(aes(x = amplicon,
cpg y = sample, fill = median_coverage < 1000)) + geom_tile() + geom_text(aes(label = round(median_coverage)),
size = 1.5) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_discrete(direction = -1, na.value = "grey50") + theme(legend.position = "none")
## Coverage status
I checked the missing amplicons. These are expected to be missing, Susi already saw that they did not work:
paged_table(anti_join(amplicons, cpg, by = "amplicon"))
Amplicons to be excluded from differentail analysis only due to lack of coverage in more than 50% of the samples. Add boolean indicator.
<- cpg %>% filter(!is_control) %>% select(sample, amplicon, median_coverage) %>%
cpg unique() %>% group_by(amplicon) %>% summarise(is_valid_amplicon = (sum(median_coverage >=
1000)/n()) > 0.5) %>% right_join(cpg)
%>% filter(!is_control & !is_valid_amplicon) %>% select(chr, start, stop,
cpg %>% unique() %>% paged_table() amplicon, amplicon_length)
-> These 6 amplicons were not included in this analysis.
%>% select(sample, amplicon, mean_conversion, region) %>% unique() %>% ggplot(aes(x = amplicon,
cpg y = sample, fill = mean_conversion > 95)) + geom_tile() + geom_text(aes(label = round(mean_conversion,
1)), size = 1.5) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_discrete(direction = -1, na.value = "grey50") + theme(legend.position = "none")
%>% select(sample, amplicon, median_conversion, region) %>% unique() %>%
cpg ggplot(aes(x = amplicon, y = sample, fill = median_conversion > 95)) + geom_tile() +
geom_text(aes(label = round(median_conversion, 1)), size = 1.5) + theme(axis.text.x = element_text(angle = 90,
hjust = 1)) + scale_fill_discrete(direction = -1, na.value = "grey50") +
theme(legend.position = "none")
# filter out data points that fail coverage or conversion standards or are
# artefacts
<- cpg %>% filter(!is_artifact & !is_coverage_fail & !is_conversion_fail &
cpg %>% select(-c(is_artifact, is_coverage_fail, is_conversion_fail,
is_valid_amplicon)
is_valid_amplicon))saveRDS(cpg, file = "01_CpG_data_filtered_190523_M02431_0022_000000000-CGG57.RDS")
Two amplicons are overlapping and one CpG is quantified in both instances. Check how similar the quantification is!
# are there CpGs that are present on more than one amplicon?
<- cpg %>% select(sample, CpG_genomic) %>% group_by(CpG_genomic,
ambiguos_cpg %>% count() %>% filter(n > 1) %>% ungroup() %>% select(CpG_genomic) %>%
sample) unique()
%>% filter(CpG_genomic == ambiguos_cpg$CpG_genomic & !is_control) %>% select(sample,
cpg %>% pivot_wider(id_cols = sample, names_from = amplicon,
methylation, amplicon) values_from = methylation) %>% mutate(delta = `13_fk5_ctcf_5UTR` - `2_fk5_gre_pE`) %>%
select(sample, delta) %>% right_join(subset(cpg, CpG_genomic == ambiguos_cpg$CpG_genomic &
!is_control, select = c(sample, region, timepoint))) %>% ggplot(aes(x = region,
y = delta, col = timepoint)) + geom_boxplot() + labs(x = "", y = "delta % methylation") +
theme_minimal()
Pretty big differences in my opinion. Need to discuss this.
# exclude control and only keep CpGs which have enough data points to be
# analyzed
<- cpg %>% filter(!is_control & is_valid_for_analysis) %>% group_by(region) %>%
cpg_pca nest() %>% mutate(pca = map(data, ~prcomp(.x %>% select(CpG_genomic, methylation)))) # %>%
# mutate( pca_graph = map( pca, ~ autoplot(.x) ) )
Check for batches: - row - column - alignment rate
Any others I am missing?
<-
batches %>% filter( !is_control & is_valid ) %>% droplevels() %>%
cpg group_by( CpG_genomic ) %>% nest() %>%
mutate( lm_batch = map(data, ~broom::tidy( lm(methylation ~ row + column + alignment_rate , data=.x ) ))
)
<- cpg %>%
cpg_filtered filter(is_valid) %>%
unnest() %>%
group_by(region, amplicon, CpG_genomic) %>%
nest() %>%
mutate(
mean_methylation=map_dbl(data, ~ mean(.x$methylation)),
#shapiro = map(data, ~ broom::tidy( shapiro.test(.x$methylation))),
lm = map(data, ~broom::tidy( lm(methylation ~ timepoint , data=.x ) ) ),
cooks.d= map(lm, ~ cooks.distance(.x) ),
lm_sig= map_lgl(lm, ~ sum( .x$lm$p.value<0.5 )>1 )
%>%
) unnest(lm)
<- cpg_filtered %>% mutate(
cpg_filteredq.value=p.adjust(p.value)
)
Note:
The original intention was to produce 1 common control (0 to 100) and derive 3 replicates before bisulfite conversion to account for variance in conversion efficiency and downstream quantification.
In this case we produced the replicates differently which needs to be factored into the model. 1. 3 instances of 0% or 100% methylation (not sure if this is methylated or demethylated) are the starting material 2. Each of the instances was then (de)methylated to 0 or 100% 3. 25/50/75 titration generated from 0% and 100% control DNA 4. bisulfite conversion and downstream quantification
<- cpg %>% filter(is_control) %>% select(-c(is_control, region)) %>% group_by(amplicon,
bac %>% nest() %>% mutate(n_replicates = map_dbl(data,
CpG_genomic, timepoint) ~length(unique(.x$replicate))), ) %>% group_by(amplicon, CpG_genomic) %>%
mutate(n_timepoints = length(unique(timepoint)), is_valid = n_timepoints >=
2) %>% filter(is_valid) %>% droplevels() %>% mutate(level = as.numeric(as.character(timepoint))) %>%
unnest()
<- bac %>% group_by(amplicon) %>% nest() %>% mutate(mod_sum = map(data,
bac_rsquared ~summary(lm(methylation ~ level + replicate, data = .x, na.action = na.omit))),
r_squared = map_dbl(mod_sum, ~.x$r.squared)) %>% select(amplicon, r_squared)
<- left_join(bac, bac_rsquared) bac
Plot linearity for replicate 1
-> Replicate 1 has a bias (25% level). Here, it seems the titration was scewed. Exclude from linearity check!
Plot linearity for replicate 2
Plot linearity for replicate 3
<- cpg %>% filter(is_control) %>% select(-is_control) %>% droplevels() %>%
bac mutate(level = as.numeric(as.character(timepoint))) %>% filter(!(replicate ==
1 & level == 25)) # exclude rep 1 / 25%
<- bac %>% group_by(amplicon) %>% nest() %>% mutate(mod_sum = map(data,
bac_rsquared ~summary(lm(methylation ~ level + replicate, data = .x, na.action = na.omit))),
r_squared = map_dbl(mod_sum, ~.x$r.squared)) %>% select(amplicon, r_squared)
<- left_join(bac, bac_rsquared) bac
Plot linearity (replicate 1 / 25% was excluded here). r2 displayed per amplicon.
ggplot(bac, aes ( x=level, y=methylation , col=replicate) ) +
geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
geom_point(size=0.5,alpha=0.5) +
geom_text( data = bac_rsquared, aes(x=50,y=90,label=round(r_squared, digits = 2)) , col="grey") +
theme(legend.position = "none") +
facet_wrap(~amplicon)
r2 values are above 0.95 for all but 2 amplicons:
paged_table(subset(bac_rsquared, r_squared < 0.95))
Look at amplicons with r2 < 0.95 in more detail.
Other QC features looked fine!
Replicates do not show same scew. - Replicate 1: looks fine - Replicate 2: scewed 75% data point - Replicate 3: scewed 50% data point
This does not look conclusive to me.
%>% ungroup() %>% filter(r_squared<0.95 & amplicon=="10_fk5_gre_i5.1" ) %>% droplevels() %>%
bac ggplot( aes ( x=level, y=methylation , col=replicate) ) +
ggtitle("10_fk5_gre_i5.1") +
geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
geom_point(size=0.5,alpha=0.5) +
theme(legend.position = "none") +
facet_wrap(~ replicate )
Other QC features looked fine!
Replicate 1 and 2 are missing important data points. Replicate 3 is complete and show more of a mild but classical scew indicative of an amplification bias. However, replicates also don’t fully agree.
Also not a clear picture here, I think it may be worth while to rerun this.
Note: adding the “below coverage” data points didn’t add more info, rep 1 was very low at 25%, rep 2 was in line with 1 and 3 at 50%.
%>% ungroup() %>% filter(r_squared<0.95 & amplicon=="15_fk5_ctcf_interg" ) %>% droplevels() %>%
bac ggplot( aes ( x=level, y=methylation , col=replicate) ) +
ggtitle("15_fk5_ctcf_interg") +
geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
geom_point(size=0.5,alpha=0.5) +
theme(legend.position = "none") +
facet_wrap(~ replicate )
=readRDS("01_CpG_data_raw_190523_M02431_0022_000000000-CGG57.RDS") %>% rename(level=timepoint)
tmp
%>% filter(amplicon=="15_fk5_ctcf_interg" & is_control) %>% droplevels() %>%
tmp ggplot( aes ( x=level, y=methylation , col=replicate) ) +
ggtitle("15_fk5_ctcf_interg") +
geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
geom_point(size=0.5,alpha=0.5) +
theme(legend.position = "none") +
facet_wrap(~ replicate )
rm(tmp)
First, determine which CpGs can be evaluated ( not all samples passed filters -> e.g. some CpGs only have data on 1 time point )
CpGs included for analysis should have: - at least two time points - at least 3 replicates per time point
<- cpg %>% filter(!is_control) %>% group_by(region, amplicon, CpG_genomic,
cpg %>% nest() %>% mutate(n_replicates = map_dbl(data, ~length(unique(.x$replicate))),
timepoint) %>% group_by(region, amplicon, CpG_genomic) %>% mutate(n_timepoints = sum(n_replicates >=
) 3), is_valid = n_replicates >= 3 & n_timepoints >= 2)
# cpg %>% filter(region=='P' & amplicon == '1_fk5_gre_pE' & CpG_genomic ==
# 28660045)
<- cpg %>%
cpg_filtered filter(is_valid) %>%
unnest() %>%
group_by(region, amplicon, CpG_genomic) %>%
nest() %>%
mutate(
mean_methylation=map_dbl(data, ~ mean(.x$methylation)),
#shapiro = map(data, ~ broom::tidy( shapiro.test(.x$methylation))),
lm = map(data, ~broom::tidy( lm(methylation ~ timepoint , data=.x ) ) ),
cooks.d= map(lm, ~ cooks.distance(.x) ),
lm_sig= map_lgl(lm, ~ sum( .x$lm$p.value<0.5 )>1 )
%>%
) unnest(lm)
<- cpg_filtered %>% mutate(
cpg_filteredq.value=p.adjust(p.value)
)
ggplot(cpg_filtered, aes(x=mean_methylation, y=q.value)) +
geom_point()
#test <- cpg_filtered %>% filter( term!="(Intercept)" & p.value<0.1 ) %>% group_by(region, amplicon) %>% select(region, amplicon) %>% unique() %>% mutate(
# plot= my_amplicon_plot(cpg_filtered, region, amplicon)
#)
ggplot(as.data.frame(cpg_filtered[289, "data"][[1]]), aes(x=sample, y=methylation)) +
geom_boxplot()
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] formatR_1.6 rmarkdown_2.7 gridExtra_2.3 ggfortify_0.4.11
## [5] forcats_0.4.0 stringr_1.4.0 dplyr_1.0.5 purrr_0.3.4
## [9] readr_1.3.1 tidyr_1.1.0 tibble_3.0.1 ggplot2_3.2.1
## [13] tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.21 reshape2_1.4.3
## [4] haven_2.2.0 lattice_0.20-38 colorspace_1.4-1
## [7] vctrs_0.3.6 generics_0.0.2 htmltools_0.3.6
## [10] yaml_2.2.0 utf8_1.1.4 rlang_0.4.10
## [13] pillar_1.4.4 withr_2.1.2 glue_1.4.1
## [16] DBI_1.0.0 RColorBrewer_1.1-2 dbplyr_1.4.2
## [19] modelr_0.1.5 readxl_1.3.1 plyr_1.8.4
## [22] lifecycle_1.0.0 munsell_0.5.0 gtable_0.3.0
## [25] cellranger_1.1.0 rvest_0.3.5 codetools_0.2-16
## [28] evaluate_0.13 labeling_0.3 knitr_1.23
## [31] fansi_0.4.1 broom_0.5.4 Rcpp_1.0.4.6
## [34] scales_1.0.0 backports_1.1.4 jsonlite_1.6.1
## [37] fs_1.3.1 hms_0.5.3 digest_0.6.25
## [40] stringi_1.4.6 cli_2.0.2 tools_3.6.0
## [43] magrittr_1.5 lazyeval_0.2.2 crayon_1.3.4
## [46] pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.2.2
## [49] reprex_0.3.0 lubridate_1.7.4 assertthat_0.2.1
## [52] httr_1.4.1 rstudioapi_0.11 R6_2.4.1
## [55] nlme_3.1-140 compiler_3.6.0