1 Description

HAM-TBS project title: TBS_Mouse_Pilot_repeat

Responsible, Involved: Susann, Tobias, Simone, Darina, Mathias Schmidt, Elisabeth

Background/Aim:

Evaluate linearity of amplicons based on control
Assess methylation levels of CpGs in FKBP5 Murine DNA and look for regulation
Setup: Murine DNA Hippocampus 20 + Murine DNA Amygdala 20 + Murine DNA PVN 20

Tissue (H=Hippocampus/A=Amygdala/ P=PVN)

23 PCRs of in vitro methylated control DNA ( Bac)
3x 5 controls (0%, 25%, 50%, 75%, 100%)
methylation levels of control DNA were verified by PSQ

Run ID: 190523_M02431_0022_000000000-CGG57

72 samples / 15 BAC controls
20 Amplicons / 23 Amplicons in BAC control

2 SUMMARY of results

2.1 Data quality

We observe a significantly lower alignment rate than we usually get from the human samples. This is due to some issues with the primers, specifically of one amplicon 2_fk5_gre_pE, which results in an unalignable fraction of reads (~ up to 40% of our unaligned reads). Potentially excluding or re-designing this PCR would improve the performance. However, our linearity assessment for this amplicon showed very good results, the methylation data derived from it is absolutely usable. It has to be noted that the PCR itself does not seem to be over-represented in the pool. But still, for some samples the fraction of reads that cannot be aligned presenting with primers derived from this PCR is up to 40% suggesting that this artifact probably clusters better than our target. We know this behavior from index dimers that tend to overtake flow cells, however, this artifact is not short enough to exibit this behavior (I would have thought).

For more detail on the on the distribution of over-represented sequences, see section “QC - alignment efficiency”.

I would recommend to redesign this PCR 2_fk5_gre_pE if we are likely to use it more often. It could signifiicantly increase the yield. Additionally, this amplicon has an overlapping target sequence to amplicon 13_fk5_ctcf_5UTR. We usually try to avoid overalapping designs - is there a special reason for this?

Note on artefact status: Some amplicons show very high rates of artefacts. However, other features as linearity assessment of these amplicons are among the best of the panel!

2.2 Linearity

The amplicons show good overall linearity. Only two amplicons may have linearity issues, but the evidence is not strong (failed replicates, atypical pattern). Compared to the amplicons assessed for the human FKBP5 panel, the amplicons here more often show mild bias (but no reason to exclude!). The amplicons that failed linearity in the human FKBP5 panel displayed a much stronger and pronounced bias than found anywhere here and with high consistency between replicates…

Amplicons with r2 < 0.95

10_fk5_gre_i5.1 does not show consistent pattern between the replicates. I cannot explain this since I have not come across this behaviour so far. One of the replicates presents with good linearity, the others show a bias in the 50% or 75% sample BUT the pattern does not align with the usual biased curve we observe.
15_fk5_ctcf_interg Here, we have the problem that only one replicate has all 5 data points present. Here, the pattern MAY be more in line with the usual bias we observe, however, also still rather mild compared to the failed amplicons in the human FKBP5 panel.

2.3 Differential analysis of brain samples

3 Pre-processing info

Methods:

QC: FastQC
Trimming: cutadapt v1.11 (min read length: 100bp)
Reference genome is hg19 is used to extract sequence of amplicons. 50bp padding is added on each side.
Alignment: Bismark v0.18.2
Stitching: We remove overlapping/redundant ends of reads symmetrically (in-house perl script) as sequence quality drops towards the end of each read.
Methylation calling (CpG and CHH): methylKit version 1.6.3, minimum Phred quality INCREASED TO 30 (used to be default 20)

Citation: To describe methods, please cite the HAM-TBS publication doi: 10.1186/s13072-018-0209-x. Please note the methods update that minimum Phred quality during methylation calling by methylKit was incresed to 30 (it used to be the default setting of 20).

4 Assemble data

4.1 Amplicon data

Displayed are all amplicons included in this run.

amplicons <- read.delim("../GENOME/Amplicon_Location.bed", head = F)
colnames(amplicons) <- c("chr", "start", "stop", "amplicon")
amplicons <- amplicons %>% as_tibble() %>% mutate(amplicons, amplicon_length = amplicons$stop - 
  amplicons$start, amplicon = as.character(amplicon))

paged_table(amplicons)

4.2 Raw methylation data

The targeted reference sequences was padded for alignment purpose (+/- 50bp up and down stream) => this needs to be regarded when calculating coordinates. Shift when reading in the data!

cpg <- tibble(
  file = list.files(paste0("../05_methylKit"), pattern="CpG.txt", full.names=T)
) %>% 
  mutate(
    sample = gsub(file, pattern="^.*methylKit/|_CpG.txt", replacement=""),
    out = map(file, function(x) {
      read.delim(x, head=T, sep="\t") %>% 
        as_tibble() %>%
        mutate(
          amplicon=gsub(chr, pattern="::.*$", replacement=""), 
          base = base - 50, # remove padding from targeted alignment coordinate
        ) %>% 
        select(-c(chrBase, chr, freqT)) %>% 
        rename(methylation = freqC)
    })
  )

cpg <- cpg %>% 
  unnest(out) %>% # Add amplicon annotation data
  left_join(amplicons) %>% 
  filter(base > 20 & (amplicon_length - base) > 20) %>% # filter primer region (20 bp)
  group_by(sample, amplicon) %>% 
  mutate(
    max_coverage = max(coverage, na.rm = T),
    is_artifact = coverage < max_coverage*0.3, # assumed at less than 30% of maximum coverage. 50% could be SNP
    CpG_genomic = start+base, 
    median_coverage = median(coverage[!is_artifact], na.rm = T), 
    is_coverage_fail = median_coverage < 1000,
  ) %>% 
  separate(sample, sep = "-", into = c('region', 'timepoint', 'replicate'), remove = F) %>%
  mutate( 
    region=recode_factor(factor(region, levels = c("Ctrl", "A", "H", "P")), Ctrl="control", A="amygdala", H="hippocampus", P="pfc"), 
    is_control = region == "control", 
    timepoint = factor(timepoint, levels = c("B", "4", "24", "0", "25", "50", "75", "100"))
    )


# add plate layout
cpg <- 
  tibble(read.delim("plate_layout.csv", head=T, sep=";")) %>% mutate( 
  sample= gsub(Sample.ID, pattern="_", replacement = "-"),
  Well=as.factor(Well)) %>%
  rename(column=Well, row=Row) %>% 
  select(-Sample.ID) %>%
  right_join(cpg)


# add flag id CpG is valid for analysis (requires a minimum of 3 replicates)
cpg <- 
  cpg %>% 
  filter(!is_control) %>% 
  group_by(region, amplicon, CpG_genomic, timepoint) %>% 
  nest() %>% 
  mutate(
    n_replicates = map_dbl(data, ~ length(unique(.x$replicate))),
    ) %>%
  group_by( region, amplicon, CpG_genomic ) %>% mutate(
    n_timepoints = sum( n_replicates >= 3 ),
    is_valid_for_analysis= n_replicates >= 3 & n_timepoints >= 2
  ) %>% right_join(cpg)

4.3 Raw CHH conversion data

Extract conversion levels for each amplicon and each sample separately.
Track both mean and median conversion! This makes a difference here.
Check more closely! (TODO)

chh <- tibble(
  file = list.files(paste0("../05_methylKit"), pattern="CHH.txt", full.names=T)
) %>% 
  mutate(
    sample = gsub(file, pattern="^.*methylKit/|_CHH.txt", replacement=""),
    out = map(file, function(x) {
      read.delim(x, head=T, sep="\t") %>% 
        as_tibble() %>%
        mutate(
          amplicon=gsub(chr, pattern="::.*$", replacement=""), 
          base = base - 50, # remove padding from targeted alignment coordinate
        ) %>% 
        select(-c(chrBase, chr, freqT)) %>% 
        rename(methylation = freqC)
    })
  )

chh <- chh %>% 
  unnest(out) %>% # Add amplicon annotation data
  left_join(amplicons) %>% 
  filter(base > 20 & (amplicon_length - base) > 20) %>% # filter primer region (20 bp)
  group_by(sample, amplicon) %>% 
  mutate(
    max_coverage = max(coverage, na.rm = T),
    is_artifact = coverage < max_coverage*0.3, # assumed at less than 30% of maximum coverage. 50% could be SNP
    CpG_genomic = start+base, 
    median_coverage = median(coverage[!is_artifact], na.rm = T), 
    is_coverage_fail = median_coverage < 1000
  ) 

save(chh, file="01_CHH_data_raw_190523_M02431_0022_000000000-CGG57.RDS")

chh <- chh %>%  filter( !is_artifact & !is_coverage_fail) %>% 
  summarize( mean_conversion= 100-mean(methylation),
             median_conversion= 100-median(methylation),
             is_conversion_fail= mean_conversion < 95)

Append CHH info to CpG data.

cpg <- cpg %>% left_join(chh, by = c("sample", "amplicon"))
rm(chh)

Save raw data.

saveRDS(cpg, file = "01_CpG_data_raw_190523_M02431_0022_000000000-CGG57.RDS")

5 QC

5.1 Alignment efficiency

Low alignment rate & very heterogeneous compared to the usual human TBS data. Whole genome alignment did not reveal off target hits. The low alignment rate seems to be rooted in the overrepresented sequences - see below.

As expected, alignment quality similar

dat <- tibble(file = list.files("../03_bismark/", pattern = ".txt", full.names = T), 
  sample = gsub(file, pattern = ".*//|_S.*$", replacement = ""), mapping_efficiency_pct = as.numeric(map_chr(file, 
    ~gsub(system(paste0("grep \"Mapping efficiency:\" ", .x), intern = T), 
      pattern = ".*:.|%", replacement = "")))) %>% select(-file)

cpg <- left_join(cpg, dat)

cpg %>% # filter(!is_control) %>%
ggplot(aes(x = timepoint, y = mapping_efficiency_pct, fill = region)) + geom_boxplot(width = 0.5, 
  alpha = 0.5) + geom_point(pch = 21, size = 2) + facet_grid(~region, scales = "free") + 
  labs(y = "Mapping efficiency [%]", x = "Timepoint") + theme(legend.position = "none")

5.1.1 Overrepresented sequences in unaligned reads

In the unaligned reads some sequences were overrepresented at high proportions. The beginning of the sequence shows the presence of our primers at the start of the read, I cannot find the source of the sequence following it. It is - however - bisulfite converted which makes finding the source more complicated.

The same check was performed on whole genome alignment with similar outcome on the overrepresented sequences within the unaligned sequences

# read in primer info
primers <- tibble(read.delim("primer.csv", head = F, sep = ";")) %>% rename(amplicon = V1, 
  primer_number = V2, primer_sequence = V3) %>% select(-primer_number) %>% 
  mutate(primer_sequence = as.character(primer_sequence), primer_no = rep(c("forward_primer", 
    "reverse_primer"), 23), primer_length = nchar(primer_sequence), seq_start20 = str_sub(primer_sequence, 
    1, 20))

paged_table(primers)

# read in over-represented sequences per sample and merge with primer info
orseq <- tibble(file = list.files(path = "../03_bismark/01_fastqc", pattern = "overrepresented_sequences.txt", 
  recursive = T, full.names = T)) %>% mutate(sample = gsub(file, pattern = "^../03_bismark/01_fastqc/|_fastqc.*$", 
  replacement = ""), out = map(file, function(x) {
  read.delim(x, head = T, sep = "\t") %>% as_tibble() %>% rename(sequence = X.Sequence) %>% 
    mutate(seq_start20 = str_sub(sequence, 1, 20))
})) %>% select(-file) %>% unnest(out) %>% select(-Possible.Source) %>% select(-sequence) %>% 
  group_by(sample, seq_start20) %>% summarize(count = sum(Count), percentage = sum(Percentage)) %>% 
  left_join(primers) %>% mutate(R1 = str_detect(sample, "_reads_1$")) %>% 
  separate(sample, sep = "-", into = c("region", "timepoint", "replicate"), 
    remove = F) %>% mutate(region = recode_factor(as.factor(region), Ctrl = "control", 
  A = "amygdala", H = "hippocampus", P = "pfc"), timepoint = as.factor(timepoint))

orseq

## # A tibble: 6,722 x 12
## # Groups:   sample [174]
##    sample region timepoint replicate seq_start20 count percentage amplicon
##    <chr>  <fct>  <fct>     <chr>     <chr>       <int>      <dbl> <fct>   
##  1 A-24-… amygd… 24        1_S83_L0… AAAAAAAAAA…   290      0.379 14_fk5_…
##  2 A-24-… amygd… 24        1_S83_L0… AAAAACAAAA…    83      0.108 <NA>    
##  3 A-24-… amygd… 24        1_S83_L0… AAAAACAAAA… 27984     36.6   2_fk5_g…
##  4 A-24-… amygd… 24        1_S83_L0… AAAAACAAAA…    90      0.118 <NA>    
##  5 A-24-… amygd… 24        1_S83_L0… AAAACAAAAC…    85      0.111 <NA>    
##  6 A-24-… amygd… 24        1_S83_L0… AAAAGAGAGT… 25341     33.1   2_fk5_g…
##  7 A-24-… amygd… 24        1_S83_L0… AAAATCCTCT…   116      0.152 <NA>    
##  8 A-24-… amygd… 24        1_S83_L0… AAACCAACCT…   246      0.321 9_fk5_g…
##  9 A-24-… amygd… 24        1_S83_L0… AAACCTTTAA…    85      0.111 12_fk5_…
## 10 A-24-… amygd… 24        1_S83_L0… AAAGAGAGTT…   453      0.592 <NA>    
## # … with 6,712 more rows, and 4 more variables: primer_sequence <chr>,
## #   primer_no <chr>, primer_length <int>, R1 <lgl>

# booxplot of percentages per amplicon

ggplot(subset(orseq, !is.na(primer_no)), aes(x = amplicon, y = percentage, col = R1)) + 
  geom_point() + labs(title = "primer presence in unaligned sequences", y = "percentage per sample") + 
  facet_wrap(~primer_no, nrow = 2) + theme(axis.text.x = element_text(angle = 90))

Display pooling ratio based on aligned reads per amplicon:

cpg %>% # filter(!is_control) %>%
select(sample, median_coverage, region, timepoint, amplicon) %>% unique() %>% 
  ggplot(aes(x = amplicon, y = median_coverage, col = timepoint)) + geom_point() + 
  # labs(title='primer presence in unaligned sequences', y='percentage per
# sample') +
facet_wrap(~region, nrow = 4) + theme(axis.text.x = element_text(angle = 90))

Pooling ration is very stable across amplicons. In which order is pooling performed? -> pooling oder could explain this pattern. (TODO ask Susi for detail)

5.2 Artifact status

In a previous meeting we hypothesized that the artefact status could be related to the specific strain of mice and differences in the genome compared to the reference assembly. However, on second thought, that would present as a SNP with 100% or 50% coverage and not as an artifact. The difference in strain can therefore not be the source of this.

It has to be noted that despite rather high artifact rates the linearity assessment using the control DNA

cpg_sum <- cpg %>% group_by(region, timepoint, replicate, sample, amplicon, 
  is_control) %>% summarize(n_artifacts = sum(is_artifact), n_coverage_fail = sum(is_coverage_fail), 
  n_good_quality_artifact = sum(is_artifact, !is_coverage_fail))
cpg_sum

## # A tibble: 1,735 x 9
## # Groups:   region, timepoint, replicate, sample, amplicon [1,735]
##    region timepoint replicate sample amplicon is_control n_artifacts
##    <fct>  <fct>     <chr>     <chr>  <chr>    <lgl>            <int>
##  1 contr… 0         1         Ctrl-… 1_fk5_g… TRUE                 2
##  2 contr… 0         1         Ctrl-… 10_fk5_… TRUE                 0
##  3 contr… 0         1         Ctrl-… 10_fk5_… TRUE                12
##  4 contr… 0         1         Ctrl-… 10_fk5_… TRUE                23
##  5 contr… 0         1         Ctrl-… 11_fk5_… TRUE                 2
##  6 contr… 0         1         Ctrl-… 11_fk5_… TRUE                 1
##  7 contr… 0         1         Ctrl-… 12_fk5_… TRUE                 2
##  8 contr… 0         1         Ctrl-… 12_fk5_… TRUE                 2
##  9 contr… 0         1         Ctrl-… 13_fk5_… TRUE                 0
## 10 contr… 0         1         Ctrl-… 14_fk5_… TRUE                 6
## # … with 1,725 more rows, and 2 more variables: n_coverage_fail <int>,
## #   n_good_quality_artifact <int>

Plot displays the number of artefacts removed per sample/amplicon. Quite stable per amplicon between samples.

ggplot(cpg_sum, aes(x = amplicon, y = sample, fill = n_good_quality_artifact)) + 
  geom_tile() + geom_text(aes(label = n_good_quality_artifact), size = 1.5) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_fill_distiller(type = "seq", 
  direction = 1) + theme(legend.position = "none")

Sum of artefacts per time point.

cpg_sum %>% group_by(region, timepoint, replicate, sample) %>% summarize(total_artifacts = sum(n_good_quality_artifact)) %>% 
  ggplot(aes(x = timepoint, y = total_artifacts)) + geom_boxplot() + geom_point(aes(fill = replicate), 
  pch = 21, size = 2) + facet_grid(. ~ region, scales = "free_x") + theme(legend.position = "none")

Sum of artefacts per amplicon.

cpg_sum %>% group_by(amplicon) %>% summarize(total_artifacts = sum(n_good_quality_artifact)) %>% 
  ggplot(aes(x = amplicon, y = total_artifacts)) + geom_col() + coord_flip() + 
  theme(legend.position = "none")

Coverage per sample per amplicon:

cpg %>% select(sample, amplicon, median_coverage, region) %>% unique() %>% ggplot(aes(x = amplicon, 
  y = sample, fill = log2(median_coverage))) + geom_tile() + geom_text(aes(label = round(median_coverage)), 
  size = 1.5) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  scale_fill_distiller(type = "seq", direction = 1) + theme(legend.position = "none")

Same plot as above - colored by threshold (minimum coverage of 1000):

cpg %>% select(sample, amplicon, median_coverage, region) %>% unique() %>% ggplot(aes(x = amplicon, 
  y = sample, fill = median_coverage < 1000)) + geom_tile() + geom_text(aes(label = round(median_coverage)), 
  size = 1.5) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  scale_fill_discrete(direction = -1, na.value = "grey50") + theme(legend.position = "none")

## Coverage status

I checked the missing amplicons. These are expected to be missing, Susi already saw that they did not work:

paged_table(anti_join(amplicons, cpg, by = "amplicon"))

Amplicons to be excluded from differentail analysis only due to lack of coverage in more than 50% of the samples. Add boolean indicator.

cpg <- cpg %>% filter(!is_control) %>% select(sample, amplicon, median_coverage) %>% 
  unique() %>% group_by(amplicon) %>% summarise(is_valid_amplicon = (sum(median_coverage >= 
  1000)/n()) > 0.5) %>% right_join(cpg)

cpg %>% filter(!is_control & !is_valid_amplicon) %>% select(chr, start, stop, 
  amplicon, amplicon_length) %>% unique() %>% paged_table()

-> These 6 amplicons were not included in this analysis.

5.3 Conversion rate

5.3.1 Mean

cpg %>% select(sample, amplicon, mean_conversion, region) %>% unique() %>% ggplot(aes(x = amplicon, 
  y = sample, fill = mean_conversion > 95)) + geom_tile() + geom_text(aes(label = round(mean_conversion, 
  1)), size = 1.5) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  scale_fill_discrete(direction = -1, na.value = "grey50") + theme(legend.position = "none")

5.3.2 Median

cpg %>% select(sample, amplicon, median_conversion, region) %>% unique() %>% 
  ggplot(aes(x = amplicon, y = sample, fill = median_conversion > 95)) + geom_tile() + 
  geom_text(aes(label = round(median_conversion, 1)), size = 1.5) + theme(axis.text.x = element_text(angle = 90, 
  hjust = 1)) + scale_fill_discrete(direction = -1, na.value = "grey50") + 
  theme(legend.position = "none")

5.4 Apply filtering

median coverage < 1000
conversion efficiency < 95%
artefact
exclude amplicons with < 50% of the samples at >1000 coverage (the excluded amplicons pretty much fully failed)

# filter out data points that fail coverage or conversion standards or are
# artefacts
cpg <- cpg %>% filter(!is_artifact & !is_coverage_fail & !is_conversion_fail & 
  is_valid_amplicon) %>% select(-c(is_artifact, is_coverage_fail, is_conversion_fail, 
  is_valid_amplicon))
saveRDS(cpg, file = "01_CpG_data_filtered_190523_M02431_0022_000000000-CGG57.RDS")

5.5 Overlapping amplicons - consistency check

Two amplicons are overlapping and one CpG is quantified in both instances. Check how similar the quantification is!

# are there CpGs that are present on more than one amplicon?
ambiguos_cpg <- cpg %>% select(sample, CpG_genomic) %>% group_by(CpG_genomic, 
  sample) %>% count() %>% filter(n > 1) %>% ungroup() %>% select(CpG_genomic) %>% 
  unique()

cpg %>% filter(CpG_genomic == ambiguos_cpg$CpG_genomic & !is_control) %>% select(sample, 
  methylation, amplicon) %>% pivot_wider(id_cols = sample, names_from = amplicon, 
  values_from = methylation) %>% mutate(delta = `13_fk5_ctcf_5UTR` - `2_fk5_gre_pE`) %>% 
  select(sample, delta) %>% right_join(subset(cpg, CpG_genomic == ambiguos_cpg$CpG_genomic & 
  !is_control, select = c(sample, region, timepoint))) %>% ggplot(aes(x = region, 
  y = delta, col = timepoint)) + geom_boxplot() + labs(x = "", y = "delta % methylation") + 
  theme_minimal()

Pretty big differences in my opinion. Need to discuss this.

5.6 PCA - brain data

# exclude control and only keep CpGs which have enough data points to be
# analyzed
cpg_pca <- cpg %>% filter(!is_control & is_valid_for_analysis) %>% group_by(region) %>% 
  nest() %>% mutate(pca = map(data, ~prcomp(.x %>% select(CpG_genomic, methylation))))  # %>%
# mutate( pca_graph = map( pca, ~ autoplot(.x) ) )

5.7 Batches

Check for batches: - row - column - alignment rate

Any others I am missing?

batches <- 
  cpg %>% filter( !is_control & is_valid ) %>% droplevels() %>% 
  group_by( CpG_genomic ) %>% nest() %>%
  mutate( lm_batch =   map(data, ~broom::tidy( lm(methylation ~ row + column + alignment_rate , data=.x ) ))
          )
  
cpg_filtered <- cpg %>% 
  filter(is_valid) %>% 
  unnest() %>%
  group_by(region, amplicon, CpG_genomic) %>%
  nest() %>%
  mutate( 
    mean_methylation=map_dbl(data, ~ mean(.x$methylation)),
    #shapiro = map(data, ~ broom::tidy( shapiro.test(.x$methylation))), 
    lm = map(data, ~broom::tidy( lm(methylation ~ timepoint , data=.x ) ) ), 
    cooks.d= map(lm, ~ cooks.distance(.x) ),
    lm_sig= map_lgl(lm, ~ sum( .x$lm$p.value<0.5 )>1 )
  ) %>%
  unnest(lm)

cpg_filtered<- cpg_filtered %>% mutate(
  q.value=p.adjust(p.value)
)

6 BAC control - linearity check

Note:

The original intention was to produce 1 common control (0 to 100) and derive 3 replicates before bisulfite conversion to account for variance in conversion efficiency and downstream quantification.

In this case we produced the replicates differently which needs to be factored into the model. 1. 3 instances of 0% or 100% methylation (not sure if this is methylated or demethylated) are the starting material 2. Each of the instances was then (de)methylated to 0 or 100% 3. 25/50/75 titration generated from 0% and 100% control DNA 4. bisulfite conversion and downstream quantification

bac <- cpg %>% filter(is_control) %>% select(-c(is_control, region)) %>% group_by(amplicon, 
  CpG_genomic, timepoint) %>% nest() %>% mutate(n_replicates = map_dbl(data, 
  ~length(unique(.x$replicate))), ) %>% group_by(amplicon, CpG_genomic) %>% 
  mutate(n_timepoints = length(unique(timepoint)), is_valid = n_timepoints >= 
    2) %>% filter(is_valid) %>% droplevels() %>% mutate(level = as.numeric(as.character(timepoint))) %>% 
  unnest()

bac_rsquared <- bac %>% group_by(amplicon) %>% nest() %>% mutate(mod_sum = map(data, 
  ~summary(lm(methylation ~ level + replicate, data = .x, na.action = na.omit))), 
  r_squared = map_dbl(mod_sum, ~.x$r.squared)) %>% select(amplicon, r_squared)
bac <- left_join(bac, bac_rsquared)

Plot linearity for replicate 1

-> Replicate 1 has a bias (25% level). Here, it seems the titration was scewed. Exclude from linearity check!

Plot linearity for replicate 2

Plot linearity for replicate 3

bac <- cpg %>% filter(is_control) %>% select(-is_control) %>% droplevels() %>% 
  mutate(level = as.numeric(as.character(timepoint))) %>% filter(!(replicate == 
  1 & level == 25))  # exclude rep 1 / 25%

bac_rsquared <- bac %>% group_by(amplicon) %>% nest() %>% mutate(mod_sum = map(data, 
  ~summary(lm(methylation ~ level + replicate, data = .x, na.action = na.omit))), 
  r_squared = map_dbl(mod_sum, ~.x$r.squared)) %>% select(amplicon, r_squared)
bac <- left_join(bac, bac_rsquared)

Plot linearity (replicate 1 / 25% was excluded here). r2 displayed per amplicon.

ggplot(bac,  aes ( x=level, y=methylation , col=replicate) ) +
  geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
  geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
  geom_point(size=0.5,alpha=0.5) +
  geom_text( data = bac_rsquared, aes(x=50,y=90,label=round(r_squared, digits = 2)) , col="grey") +
  theme(legend.position = "none") +
  facet_wrap(~amplicon)

6.1 Failed linearity

r2 values are above 0.95 for all but 2 amplicons:

paged_table(subset(bac_rsquared, r_squared < 0.95))

Look at amplicons with r2 < 0.95 in more detail.

6.1.1 10_fk5_gre_i5.1

Other QC features looked fine!

Replicates do not show same scew. - Replicate 1: looks fine - Replicate 2: scewed 75% data point - Replicate 3: scewed 50% data point

This does not look conclusive to me.

bac %>% ungroup() %>% filter(r_squared<0.95 & amplicon=="10_fk5_gre_i5.1" ) %>% droplevels() %>%
  ggplot(  aes ( x=level, y=methylation , col=replicate) ) +
  ggtitle("10_fk5_gre_i5.1") +
  geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
  geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
  geom_point(size=0.5,alpha=0.5) +
  theme(legend.position = "none") +
  facet_wrap(~ replicate )

6.1.2 15_fk5_ctcf_interg

Other QC features looked fine!

Replicate 1 and 2 are missing important data points. Replicate 3 is complete and show more of a mild but classical scew indicative of an amplification bias. However, replicates also don’t fully agree.

Also not a clear picture here, I think it may be worth while to rerun this.

Note: adding the “below coverage” data points didn’t add more info, rep 1 was very low at 25%, rep 2 was in line with 1 and 3 at 50%.

bac %>% ungroup() %>% filter(r_squared<0.95 & amplicon=="15_fk5_ctcf_interg" ) %>% droplevels() %>%
  ggplot(  aes ( x=level, y=methylation , col=replicate) ) +
  ggtitle("15_fk5_ctcf_interg") +
  geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
  geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
  geom_point(size=0.5,alpha=0.5) +
  theme(legend.position = "none") +
  facet_wrap(~ replicate )

tmp=readRDS("01_CpG_data_raw_190523_M02431_0022_000000000-CGG57.RDS") %>% rename(level=timepoint)


tmp %>% filter(amplicon=="15_fk5_ctcf_interg" & is_control) %>% droplevels() %>%
  ggplot(  aes ( x=level, y=methylation , col=replicate) ) +
  ggtitle("15_fk5_ctcf_interg") +
  geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
  geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
  geom_point(size=0.5,alpha=0.5) +
  theme(legend.position = "none") +
  facet_wrap(~ replicate )

rm(tmp)

7 Differential analysis of mouse brain data

First, determine which CpGs can be evaluated ( not all samples passed filters -> e.g. some CpGs only have data on 1 time point )

CpGs included for analysis should have: - at least two time points - at least 3 replicates per time point

cpg <- cpg %>% filter(!is_control) %>% group_by(region, amplicon, CpG_genomic, 
  timepoint) %>% nest() %>% mutate(n_replicates = map_dbl(data, ~length(unique(.x$replicate))), 
  ) %>% group_by(region, amplicon, CpG_genomic) %>% mutate(n_timepoints = sum(n_replicates >= 
  3), is_valid = n_replicates >= 3 & n_timepoints >= 2)

# cpg %>% filter(region=='P' & amplicon == '1_fk5_gre_pE' & CpG_genomic ==
# 28660045)

7.1 Run linear model

cpg_filtered <- cpg %>% 
  filter(is_valid) %>% 
  unnest() %>%
  group_by(region, amplicon, CpG_genomic) %>%
  nest() %>%
  mutate( 
    mean_methylation=map_dbl(data, ~ mean(.x$methylation)),
    #shapiro = map(data, ~ broom::tidy( shapiro.test(.x$methylation))), 
    lm = map(data, ~broom::tidy( lm(methylation ~ timepoint , data=.x ) ) ), 
    cooks.d= map(lm, ~ cooks.distance(.x) ),
    lm_sig= map_lgl(lm, ~ sum( .x$lm$p.value<0.5 )>1 )
  ) %>%
  unnest(lm)

cpg_filtered<- cpg_filtered %>% mutate(
  q.value=p.adjust(p.value)
)

ggplot(cpg_filtered, aes(x=mean_methylation, y=q.value)) +
  geom_point()

#test <- cpg_filtered %>% filter( term!="(Intercept)" & p.value<0.1 ) %>% group_by(region, amplicon) %>% select(region, amplicon) %>% unique() %>% mutate(
#  plot= my_amplicon_plot(cpg_filtered, region, amplicon)
#)

ggplot(as.data.frame(cpg_filtered[289, "data"][[1]]), aes(x=sample, y=methylation)) +
  geom_boxplot()

7.2 Open tasks

Ask Susi about the preparation of control library & include details in intro
Find out about pooling (sequence of events :)
Does median for conversion efficiency work better for brain tissue / look at non-CpG methylation? (does make a difference for some samples / 1 specific amplicon only)

8 sessionInfo()

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] formatR_1.6      rmarkdown_2.7    gridExtra_2.3    ggfortify_0.4.11
##  [5] forcats_0.4.0    stringr_1.4.0    dplyr_1.0.5      purrr_0.3.4     
##  [9] readr_1.3.1      tidyr_1.1.0      tibble_3.0.1     ggplot2_3.2.1   
## [13] tidyverse_1.3.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0   xfun_0.21          reshape2_1.4.3    
##  [4] haven_2.2.0        lattice_0.20-38    colorspace_1.4-1  
##  [7] vctrs_0.3.6        generics_0.0.2     htmltools_0.3.6   
## [10] yaml_2.2.0         utf8_1.1.4         rlang_0.4.10      
## [13] pillar_1.4.4       withr_2.1.2        glue_1.4.1        
## [16] DBI_1.0.0          RColorBrewer_1.1-2 dbplyr_1.4.2      
## [19] modelr_0.1.5       readxl_1.3.1       plyr_1.8.4        
## [22] lifecycle_1.0.0    munsell_0.5.0      gtable_0.3.0      
## [25] cellranger_1.1.0   rvest_0.3.5        codetools_0.2-16  
## [28] evaluate_0.13      labeling_0.3       knitr_1.23        
## [31] fansi_0.4.1        broom_0.5.4        Rcpp_1.0.4.6      
## [34] scales_1.0.0       backports_1.1.4    jsonlite_1.6.1    
## [37] fs_1.3.1           hms_0.5.3          digest_0.6.25     
## [40] stringi_1.4.6      cli_2.0.2          tools_3.6.0       
## [43] magrittr_1.5       lazyeval_0.2.2     crayon_1.3.4      
## [46] pkgconfig_2.0.3    ellipsis_0.3.1     xml2_1.2.2        
## [49] reprex_0.3.0       lubridate_1.7.4    assertthat_0.2.1  
## [52] httr_1.4.1         rstudioapi_0.11    R6_2.4.1          
## [55] nlme_3.1-140       compiler_3.6.0

Murine TBS pilot1

Simone Roeh

3/17/2021

1 Description

2 SUMMARY of results

2.1 Data quality

2.2 Linearity

2.3 Differential analysis of brain samples

3 Pre-processing info

4 Assemble data

4.1 Amplicon data

4.2 Raw methylation data

4.3 Raw CHH conversion data

5 QC

5.1 Alignment efficiency

5.1.1 Overrepresented sequences in unaligned reads

5.2 Artifact status

5.3 Conversion rate

5.3.1 Mean

5.3.2 Median

5.4 Apply filtering

5.5 Overlapping amplicons - consistency check

5.6 PCA - brain data

5.7 Batches

6 BAC control - linearity check

6.1 Failed linearity

6.1.1 10_fk5_gre_i5.1

6.1.2 15_fk5_ctcf_interg

7 Differential analysis of mouse brain data

7.1 Run linear model

7.2 Open tasks

8 sessionInfo()