190523_M02431_0022_000000000-CGG57.Rmd

---
title: "Murine TBS pilot1"
author: "Simone Roeh"
date: "3/17/2021"
output:
  html_document:
    theme: default
    highlight: haddock
    code_folding: hide
    toc: true
    toc_float:
      collapsed: yes
      smooth_scroll: yes
    number_sections: true
always_allow_html: true
editor_options:
  chunk_output_type: console
  markdown:
    wrap: 72
---


# Description

HAM-TBS project title: TBS_Mouse_Pilot_repeat

**Responsible, Involved:** Susann, Tobias, Simone, Darina, Mathias
Schmidt, Elisabeth

**Background/Aim:**

-   Evaluate linearity of amplicons based on control

-   Assess methylation levels of CpGs in FKBP5 Murine DNA and look for regulation

-   Setup: Murine DNA Hippocampus 20 + Murine DNA Amygdala 20 + Murine DNA PVN 20

**Tissue** (H=Hippocampus/A=Amygdala/ P=PVN)

-   23 PCRs of in vitro methylated control DNA ( Bac)
-   3x 5 controls (0%, 25%, 50%, 75%, 100%)
-   methylation levels of control DNA were verified by PSQ

**Run ID:** 190523_M02431_0022_000000000-CGG57

-   72 samples / 15 BAC controls
-   20 Amplicons / 23 Amplicons in BAC control

# SUMMARY of results

## Data quality
We observe a significantly lower *alignment rate* than we usually get from the human samples.
This is due to some issues with the primers, specifically of one amplicon *2_fk5_gre_pE*, which results in an unalignable fraction of reads (~ up to 40% of our unaligned reads).
Potentially excluding or re-designing this PCR would improve the performance. However, our linearity assessment for this amplicon showed very good results, the methylation data derived from it is absolutely usable.
It has to be noted that the PCR itself does not seem to be over-represented in the pool. But still, for some samples the fraction of reads that cannot be aligned presenting with primers derived from this PCR is up to 40% suggesting that this artifact probably clusters better than our target. We know this behavior from index dimers that tend to overtake flow cells, however, this artifact is not short enough to exibit this behavior (I would have thought).

For more detail on the on the distribution of over-represented sequences, see section "QC - alignment efficiency".

> I would recommend to redesign this PCR *2_fk5_gre_pE* if we are likely to use it more often. It could signifiicantly increase the yield.
> Additionally, this amplicon has an overlapping target sequence to amplicon *13_fk5_ctcf_5UTR*. We usually try to avoid overalapping designs - is there a special reason for this?


*Note* on artefact status:
Some amplicons show very high rates of artefacts. However, other features as linearity assessment of these amplicons are among the best of the panel!


## Linearity

The amplicons show good overall linearity. Only two amplicons may have linearity issues, but the evidence is not strong (failed replicates, atypical pattern).
Compared to the amplicons assessed for the human FKBP5 panel, the amplicons here more often show mild bias (but no reason to exclude!). The amplicons that failed linearity in the human FKBP5 panel displayed a much stronger and pronounced bias than found anywhere here and with high consistency between replicates...

Amplicons with r2 < 0.95

-   *10_fk5_gre_i5.1* does not show consistent pattern between the replicates. I cannot explain this since I have not come across this behaviour so far. One of the replicates presents with good linearity, the others show a bias in the 50% or 75% sample BUT the pattern does not align with the usual biased curve we observe.

-   *15_fk5_ctcf_interg* Here, we have the problem that only one replicate has all 5 data points present. Here, the pattern MAY be more in line with the usual bias we observe, however, also still rather mild compared to the failed amplicons in the human FKBP5 panel.


## Differential analysis of brain samples


# Pre-processing info

**Methods:**

-   *QC*: FastQC
-   *Trimming*: cutadapt v1.11 (min read length: 100bp)
-   *Reference genome* is hg19 is used to extract sequence of amplicons.
50bp padding is added on each side.
-   *Alignment*: Bismark v0.18.2
-   *Stitching*: We remove overlapping/redundant ends of reads
symmetrically (in-house perl script) as sequence quality drops
towards the end of each read.
-   *Methylation calling (CpG and CHH)*: methylKit version 1.6.3,
minimum Phred quality INCREASED TO 30 (used to be default 20)

**Citation:** To describe methods, please cite the HAM-TBS publication
doi: 10.1186/s13072-018-0209-x. Please note the methods update that
minimum Phred quality during methylation calling by methylKit was
incresed to 30 (it used to be the default setting of 20).

```{r setup, include=FALSE}
knitr::opts_chunk$set(tidy=TRUE, tidy.opts=list(arrow=TRUE, indent=2),warning=F, message=FALSE)

library(tidyverse, quietly = T)
library(ggfortify, quietly = T)
library(grid, quietly = T)
library(gridExtra, quietly = T)
library(rmarkdown)
library(formatR)

theme_set(theme_minimal())

getwd()

```

# Assemble data

## Amplicon data

Displayed are all amplicons included in this run.

```{r, layout="l-body-outset"}
amplicons <- read.delim("../GENOME/Amplicon_Location.bed", head=F)
colnames(amplicons) <- c("chr", "start", "stop", "amplicon")
amplicons <- amplicons %>%
  as_tibble() %>%
  mutate(amplicons, amplicon_length=amplicons$stop - amplicons$start,
         amplicon=as.character(amplicon))

paged_table(amplicons)
```

## Raw methylation data

The targeted reference sequences was padded for alignment purpose (+/-
50bp up and down stream) =\> this needs to be regarded when calculating
coordinates. Shift when reading in the data!

```{r cpg}

cpg <- tibble(
  file = list.files(paste0("../05_methylKit"), pattern="CpG.txt", full.names=T)
) %>%
  mutate(
    sample = gsub(file, pattern="^.*methylKit/|_CpG.txt", replacement=""),
    out = map(file, function(x) {
      read.delim(x, head=T, sep="\t") %>%
        as_tibble() %>%
        mutate(
          amplicon=gsub(chr, pattern="::.*$", replacement=""),
          base = base - 50, # remove padding from targeted alignment coordinate
        ) %>%
        select(-c(chrBase, chr, freqT)) %>%
        rename(methylation = freqC)
    })
  )

cpg <- cpg %>%
  unnest(out) %>% # Add amplicon annotation data
  left_join(amplicons) %>%
  filter(base > 20 & (amplicon_length - base) > 20) %>% # filter primer region (20 bp)
  group_by(sample, amplicon) %>%
  mutate(
    max_coverage = max(coverage, na.rm = T),
    is_artifact = coverage < max_coverage*0.3, # assumed at less than 30% of maximum coverage. 50% could be SNP
    CpG_genomic = start+base,
    median_coverage = median(coverage[!is_artifact], na.rm = T),
    is_coverage_fail = median_coverage < 1000,
  ) %>%
  separate(sample, sep = "-", into = c('region', 'timepoint', 'replicate'), remove = F) %>%
  mutate(
    region=recode_factor(factor(region, levels = c("Ctrl", "A", "H", "P")), Ctrl="control", A="amygdala", H="hippocampus", P="pfc"),
    is_control = region == "control",
    timepoint = factor(timepoint, levels = c("B", "4", "24", "0", "25", "50", "75", "100"))
    )


# add plate layout
cpg <-
  tibble(read.delim("plate_layout.csv", head=T, sep=";")) %>% mutate(
  sample= gsub(Sample.ID, pattern="_", replacement = "-"),
  Well=as.factor(Well)) %>%
  rename(column=Well, row=Row) %>%
  select(-Sample.ID) %>%
  right_join(cpg)


# add flag id CpG is valid for analysis (requires a minimum of 3 replicates)
cpg <-
  cpg %>%
  filter(!is_control) %>%
  group_by(region, amplicon, CpG_genomic, timepoint) %>%
  nest() %>%
  mutate(
    n_replicates = map_dbl(data, ~ length(unique(.x$replicate))),
    ) %>%
  group_by( region, amplicon, CpG_genomic ) %>% mutate(
    n_timepoints = sum( n_replicates >= 3 ),
    is_valid_for_analysis= n_replicates >= 3 & n_timepoints >= 2
  ) %>% right_join(cpg)


```

## Raw CHH conversion data

Extract conversion levels for each amplicon and each sample separately.\
Track both mean and median conversion! This makes a difference here.\
Check more closely! (TODO)\

```{r chh}
chh <- tibble(
  file = list.files(paste0("../05_methylKit"), pattern="CHH.txt", full.names=T)
) %>%
  mutate(
    sample = gsub(file, pattern="^.*methylKit/|_CHH.txt", replacement=""),
    out = map(file, function(x) {
      read.delim(x, head=T, sep="\t") %>%
        as_tibble() %>%
        mutate(
          amplicon=gsub(chr, pattern="::.*$", replacement=""),
          base = base - 50, # remove padding from targeted alignment coordinate
        ) %>%
        select(-c(chrBase, chr, freqT)) %>%
        rename(methylation = freqC)
    })
  )

chh <- chh %>%
  unnest(out) %>% # Add amplicon annotation data
  left_join(amplicons) %>%
  filter(base > 20 & (amplicon_length - base) > 20) %>% # filter primer region (20 bp)
  group_by(sample, amplicon) %>%
  mutate(
    max_coverage = max(coverage, na.rm = T),
    is_artifact = coverage < max_coverage*0.3, # assumed at less than 30% of maximum coverage. 50% could be SNP
    CpG_genomic = start+base,
    median_coverage = median(coverage[!is_artifact], na.rm = T),
    is_coverage_fail = median_coverage < 1000
  )

save(chh, file="01_CHH_data_raw_190523_M02431_0022_000000000-CGG57.RDS")

chh <- chh %>%  filter( !is_artifact & !is_coverage_fail) %>%
  summarize( mean_conversion= 100-mean(methylation),
             median_conversion= 100-median(methylation),
             is_conversion_fail= mean_conversion < 95)
```

Append CHH info to CpG data.

```{r}
cpg <- cpg %>% left_join(chh, by = c("sample", "amplicon"))
rm(chh)
```

Save raw data.

```{r}
saveRDS(cpg, file="01_CpG_data_raw_190523_M02431_0022_000000000-CGG57.RDS")
```

# QC

## Alignment efficiency

Low alignment rate & very heterogeneous compared to the usual human
TBS data. Whole genome alignment did not reveal off target hits. The low alignment rate seems to be rooted in the overrepresented sequences - see below.

As expected, alignment quality similar
```{r, fig.height=3, fig.width=7}
dat <- tibble(
  file = list.files("../03_bismark/", pattern=".txt", full.names=T),
  sample = gsub(file, pattern=".*//|_S.*$", replacement=""),
  mapping_efficiency_pct = as.numeric(map_chr(file, ~gsub(system(paste0("grep \"Mapping efficiency:\" ", .x), intern = T), pattern=".*:.|%", replacement="")))
) %>% select(-file)

cpg <- left_join(cpg, dat)

cpg %>%
#  filter(!is_control) %>%
  ggplot(aes(x = timepoint, y = mapping_efficiency_pct, fill = region)) +
  geom_boxplot(width = 0.5, alpha = 0.5) +
  geom_point(pch = 21, size = 2) +
  facet_grid(~region, scales = "free") +
  labs(y = "Mapping efficiency [%]", x = "Timepoint") +
  theme(legend.position = "none")
```


```{r, echo=F, eval=FALSE}

#### Whole genome alignment
# Test if unaligned reads are off target hits in genome -> whole genome alignment.
# Compare alignment rates.
# Lower alignment rates in whole genome data
# also still primer issue with same distribution

wg_alignment_rate=tibble(read.delim("../03_bismark_whole_genome/mapping_efficiency.txt",  head=F, sep=" ")[,1:2]) %>% rename( sample=V1, wg_alignment_rate=V2 ) %>%
  mutate(
    wg_alignment_rate = as.numeric(gsub(wg_alignment_rate, pattern="%", replacement = ""))/100,
    sample=gsub(sample, pattern="_.*$", replacement = "") )

wg_alignment_rate %>% left_join(dat)


```

### Overrepresented sequences in unaligned reads
In the unaligned reads some sequences were overrepresented at high proportions.
The beginning of the sequence shows the presence of our primers at the start of the read, I cannot find the source of the sequence following it.
It is - however - bisulfite converted which makes finding the source more complicated.

The same check was performed on whole genome alignment with similar outcome on the overrepresented sequences within the unaligned sequences

```{r primers}

# read in primer info
primers=tibble(
  read.delim("primer.csv", head=F, sep=";")
)  %>%
  rename(amplicon=V1, primer_number=V2, primer_sequence=V3) %>%
  select(-primer_number) %>%
  mutate(primer_sequence  = as.character(primer_sequence),
    primer_no=rep(c("forward_primer", "reverse_primer"), 23),
    primer_length=nchar(primer_sequence),
    seq_start20=str_sub(primer_sequence, 1,20)
  )

paged_table(primers)
```

```{r orseq}
# read in over-represented sequences per sample and merge with primer info
orseq <- tibble(
  file = list.files(path="../03_bismark/01_fastqc", pattern="overrepresented_sequences.txt", recursive = T,full.names=T)
) %>%
  mutate(
    sample = gsub(file, pattern="^../03_bismark/01_fastqc/|_fastqc.*$", replacement=""),
    out = map(file, function(x) {
      read.delim(x, head=T, sep="\t") %>%
        as_tibble() %>%
        rename(sequence = X.Sequence ) %>%
        mutate(
          seq_start20=str_sub( sequence, 1, 20)
        )
    })
  ) %>%
  select(-file) %>% unnest(out) %>% select(-Possible.Source) %>%
  select(-sequence) %>%
  group_by(sample, seq_start20) %>%
  summarize( count=sum(Count),
             percentage=sum(Percentage) ) %>%
  left_join(primers) %>%
  mutate( R1= str_detect(sample,"_reads_1$")) %>%
  separate(sample, sep = "-", into = c('region', 'timepoint', 'replicate'), remove = F) %>%
  mutate( region=recode_factor(as.factor(region), Ctrl="control", A="amygdala", H="hippocampus", P="pfc"),
          timepoint=as.factor(timepoint))

orseq

```


```{r, fig.height=8, fig.width=8}
# booxplot of percentages per amplicon

ggplot(subset(orseq, !is.na(primer_no)), aes(x=amplicon, y=percentage, col=R1)) +
  geom_point() +
  labs(title="primer presence in unaligned sequences", y="percentage per sample") +
  facet_wrap(~primer_no, nrow=2) +
  theme(axis.text.x = element_text(angle = 90))

```

Display pooling ratio based on aligned reads per amplicon:

```{r, fig.height=9, fig.width=5}

cpg %>%
  #filter(!is_control) %>%
  select(sample, median_coverage, region, timepoint, amplicon) %>%
  unique() %>%
  ggplot(aes( x=amplicon, y=median_coverage, col=timepoint )) +
  geom_point() +
  #labs(title="primer presence in unaligned sequences", y="percentage per sample") +
  facet_wrap(~region, nrow = 4) +
  theme(axis.text.x = element_text(angle = 90))


```

>Pooling ration is very stable across amplicons.
>In which order is pooling performed? -> pooling oder could explain this pattern. (TODO ask Susi for detail)

## Artifact status

In a previous meeting we hypothesized that the artefact status could be related to the specific strain of mice and differences in the genome compared to the reference assembly.
However, on second thought, that would present as a SNP with 100% or 50% coverage and not as an artifact.
The difference in strain can therefore not be the source of this.

It has to be noted that despite rather high artifact rates the linearity assessment using the control DNA

```{r}
cpg_sum <- cpg %>%
  group_by(region, timepoint, replicate, sample, amplicon, is_control) %>%
  summarize(
    n_artifacts = sum(is_artifact),
    n_coverage_fail = sum(is_coverage_fail),
    n_good_quality_artifact = sum(is_artifact, !is_coverage_fail)
  )
cpg_sum
```

Plot displays the number of artefacts removed per sample/amplicon. Quite
stable per amplicon between samples.

```{r fig1, fig.height = 12, fig.width = 8}
ggplot(cpg_sum,
       aes(x = amplicon, y = sample, fill = n_good_quality_artifact)) +
  geom_tile() +
  geom_text(aes(label = n_good_quality_artifact), size=1.5) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_distiller(type = "seq", direction = 1) +
  theme(legend.position = "none")
```

Sum of artefacts per time point.

```{r fig2, fig.height = 4, fig.width = 8}
cpg_sum %>%
  group_by(region, timepoint, replicate, sample) %>%
  summarize(total_artifacts = sum(n_good_quality_artifact)) %>%
  ggplot(aes( x=timepoint, y = total_artifacts)) +
  geom_boxplot() +
  geom_point(aes(fill=replicate), pch = 21, size = 2) +
  facet_grid(.~region, scales = "free_x")  +
  theme(legend.position = "none")
```

Sum of artefacts per amplicon.

```{r fig3, fig.height = 8, fig.width = 6}
cpg_sum %>%
  group_by(amplicon) %>%
  summarize(total_artifacts = sum(n_good_quality_artifact)) %>%
  ggplot(aes(x=amplicon, y = total_artifacts)) +
  geom_col() +
  coord_flip() +
  theme(legend.position = "none")
```

Coverage per sample per amplicon:

```{r fig4, fig.height = 10, fig.width = 5}
cpg %>% select( sample, amplicon, median_coverage, region) %>% unique() %>%
  ggplot(aes(x = amplicon, y = sample, fill = log2(median_coverage))) +
  geom_tile() +
  geom_text(aes(label = round(median_coverage)), size=1.5) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_distiller(type = "seq", direction = 1) +
  theme(legend.position = "none")
```

Same plot as above - colored by threshold (minimum coverage of 1000):

```{r fig5, fig.height = 10, fig.width = 5}
cpg %>% select( sample, amplicon, median_coverage, region) %>% unique() %>%
  ggplot(aes(x = amplicon, y = sample, fill = median_coverage<1000)) +
  geom_tile() +
  geom_text(aes(label = round(median_coverage)), size=1.5) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_discrete(direction = -1, na.value = "grey50") +
  theme(legend.position = "none")
```
## Coverage status

I checked the missing amplicons. These are expected to be missing, Susi already saw that they did not work:
```{r, layout="l-body-outset"}
paged_table(anti_join(amplicons, cpg, by="amplicon"))


```


Amplicons to be excluded from differentail analysis only due to lack of coverage in more than 50% of the samples. Add boolean indicator.
```{r, layout="l-body-outset"}

cpg <-
  cpg %>%
  filter(!is_control) %>%
  select(sample, amplicon, median_coverage) %>% unique() %>%
  group_by(amplicon) %>%
  summarise(is_valid_amplicon = ( sum(median_coverage >= 1000) / n() ) > 0.5  ) %>%
  right_join(cpg)

cpg %>% filter(!is_control & !is_valid_amplicon ) %>% select(chr, start, stop, amplicon, amplicon_length) %>% unique() %>%
paged_table()


```

>
> -> These 6 amplicons were not included in this analysis.
>

## Conversion rate

### Mean

```{r fig6, fig.height = 10, fig.width = 5}
cpg %>% select( sample, amplicon, mean_conversion, region) %>% unique() %>%
  ggplot(aes(x = amplicon, y = sample, fill = mean_conversion > 95)) +
  geom_tile() +
  geom_text(aes(label = round(mean_conversion, 1)), size=1.5) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_discrete(direction = -1, na.value = "grey50") +
  theme(legend.position = "none")
```

### Median

```{r fig7, fig.height = 10, fig.width = 5}
cpg %>% select( sample, amplicon, median_conversion, region) %>% unique() %>%
  ggplot(aes(x = amplicon, y = sample, fill = median_conversion > 95)) +
  geom_tile() +
  geom_text(aes(label = round(median_conversion, 1)), size=1.5) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_discrete(direction = -1, na.value = "grey50") +
  theme(legend.position = "none")
```


## Apply filtering

1. median coverage < 1000
2. conversion efficiency < 95%
3. artefact
4. exclude amplicons with < 50% of the samples at >1000 coverage (the excluded amplicons pretty much fully failed)

```{r}
# filter out data points that fail coverage or conversion standards or are artefacts
cpg <- cpg %>% filter(!is_artifact & !is_coverage_fail & !is_conversion_fail & is_valid_amplicon) %>% select(-c(is_artifact, is_coverage_fail, is_conversion_fail, is_valid_amplicon))
saveRDS(cpg, file="01_CpG_data_filtered_190523_M02431_0022_000000000-CGG57.RDS")

```


## Overlapping amplicons - consistency check

Two amplicons are overlapping and one CpG is quantified in both instances.
Check how similar the quantification is!

```{r, fig.height=3, fig.width=5}

# are there CpGs that are present on more than one amplicon?
ambiguos_cpg=cpg %>% select(sample, CpG_genomic) %>%
  group_by(CpG_genomic, sample) %>%
  count() %>%
  filter(n>1) %>%
  ungroup() %>%
  select(CpG_genomic) %>% unique()

cpg %>% filter(CpG_genomic==ambiguos_cpg$CpG_genomic & !is_control ) %>% select(sample, methylation, amplicon) %>%
  pivot_wider(id_cols = sample, names_from = amplicon, values_from = methylation) %>%
  mutate(delta = `13_fk5_ctcf_5UTR` - `2_fk5_gre_pE`) %>% select(sample, delta) %>%
  right_join( subset(cpg, CpG_genomic==ambiguos_cpg$CpG_genomic & !is_control , select=c(sample, region, timepoint) ) ) %>%
  ggplot(aes( x=region, y=delta, col=timepoint )) +
  geom_boxplot() +
  labs( x="", y="delta % methylation") +
  theme_minimal()
```

> Pretty big differences in my opinion. Need to discuss this.

## PCA - brain data
```{r pca-brain}

# exclude control and only keep CpGs which have enough data points to be analyzed
cpg_pca <-
  cpg %>% filter(!is_control & is_valid_for_analysis) %>%
  group_by(region) %>% nest() %>%
  mutate( pca = map( data, ~prcomp(.x %>% select( CpG_genomic ,methylation )) ))# %>%
  #mutate( pca_graph = map( pca, ~ autoplot(.x) ) )


```


## Batches
Check for batches:
- row
- column
- alignment rate

Any others I am missing?


```{r, eval=F}

batches <-
  cpg %>% filter( !is_control & is_valid ) %>% droplevels() %>%
  group_by( CpG_genomic ) %>% nest() %>%
  mutate( lm_batch =   map(data, ~broom::tidy( lm(methylation ~ row + column + alignment_rate , data=.x ) ))
          )

cpg_filtered <- cpg %>%
  filter(is_valid) %>%
  unnest() %>%
  group_by(region, amplicon, CpG_genomic) %>%
  nest() %>%
  mutate(
    mean_methylation=map_dbl(data, ~ mean(.x$methylation)),
    #shapiro = map(data, ~ broom::tidy( shapiro.test(.x$methylation))),
    lm = map(data, ~broom::tidy( lm(methylation ~ timepoint , data=.x ) ) ),
    cooks.d= map(lm, ~ cooks.distance(.x) ),
    lm_sig= map_lgl(lm, ~ sum( .x$lm$p.value<0.5 )>1 )
  ) %>%
  unnest(lm)

cpg_filtered<- cpg_filtered %>% mutate(
  q.value=p.adjust(p.value)
)


```


# BAC control - linearity check

**Note:**

The original intention was to produce 1 common control (0 to 100) and derive 3 replicates before bisulfite conversion to account for
variance in conversion efficiency and downstream quantification.

In this case we produced the replicates differently which needs to be factored into the model.
1. 3 instances of 0% or 100% methylation (not sure if this is methylated or demethylated) are the starting material
2. Each of the instances was then (de)methylated to 0 or 100%
3. 25/50/75 titration generated from 0% and 100% control DNA
4. bisulfite conversion and downstream quantification


```{r}
bac <- cpg %>%
  filter( is_control ) %>%
  select(-c(is_control, region)) %>%
  group_by(amplicon, CpG_genomic, timepoint) %>%
  nest() %>%
  mutate(
    n_replicates = map_dbl(data, ~ length(unique(.x$replicate))),
    ) %>%
  group_by(amplicon, CpG_genomic ) %>% mutate(
    n_timepoints = length(unique(timepoint)),
    is_valid= n_timepoints >= 2
  ) %>%
  filter(is_valid) %>%
  droplevels() %>%
  mutate(level=as.numeric(as.character(timepoint))) %>%
  unnest()

bac_rsquared <- bac %>%
  group_by( amplicon ) %>%
  nest() %>%
  mutate(
    mod_sum = map(data, ~summary(lm(methylation ~ level + replicate, data = .x, na.action = na.omit))),
    r_squared = map_dbl(mod_sum, ~.x$r.squared)
  ) %>%
  select(amplicon, r_squared)
bac <- left_join(bac, bac_rsquared)
```

Plot linearity for replicate 1
```{r, echo=FALSE, warning=FALSE, fig.height=7, fig.width=7}
ggplot(subset(bac, replicate==1),  aes ( x=level, y=methylation , col=replicate) ) +
  geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
  geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
  geom_point(size=0.5,alpha=0.5) +
  theme(legend.position = "none") +
  facet_wrap(~amplicon)
```

> -\> Replicate 1 has a bias (25% level). Here, it seems the titration
> was scewed. Exclude from linearity check!


Plot linearity for replicate 2
```{r, echo=FALSE, warning=FALSE, fig.height=7, fig.width=7}
ggplot(subset(bac, replicate==2),  aes ( x=level, y=methylation , col=replicate) ) +
  geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
  geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
  geom_point(size=0.5,alpha=0.5) +
  theme(legend.position = "none") +
  facet_wrap(~amplicon)
```

Plot linearity for replicate 3
```{r, echo=FALSE, warning=FALSE, fig.height=7, fig.width=7}
ggplot(subset(bac, replicate==3),  aes ( x=level, y=methylation , col=replicate) ) +
  geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
  geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
  geom_point(size=0.5,alpha=0.5) +
  theme(legend.position = "none") +
  facet_wrap(~amplicon)
```


```{r}
bac <- cpg %>%
  filter( is_control ) %>%
  select(-is_control) %>%
  droplevels() %>%
  mutate(level=as.numeric(as.character(timepoint))) %>%
  filter( !(replicate == 1 & level == 25))  # exclude rep 1 / 25%

bac_rsquared <- bac %>%
  group_by( amplicon ) %>%
  nest() %>%
  mutate(
    mod_sum = map(data, ~summary(lm(methylation ~ level + replicate, data = .x, na.action = na.omit))),
    r_squared = map_dbl(mod_sum, ~.x$r.squared)
  ) %>%
  select(amplicon, r_squared)
bac <- left_join(bac, bac_rsquared)
```

Plot linearity (replicate 1 / 25% was excluded here). *r2* displayed per amplicon.

```{r, fig.height=7, fig.width=7}
ggplot(bac,  aes ( x=level, y=methylation , col=replicate) ) +
  geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
  geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
  geom_point(size=0.5,alpha=0.5) +
  geom_text( data = bac_rsquared, aes(x=50,y=90,label=round(r_squared, digits = 2)) , col="grey") +
  theme(legend.position = "none") +
  facet_wrap(~amplicon)
```

## Failed linearity

> *r2* values are above 0.95 for all but 2 amplicons:

```{r}
paged_table(subset(bac_rsquared, r_squared<0.95))
```


Look at amplicons with r2 \< 0.95 in more detail.

### 10_fk5_gre_i5.1
Other QC features looked fine!

Replicates do not show same scew.
-   Replicate 1: looks fine
-   Replicate 2: scewed 75% data point
-   Replicate 3: scewed 50% data point

This does not look conclusive to me.

```{r, fig.height=2, fig.width=4}
bac %>% ungroup() %>% filter(r_squared<0.95 & amplicon=="10_fk5_gre_i5.1" ) %>% droplevels() %>%
  ggplot(  aes ( x=level, y=methylation , col=replicate) ) +
  ggtitle("10_fk5_gre_i5.1") +
  geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
  geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
  geom_point(size=0.5,alpha=0.5) +
  theme(legend.position = "none") +
  facet_wrap(~ replicate )
```


### 15_fk5_ctcf_interg
Other QC features looked fine!

Replicate 1 and 2 are missing important data points.
Replicate 3 is complete and show more of a mild but classical scew indicative of an amplification bias. However, replicates also don't fully agree.

Also not a clear picture here, I think it may be worth while to rerun this.

Note: adding the "below coverage" data points didn't add more info, rep 1 was very low at 25%, rep 2 was in line with 1 and 3 at 50%.

```{r, fig.height=2, fig.width=4}
bac %>% ungroup() %>% filter(r_squared<0.95 & amplicon=="15_fk5_ctcf_interg" ) %>% droplevels() %>%
  ggplot(  aes ( x=level, y=methylation , col=replicate) ) +
  ggtitle("15_fk5_ctcf_interg") +
  geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
  geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
  geom_point(size=0.5,alpha=0.5) +
  theme(legend.position = "none") +
  facet_wrap(~ replicate )
```

```{r, fig.height=2, fig.width=4}
tmp=readRDS("01_CpG_data_raw_190523_M02431_0022_000000000-CGG57.RDS") %>% rename(level=timepoint)


tmp %>% filter(amplicon=="15_fk5_ctcf_interg" & is_control) %>% droplevels() %>%
  ggplot(  aes ( x=level, y=methylation , col=replicate) ) +
  ggtitle("15_fk5_ctcf_interg") +
  geom_smooth(method="lm", se=F, na.rm=T, col="lightgrey", size=0.5) + # dirty - take from actual model (TODO)
  geom_smooth(method="loess", se=F, na.rm=T, col="lightgreen", size=0.5) +
  geom_point(size=0.5,alpha=0.5) +
  theme(legend.position = "none") +
  facet_wrap(~ replicate )

rm(tmp)

```


# Differential analysis of mouse brain data


First, determine which CpGs can be evaluated ( not all samples passed filters -> e.g. some CpGs only have data on 1 time point )

CpGs included for analysis should have:
- at least two time points
- at least 3 replicates per time point

```{r}
cpg <-cpg %>%
  filter(!is_control) %>%
  group_by(region, amplicon, CpG_genomic, timepoint) %>%
  nest() %>%
  mutate(
    n_replicates = map_dbl(data, ~ length(unique(.x$replicate))),
    ) %>%
  group_by( region, amplicon, CpG_genomic ) %>% mutate(
    n_timepoints = sum( n_replicates >= 3 ),
    is_valid= n_replicates >= 3 & n_timepoints >= 2
  )

#cpg %>% filter(region=="P" & amplicon == "1_fk5_gre_pE" & CpG_genomic == 28660045)


```


## Run linear model

```{r, eval = F}
cpg_filtered <- cpg %>%
  filter(is_valid) %>%
  unnest() %>%
  group_by(region, amplicon, CpG_genomic) %>%
  nest() %>%
  mutate(
    mean_methylation=map_dbl(data, ~ mean(.x$methylation)),
    #shapiro = map(data, ~ broom::tidy( shapiro.test(.x$methylation))),
    lm = map(data, ~broom::tidy( lm(methylation ~ timepoint , data=.x ) ) ),
    cooks.d= map(lm, ~ cooks.distance(.x) ),
    lm_sig= map_lgl(lm, ~ sum( .x$lm$p.value<0.5 )>1 )
  ) %>%
  unnest(lm)

cpg_filtered<- cpg_filtered %>% mutate(
  q.value=p.adjust(p.value)
)

ggplot(cpg_filtered, aes(x=mean_methylation, y=q.value)) +
  geom_point()

#test <- cpg_filtered %>% filter( term!="(Intercept)" & p.value<0.1 ) %>% group_by(region, amplicon) %>% select(region, amplicon) %>% unique() %>% mutate(
#  plot= my_amplicon_plot(cpg_filtered, region, amplicon)
#)

ggplot(as.data.frame(cpg_filtered[289, "data"][[1]]), aes(x=sample, y=methylation)) +
  geom_boxplot()
```


## Open tasks

- Ask Susi about the preparation of control library & include details in intro
- Find out about pooling (sequence of events :)
- Does median for conversion efficiency work better for brain tissue / look at non-CpG methylation? (does make a difference for some samples / 1 specific amplicon only)


# sessionInfo()

```{r}
sessionInfo()
```