-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Fawaz Dabbaghie
committed
Oct 25, 2019
1 parent
05b7a7a
commit 481f68c
Showing
71 changed files
with
15,721 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
%!TEX root = ../Thesis.tex | ||
\section{Bloom Filters} \label{bloom_filters} | ||
Is an important probabilistic data structure developed in the 70s by Burton H. Bloom in this paper \cite{bloom1970space}. This data structure is usually used to check the membership of an item in a set, one of its advantages is that it can't have false negatives but it can have false positives.\\ | ||
The idea behind the bloom filters is that we have an array of bits with size $m$ and all the bits are set to 0. We also have $h$ hash functions that are independent. For each item, the all the hash function are used which map to different parts of the array, and the bits are changed from 0 to 1. Therefore, a false positive can happen if another item mapped to some part of the array and switched that part of the array to 1. However, false negatives can't happen, because if the same hash functions are used for that item, it will map to the same parts in the array which should be set to 1.\\ | ||
The false positive rate of bloom filters can be approximated using this equation $\left ( 1 - e^{-h\frac{n}{m}} \right )^{h}$, where $n$ is the number of unique elements added to the filter, $m$ is the bit array size, and $h$ is the number of independent hash functions. We can see that if we wanted more accuracy we can change $m$ or $h$ but with some trade off, increasing $m$ means more memory and more $h$ means more calculations. | ||
\section{$\ell$-mminimizer} \label{l_minimizer} | ||
For a string $x$ with length $k$ characters, the left minimizer of $x$ or lmm($x$) with $\ell < k$ is the $\ell$-minimizer of the prefix $(k-1)$-mer, and the right minimizer or rmm($x$) is the $\ell$-minimizer of the suffix $(k-1)$-mer. | ||
|
||
\section{GFA Files} \label{gfa_files} | ||
GFA is short for Graphical Fragment Assembly, this format is used for the representation of genomic assembly graphs. It is a simple, text-based format that is tab-delimited. First column contains one capital letter that describes the type of the line, and there are several types of lines: | ||
\begin{enumerate} | ||
\item \textbf{H}: for a header line, which has one mandatory column which is "H", and an optional column can be added for the version number. | ||
\item \textbf{L}: for an edge or link line, has 6 mandatory columns, first column is "L" for the line type, then we have the "from" node id, followed by its orientation (+ or - for normal overlap or reverse complement), then we have the "to" nod id followed by the orientation, and the last column is the overlap size. Can have several optional fields such as mapping quality, number of mismatches, etc... | ||
\item \textbf{S}: for a segment or a node line, has 3 mandatory columns, first column is "S" for the type of line, followed by the node name or ID, then the sequence. Can have many optional columns such as sequence length, node count, k-mer count, and so on. | ||
\item \textbf{C}: for a containment line, in case one node or sequence was fully contained in another one, this line has 7 mandatory columns, first column is the identifier "C", followed by the "container" node, then the orientation (whether the sequence is contained in it is current sequence or the reverse complement one), then the "contained" node and it is orientation, the 6th column has an integer for the position where the contained sequence starts in the container sequence, and the last column is the overlap. | ||
\item \textbf{P}: for a path line, has 4 mandatory columns, first is "P", followed by the path unique ID, the a comma separated list of node ids and orientation of this path, and the last column is also a comma separated list of overlaps. | ||
\end{enumerate} | ||
Comment lines start with a hash \#.\\ | ||
Table \ref{tab:gfa_example} is a small example of a GFA file with 3 edges, 3 nodes (ACCTT, TCAAGG, CTTGATT), and one path. We can see for example that node 1 overlaps with 4 base pairs with the reverse complement of node 2. The path line discribes a path between ACCTT (node 1) CCTTGA (reverse complement of node 2) and CTTGATT (node 3) and their respective overlaps in this orientation, this path gives back after removing the overlaps ACCTTGATT. GFA file specification and example adapted from \texttt{http://gfa-spec.github.io/GFA-spec/GFA1.html}. | ||
|
||
\begin{table}[h] | ||
\centering | ||
\setlength{\tabcolsep}{1.7em} | ||
\begin{tabular}{ | ||
>{\columncolor[HTML]{E8F1FD}}l | ||
>{\columncolor[HTML]{E8F1FD}}l | ||
>{\columncolor[HTML]{E8F1FD}}l | ||
>{\columncolor[HTML]{E8F1FD}}l | ||
>{\columncolor[HTML]{E8F1FD}}l | ||
>{\columncolor[HTML]{E8F1FD}}l } | ||
H & VN:Z:1.0 & & & & \\ | ||
S & 1 & ACCTT & & & \\ | ||
S & 2 & TCAAGG & & & \\ | ||
S & 3 & CTTGATT & & & \\ | ||
L & 1 & + & 2 & - & 4M \\ | ||
L & 2 & - & 3 & + & 5M \\ | ||
L & 1 & + & 3 & + & 3M \\ | ||
P & 4 & 11+,12-,13+ & 4M,5M & & | ||
\end{tabular} | ||
\caption{Example of a tab separated GFA file with} | ||
\label{tab:gfa_example} | ||
\end{table} | ||
|
||
\section{Extra Tables} | ||
\renewcommand{\arraystretch}{2} | ||
\begin{sidewaystable} | ||
\centering | ||
\begin{tabular}{|c|c|} | ||
|
||
\hline | ||
Long Read$_1$& | ||
\verb|m141228_203435_42248_c100724252550000001823140404301566_s1_p0/124717/534_16502 | \\ \hline %1 | ||
Long Read$_2$& \verb|m140817_090215_42175_c100689561270000001823145102281512_s1_p0/77789/1261_15093 | \\ \hline %0 | ||
Long Read$_3$ & \verb|$m140730_234813_42161_c100693980030000001823147103241540_s1_p0/62995/10187_28465| \\ \hline %1 | ||
Long Read$_4$ & \verb|$m150119_034210_42225_c100724262550000001823140404301552_s1_p0/146960/5678_25609| \\ \hline %0 | ||
Long Read$_5$ & \verb|$m150114_041139_42248_c100723882550000001823140404301566_s1_p0/14429/8164_19894 | \\ \hline %1 | ||
Long Read$_6$ & \verb|$m150103_220523_42248_c100723822550000001823140404301525_s1_p0/52501/1323_16658 | \\ \hline %1 | ||
\end{tabular} | ||
\caption{PacBio long reads IDs from NA19420 sample used to construct table \ref{tab:mec_bubble_chain} for the bubble chain in figure \ref{fig:bubble_chain} } | ||
\label{tab:long_reads} | ||
\end{sidewaystable} |
Oops, something went wrong.