Approach.tex

\section{Approach} \label{sec:Approach}

Recently, learning continuous, dense, and real-valued distributed vector representation of words have found it's place in several NLP tasks. These high-quality vectors encode the semantics of words based on their context; typically learned with the influence of its co-occurring words within sentences or fixed frames in a large corpus. As a result, similar words appear close to each other in the high dimensional (latent) vector-space. To address our problem, we adopt a two-stage process where in the first step, we learn the distributed (continuous) representation for words that come in a given event description. In the second step, we design a bottom-up approach that combines the independent word-vectors to compose a distributed vector representation for the given event-query. Next, to estimate focus time of the event, we compare the vectors learned for temporal expressions mentioned in the text representing years (e.g. `1991') by treating them as special words, to the composed event vector. We obtain a ranked list of all the years mentioned in the corpus by computing vector similarity to the event vector with the assumption that the top-ranked year represents the focus time of the given event-query.

Intuitively, since the distributed vectors capture the context of the words, if the event words and a certain year often co-occur with the similar context in a large corpus, then most likely the year represents the happening time period of the event.

In our approach, we present two models for generating a ranked list of year for a given event.  In the Early Fusion model, word vectors for a given event are combined first, and then one final ranked list of years is generated. In the second late-combination model, ranked lists of years are retrieved for individual words in the event, and then a late combination of the lists is done to obtain the final list. We next describe our proposed approach in detail.

\subsection{Model}

\noindent\textbf{Event}\\
%We define an event as an aggregated summary of the actions contributed by it's actors, locations, organizations with respect to time.
We define event as a textual description of a phenomenon with respect to one or a prolonging time period. In our case, each event is associated with a particular point of time (a particular year). \\

\noindent\textbf{Document}\\
In our case by Documents are complete event descriptions present in the corpus with a self-explanatory headline and a specific publication date. We follow the query likelihood model to retrieve the top-K documents from the corpus for a given event-query. Documents are ranked based on the probability of the query $Q$ in the document's language model $P(Q|M_{d})$.\\

\noindent\textbf{Time}\\
We use SUTime (A library for recognizing and normalizing time expressions) \cite{chang2012sutime} to identify temporal expressions present in the corpus. We don't deal with temporal expressions at a finer level than years (e.g. \textit{`More than a decade', `For the last 16 years'}). We also convert granular level temporal expressions like month, days, and weeks into years (e.g. we only consider the year part 1991 of the Timex expression 1991-07-05 of \textit{`5$^{th}$ July, 1991'}).

\subsection{Global and Local Model}
\subsubsection{Global Model:}
In Global Model, at first, we first collect the distributed representations of the words present in the corpus. In the second step, we compute distributed representation of the event with the help of the individual words present in the event. In the third step, we compute the cosine similarity between the event vector and the vector of a temporal expression. Finally, we rank the temporal expressions in the decreasing order of their similarity scores against the event vector.

\subsubsection{Local Model: Early Fusion}
In this Local model, we first retrieve top-K documents from the corpus against the event-query. In the second step, we add local context to the distributed representations of the temporal expressions found from the documents retrieved against the event-query (e.g. if the word \textit{2008} has co-occurred with words like \textit{obama, administration, presidency} in the same sentences present in the retrieved documents, 2, 4, and 6 times respectively; we then combine distributed representations of these words 2, 4, and 6 times with the vector of \textit{2008} respectively to come up with a final distributed representation of \textit{2008}), and for the event vector creation we combine the word vectors of individual words present in the event-query. Finally, we rank the temporal expressions in the decreasing order of their similarity scores against the event vector.

\begin{center}
\includegraphics[scale=.30]{images/early-fusion.png}\\
Fig 1: Architecture of Early Fusion
\end{center}

\begin{table}[]
\centering
\caption{Mathematical Notations}
\label{table:math}

\begin{tabular}{|l|l|}
\hline
\textbf{Literal}         & \textbf{Meaning}                                                   \\ \hline
$\mathbf{V_{y}}$        & Distributed Representation of year \textit{$y$}           \\ \hline
$n$              & Total number of co-occurring words with year $y$     \\ \hline
$f_{w_{i}}$ & Co-occurring frequency of word $i$                          \\ \hline
$\mathbf{V_{y}'}$    & Final distributed representation of year \textit{$y$}     \\ \hline
$\mathbf{V_{e}}$        & Distributed Representation of event \textit{$e$}          \\ \hline
$\mathbf{V_{e_{j}}}$ & Distributed Representation of word j present in the event \\ \hline
$m$              & Total number of words present in the event $e$     \\ \hline
\end{tabular}

\end{table}

From the above notations, in Table \ref{table:math}, one can compute final vector representation of a year vector $\mathbf{V_{y}'}$ as,

\begin{center}
$\mathbf{V_{y_1}'} = \mathbf{V_{y_{1}}}+\mathbf{V_{w_1}}*f_{w_1}+\mathbf{V_{w_2}}*f_{w_2}+\dots+\mathbf{V_{w_n}}*f_{w_n}$\\
$\mathbf{V_{e}} = \mathbf{V_{e_1}}+\mathbf{V_{e_2}}+\dots+\mathbf{V_{e_m}}$\\
\end{center}
Vector similarity of year ${y_{1}}$ with respect to event $\mathbf{V_{e}}$ is then,

\begin{center}
$cossim(\mathbf{V_{y_1}'},\mathbf{V_{e}})$
\end{center}
 where, cosine similarity of two vectors $\mathbf{A}$ and $\mathbf{B}$ is defined as,
\begin{center}
$cossim(\mathbf{A},\mathbf{B}) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$
\end{center}

\subsubsection{Local Model: Late Fusion}
Late fusion consists of first taking a decision in each view separately and then fusing all decisions arising from all views in a Local Model. Late fusion helps in reaching a final consensus after initial agreements and disagreements produced by the individual words in the early stage of ordering the temporal expressions with respect to individual words present in the event description. In this model, we come up with the final representation of temporal vectors by the same approach as above, but we don't create an event vector here. Instead, we perform late fusion of all the rank lists of temporal expressions produced with respect to individual words.
Say, an event-query contains terms like \textit{bomb, subway, london, attack}); now we rank the individual vectors of temporal words (calculated from the previous state) with respect to each word present in the query. Finally, we perform Late Fusion on these individual rank lists to come up with a final rank list.

\begin{center}
\includegraphics[scale=.30]{images/late-fusion.png}\\
Fig 2: Architecture of Late Fusion
\end{center}

\noindent\textbf{Effectiveness of adding local context: \\}
Adding local context to the year brings it closer to the query context. For example, let's look at a query \textit{``Riots and mass killings in Indian state of Gujarat"}, the focus time of which is the year 2002. One can observe that in Fig. 3 the year 2002 before and after adding local context. It changes it's position in the semantic space to become even closer to the query context.

\begin{center}
\includegraphics[scale=.35]{images/figure_1.png}\\
Fig 3: Year 2002 after adding local context
\end{center}

\noindent\textbf{Effectiveness of Late Fusion: \\}
\noindent\textbf{Human Judgement: } It has been empirically observed that no utterance completely captures a unique meaning for all the readers. Thus, it might be expected that diverse interpretations of an utterance are scattered, with some probability distribution, about the `intention' of the utterer. In this situation some kind of combination of queries will be more effective than even the best of the input queries at a substantial fraction of the time.  \cite{belkin1995combining}.\\

\noindent\textbf{Signal Processing: } In a signal processing model, at any choice of threshold $t$ there is some probability $d(t)$ that a relevant item will be retrieved, and some probability $f(t)$ that a non-relevant item will be retrieved. If two systems are independent, and using the same threshold, the corresponding probabilities for the combined systems are (suppressing for a moment the threshold parameter) $d(1)d(2)$ and $f(1)f(2)$. In general (Cherikh, 1989), this can be expected to yield a system with a better operating characteristic than either of the independent systems. Intuitively, there are so many non-relevant documents that could be brought to the top of the list as "noise" that there is little chance that two independent systems would bring the same ones to the top of the list. There are far fewer relevant documents, and so the chance that those will appear at the top of both lists is correspondingly greater \cite{belkin1995combining}.\\

\noindent\textbf{Expansion of Parameter Space: } When some objective function is to be maximized, any expansion of the space of possible retrieval rules can lead to, at worst, the same best value and, at best, a better value. When the outputs of two systems are already available to use as solution choices, expanding the parameter space by considering different combinations of them can, logically, only improve the performance \cite{belkin1995combining}.
	\section{Approach} \label{sec:Approach}

	Recently, learning continuous, dense, and real-valued distributed vector representation of words have found it's place in several NLP tasks. These high-quality vectors encode the semantics of words based on their context; typically learned with the influence of its co-occurring words within sentences or fixed frames in a large corpus. As a result, similar words appear close to each other in the high dimensional (latent) vector-space. To address our problem, we adopt a two-stage process where in the first step, we learn the distributed (continuous) representation for words that come in a given event description. In the second step, we design a bottom-up approach that combines the independent word-vectors to compose a distributed vector representation for the given event-query. Next, to estimate focus time of the event, we compare the vectors learned for temporal expressions mentioned in the text representing years (e.g. `1991') by treating them as special words, to the composed event vector. We obtain a ranked list of all the years mentioned in the corpus by computing vector similarity to the event vector with the assumption that the top-ranked year represents the focus time of the given event-query.

	Intuitively, since the distributed vectors capture the context of the words, if the event words and a certain year often co-occur with the similar context in a large corpus, then most likely the year represents the happening time period of the event.

	In our approach, we present two models for generating a ranked list of year for a given event. In the Early Fusion model, word vectors for a given event are combined first, and then one final ranked list of years is generated. In the second late-combination model, ranked lists of years are retrieved for individual words in the event, and then a late combination of the lists is done to obtain the final list. We next describe our proposed approach in detail.

	\subsection{Model}

	\noindent\textbf{Event}\\
	%We define an event as an aggregated summary of the actions contributed by it's actors, locations, organizations with respect to time.
	We define event as a textual description of a phenomenon with respect to one or a prolonging time period. In our case, each event is associated with a particular point of time (a particular year). \\

	\noindent\textbf{Document}\\
	In our case by Documents are complete event descriptions present in the corpus with a self-explanatory headline and a specific publication date. We follow the query likelihood model to retrieve the top-K documents from the corpus for a given event-query. Documents are ranked based on the probability of the query $Q$ in the document's language model $P(Q\|M_{d})$.\\

	\noindent\textbf{Time}\\
	We use SUTime (A library for recognizing and normalizing time expressions) \cite{chang2012sutime} to identify temporal expressions present in the corpus. We don't deal with temporal expressions at a finer level than years (e.g. \textit{`More than a decade', `For the last 16 years'}). We also convert granular level temporal expressions like month, days, and weeks into years (e.g. we only consider the year part 1991 of the Timex expression 1991-07-05 of \textit{`5$^{th}$ July, 1991'}).

	\subsection{Global and Local Model}
	\subsubsection{Global Model:}
	In Global Model, at first, we first collect the distributed representations of the words present in the corpus. In the second step, we compute distributed representation of the event with the help of the individual words present in the event. In the third step, we compute the cosine similarity between the event vector and the vector of a temporal expression. Finally, we rank the temporal expressions in the decreasing order of their similarity scores against the event vector.

	\subsubsection{Local Model: Early Fusion}
	In this Local model, we first retrieve top-K documents from the corpus against the event-query. In the second step, we add local context to the distributed representations of the temporal expressions found from the documents retrieved against the event-query (e.g. if the word \textit{2008} has co-occurred with words like \textit{obama, administration, presidency} in the same sentences present in the retrieved documents, 2, 4, and 6 times respectively; we then combine distributed representations of these words 2, 4, and 6 times with the vector of \textit{2008} respectively to come up with a final distributed representation of \textit{2008}), and for the event vector creation we combine the word vectors of individual words present in the event-query. Finally, we rank the temporal expressions in the decreasing order of their similarity scores against the event vector.

	\begin{center}
	\includegraphics[scale=.30]{images/early-fusion.png}\\
	Fig 1: Architecture of Early Fusion
	\end{center}

	\begin{table}[]
	\centering
	\caption{Mathematical Notations}
	\label{table:math}

	\begin{tabular}{\|l\|l\|}
	\hline
	\textbf{Literal} & \textbf{Meaning} \\ \hline
	$\mathbf{V_{y}}$ & Distributed Representation of year \textit{$y$} \\ \hline
	$n$ & Total number of co-occurring words with year $y$ \\ \hline
	$f_{w_{i}}$ & Co-occurring frequency of word $i$ \\ \hline
	$\mathbf{V_{y}'}$ & Final distributed representation of year \textit{$y$} \\ \hline
	$\mathbf{V_{e}}$ & Distributed Representation of event \textit{$e$} \\ \hline
	$\mathbf{V_{e_{j}}}$ & Distributed Representation of word j present in the event \\ \hline
	$m$ & Total number of words present in the event $e$ \\ \hline
	\end{tabular}

	\end{table}

	From the above notations, in Table \ref{table:math}, one can compute final vector representation of a year vector $\mathbf{V_{y}'}$ as,

	\begin{center}
	$\mathbf{V_{y_1}'} = \mathbf{V_{y_{1}}}+\mathbf{V_{w_1}}f_{w_1}+\mathbf{V_{w_2}}f_{w_2}+\dots+\mathbf{V_{w_n}}*f_{w_n}$\\
	$\mathbf{V_{e}} = \mathbf{V_{e_1}}+\mathbf{V_{e_2}}+\dots+\mathbf{V_{e_m}}$\\
	\end{center}
	Vector similarity of year ${y_{1}}$ with respect to event $\mathbf{V_{e}}$ is then,

	\begin{center}
	$cossim(\mathbf{V_{y_1}'},\mathbf{V_{e}})$
	\end{center}
	where, cosine similarity of two vectors $\mathbf{A}$ and $\mathbf{B}$ is defined as,
	\begin{center}
	$cossim(\mathbf{A},\mathbf{B}) = {\mathbf{A} \cdot \mathbf{B} \over \\|\mathbf{A}\\| \\|\mathbf{B}\\|} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$
	\end{center}

	\subsubsection{Local Model: Late Fusion}
	Late fusion consists of first taking a decision in each view separately and then fusing all decisions arising from all views in a Local Model. Late fusion helps in reaching a final consensus after initial agreements and disagreements produced by the individual words in the early stage of ordering the temporal expressions with respect to individual words present in the event description. In this model, we come up with the final representation of temporal vectors by the same approach as above, but we don't create an event vector here. Instead, we perform late fusion of all the rank lists of temporal expressions produced with respect to individual words.
	Say, an event-query contains terms like \textit{bomb, subway, london, attack}); now we rank the individual vectors of temporal words (calculated from the previous state) with respect to each word present in the query. Finally, we perform Late Fusion on these individual rank lists to come up with a final rank list.

	\begin{center}
	\includegraphics[scale=.30]{images/late-fusion.png}\\
	Fig 2: Architecture of Late Fusion
	\end{center}

	\noindent\textbf{Effectiveness of adding local context: \\}
	Adding local context to the year brings it closer to the query context. For example, let's look at a query \textit{``Riots and mass killings in Indian state of Gujarat"}, the focus time of which is the year 2002. One can observe that in Fig. 3 the year 2002 before and after adding local context. It changes it's position in the semantic space to become even closer to the query context.

	\begin{center}
	\includegraphics[scale=.35]{images/figure_1.png}\\
	Fig 3: Year 2002 after adding local context
	\end{center}

	\noindent\textbf{Effectiveness of Late Fusion: \\}
	\noindent\textbf{Human Judgement: } It has been empirically observed that no utterance completely captures a unique meaning for all the readers. Thus, it might be expected that diverse interpretations of an utterance are scattered, with some probability distribution, about the `intention' of the utterer. In this situation some kind of combination of queries will be more effective than even the best of the input queries at a substantial fraction of the time. \cite{belkin1995combining}.\\

	\noindent\textbf{Signal Processing: } In a signal processing model, at any choice of threshold $t$ there is some probability $d(t)$ that a relevant item will be retrieved, and some probability $f(t)$ that a non-relevant item will be retrieved. If two systems are independent, and using the same threshold, the corresponding probabilities for the combined systems are (suppressing for a moment the threshold parameter) $d(1)d(2)$ and $f(1)f(2)$. In general (Cherikh, 1989), this can be expected to yield a system with a better operating characteristic than either of the independent systems. Intuitively, there are so many non-relevant documents that could be brought to the top of the list as "noise" that there is little chance that two independent systems would bring the same ones to the top of the list. There are far fewer relevant documents, and so the chance that those will appear at the top of both lists is correspondingly greater \cite{belkin1995combining}.\\

	\noindent\textbf{Expansion of Parameter Space: } When some objective function is to be maximized, any expansion of the space of possible retrieval rules can lead to, at worst, the same best value and, at best, a better value. When the outputs of two systems are already available to use as solution choices, expanding the parameter space by considering different combinations of them can, logically, only improve the performance \cite{belkin1995combining}.