Distributional Semantics

An alternative to the one hot representation is where you figure out the meaning of a word by the words that frequently appear nearby.

The context of a word \(w\) is the set of words that appear nearby (in a fixed size window). The representation of \(w\) is given by the many contexts of \(w\). This representation is called an embedding. It is an \(N\) dimensional vector.

Question: How do we decide on the dimensions of an embedding?

Word2vec: Given a corpus, represent every word by a vector. Let the center word be \(w\), and the other words in the context be \(o\).

For each outside word \(o\), calculate \(P(o|c)\). Adjust the vectors to maximize this quantity.

Each word actually has two vectors:

\(v_{w}\): When \(w\) is the center word
\(u_{w}\): When \(w\) is the context word.

Question: Should the window be symmetric?

The likelihood function is:

\begin{equation*} L(\theta)=\prod_{t=1}^{T}\prod_{-m\le j\le m;j\ne0}P\left(w_{t+j}|w_{t};\theta\right) \end{equation*}

\(t\) is the position of the center in the window.

The objective function is:

\begin{equation*} J(\theta)=-\frac{1}{T}\log\left(L\left(\theta\right)\right) \end{equation*}

\(\theta\) represents all the parameters.

\begin{equation*} P(o|c)=\frac{\exp\left(\mathbf{u_{0}}\cdot\mathbf{v_{c}}\right)}{\sum_{w\in V}\exp\left(\mathbf{u_{w}}\cdot\mathbf{v_{c}}\right) \end{equation*}

\(V\) is the vocabulary of the corpus.

The denominator is there to provide the normalization. The exponential is to ensure the probability is positive. The function above is the softmax function, and it exists to convert a group of arbitrary numbers into probabilities. It is a generalization of the logistic function.

To compute the vectors, we need to maximize the objective function, which means calculating all the partial derivatives.

We can easily show that:

\begin{equation*} \frac{\partial J}{\partial \mathbf{v_{c}}}=-\frac{1}{T}\sum_{t=1}^{T}\sum_{o}\left[\mathbf{u_{o}}-\sum_{x=1}^{V}p(x|c)\mathbf{u_{x}}\right] \end{equation*}

This is zero when the inner bracket is 0. Now the RHS is really just the expected value of \(\mathbf{u_{x}}\).

So, given a center word \(c\), and a fixed context window, the maximum is achieved when the expected value of a context word is the context word.

(But we will have multiple context words! The expected value is a constant - independent of \(o\). Probably need to bring the inner summation over \(o\) inside?)

Thus it is the observed value - the expected value.

If a word has multiple meanings, it still has only one vector.

The vectors for antonyms are actually close to each other. The negative of a vector is usually some fairly random word.

Analogy Problem

Take the vector for king. Subtract the vector for man and add the vector for woman. What is the closest vector? You’ll see it is queen.