Vector Space Models

# Distributed word representations #

## Meaning representations #

Co-occurrence matrix:

Meaning can be present in such a matrix.

• If a word co-occurs often with “excellent,” it likely is a positive word; if it co-occurs often with “terrible,” it likely denotes something negative

## Guiding hypothesis for vector-space models #

The meaning of a word is derived from its use in a language. If two words have similar vectors in a co-occurrence matrix, they tend to have similar meanings (Turney and Pantel, 2010).

## Feature representations of data #

• the movie was horrible becomes [4, 0, 0.25] (4 words, 0 proper names, 0.25 concentration of negative words)
• Reduces noisy data to restricted feature set

## What do we define co-occurrence as? #

For a sentence, e.g. from swerve of shore to bend of bay, brings

Consider the word “to”

• Window: how many words around “to” (in both directions) do we want to focus on?
• Scaling: how to weight words in the window?
• Flat: treat everything equally
• Inverse: word is weighted 1/n if it is distance n from the target word

Larger, flatter windows capture more semantic information, whereas smaller, more scaled windows capture more syntactic information

Can also consider different unit sizes - words, sentences, etc

## Constructing data #

• Tokenization
• Annotation
• Tagging
• Parsing
• Feature selection

## Matrix design #

### word x word #

• Rows and columns represent individual words
• Value `a_ij` in a matrix represents how many times words `i` and `j` co-occur with each other in a given set of documents
• Very dense (lots of nonzero entries)! Density increases with more documents in the corpus
• Dimensionality remains fixed as we bring in new data as long as we pre-decide on vocabulary

### word x document #

• Rows represent words; columns represent documents
• Value `a_ij` in a matrix represents how many times word `i` occurs in document `j`
• Very sparse: may be hard to compute certain operations, but easy storage

### word x discourse context #

• Rows represent words; columns represent discourse context labels
• Labels are assigned by human annotators based on what type of context the sentence is (i.e., acceptance dialogue, rejecting part of previous statement, phrase completion, etc)
• Value `a_ij` in a matrix represents how many times word `i` occurs in discource context `j`

### Other designs #

• word x search proximity
• word x dependency relations

Note: Models like GloVe and word2vec provide packaged solutions that pre-chose from these design choices.

## Vector comparison (similarity) #

Within the context of this example:

Note that B and C are close in distance (frequency info), but A and B have a similar bias (syntactic/semantic info)

### Euclidean #

For vectors `u`, `v` of `n` dimensions:

$\bg_white euc(u, v) = \sqrt{\sum_{i=1}^n |u_i - v_i|^2}$

This measures the straight-line distance between `u` and `v` capturing the pure distance aspect of similarity

Note: Length normalization

This captures the bias aspect of similarity

### Cosine #

For vectors `u`, `v` of `n` dimensions:

$\bg_white cosdist(u, v) = 1 - \frac{u_i^T v_i}{||u||_2 ||v||_2}$

• Division by the length effectively normalizes vectors
• Captures the bias aspect of similarity
• Not considered a proper distance metric because it fails the triangle inequality; however, the following does:

$\bg_white newcosdist(u, v) = \frac{\cos^{-1}\left(\frac{u_i^T v_i}{||u||_2 ||v||_2}\right)}{\pi}$

• But the correlation between these two metrics is nearly perfect, so in practice, use the simpler one

### Other metrics #

• Matching
• Dice
• Jaccard
• KL (distance between probability distributions)
• Overlap

## Reweighting #

Goal: Amplify important data useful for generalization, because raw counts/frequency are poor proxy for semantic information

### Normalization #

• L2 norming (see above)
• Probability distribution: divide values by sum of all values

### Observed/Expected #

Intuition: Keeps words in idioms co-occurring more than expected; other word pairs co-occur less than expected

### Pointwise Mutual Information (PMI) #

$\bg_white pmi(X, i, j) = ln\left(\frac{X_{ij}}{\text{expected}(X, i, j)}\right) = ln\left(\frac{P(X_{ij})}{P(X_{i*})P(X_{*j})}\right)$

This is the log of observed count divided by expected count.

### Positive PMI #

PMI undefined when `X_{ij} = 0`. So:

$\bg_white ppmi(X, i, j) = max(0, pmi(X,i,j))$

### TF-IDF #

For a corpus of documents D:

## Dimensionality reduction #

### Latent Semantic Analysis #

• Also known as Trucated Singular Value Decomposition (Truncated SVD)
• Standard baseline, difficult to beat

Intuition:

• Fitting a linear model onto data encourages dimensionality reduction (since we can project data onto the model); this captures greatest source of variation in the data
• We can continue adding linear models to capture other sources of variation

Method:

Any matrix of real numbers can be written as

$\bg_white A = TSD^T$

where `S` is a diagonal matrix of singular values and `T` and `D^T` are orthogonal. In NLP, `T` is the term matrix and `D^T` is the document matrix.

Dimensionality reduction comes from being selective about which singular values and terms to include (i.e., capturing only a few sources of variation in the data).

### Autoencoders #

• Flexible class of deep learning architectures for learning reduced dimensional representations

Basic autoencoder model:

### GloVe #

• Goal is to learn vectors for words such that their dot product is proportional to their log probability of co-occurrence