Friday, July 27, 2012

Layman's Explanation of Online LDA

Topic Modeling!
LDA stands for Latent Dirichlet Allocation, and it is a type of topic modeling algorithm. The purpose of LDA is to learn the representation of a fixed number of topics, and given this number of topics learn the topic distribution that each document in a collection of documents has. For example, if we were given the following sentences:
A:I spend the day at the beach tanning.
B: I ate Mexican Tacos and Guacamole.
C:I love tanning in Mexican beaches while eating quesadillas and tacos under the sun.


LDA might say something like:
Sentence A is 100% about Topic 1
Sentence B is 100% Topic 2
Sentence C is 30% Topic 1, 70% Topic 2


where LDA also discovers that:
Topic 1: 30% beach, 15% tanning, 10% sun, … (where we notice that topic 1 represents things related to the beach)
Topic 2: 40% Mexican, 10% Tacos, 10% Guacamole, 10% Quesadilla , … (where we notice that topic 2 represents things related to Mexico.)


LDA learns how topics and documents are represented in the following form:

1)First the number of topics to discover is selected. (Similar to when we specify the number of clusters we wish our clustering algorithm to consider)

2) Once the number of topics is selected, LDA will go through each of the words in each of the documents, and it will randomly assign the word to one of the K topics. After this step we will have topic representations (how the words are distributed in each topic) and documents represented in terms of topics (Just like the above example, where we said Sentence or Document C is 30% about Topic 1 and 70% about Topic 2.) Now, the thing is, the assignment of words to topics, was done in a random form, so of course this obtained representation is not very optimal or accurate. To better this representation LDA will analyze per document:
what is the percentage of words within the document that were assigned to a particular topic. And for each word in the document, LDA will analyze over all the documents, what is the percentage of times that particular word has been assigned to a particular topic. LDA will therefore be calculating:

1) p(topic t | document d) = percentage of words in a document d that are currently assigned to topic t.
2) p(word w | topic t) = percentage of times the word w was assigned to topic t over all documents.

LDA will decide to move a word w from topic A to topic B when:
p(topic A | document d) * p(word w | topic A)< p(topic B | document d) * p(word w |topic B)
After a while, LDA "converges" to a more optimal state, where topic representations and documents represented in terms of these topics are ok.

Now that we have understood the underlining principle about how LDA works. We will now discuss online LDA.
The problem with LDA is that the posterior probability we need to calculate in order to reassign words to topics is very difficult to compute. Therefore researchers use approximation techniques to find what this posterior probability is.
Generally algorithms for approximating this posterior probability are either based on sampling approaches or optimization approaches. Sampling approaches are typically based on Markov Chain Monte Carlo (MCMC) sampling. MCMC intends to find the posterior probability distribution by randomly drawing values from a complex distribution of interest. MCMC are named that way, because the previous sampled values (previous states) affect the generation of the next random sample value ( in other words, the transition probabilities between sample values is a function of the most recent sample value.)
Optimization approaches on the other hand, are typically based on variational inference. Variational Inference can be seen as deterministic alternative to MCMC. Variational Inference replaces MCMC's random, somewhat independent sampling, with optimization. Variational Inference seeks to optimize a simplified parametric distribution to be close in Kullback-Leibler divergence to the posterior. The following picture, intends to show how variational inference defines a subfamily of distributions, and the goal is to find a point in the subfamily distribution that is the closest to P(z|x). Similarity is measured using Kullback–Leibler divergence, which is a non-symmetric measure of the difference between two probability distributions P and Q.
Variational Inference has shown to be as accurate as MCMC, but FASTER, so this has made Variational Inference very popular when applying it to large datasets.

Now, despite the benefits Variational Inference brings. Large scale data analysis can still be difficult. What many groups have done is to use batch variational inference, where there is a constant iteration between analyzing each observation and updating dataset-wide variational parameters, but in really big datasets each iteration can become very costly and impractical...and this is where Online LDA comes to the rescue!
Online LDA is based on online stochastic optimization, which has shown to produce good parameter estimates dramatically faster than batch algorithms on large datasets.
Online stochastic optimization in LDA is about finding a balance between exploiting the knowledge gained on a particular topic assignation, and exploring new topic assignations. Note: Images from standford university and princeton university