Monday, May 16, 2011

LDA is not Ladies Ditching Apes

I have recently been working on Topic Modelling and thought I'd do a brief tutorial on how to automatically divide a text into a series of relevant topics.

Before we dive into our coding, let's give a brief overview of the topic so we are all on the same page:

Topic Modelling is all about automatically finding the thematic structure of a document or a series of documents.

Topic modelling specifies a probabilistic method through which documents can be created. Initially a distribution over topics is selected. For example, the topics of Love and Mexico could be chosen, but each assigned a certain weight or probability. If it was sought for the article to have a greater political inclination, Mexico would be assigned a greater weight than Love. Whereas if the purpose was to write a romantic novel, Love would have a much higher weight or probability assigned than Mexico. Once the topics along with their corresponding probabilities have been assigned, a topic is chosen randomly according to the distribution, and a word from that topic is drawn. This process of randomly choosing a word from a topic is done iteratively until the system has finished "writing" the article.

Besides creating documents automatically ( Hasta la vista estudiantes de Literatura :P) Topic Modelling can also infer the set of topics responsible for generating a collection of documents.
We care about Topic Modelling because it can enhance search in large archives of texts, it also permits for better similarity measures: given two documents exactly how similar are they?

Different algorithms exist for finding the thematic structure of a document. Today we will focus on one particular algorithm called Latent Dirichlet Allocation (LDA). Which is a "...generative probabilistic model for text corpora...".
The intuition behind LDA is that a document is conformed of a series of different topics, and each topic is a probability distribution over words. Each document is a random mixture of corpus-wide topics, where each word of a document is drawn from one of these topics. LDA intents to infer how the documents are divided according to these topics, and what the topics are. The only information LDA has, are the documents.

In the following, we will use THE FORMAL notation of LDA, (mathematical style!) to make things a bit more clearer:
P(z) denotes the topic distribution z of a particular document. P( w | z ) is the probability distribution of words w given topic z. LDA assumes each word wi in a document (where the index refers to the ith word token) is generated by first sampling a topic from the topic distribution, then choosing a word from the topic-word distribution. We write P( zi = j ) as the probability that the jth topic was sampled for the ith word token and P( wi | zi = j ) as the probability of word wi under topic j.

LDA assumes that the topics present a Dirichlet distribuition, i.e. the mixture weights θ are generated by a Dirichlet prior on θ. Each topic is modelled as a multinomial distribution
over words.
Hopefully this brief overview will allow us to have some Python coding fun for our next post!

6 comments:

Anonymous said...

Hey, sounds like a cool topic! I played around with Latent Semantic Indexing some time ago. LDA seems to be the probabilistic model of it, so I guess you have fun fitting parameters with EM. It will be interesting to see the python code. Oh, I like the new colors of your blog. :) Cheers

samee said...

A vers si entendí:
Osea que de un texto con muchos temas el programa los subdividiría en temas relevantes?
Y para encontrar tal estructura en los textos se aplica el LDA.
Oh, mira que interesante!
Bueno, esperamos por el código :D

Ay, a mí también me encantó el nuevo look del blog :D

Buena vibra!

Little Saiph said...

@bytefish thanks on the comment of my blog =). I found a library in python called Gensim: http://nlp.fi.muni.cz/projekty/gensim/tutorial.html It's pretty nice. I still haven't finished the experiments I need to run, so due to time constraints the python code post will be postponed :'( But it could be a nice summer read right? LOL
What were you doing exactly with LSI?

@samee gracias Samy por leerme!Gracias por las porras al nuevo look de mi blog! :D Sip exacto, o sea es un metodo con el cual la computadora puede automaticamente detectar que temas hay en un texto. Eso puede ser util para regresar mejores resultados en una busqueda etc

BTW I'm glad I met you guys through my blog. It's pretty amazing how my blog has brought me real friends =)

Anonymous said...

Is there any practical applications for this that you know?

Little Saiph said...

hey! gracias por visitar mi blog chuchin :P
Chomp (www.chomp.com)uses LDA for their bussiness
In addition to Chomp, which is doing really cool things, Jstor's
research group is also using topic models for search. Specifically,
the discipline browser and their Data for Research Website use topics
to restrict search results to certain categories (see
http://dfr.jstor.org). Google has used topic models for summarizing
the content of documents, and researchers at Yahoo have studied LDA
for matching items (webpages or ads) with users. (search fLDA by
Agarwal and Chen).

ETS has studied topic models for automated essay grading (see
"Comparison of Dimension Reduction Methods for Automated Essay
Grading"), and Pearson is working on the same sorts of technology (see
http://www.pearsonkt.com/ ) using topic models. IMHO, it's slightly
scary that students' essays might be graded by algorithms with the
current limitations we have in NLP.

Little Saiph said...

that info came from the lda mailing list ;)