I have recently been working on Topic Modelling and thought I'd do a brief tutorial on how to automatically divide a text into a series of relevant topics.
Before we dive into our coding, let's give a brief overview of the topic so we are all on the same page:
Topic Modelling is all about automatically finding the thematic structure of a document or a series of documents.
Topic modelling specifies a probabilistic method through which documents can be created. Initially a distribution over topics is selected. For example, the topics of Love and Mexico could be chosen, but each assigned a certain weight or probability. If it was sought for the article to have a greater political inclination, Mexico would be assigned a greater weight than Love. Whereas if the purpose was to write a romantic novel, Love would have a much higher weight or probability assigned than Mexico. Once the topics along with their corresponding probabilities have been assigned, a topic is chosen randomly according to the distribution, and a word from that topic is drawn. This process of randomly choosing a word from a topic is done iteratively until the system has finished "writing" the article.
Besides creating documents automatically ( Hasta la vista estudiantes de Literatura :P) Topic Modelling can also infer the set of topics responsible for generating a collection of documents.
We care about Topic Modelling because it can enhance search in large archives of texts, it also permits for better similarity measures: given two documents exactly how similar are they?
Different algorithms exist for finding the thematic structure of a document. Today we will focus on one particular algorithm called Latent Dirichlet Allocation (LDA). Which is a "...generative probabilistic model for text corpora...".
The intuition behind LDA is that a document is conformed of a series of different topics, and each topic is a probability distribution over words. Each document is a random mixture of corpus-wide topics, where each word of a document is drawn from one of these topics. LDA intents to infer how the documents are divided according to these topics, and what the topics are. The only information LDA has, are the documents.
In the following, we will use THE FORMAL notation of LDA, (mathematical style!) to make things a bit more clearer:
P(z) denotes the topic distribution z of a particular document. P( w | z ) is the probability distribution of words w given topic z. LDA assumes each word wi in a document (where the index refers to the ith word token) is generated by first sampling a topic from the topic distribution, then choosing a word from the topic-word distribution. We write P( zi = j ) as the probability that the jth topic was sampled for the ith word token and P( wi | zi = j ) as the probability of word wi under topic j.
LDA assumes that the topics present a Dirichlet distribuition, i.e. the mixture weights θ are generated by a Dirichlet prior on θ. Each topic is modelled as a multinomial distribution
over words.
Hopefully this brief overview will allow us to have some Python coding fun for our next post!