Wednesday, November 14, 2012

Another layman's explanation of: Expert Evolution in Online Social Networks

I was recently reading a very interesting paper titled: Evolution of Experts in Question Answering Communities by Aditya Pal, Shuo Chang and Joseph Konstan. And thought I would share the paper and intend to explain it in Layman's terms.
There has been vast amount of work done in detecting experts in Question Answering Communities, typically this analysis is either through graph based methods or feature based methods. Graph based methods tend to analyze the link structure of a user in an online social network to find authoritative users. They analyze things such as: to how many other people is the user "friends" to? Feature based methods, on the other hand, analyze the characteristics of the users: how many best answers does the user have? What language style does he use? etc etc
The work we are analyzing seeks to identify experts, but then does a temporal analysis, to study how experts evolve in a community and how they influence a community's dynamics. The online community studied is Stackoverflow. To identify experts, the authors used two approaches: On one hand, they identify the number of positive votes a user's answerers and questions have received (a user gets a positive vote, when his/her answer is helpful to the community, or when his/her question is interesting or relevant to someone in the community) and labeled the top 10% of users with the highest number of votes as experts.
To analyze how experts evolve and how a community can be influenced in time by the answers and social interactions of experts, the authors performed the following:
  1. the questions and answers of the community were divided into bi-weekly buckets. Were the first bucket would hold the questions and anwsers of the first two weeks of the stackoverflow data they had collected, the second bucket the questions and answers created in the 3-4th weeks etc etc 
  2. For each user it is then possible to calculate per bucket (per every 2 weeks,) the number of questions, answers and best answers he/she have given. 
  3. For each user a relative time series is computed of each data type he/she has generated (questions, answers and best answers). This relative time series is constructed so that the contribution of a user can be valued relatively to the contribution of other users. For this, what is done,  is that in each of the time buckets the mean and standard deviation for each data type  are calculated. (lets recall that a bucket holds the number of answers, questions and best answers different  users have given in that particular time period, so for each type of variables, we can calculate the mean and standard deviation. It is then possible to normalize a data point in the time bucket as:
    X_b=(X_b - Mean_b)/(standardDeviation_b)

    Where X_b represents the number of answers a particular user has generated in time bucket b. And Mean_b represents the mean of all the number of answers different users have given in time bucket b
  4. After this step, each user is associated with 3 relative time series: the time series of their answers, questions and best answers. From the answers and best answer time series, a point wise ratio between best answers and answers is then calculated. This point wise ratio indicates  the probability of a user's answers being selected as the best answer.
    The following figure shows an interesting plot where we see how the likelihood of an expert and an average user receiving the votes for best answer changes over time.

What we notice is that the likelihood of receiving the best answer increases significantly over time for experts in comparison to average users. Initially the likelihood of receiving a best answer is the same for both experts and average users. The authors believe that this occurs, because when a new person, who happens to be an expert, joins the community, other users are wary of marking the answers of newcomers as the best. But as the expert gains reputation, the rest of the community members become more and more comfortable in marking their answers as the best.
The next interesting thing the author's analyzed was the the likelihood of having a user ask a question. It was seen that in general expert users do not ask questions. They found that the overall question to answer ratio among experts was 1/15 !!! To compare the time series of questions and answers, the authors computed an aggregate time series of the number of questions and answers of experts, and then normalize the time series such that it has mean=0 and standard deviation =1. From these two resulting distributions (questions and answers) a cross-covariance was computed. Now, the cross-covariance will give us information about just how similar two signals are, as a function of a time-lag applied to them. The authors found that the optimal time lag was zero for the majority of expert users. Which indicates that likelihood of an expert asking or responding to a question vary simultaneously.

Friday, July 27, 2012

Layman's Explanation of Online LDA

Topic Modeling!
LDA stands for Latent Dirichlet Allocation, and it is a type of topic modeling algorithm. The purpose of LDA is to learn the representation of a fixed number of topics, and given this number of topics learn the topic distribution that each document in a collection of documents has. For example, if we were given the following sentences:
A:I spend the day at the beach tanning.
B: I ate Mexican Tacos and Guacamole.
C:I love tanning in Mexican beaches while eating quesadillas and tacos under the sun.

LDA might say something like:
Sentence A is 100% about Topic 1
Sentence B is 100% Topic 2
Sentence C is 30% Topic 1, 70% Topic 2

where LDA also discovers that:
Topic 1: 30% beach, 15% tanning, 10% sun, … (where we notice that topic 1 represents things related to the beach)
Topic 2: 40% Mexican, 10% Tacos, 10% Guacamole, 10% Quesadilla , … (where we notice that topic 2 represents things related to Mexico.)

LDA learns how topics and documents are represented in the following form:

1)First the number of topics to discover is selected. (Similar to when we specify the number of clusters we wish our clustering algorithm to consider)

2) Once the number of topics is selected, LDA will go through each of the words in each of the documents, and it will randomly assign the word to one of the K topics. After this step we will have topic representations (how the words are distributed in each topic) and documents represented in terms of topics (Just like the above example, where we said Sentence or Document C is 30% about Topic 1 and 70% about Topic 2.) Now, the thing is, the assignment of words to topics, was done in a random form, so of course this obtained representation is not very optimal or accurate. To better this representation LDA will analyze per document:
what is the percentage of words within the document that were assigned to a particular topic. And for each word in the document, LDA will analyze over all the documents, what is the percentage of times that particular word has been assigned to a particular topic. LDA will therefore be calculating:

1) p(topic t | document d) = percentage of words in a document d that are currently assigned to topic t.
2) p(word w | topic t) = percentage of times the word w was assigned to topic t over all documents.

LDA will decide to move a word w from topic A to topic B when:
p(topic A | document d) * p(word w | topic A)< p(topic B | document d) * p(word w |topic B)
After a while, LDA "converges" to a more optimal state, where topic representations and documents represented in terms of these topics are ok.

Now that we have understood the underlining principle about how LDA works. We will now discuss online LDA.
The problem with LDA is that the posterior probability we need to calculate in order to reassign words to topics is very difficult to compute. Therefore researchers use approximation techniques to find what this posterior probability is.
Generally algorithms for approximating this posterior probability are either based on sampling approaches or optimization approaches. Sampling approaches are typically based on Markov Chain Monte Carlo (MCMC) sampling. MCMC intends to find the posterior probability distribution by randomly drawing values from a complex distribution of interest. MCMC are named that way, because the previous sampled values (previous states) affect the generation of the next random sample value ( in other words, the transition probabilities between sample values is a function of the most recent sample value.)
Optimization approaches on the other hand, are typically based on variational inference. Variational Inference can be seen as deterministic alternative to MCMC. Variational Inference replaces MCMC's random, somewhat independent sampling, with optimization. Variational Inference seeks to optimize a simplified parametric distribution to be close in Kullback-Leibler divergence to the posterior. The following picture, intends to show how variational inference defines a subfamily of distributions, and the goal is to find a point in the subfamily distribution that is the closest to P(z|x). Similarity is measured using Kullback–Leibler divergence, which is a non-symmetric measure of the difference between two probability distributions P and Q.
Variational Inference has shown to be as accurate as MCMC, but FASTER, so this has made Variational Inference very popular when applying it to large datasets.

Now, despite the benefits Variational Inference brings. Large scale data analysis can still be difficult. What many groups have done is to use batch variational inference, where there is a constant iteration between analyzing each observation and updating dataset-wide variational parameters, but in really big datasets each iteration can become very costly and impractical...and this is where Online LDA comes to the rescue!
Online LDA is based on online stochastic optimization, which has shown to produce good parameter estimates dramatically faster than batch algorithms on large datasets.
Online stochastic optimization in LDA is about finding a balance between exploiting the knowledge gained on a particular topic assignation, and exploring new topic assignations. Note: Images from standford university and princeton university

Sunday, April 01, 2012

Social Signals and Machine interpretation


Have you found yourself alone in the dark, working on your computer feeling a tad depressed and wishing your computer could respond in some way to your mental and emotional state? You know, maybe your computer could send you some funny comics to lighten your day, or perhaps send you an inspiring quote to keep you in the fight for life....These ideas might sound a bit far fetched given our current reality with our machines. But there is actually an active community, whose goal is precisely for machines to understand human emotion and social interaction.
Having machines understand emotion and social interactions, is beneficial for:

  • Social Scientists, Psychologists and Doctors: As they all are very interested in observing and quantifying human behaviour. For Psychologists and Doctors this could help them in diagnosing and rehabilitating their patients.
  • Intelligent Algorithms: that by understanding their user better, could respond to semantic queries and retrieve more relevant material. As humans we are very familiar with interacting with different meanings given different contexts, think for example of that popular 60's song titled: "it's the same old song, but with a different meaning since you've been gone
  • Ambient intelligence: Environments can become more responsive to the social context. Perhaps the room detects that the crowd at a small reunion is bored, and so the room could maybe start playing whimsical animal figures on the walls to entertain the audience.

Overall having machines being able to interpret human emotion can improve Human Computer Interaction, as it increases the computer's sensitivity to the user's mental and emotional state [5].

Now, given that we understand the benefits of machines that can understand our emotions betters. The question is, so... how can this be enabled?

Recent investigations focus on something called SOCIAL SIGNALS.

But what is exactly a social signal?
-A social signal (According to Poggi and D'Errico) is a communicative or informative signal that conveys information about social actions, social interactions, social emotions and attitudes.

Where the heck does this idea of "Social Signal" come from?
The term "Social Signal" was inspired by various psychology studies, that analysed how non-verbal behaviour relates to social interactions. Psychologists were studying things such as:

  • How does non-verbal behaviour effect the formation of impressions? For example, apparently when you smile a lot and have rapid movements, people take this as if you are an extrovert.
  • How does non-verbal behaviour reinforce the nature of a relationship? Apparently men tend to lean forward more and gaze toward the person they are talking with, when the other person so happens to be a female, this behaviour is experienced even more if they are having an intimate conversation, rather than the general boring interpersonal water cooler chit chat.
  • Can facial movements be mapped into emotional signals and conversation signals? This is an area greatly studied in deception detection, as there are certain muscles that are expected to be moved when someone is angry, happy, sad etc. Therefore a person might unwillingly move those muscles, and show their true feelings. Or not move them, and therefore give clues as to the fact that they are being deceitful ( See [4]).

Now the big question is: O.K., So how do social signals help machines understand human emotion?
In 2007, a professor and researcher from MIT's Media Lab,Alex Pentland, introduced the notion of “social signal processing”. Which is about applying traditional signal processing techniques to social signals, and use this processing and analysis to predict human social behaviour. For example, his group created a machine that was able to autonomously predict the outcome of a negotiation or of a speed date within its very first minute (see [1] and [2]).

The main goal of social signal processing, is to enable analysis of Human behaviour by computers. For this advanced pattern recognition techniques are utilized to automatically interpret complex human behavioural patterns. In the next days, we will talk more about these techniques that are used to interpret human behaviour. Stay tuned! n_n

1)I. Poggi and F. DÉrrico, "Social signals: A psychological perspective." . Springer Verlag’s Advances in Pattern Recognition series, 2011, pp. 185-225.
2)Pentland, A.: Social signal processing. IEEE Signal Process. Mag. 24(4), 108–111 (2007)
3)Curhan, J., Pentland, A.: Thin slices of negotiation: predicting outcomes from conversational
dynamics within the first five minutes. J. Appl. Psychol. 92, 802–811 (2007)
4)Ekman, P., Friesen, W.V.: Nonverbal leakage and clues to deception. Psychiatry 32, 88–106
5) A. A. Salah, M. Pantic, and A. Vinciarelli, "Recent Developments in Social Signal Processing," in Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 2011, pp. 380-385.