Wednesday, November 14, 2012

Another layman's explanation of: Expert Evolution in Online Social Networks

I was recently reading a very interesting paper titled: Evolution of Experts in Question Answering Communities by Aditya Pal, Shuo Chang and Joseph Konstan. And thought I would share the paper and intend to explain it in Layman's terms.
There has been vast amount of work done in detecting experts in Question Answering Communities, typically this analysis is either through graph based methods or feature based methods. Graph based methods tend to analyze the link structure of a user in an online social network to find authoritative users. They analyze things such as: to how many other people is the user "friends" to? Feature based methods, on the other hand, analyze the characteristics of the users: how many best answers does the user have? What language style does he use? etc etc
The work we are analyzing seeks to identify experts, but then does a temporal analysis, to study how experts evolve in a community and how they influence a community's dynamics. The online community studied is Stackoverflow. To identify experts, the authors used two approaches: On one hand, they identify the number of positive votes a user's answerers and questions have received (a user gets a positive vote, when his/her answer is helpful to the community, or when his/her question is interesting or relevant to someone in the community) and labeled the top 10% of users with the highest number of votes as experts.
To analyze how experts evolve and how a community can be influenced in time by the answers and social interactions of experts, the authors performed the following:
  1. the questions and answers of the community were divided into bi-weekly buckets. Were the first bucket would hold the questions and anwsers of the first two weeks of the stackoverflow data they had collected, the second bucket the questions and answers created in the 3-4th weeks etc etc 
  2. For each user it is then possible to calculate per bucket (per every 2 weeks,) the number of questions, answers and best answers he/she have given. 
  3. For each user a relative time series is computed of each data type he/she has generated (questions, answers and best answers). This relative time series is constructed so that the contribution of a user can be valued relatively to the contribution of other users. For this, what is done,  is that in each of the time buckets the mean and standard deviation for each data type  are calculated. (lets recall that a bucket holds the number of answers, questions and best answers different  users have given in that particular time period, so for each type of variables, we can calculate the mean and standard deviation. It is then possible to normalize a data point in the time bucket as:
    X_b=(X_b - Mean_b)/(standardDeviation_b)

    Where X_b represents the number of answers a particular user has generated in time bucket b. And Mean_b represents the mean of all the number of answers different users have given in time bucket b
  4. After this step, each user is associated with 3 relative time series: the time series of their answers, questions and best answers. From the answers and best answer time series, a point wise ratio between best answers and answers is then calculated. This point wise ratio indicates  the probability of a user's answers being selected as the best answer.
    The following figure shows an interesting plot where we see how the likelihood of an expert and an average user receiving the votes for best answer changes over time.

What we notice is that the likelihood of receiving the best answer increases significantly over time for experts in comparison to average users. Initially the likelihood of receiving a best answer is the same for both experts and average users. The authors believe that this occurs, because when a new person, who happens to be an expert, joins the community, other users are wary of marking the answers of newcomers as the best. But as the expert gains reputation, the rest of the community members become more and more comfortable in marking their answers as the best.
The next interesting thing the author's analyzed was the the likelihood of having a user ask a question. It was seen that in general expert users do not ask questions. They found that the overall question to answer ratio among experts was 1/15 !!! To compare the time series of questions and answers, the authors computed an aggregate time series of the number of questions and answers of experts, and then normalize the time series such that it has mean=0 and standard deviation =1. From these two resulting distributions (questions and answers) a cross-covariance was computed. Now, the cross-covariance will give us information about just how similar two signals are, as a function of a time-lag applied to them. The authors found that the optimal time lag was zero for the majority of expert users. Which indicates that likelihood of an expert asking or responding to a question vary simultaneously.