This spring, I volunteered to teach a lecture in a new Berkeley course called “Analyzing Big Data With Twitter,” developed jointly by Twitter and Berkeley’s School of Information. I had recently done my masters thesis work on predicting the spread of topics on Twitter by looking at phenomena on a macroscopic scale — like the time series of tweet activity. But I was curious to see what was going on on a microscopic scale — how are topics spreading from person to person? I decided to dig into the data and find out.
First, I needed a way to track when a topic spreads from one person to another person. I considered using retweets, but quickly realized there are several drawbacks. For one, retweets correspond to a particular tweet rather than a topic. There could be many tweets about the same topic, and tracking the spread of a single tweet just wouldn’t do. Even tracking retweets for a collection of tweets is inadequate, as many of the tweets about the topic may not be retweets at all. Looking at retweets wouldn’t reveal how the topic really spread through the network, because they don’t account for all tweets, and retweets are only one way for a topic to spread from one person to another.
So how does a topic spread? Often, you look at your timeline and see tweets from your followers about the topic. This may influence you to create an original tweet commenting on the same topic.
I wanted to capture these events, which are not explicitly recorded the way that retweets are.
To do that, one has to eliminate all other ways that a person can be influenced to write a tweet about the topic. For example, if the topic was in the news or on the trending topics list, it is not clear whether someone was compelled to tweet about the topic by someone they follow or by an exogenous source.
To attempt to eliminate exogenous sources, I used only “hashtag meme” topics, the type that originate and spread within twitter — things like #ThingsYouSayToYourBestFriend, rather than events like #election2012. In addition, I only tracked topics up until they became trending.
I generated some videos of information spreading for various topics. To simplify the spreading graph, I kept only the most recent parent for each node. The colors represent components of the resulting graph.
Here in this first video, the green nodes form the largest component. The shade of green represents the distance from the first ever green node. The red nodes are all other components. The pink nodes trace out the longest path through the green component — 53 hops.
In this video, I took the most recent 2 parents.
This is the first step in a work in progress. I’d like to understand more about how people are influenced to spread a topic using this approach.
For now, you can see a blog post about the lecture and the video (also above) here. I talk about some theory behind information cascades in the first half, and some more detailed experimental results and difficulties in the second half.