When the Esri DC Dev Center team first found out about the reported explosions at the finish line of the Boston Marathon, we immediately tuned into Twitter to capture live discussions so that we could understand the series of events. With over 440,000 tweets captured in under 24 hours, one can imagine the difficulty in trying to synthesize an understanding of how events occurred over that time period.
However, we turned to Latent Dirichlet Allocation for extracting out structure from these tweets. LDA is a generative model that stipulates how documents are comprised of a mixture of topics and how each topic has a unique distribution over vocabulary. Since these topics and their vocabulary distributions are not directly observable, we can use statistical inference to infer those distributions given a corpus of text documents. As a result, we obtain a distribution over topics for each text document along with a distribution over vocabulary for each topic that can be used to gain insight into structuring of those documents.
LDA has been used before to solve a variety of problems, but one of the questions we’re interested in addressing is what happens when we perform LDA on a large scale discussion that evolves rapidly over time? We observe that topics exhibit ordering over time, suggesting that the topics extracted by LDA correlate to topics of discussion surrounding a sequence of events:
Upon inspection of the vocabulary distributions for each topics, we can reconstruct the series of events that drove discussions. We can see that initial discussions were focused around initial reactions and observations (“Two explosions reported near Boston Marathon finish line”). The discussion then transitions to people sympathizing with the victims (“How can I donate?” and “Our prayers go out to the victims”). The forth and fifth topics focus around the discovery of additional explosives that were dismantled along with the increased participation with news and media organizations. Finally, the last two topics focus around Google People Finder and Obama’s speech regarding the events. The top vocabulary associated with each topic is given as:
|Topic 1||Topic 2||Topic 3||Topic 4|
|Topic 5||Topic 6||Topic 7||Topic 8|
So, in short, we were able to demonstrate how there can be a temporal ordering of topics from topic modeling approaches that can help rebuild a story for a series of complex events. Moving forward, there are several questions that we’re interested in answering:
- How does the evolution of discussion differ depending on geography?
- What additional insight can we gain when using correlated topic models to extract out significant changes in these discussions?
- Are there any other topic model extensions that can be modified for taking into account spatio-temporal aspects?
What are the next steps for the Esri Dev Center team? Well, stay tuned as we extend these methodologies to look at how discussion evolve over geospatial regions!
Update (Friday, 3:50PM):
After hearing about the series of events the occurred on MIT campus last night, we collected tweets containing the keywords associated with “MIT” and “shooting”. Using the same methodologies, we see that the extracted topics exhibit intuitive ordering over time.
The corresponding vocabulary distribution for each topic is given by:
|Topic 1||Topic 2||Topic 3||Topic 4||Topic 5||Topic 6|
Welcome to the Esri DC Development Center blog. We write about features of our work on big data analytics, open platforms, and open data, what is new and exciting in the Esri and community, and general industry thought leadership and discussions of geospatial data visualization and analysis.
Please explore what we're working on and let us know if you have any questions or ideas!