When the Esri DC Dev Center team first found out about the reported explosions at the finish line of the Boston Marathon, we immediately tuned into Twitter to capture live discussions so that we could understand the series of events. With over 440,000 tweets captured in under 24 hours, one can imagine the difficulty in trying to synthesize an understanding of how events occurred over that time period.

However, we turned to Latent Dirichlet Allocation for extracting out structure from these tweets. LDA is a generative model that stipulates how documents are comprised of a mixture of topics and how each topic has a unique distribution over vocabulary. Since these topics and their vocabulary distributions are not directly observable, we can use statistical inference to infer those distributions given a corpus of text documents. As a result, we obtain a distribution over topics for each text document along with a distribution over vocabulary for each topic that can be used to gain insight into structuring of those documents.

LDA has been used before to solve a variety of problems, but one of the questions we’re interested in addressing is what happens when we perform LDA on a large scale discussion that evolves rapidly over time? We observe that topics exhibit ordering over time, suggesting that the topics extracted by LDA correlate to topics of discussion surrounding a sequence of events:

Upon inspection of the vocabulary distributions for each topics, we can reconstruct the series of events that drove discussions. We can see that initial discussions were focused around initial reactions and observations (“Two explosions reported near Boston Marathon finish line”). The discussion then transitions to people sympathizing with the victims (“How can I donate?” and “Our prayers go out to the victims”). The forth and fifth topics focus around the discovery of additional explosives that were dismantled along with the increased participation with news and media organizations. Finally, the last two topics focus around Google People Finder and Obama’s speech regarding the events. The top vocabulary associated with each topic is given as:

Topic 1 Topic 2 Topic 3 Topic 4
line donate marathon explosive
finish receive dead dismantled
boston retweet donate official
marathon every injured intelligence
explosions for for found
two bostonmarathon every devices
near victims boston breaking
reported prayforboston prayers boston
news controlled bostonmarathon homemade
everyone kingjames victims marathon
photo dead two section
prayers involvedhurt retweet two
Topic 5 Topic 6 Topic 7 Topic 8
library library library person
another jfk jfk finder
jfk threat google google
confirms west finder give
reutersus nyc person obama
police location tips hospital
devices leave saudi straight
confirmed another call blood
reuters street suspect the
skynewsbreak you third president
explosion now national shot
breaking bomb commissioner created

So, in short, we were able to demonstrate how there can be a temporal ordering of topics from topic modeling approaches that can help rebuild a story for a series of complex events. Moving forward, there are several questions that we’re interested in answering:

  • How does the evolution of discussion differ depending on geography?
  • What additional insight can we gain when using correlated topic models to extract out significant changes in these discussions?
  • Are there any other topic model extensions that can be modified for taking into account spatio-temporal aspects?

What are the next steps for the Esri Dev Center team? Well, stay tuned as we extend these methodologies to look at how discussion evolve over geospatial regions!

Update (Friday, 3:50PM):

After hearing about the series of events the occurred on MIT campus last night, we collected tweets containing the keywords associated with “MIT” and “shooting”. Using the same methodologies, we see that the extracted topics exhibit intuitive ordering over time.

The corresponding vocabulary distribution for each topic is given by:

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
police watertown safe akitz watertown harvard
campus grenades campus watertown police emerson
officer scanner mitshooting chair dead closed
shot custody officer wall suspects akitz
died akitz youranonnews bullet marathon classes
state explosions confirmation hole tied college
shooter suspects police sunil one today
says police longer tripathi officer waking
breaking explosives says suspect suspect universities
week officer official mike robbery der
say related arrested mulugeta carjacking last
active reports update names two swat
cambridge laurel resume identified say ich
cnnbrk witnesses operation mitshooting cnn und

6 Responses to The evolution of discussion around the Boston Marathon events

  1. It would be wonderful to see this same analysis around the different topics layered on a map to do a further sentiment analysis on the types of emotions expressed from different geographic regions.

  2. Stefan Novak says:

    Thanks for the feedback, Jeff! You’re definitely on to something – I think there are a lot of opportunities to explore the correlation between geospatial sentiment analysis with discussion topic modeling to gain additional insight on significant events.

  3. [...] Another review of the event used a form of topic analysis to explore the semantic textual content different waves of Twitter activity using a Twitter accession count timebase – The evolution of discussion around the Boston Marathon events. [...]

  4. Ciro Cattuto says:

    Very interesting! We did something similar for the recent Italian elections, using non-negative tensor factorization techniques. We then visualized the time-varying topics as streamgraphs, using d3.

  5. XRumerTest says:

    Hello. And Bye.

  6. Best SEO says:

    I don’t even know how I ended up right here, however I thought this submit was good. I do not recognize who you might be however definitely you’re going to a famous blogger if you happen to are not already. Cheers!