With over 6.4 million tweets shared amongst friends during the Academy Awards, peaking at over 85,000 tweets per second during Michelle Obama’s presentation of the Best Picture award, there have been quite a number of creative analyses and visualizations that have been constructed to demonstrate the predictive and explanatory power of social media for popular events. Most notably was Topsy’s Twitter Oscars Index, which tracked sentiment of tweets relating to Oscar films over the last several weeks. The team at Esri’s R&D Center in Washington, D.C. decided to take this once step further by translating sentiment scores into the geospatial frame for additional insight. By taking advantage of Esri’s Tapestry market segment data set, not only can we measure what people are saying, but also who and where.


In order to successfully quantify sentiment within tweets, we wish to construct a binary classifier to differentiate between “positive” and “negative” tweets, otherwise known as polarity. However, constructing a relevant training data set can be considerably arduous (yet feasible), so we are going to rely on publicly available data sets to help train our models to differentiate between positive and negative tweets.

Selected models

Before jumping into the results, let’s take a moment to review the different classification models we’re going to be working with:

  • Naive Bayes: a simple probabilistic model that we all love.
  • Maximum entropy (via MEGAM): a general technique that estimates the conditional distribution of the class variable given a document. Note that for those who are familiar with multinomial logistic regression, it satisfies the characteristics of a maximum entropy classifier. To quote Nigam et. al.:

    “The underlying principle of maximum entropy is that without external knowledge, one should prefer distributions that are uniform. Constraints on the distribution, derived from labeled training data, inform the technique where to be minimally non-uniform. The maximum entropy formulation has a unique solution which can be found by the improved iterative scaling algorithm.”
  • Support vector machine (via SVMlight): a non-probabilistic linear classifier that takes advantage of kernel functions to construct an optimally separating hyperplane between the two classes.
  • Sentiment140.com: A publicly available API that we can use as a baseline. Their implementation is closed-source, however their algorithm can be read here.

Available data

We’re going to focus on training each of the three models on three data sets:

  • B. Pang’s “Movie Review Data” (2004): 1k positive and 1k negative movie reviews taken from IMDb’s archive of the rec.arts.movies.reviews newsgroup.
  • A. Maas’s “Large Movie Review Dataset” (2011): 12.5k positive and 12.5k negative movie reviews taken directly from IMDb. Note that we are using just the provided training set to conduct training and testing on.
  • Sentiment140.com’s tweet corpus, a collection of tweets that have been classified via emoticons.

Note that any models that train using the first two data sets would be biased towards features (vocabulary) that is prevalent in movies reviews that extend beyond the 140-character limitation of tweets. This is a subtle caveat that may contribute to inaccurate predictions, however we will talk about how this limitation can be mitigated by the use of ensemble classifiers.

Training and validation: Building our models

Using NLTK (and the help of nltk-trainer), each model is trained and tested using a random 80%/20% split on each available data set. Additionally, we use perf to conduct analysis on the performance of the classifier.

The overall accuracy of each model configuration, based on the evaluation of the test data set, can be given as:

Model Training Dataset Accuracy* Precision* Recall*
Maximum Entropy Pang 100.0% 100.0% 100.0%
Maximum Entropy Maas 93.4% 93.2% 93.7%
Naive Bayes Pang 96.7% 93.8% 100.0%
Naive Bayes Maas 92.1% 93.6% 90.3%
SVM Pang 88.4% 90.5% 85.8%
SVM Maas 88.6% 88.3% 89.0%

*At a threshold of 0.50

We can assess the performance of each model by analyzing predictions from the test data set. Once a prediction score is generated for a given piece of text, we use a “threshold” to decide whether or not a given text is positive or negative (thus, collapsing the prediction score to a class designation). If the prediction score exceeds the threshold, the text is decided to be of positive sentiment, otherwise the text is assumed to be negative. In the extreme case at which we specify the threshold to be 1.0, all tweets would be assumed to be negative and vice versa if we set the threshold to be 0.0. By adjusting the threshold, this allows us to evaluate Type I and Type II errors, which can be represented by the notions of precision and recall. This gives us a more insightful understanding of the performance of our model.

In the example case of the Naive Bayes model which was trained using Pang’s “Movie Review Data”, we can visualize different performance characteristic of the classifier. Additionally, we can test across different data sets to compare prediction consistency:

In this case, the model that was trained on Maas’s data set performed just as well when testing against Pang’s. (This the increased accuracy in the top-left graph.)

For a better description of these measures, I encourage you to take a look at the source code for perf as it includes a lot of fantastic documentation on how to interpret these values.

Training and validation: Validating predictions

Throughout the course of weekend of the Oscars, The Esri R&D Center team were able to collect 3 million tweets related to films highlighted during the Oscars. Generating prediction scores for each model configuration yields interesting interesting distributions:

Note that the “Default” and “Movies” datasets correspond to the “default” and “movie” topics available via the Sentiment140 API. We can see that the Naive Bayes model, trained with Pang’s data set, exhibits a natural distribution over the 3M tweets collected.

Furthermore, we can visualization the correlation of prediction scores across models:

The most notable observation is how both SVM and Sentiment 140 predictions exhibit negative correlation with other models. Keep in mind that the data used to train the Sentiment140 model was derived from tweets containing emoticons, thus the negative correlation with the Naive Bayes and maximum entropy model suggests that there is a poor intersection of common features among the two data sets. One possible way to mitigate this is by combining predictions across models / data sets via an ensemble classifier.

For the remainder of the analysis, we chose the Naive Bayes model trained using Pang’s data set.

Aggregating predictions to the county

Taking advantage of Esri’s geocoding services, we are able to geocode approximately 400k of the tweets. Aggregating these tweets to the county level highlights that a significant amount of them (unsurprisingly) originated in the Los Angeles area:

However, how can we go one step further. After aggregating tweets to county geometries, we can then reference the Tapestry data set to extract out market segment composition per county. Once that is done, a linear regression model is fit to each segment-film combination so that residual scores can be extracted for each county. As a result, we can identify counties which exhibit relatively high sentiment for the Laptops and Lattes segment and the film Argo. Specifically, when we zoom into the Northeast corridor, we observe high sentiment around the Washington, D.C. area:

Additionally, by looking at the residual scores for High Rise Renters and Lincoln, we can see that Los Angeles had a stronger negative sentiment compared to portions of the Northeast:

Please feel free to play around with the publicly available data set to see if you’re able to uncover any interesting trends! Note that you can filter on market segments via the Tapestry Households code, which is described here.


One of the significant challenges in this analysis was trying to identify a training data set that represents a natural separation in vocabulary which parallels what was observed within the production data set. We explained how different models can be trained on various data sets and further combined by the use of ensemble classifiers. Moving forward, additional geospatial analysis can be conducted on the residuals generated from the regression model to extract out additional insight from the data.

Keep in mind that the techniques illustrated here can be easily translated to solving other NLP-related problems centered around classification. Whether you’re trying to classify documents as threat/non-threat in the security domain, attempting to document insurance claims as fraud/non-fraud, or trying to determine social media posts as emergency/non-emergency, the methods here can be easily adapted to those separate domains.

Please feel free to play with the data and let us know if you’re able to uncover any interesting patterns! In the mean time, stay tuned for additional discussions related to further geospatial analysis on this data!

Additional resources

  • To learn about how to assess the performance of a binary classifier, take a look at the source code for perf. There is excellent documentation that provides insightful descriptions on how to interpret binary classification predictions.
  • There are a slew of NLP toolkits out there that feature rich pre-processing tools.
  • For more information on statistical techniques used for solving classification, The Elements of Statistical Learning is available for free online.
Tagged with:

One Response to Modeling Twitter sentiment during the Oscars

  1. Zafer says:

    Really cool. I am trying to include geospatial analysis for my website, Buzz Scale, which generates scores for the latest movies by performing sentiment analysis. Interesting to see geographic differences in sentiment (LA vs East Coast). Perhaps I should look into doing some analysis to see if particular movie titles have greater difference in sentiment by geography than other films.