We’ve been working with Twitter’s streaming API for some time and thinking about how we could effectively leverage it for geographic analysis. Especially, as sentiment analysis has made progress the possibilities for using Twitter as a leading indicator of market reaction by geography is very exciting. To this end we’ve combined location based analysis and sentiment tracking through GeoIQ to gauge market reaction to the Oscars. Thanks the Herculean efforts of Chris Helm and the rest of the team I’m proud to say have a new dashboard for tracking sentiment by geography from Twitter. The new dashboard also gave the team a chance to push what GeoIQ could do with HTML5 and SVG. That said it is best to check out the new hotness in Safari or Chrome.
Starting with the Oscar dashboard, we collected all the Tweets that mentioned the nominees for best movie, best actor and best actress. From this collection of data we populated the dashboard with a broad array of analyses. For each Tweet we assigned it to a major market based on its geography and also calculated the sentiment for each Tweet. We took a quick pass at putting together the highlights of the analysis in the slides below:
There is a lot going on under the hood here, but to keep it simple we are collecting a set of tags from the Twitter streaming API then performing a variety of analysis against the Tweets we pull. The two main analytical tasks are determining geographic origin followed by analyzing the sentiment of the Tweet. For geography we perform it at three different levels: 1) grabbing the coordinates for Tweets from GPS enabled phones 2) taking the bounding box for Tweets from Geo-IP and user designations and 3) using the location from the users profile. We work progressively from 1) to 3) and typically get locations for between 30-60%. For sentiment we tested out a variety of API’s and ended up using Repustate for this project and it held up well to the load.
This approach is not without its challenges. Profile location is notoriously ambiguous as research studies have elucidated. While there are good mitigation strategies for the profile issue we’ve seen a larger issue plague geographic analysis of Twitter. In the vast majority of Twitter geographic visualizations the data is geocoded and represented as a point on the map. The problem is the accuracy of this point versus where the Tweet actually came from varies wildly. In the case of lat long coordinates from a GPS enabled mobile phone this can be accurate within a few feet, but in the case of a profile city geocode it can be hundreds of miles off. Despite large variances in accuracy these points are typically shown as the same, which can cause misleading results.
To solve this problem we aggregate all the Tweets to polygons – in this case major market areas. The key is that the polygons you are aggregating to are larger than your accuracy error bound from geocoding. The cool thing the team did with the Oscar dashboard was make it so these aggregations happen dynamically. As Tweets come in they are intersected with major market polygons and the summary statistics are calculated for each major market. For any of the markets just click the graduated circle for it to get the aggregated statistics. Also you can click multiple movies or actors/actresses and it will calculate the aggregate summary statistics for any of the clicked items. You can also see Tweets from mobile devices by clicking “Current Tweets” to see exact locations and animate them over time by clicking “play”.
A second challenge with location based sentiment analysis is how meaningful are the results. I think one of the things we miss are margin of error calculations for sentiment analysis. Once we’ve aggregated data we have a sample size for that geography that we can calculate a margine of error against. In the summary statistics for each major market you can find a margin of error calculation for the sample size. This allows the viewer to know the confidence level for any analysis by geography.
The last nuance we’ve added to the dashboard is the ability the bring in demographics by major market to overlay a variety of income, ethnicity, age and gender beneath the Twitter sentiment. This allows users to see how a variety of demographics trends intersect the sentiment data. There is lots to play with a we look forward to feedback on how it can be approved and applied to other use cases.
Welcome to the Esri DC Development Center blog. We write about features of our work on big data analytics, open platforms, and open data, what is new and exciting in the Esri and community, and general industry thought leadership and discussions of geospatial data visualization and analysis.
Please explore what we're working on and let us know if you have any questions or ideas!