(by Ming Tsou, November 05, 2012, 11:59pm). firstname.lastname@example.org, Professor of Geography, San Diego State University)
Tomorrow is the election day for the 2012 U.S. Presidential Election. Who will win the election? That’s probably the most expensive answer now on the Earth. Hundreds of different “predictions” and “analysis” are available on TVs, Radios, Newspapers, and various media. Can the new generation of media – social media — help us to predict the election results? Our CDI project team humbly presented some preliminary finding so far for the election analysis. Our verdict is “YES, but not today”.
Before I elaborate on the results, I would like to acknowledge the wonderful works conducted by our great research team and our graduate students, especially Jay Yang, Su Han, and Daniel Lusher. They worked days and nights on the design of Twitter APIs, SQL databases, data clean and analysis, and map creations.
Figure 1 illustrates the DAILY comparison between the total numbers of Tweets containing the “Obama” keyword and the total numbers of Tweets containing the “Romney” keyword in TOP 30 U.S. Cities (combining all 30 cities’ tweet frequency numbers, 1.87 millions tweets total). One important concept in the tweets is that there are lots of “noises”, “errors”, “biases”, and “distortion” within the BIG DATA – tweets. Over 30% of tweets are RT (re-tweets) and over 20% of tweets are generated by “robots” or media tools (based on our preliminary analysis). The higher number of tweets associated with keywords may not indicate the supporting rate or true popularity. We like to use the term “attention” rather than “popularity” to indicate higher numbers of tweets with the keyword. Sometime, higher attention may be associated with negative comments or dislike tweets. Another potential problem is that we only collect tweets from the 17 miles radius of major U.S. Cities, where the urban population profile may prefer Democrat candidates.
To correct these errors and biases, it will take lots of efforts and algorithms to remove these “noises” in the social media. Well, the election is tomorrow and I don’t have time now. So, let’s forget all these potential errors and biases in our tweet data at this point. If we simply use the raw data (Figure 1) and converted them to the trend of two candidates, we found that the result is surprisingly similar to other official polls and prediction regarding the Presidential Election. [Obama]’s attention became higher than [Romney] after October 31 when the Hurrican Sandy caused significant damages on the East Cost. But the gap between the two candidates become smaller and smaller in the last two days. Interesting, right? The most interesting part is that we can create these interesting analysis in a very CHEAP and FAST way. Only two Ph.D. students and one master student at San Diego State University did all these works! I don’t need to hire hundreds of people to do telephone interview and polls!
So, why our verdict is “NOT Today”? Because there are still many “inconsistent results” in our tweet analysis especially when we zoom-in to the city or state levels — such as the analysis of “swing” states. Let’s use the Ohio and Florida as two examples (Figure 2 and Figure 3).
Figure 2 illustrates the trend of tweet “attention” between Obama and Romney in Ohio by combining the tweets results from Cleveland and Columbus. The gap (57% vs 43%) has been exaggerated in the tweets’ attention rate comparing to other poll rate (50% vs 48.5%). But the trend is similar. One possible explanation is that the social media attention rate may exaggerate the change of trends in the real world. Social media will create several “distortion efforts” to reflect our real world similar to the “projection” concept in cartography (Figure 3).
Figure 4 illustrates the trend change in Jacksonville (Florida). It looks like a very tight competition now (50% vs 50%) in both November 03 and November 04.
Figure 5 illustrates the weekly trend comparing “Obama” and “Romney” in tweets’ attention in 30 U.S. Cities. The graph illustrated a few interesting patterns. First, the highest attention week is the week of Democrat Convention (Sep. 02-08) and the GOP convention week did not create a very strong attention in social media. Second, the killing of U.S. Ambassador in Benghazi did trigger a higher attention for “Romney” and then the TV debates also create attentions for both candidates.
Figure 6 is the comparison between the RCP Poll Averages vs the Tweets Attention.
In addition to these preliminary analysis results, our project also create several interactive online tools for mapping the presidential election trend here:
Our next step is to conduct a deeper analysis of these tweets by using computational linguistic tools to separate PRO and CON tweets for each candidate and also establish the Twitter user profile comparing to the U.S. population distribution (Figure 7). Each social media user profile may be different from each other. Different age groups may use different types of keywords in their tweets. Some day, our algorithms may be able to help us to identify users’ age group automatically. But .. not today.
Social media have lots of potential to be applied in the election campaign or poll analysis. It is different from traditional poll surveys, which collect data from “passive” voters. Twitter, on the other hand, can be used to collect “active” voters who are actively posting their favorite candidates. Active voters are more likely to “vote” in real world comparing to “passive” voters. My prediction is that the social media will be used intensively in 2016 Presidential Election, when we develop more comprehensive data analysis methods and data clearing algorithms.
Well, it’s past midnight now. I have to go to bed and sleep. If we are “lucky” enough, maybe one of our graphs above will become the magic crystal ball tomorrow.
Ming Tsou from San Diego State University.