Search This Blog

Friday, December 2, 2016

Two Data Questions You Need to Ask

The Harvard Business Review article by Michael Li is probably behind a pay wall but it's a great, short note on two important questions to ask a data analyst. Maybe I just like it because I agree with him - it's something I constantly tell my students - but I think it's good advice for anyone. If you have access to HBR, then stop reading my blog and go read the article and judge it for yourself.

If you don't have HBR access here's an even shorter version.

1) How was the data collected? I tell my students that statistical calculations, even complicated ones, are fairly straightforward and no one can really argue with your calculations. All of the debate in statistical analysis is about the data.

2) What's the margin of error? If you understand the difference between a point estimate and a population parameter, then you know that people regularly present point estimates as "the" number. That's a mistake. A point estimate is just an estimate. Therefore, the uncertainty behind it needs to be openly acknowledged with a margin of error.

Don't take my word for it. Go read Dr. Li's article.

Wednesday, November 30, 2016

About that popular vote ... (more on the 2016 election)

I recently wrote about the pre-election polling for this year's Presidential election. Now it's time to comment on post-election posturing.

This is the second time in my life that the Electoral College winner was not the popular vote winner. It's no surprise that many, especially on the losing side, are calling for the elimination of the Electoral College. I'm not going to get into that debate. There are plenty of opinion pieces out there.

Instead I want to address the data implications. Nearly everyone seems to assume that without the Electoral College we would have had President Gore and would now have President-Elect Clinton. A thought provoking column by Aaron Blake points out that we can't make that assumption.

I strongly encourage you to read it but I'll summarize some major points here. Presidential campaign strategies are based on the existence of the Electoral College. Whether it's advertising, candidate appearances, or the "ground game", everything is done in order to win the College, not the popular vote.

Strategies would be very different if candidates were trying to win the popular vote. In particular, non-swing state would get a lot more attention. States like California (blue) or Texas (red) are ignored because 100% of their Electoral votes (55 & 34) consistently go to the same party (Democrat, Republican). These non-swing states tend to have lower voter turnout than swing states.

Under a popular vote campaign strategy, states like these would get a lot more attention. The Democrats would almost certainly still win California but not 100% of California. Would voter turnout increase? If so, how much? Instead of 60/40 vote in favor of Democrats, would it be 55/45? Would small changes in high population states be enough to swing the national popular vote? Keep in mind that while Republicans were trying to shift the California vote, Democrats would be doing the same thing in Texas.

I'm not saying that Trump (or Bush II) would have won the popular vote. I'm saying that we don't know. When the popular vote under an Electoral College system is won by just a couple percent, we simply have no idea what the outcome would be without an Electoral College.  

This brings me back to my previous post. The 13 Keys model predicts only the popular vote but it was developed using historical popular vote outcomes under an Electoral College system. That makes we wonder whether or not elimination of the College would change the predictive power of those Keys.

Friday, November 18, 2016

The 2016 Presidential Election was NOT the Death of Data

Now that the presidential election is a couple of weeks in our rear window, I thought it a good time to comment on some data aspects of the election. Driving to work the morning after the election I listened to national program announce that the election results signaled "the death of data" because the polls had been so terribly wrong. Others have written about the data's failure.

That's simply not true. First, some polls did a pretty good job predicting the outcome. Whether or not those polls should have been considered outliers will be debated in the polling community for many years.

However, I want to focus on another issues. For this discussion, let's concede that "most of the polls were wrong". That's still not the death of data. At worst, it's the death of survey data.

For decades, social scientists have known that self-reported activities often don't match behaviors. Perhaps the most famous example is the Tucson Garbage Project. Political pollsters, and those who report on them, should always keep in mind that they're merely getting people's statements about what they are going to do. They are not getting any information on actual actions. In other words, they don't have empirical data.

How would you use empirical data to predict a presidential outcome? Just ask Alan Lichtman. Lichtman's "13 Keys to the White House" model has correctly predicted every presidential election since 1984 and it doesn't use polling data. Instead, it was created by looking at historical voting data (in other words, empirical data) from 1860 through 1980. There are 13 true/false indicators. If enough of them come up "against", then it predicts the incumbent party in the White House will lose the popular vote to the challenging party. It says nothing about the electoral college.

That distinction between popular vote and electoral college is very important. Lichtmans' model correctly predicted Gore's popular vote win in 2000. It did not predict Bush's electoral college victory because it doesn't predict anything about the electoral college. It can't be considered either right or wrong on the electoral college.

Sometimes, the 13 Keys are clear fairly early the election cycle. At least once, Lichtman made his prediction two years ahead of the election. This time around, his prediction was later in the cycle and he predicted a Trump victory. After the election, he was hailed by some as being one of just a few who predicted correctly.

Now if you've been paying attention up to this point, you might say "Wait a minute. You said that Lichtman predicts the popular vote, not the electoral college and Clinton won the popular vote. Therefore he was wrong!".

On the surface it would appear that way. However, the final Key that turned "against" the incumbent party involved third party candidates. A significant third party vote was a signal against the incumbents. When Lichtman made his prediction, it looked very much like Gary Johnson would be a significant third party influence. In the end though, all the third party votes combined were under 5%. In other words, this Key actually went in favor of the incumbent party and, sure enough, Clinton won the popular vote.

Admittedly, calling this particular Key true or false involved polling data on the third parties, but the Key itself was developed with empirical data and the 13 empirically developed Keys once again predicted correctly. The worst you can say about Lichtman's prediction is the he predicted a Key wrong but the Keys themselves predicted correctly.

That's far from "the death of data".