Search This Blog

Friday, December 2, 2016

Two Data Questions You Need to Ask

The Harvard Business Review article by Michael Li is probably behind a pay wall but it's a great, short note on two important questions to ask a data analyst. Maybe I just like it because I agree with him - it's something I constantly tell my students - but I think it's good advice for anyone. If you have access to HBR, then stop reading this blog post. Go read the article and judge it for yourself.

If you don't have HBR access here's an even shorter version.

1) How was the data collected? I tell my students that statistical calculations, even complicated ones, are fairly straightforward and no one can really argue with your calculations. All of the debate in statistical analysis is about the data.

2) What's the margin of error? If you understand the difference between a point estimate and a population parameter, then you know that people regularly present point estimates as "the" number. That's a mistake. A point estimate is just an estimate. Therefore, the uncertainty behind it needs to be openly acknowledged with a margin of error.

Don't take my word for it. Go read Dr. Li's article.

Wednesday, November 30, 2016

About that popular vote ... (more on the 2016 election)

I recently wrote about the pre-election polling for this year's Presidential election. Now it's time to comment on post-election posturing.

This is the second time in my life that the Electoral College winner was not the popular vote winner. It's no surprise that many, especially on the losing side, are calling for the elimination of the Electoral College. I'm not going to get into that debate. There are plenty of opinion pieces out there.

Instead I want to address the data implications. Nearly everyone seems to assume that without the Electoral College we would have had President Gore and would now have President-Elect Clinton. A thought provoking column by Aaron Blake points out that we can't make that assumption.

I strongly encourage you to read it but I'll summarize some major points here. Presidential campaign strategies are based on the existence of the Electoral College. Whether it's advertising, candidate appearances, or the "ground game", everything is done in order to win the College, not the popular vote.

Strategies would be very different if candidates were trying to win the popular vote. In particular, non-swing state would get a lot more attention. States like California (blue) or Texas (red) are ignored because 100% of their Electoral votes (55 & 34) consistently go to the same party (Democrat, Republican). These non-swing states tend to have lower voter turnout than swing states.

Under a popular vote campaign strategy, states like these would get a lot more attention. The Democrats would almost certainly still win California but not 100% of California. Would voter turnout increase? If so, how much? Instead of 60/40 vote in favor of Democrats, would it be 55/45? Would small changes in high population states be enough to swing the national popular vote? Keep in mind that while Republicans were trying to shift the California vote, Democrats would be doing the same thing in Texas.

I'm not saying that Trump (or Bush II) would have won the popular vote. I'm saying that we don't know. When the popular vote under an Electoral College system is won by just a couple percent, we simply have no idea what the outcome would be without an Electoral College.  

This brings me back to my previous post. The 13 Keys model predicts only the popular vote but it was developed using historical popular vote outcomes under an Electoral College system. That makes we wonder whether or not elimination of the College would change the predictive power of those Keys.

Friday, November 18, 2016

The 2016 Presidential Election was NOT the Death of Data

Now that the presidential election is a couple of weeks in our rear window, I thought it a good time to comment on some data aspects of the election. Driving to work the morning after the election I listened to national program announce that the election results signaled "the death of data" because the polls had been so terribly wrong. Others have written about the data's failure.

That's simply not true. First, some polls did a pretty good job predicting the outcome. Whether or not those polls should have been considered outliers will be debated in the polling community for many years.

However, I want to focus on another issues. For this discussion, let's concede that "most of the polls were wrong". That's still not the death of data. At worst, it's the death of survey data.

For decades, social scientists have known that self-reported activities often don't match behaviors. Perhaps the most famous example is the Tucson Garbage Project. Political pollsters, and those who report on them, should always keep in mind that they're merely getting people's statements about what they are going to do. They are not getting any information on actual actions. In other words, they don't have empirical data.

How would you use empirical data to predict a presidential outcome? Just ask Alan Lichtman. Lichtman's "13 Keys to the White House" model has correctly predicted every presidential election since 1984 and it doesn't use polling data. Instead, it was created by looking at historical voting data (in other words, empirical data) from 1860 through 1980. There are 13 true/false indicators. If enough of them come up "against", then it predicts the incumbent party in the White House will lose the popular vote to the challenging party. It says nothing about the electoral college.

That distinction between popular vote and electoral college is very important. Lichtmans' model correctly predicted Gore's popular vote win in 2000. It did not predict Bush's electoral college victory because it doesn't predict anything about the electoral college. It can't be considered either right or wrong on the electoral college.

Sometimes, the 13 Keys are clear fairly early the election cycle. At least once, Lichtman made his prediction two years ahead of the election. This time around, his prediction was later in the cycle and he predicted a Trump victory. After the election, he was hailed by some as being one of just a few who predicted correctly.

Now if you've been paying attention up to this point, you might say "Wait a minute. You said that Lichtman predicts the popular vote, not the electoral college and Clinton won the popular vote. Therefore he was wrong!".

On the surface it would appear that way. However, the final Key that turned "against" the incumbent party involved third party candidates. A significant third party vote was a signal against the incumbents. When Lichtman made his prediction, it looked very much like Gary Johnson would be a significant third party influence. In the end though, all the third party votes combined were under 5%. In other words, this Key actually went in favor of the incumbent party and, sure enough, Clinton won the popular vote.

Admittedly, calling this particular Key true or false involved polling data on the third parties, but the Key itself was developed with empirical data and the 13 empirically developed Keys once again predicted correctly. The worst you can say about Lichtman's prediction is the he predicted a Key wrong but the Keys themselves predicted correctly.

That's far from "the death of data".

Saturday, March 12, 2016

Cheating Age or Aging Cheaters

This article is a bit over a year old but I ran across it today and was struck by the the easy acceptance of the conclusions.

Since it's behind a pay wall, here's a short summary: A study was done using data from the Ashley-Madison dating web site. For those who missed last year's headlines, this site is for people already in a committed relationship who want to find someone to cheat with. Researchers looked at the ages of registered users and found a statistically significant bump in the distribution for ages ending in "9".

The conclusion is that people are more likely to cheat when they are approaching a milestone birthday (30, 40, 50, etc.). They get stressed about aging and reach out for a little excitement.

That's certainly a possible explanation. However,  I think they overlook another simpler explanation - people lie about their age. If you're anywhere in the 40 to 45 range and you want to claim you're in your 30's then 39 is more believable than 38, 37, 36, etc. 

I think this explanation is at least as likely as the first. Keep in mind, everyone on this web site is trying to CHEAT. They're all dishonest. Why wouldn't they lie about their age? 

Thursday, March 10, 2016

Data matters - but only if you accept it.

In my classes I regularly tell students that statistical controversy is rarely about the number crunching, it's about the data.

I recently posted a link to an article about political polling. Summary? The data was wrong so the predictions were wrong.

However, there's another data problem and, perhaps, it's the most common problem. What should you do when you think something is true but you have data that contradicts it? Statistically, you're supposed to go with the data and change what you thought was true.

People who know statistics might recognize this situation. You have a null hypothesis, Ho, and the data leads to a small p-value. This means you should reject Ho and conclude that the alternative hypothesis, Ha, is true instead.

Unfortunately people tend to fall the other way. If the data contradicts their beliefs, they reject the data. This is true even among those who have scientific and statistical training.

The People's Pharmacy recently wrote about a new study that "exonerates" eggs. This is one of many studies that contradict that old idea the eggs are unhealthy because of dietary cholesterol. All the data shows that eggs are healthy. Yet the article states:

"It’s hard to teach old dogs new tricks. Many health professionals will find it challenging to accept the new data from the Finnish Heart Study. But the writing has been on the wall for quite a few years that the evidence supporting dietary cholesterol as the culprit behind heart disease was weak."

Health professionals, people with significant scientific training, will stick with their old beliefs and reject the data.

The limits and challenges of public opinion polling

With the Presidential primaries in full swing, the polls were pretty far off in Michigan.

This article provides some possible explanation. It also talks about the difficulty of getting proper samples and the importance of asking the right questions.