Search This Blog

Tuesday, April 15, 2014

It's enough to make a statistician cry

As mentioned before, there is no such thing as perfect data. Political meddling just makes it worse.  This isn't a short story about crime statistics, but it's a worthwhile read for someone trying to understand government data sources.

Tuesday, April 8, 2014

Difficulty with data

In most introductory statistics classes, data magically appears (in the correct electronic format!) with no discussion about it's source or validity.  Unintentional or not, this gives students the idea that they should accept data at face value. I frequently tell my students that number-crunching is simple compared to getting good data and that most statistical disagreements are about the data rather than the calculations.

They listen and some of them sort of get it, but the steady stream of magic data they receive can override anything I say. Therefore, I've started scattering examples of imperfect data throughout the class. By "imperfect", I don't mean "mistake". Instead, I want to demonstrate how hard some things are to measure or classify and, therefore, that no data set is perfect (at least not any interesting data set).

The 2000 Presidential election was a good example of how difficult it can be to simply determine whether a vote is for Candidate A, Candidate B, or no one but that's ancient history for today's students. 

Since many of my students are athletes, I'm now including controversial sports calls. There are reams of historical sports data available.  We rarely question that data in spite of the fact that we often argue during the game while the data is being created.

"That's a charge, not a foul!" "What do you mean ball? That was clearly a strike!!"

Regardless of the fans' preferences or the rule book's definitions, in sports the official data is whatever call the officials make at the time.

Charge or a foul?  It's whatever the official says it is.
Ball or a strike? It's whatever the official says it is.

Packer fans won't forget this call for many, many years. Nearly everyone said that it was not a Seattle touchdown. Unfortunately for the Packers, the operational definition of a touchdown has nothing to do with what "nearly everyone" says. A touchdown is whatever the officials say it is and NFL data will classify this as a touchdown and the game as a Packer loss forever.




Wednesday, April 2, 2014

Nate Silver wasn't "wrong" about the 2012 congressional election.


Nate Silver has been in the news lately for his prediction about the 2014 congressional election. He says that the Republicans have a 60% chance of taking over the Senate. Several Democrats have reacted to the prediction by pointing out that Silver made a similar prediction in 2012 (a 61% chance) and he was "wrong" because it didn't happen. Therefore, we shouldn't trust his prediction this time.

That's a complete misunderstanding of Silver's prediction. He never said that the Republicans would win. He merely said that the probability they would win was larger than the probably that they would not win.

For example, suppose I hand you a bag with 10 poker chips in it and I tell you that four of the chips are blue and six are red. You can't see inside of the bag. You only know what I've told you and, based on my statement, there is a 60% chance that a randomly selected chip will be red.

You shake the bag to mix up the chips, reach in, and pull out a chip. If the chip is blue would you say that I was "wrong" when I told you that six of the ten were red?  Of course not, there was a 40% probability that the chip would be blue.

What if the Republicans don't win this year either? Would that make Silver wrong?

Let's go back to the bag. If there are six red chips and four blue chips, what's the probability that you'd pull out a blue chip twice in a row (assuming that you put the first chip back)?

It's (0.40)(0.40) = 0.16.  A sixteen percent chance isn't exceptionally large, but it's not tiny either.

You can't say that Silver is "wrong" based on his specific prediction. If a knowledgeable statistician wants to go back through Silver's process in detail to look at where and how he got his data and how he analyzed it, it's entirely possible that they would disagree with something.  Even that wouldn't necessarily make Silver wrong. There are legitimate disagreements in the statistical world on how to obtain and analyze data. There might be actual mistakes in someone's process, but disagreements in the discipline aren't mistakes.

In this case, it's more likely an example of confirmation bias. Perhaps the greatest barrier to effective use of data in any organization is getting past our tendency to accept data the confirms our predetermined biases and reject data the contradicts them. That's not use of statistics, that's abuse of statistics.