Data Matters: 2014

Wednesday, September 3, 2014

You can't have it both ways

Take a look at the third "weight loss myth" on this list. It says that research (in other words, actual data) shows that there's no difference in BMI for students with and without regular PE classes. In other words, there is no evidence that PE classes help reduce childhood obesity.

Then it says "Anything that gets children moving is a step in the right direction, however...".

That's a contradiction. You can't have it both ways. Either there's evidence that PE classes help or there isn't. You can't say research shows that PE has no effect on weight but PE classes will help anyway.

If you read a lot of news about various studies you'll see that it's a common error. When people really want to believe that X works they have a hard time accepting evidence that doesn't support their belief. You will find statements such as "There's no evidence that X helps Y but based on our theories/logic, we believe X should help Y so we should keep doing X".

The other common response is "The evidence shows no link between X and Y but that just means we need even more X". In other words, "My belief that X works is still correct, we can ignore the data because we were wrong about how much X we needed".

That response also shows up in the article linked: "..researchers said that PE classes were falling short and suggested a curriculum where children get more activity than classes currently offer".

Thursday, August 28, 2014

This is Statistics

The American Statistical Association created an interesting web site about the world of statistics and careers in statistics. Neat stuff.

Video games versus movies

This is an interesting graph comparing costs and revenues of video games and movies but it's restricted the "most" expensive and "highest" revenue. Therefore there's only one item (Grand Theft Auto V) on both lists. I'd like to see the costs and revenues for all 19 items that appear across both lists.

Probability in the Real World

Your tax dollars at work!**

NPR did an interesting series on applications of probability. The longest is about 9 minutes and the shortest is just a couple minutes. Well worth some of your time.

**Well, sort of your tax dallors. It depends on your definition of "tax dollars". I've read claims that at little as 2% and as much as 25% of NPR's funding comes from government sources.

Wednesday, August 27, 2014

Retirement saving crisis?

Recent news carried a story about the lack of retirement savings in the US. The USA Today article referred to a study by BankRate.Com (more on that study can be found here and here).

The opening line is quite alarming "A third of people (36%) in the U.S. have nothing saved for retirement, a new survey shows". At first glance that looks bad but 23.3% of the US population is under 18 so you wouldn't expect them to have retirement savings.

Therefore I dug a little deeper. The original study was not all people, it was adults only. OK, that makes more sense but still left me wondering how bad things really were.

Since 23.3% of the US is under 18, that means 77.7% of the 316 million residents in the US are adults. That's roughly 243 million. Thirty-six percent of that group is about 88 million.

Therefore, there are about 88 million adults with no retirement savings at all*.

But how bad is that? Some adults don't work. Some are the non-working spouse in a single-income household and 14 million are on disability. There are 45 million (14.1% of the total population, ) 65 and older, most of whom no longer work. There are probably other non-working groups.

Some of these non-working groups may have saved for retirement before they left the work force (of the 65+, only 14% or about 6.25 million have no retirement savings**) but once they stop working, no one should expect them to save for retirement.

Unfortunately, all these numbers don't help much in determining the size of the retirement savings crisis. If 88 million out of 243 million adults have no retirement savings, then 155 million adults do have retirement savings. Compared to a US Labor Force of 156 million people, this doesn't look that bad but we can't assume that these are the same people.

We need to subtract from 155 million all the adults with retirement savings who are not in the Labor Force. Good luck finding that number. For the 65+ nearly 39 million of them have retirement savings but, again, we can't assume that all of them have left the Labor Force.

For now, the best I can say is that 10's of millions of the 155 million adults with retirement savings are not in the labor force. Is that 20 million? 40 million? 60 million? I don't know so let's play with a range.

If it's 20 million, then 135 million adults in the Labor Force (86.5%) have retirement savings. On the other hand, it it's 60 million, then only 90 million adults in the Labor Force (57.6%) have retirement savings.

To flip it to the original headline, I'm estimating that somewhere between 14.5% and 42.4% of the Labor Force has no retirement savings. One end of that range is bad news, but not quite "crises". The other end is a disaster.

Too bad we don't know which end we're really on. Sometimes statistics leaves more questions than answers.

--------------------------------------------
*The original study had a 3.5% margin of error the 36% claim. I'm intentionally ignoring the margin of error and using the 36% point estimate in an effort to keep this simpler. The numbers would change by several million people but the overall logic would remain the same and it's the logic that I'm trying to demonstrate.

**I can't find it on BankRate's site but another article about the survey claims at 14% of those 65 and older have no retirement savings.

Friday, July 18, 2014

Sometimes a single variable says a lot

Previously, I wrote a post that questioned the value of a single variable. I haven't changed my mind, but I admit that a single variable can sometimes be interesting.

I took a random sample of 8,000 records from the Cash For Clunkers program. Below is a table of the odometer readings from the trade-in vehicles.

Look at the frequencies. The vast majority are one or two (or zero). Therefore it's not believable that 34 vehicles would have exactly 99,999 miles on the odometer or that 12 would have exactly 100,000 miles. I'm not even going to get into the 17 vehicles with zero miles. If you look at the full data set you'll also find this pattern repeated for any large sample. You'll also find an unusually large number of cars with 1,000,000 and 10,000,000 miles on the odometer!

Just one variable. Yet it tells us a lot. Clearly, the odometer reading data isn't accurate. Since the odometer readings came from the same source as the other variables (the selling dealerships), this single table for a single variable allows us to question the accuracy of all the variables.

This doesn't mean that the data is useless. Nearly every interesting data set has inaccuracies. This just reminds us to be wary about blindly accepting data.

The first question that came to my mind upon seeing this table was: fraud or sloppy? Was data intentionally cooked or were people just too lazy to try to get it right?

I can't answer that, but introducing a second variable can still be instructive.

I've examined the full data set for several states and found that the 99,999 and 100,000-mile odometer readings tend to cluster with just a few dealerships. We still don't know if it's fraud or sloppy, but I can see who did it.

I won't name dealerships here, but you can get the full dataset (it's public data) and do your own analysis.

One variable is never(?) enough

Well, "never" is a strong word but I can't think of many situations where a single variable tells us much (maybe I'll post one example later). Sometimes, even two variables aren't enough.

Consider this article on Millennial employment. It has several "yes, but..." situations. The article says "look at the relationship between two variables" and then goes on to say "yes, but when you add other variables you get different relationships."

Look - there's a gender gap in pay! That's two variables: gender and pay. Yes, but ... when you add in major, the pay gap reduces significantly. When you add in taking time out for raising children, the pay gap reduces even further and reverses in some fields.

Look - graduates of for-profit schools make more money than graduates of non-for-profit schools! Again, two variables: school type and pay. Yes, but ... when you add in age and work experience, you find that younger, inexperienced for-profit graduates make less.

There's another on race and unemployment (I'll blow the surprise - Asians have the highest unemployment rate).

This article is a great example of the difficulty in attributing cause to relationships in two variables. It makes me wonder why introductory statistics classes focus so heavily on single variable methods. Most students take, at most, one statistics course. Maybe that course should spend more time analyzing multivariate relationships.

Monday, July 14, 2014

Alcohol Deaths

A recent CDC study that claims alcohol consumption is killing "1 in 10 adults" and is the "leading cause of preventable death" based on 88,000 deaths in working age adults (20 to 64) between 2006 and 2010. These deaths were "due to health effects from drinking too much over time, such as breast cancer, liver disease, and heart disease, and health effects from consuming a large amount of alcohol in a short period of time, such as violence, alcohol poisoning, and motor vehicle crashes".

I don't think anyone is surprised that "drinking too much" is bad for you. However, before we revisit prohibition, maybe we should look at the CDC study more carefully.

The numbers are based on data for "Alcohol-Related Disease Impact (ARDI)" data provided by nation-wide and state counts of alcohol-attributable deaths. This concept has been around since at least the early 1990s. Look at the list of causes in the quote. Breast cancer? Heart disease? Are they really claiming that breast cancer is caused by booze?

Not exactly. The second link doesn't have all the details but it explains that they try to separate deaths completely caused by alcohol from those sometimes caused by alcohol and adjust the data accordingly. There's something called an "alcohol-attributable fraction (AAF)". For example, they take the total number of breast cancer deaths and multiply by the AAF to estimate the number of alcohol-attributable breast cancer deaths. The concept isn't necessarily bad, but like all statistics, the AAFs are estimates and aren't estimated the same way for each cause of death. In other words the 88,000 number is an estimate based on estimates of estimates but they don't provide a margin of error on their estimate.

That doesn't mean that the basic claim is wrong. Excessive alcohol consumption is a serious problem. However, instead of aggregating a bunch of estimates into one big scary number that isn't necessarily accurate, I'd prefer a table showing what alcohol does. Put the cause of death in Column A and an explanation in Column B. Put "Breast cancer" in A and in B explain how excessive alcohol consumption increases the risk of incurring breast cancer and dying from breast cancer. Do the same with "Heart disease", "Liver disease", etc.

Why don't they do that? I can think of two possibilities:
1) They think most of us are too stupid to understand a table. Sadly, they could be right.
2) Many of these causes of death have small AAFs (in other words, few alcohol-related deaths) so they wouldn't be impressive listed individually. However, added up, they help make the one, big scary number even bigger.

Note: The first link, the most recent study, says that alcohol is the "leading cause of preventable death." The second link is 10 years older and alcohol was merely the "third leading preventable cause of death". Does that mean that alcohol deaths increased relative to population size or other deaths decreased? Maybe someday I'll chase the data and find out.

Wednesday, June 11, 2014

Statistics Makes Liars And...

Liars Make Statistics.

I found this tidbit on the the Freakonomics blog. By adding illegal economic activities (or at least estimates of them) to the GDP, Italy will increase their GDP and, therefore, make their debt a smaller percent of GDP. It's magic! Without reducing real debt or growing the actual economy they will instantly improve their debt to GDP ratio. If Italy gets away with this I expect other countries to follow suit.

Imagine what incentives this provides for a government. As long as the activities remain illegal you don't have clear data on their economic impact. You have only estimates. Therefore, you have an incentive to:

Estimate higher rather than lower. I've already posted about how political incentives to report lower crime rates can create havoc with government data. Adding illegal activities to GDP would reverse that incentive for some crimes.
Don't legalize currently illegal activities. Once an activity is legal, there's much better data on its economic impact. Nevada has much more accurate data on prostitution's economic impact than California does. Therefore, California can more easily get away with overestimating prostitution in their state (see point 1). Soon, California and Colorado will have more accurate data on the marijuana economy than Illinois and New York.
Make more activities illegal. If you want to boost your GDP by using high estimates of illegal economic activities (point 1) and you accept that legal activities can't easily be over-estimated (point 2), then there's an incentive to make MORE things illegal. Perhaps you'd want to drive the alcohol and tobacco industries underground and exaggerate the size of those markets.

Imagine trying this in your personal life. Suppose that you have a low-income job but a thriving, unreported side income (eBay, Craigslist, off-the-books handyman service, etc.). You want to buy a house but your income is too small to qualify for a mortgage. Wouldn't it be nice if you could tell the bank to boost the reported income by $20,000 for unspecified activities without the IRS or local police getting involved? I guess governments can do things that individuals can't.

Tuesday, May 27, 2014

I just love contradictory news...

According to this article, non-college graduates are making income gains relative to college graduates. As a college professor, that's bad news for my profession. In a perfect world, people would want to "become educated" just for the sake of being educated. However, in the real world, most people go to college in hopes of increasing their income. Any research that says college isn't paying off is bad news for colleges.

Therefore, I'm really happy to see this article because it claims that, not only do the college-educated make more money than non-college graduates, the wage gap is growing in favor of the college-educated.

What should we conclude about these contradictory claims? Can they both be correct?

Actually, they can both be correct (and they can both be wrong).

First of all, the first link is about Canada and the claims are based on Canadian data while the second article is about the US. There could be differences in US and Canadian labor markets that narrow the college/non-college wage gap on one side of the border and increase it on the other.

Second, one needs to look at time frames, ages of the workers being considered, means versus medians, etc. to make sure that both claims are really about the same thing. I didn't take time to chase down details on the Canadian article, but I did a little digging on the US article. It refers to a study by the Economic Policy Institute in Washington DC. I found the EPI but could not find this particular study. Instead, I found the following studies:

August 2011: College graduates losing ground on wages. Of course, that was right in the midst of a terrible recession.
May 2014: Too many college graduates can't find job. This is much more recent.

These two articles seem to agree with the Canadian claim rather than the US claim and this was the agency referenced for the US Claim! Weird.

To make it more confusing, this article is about an MIT economist who says that, at a median income level, college graduates are doing much better than non-college graduates. This article also refers to the EPI study that I can't find.

More contradictory claims, possibly from the same research group. I leave it to readers to do their own research and figure out what's going on, but here are some things to consider:

The current college/non-college income gap between 25-year-olds is a completely different question than the current college/non-college income gap between 50-year-olds or the lifetime earnings gap at any age.
Historical data on income gaps (current gap or lifetime gap for any age group) are different questions than predicted income gaps.

Read some of the links and see if you can figure out what data and what questions these "contradictory" claims are addressing.

Thursday, May 1, 2014

What's your message?

Good post over at HBR. I've certainly made errors in this regard.

Analyzing data requires a lot of work that no one other than the analyst should ever see. In order to get a handle on the data you might create many exploratory graphs and compute multiple statistics. After doing all that work, it's understandable that you want to show it to someone. But first ask yourself "why?".

Why use a particular graph? Why a particular computation? Presentation time is when you cut everything down to your essential message and conclusions. Showing extra stuff will just confuse your audience and obscure your message.

Imagine you're at a movie and after every scene they stop the plot to explain how they selected the camera angles and lighting. Then they show you all the footage they edited out and explain why they didn't need that footage in the final film. It wouldn't make any sense. The few people who care buy the DVD with the "director's extras". The rest of us just want to see the movie.

It's the same with data. Determine your message and present just that message as clearly and simply as possible.

Another HBR post covering the same issue.

Tuesday, April 15, 2014

It's enough to make a statistician cry

As mentioned before, there is no such thing as perfect data. Political meddling just makes it worse. This isn't a short story about crime statistics, but it's a worthwhile read for someone trying to understand government data sources.

Friday, April 11, 2014

Census Data Advice

I got this link from Flowing Data. Good stuff on using the Census Bureau's survey results.

Tuesday, April 8, 2014

Difficulty with data

In most introductory statistics classes, data magically appears (in the correct electronic format!) with no discussion about it's source or validity. Unintentional or not, this gives students the idea that they should accept data at face value. I frequently tell my students that number-crunching is simple compared to getting good data and that most statistical disagreements are about the data rather than the calculations.

They listen and some of them sort of get it, but the steady stream of magic data they receive can override anything I say. Therefore, I've started scattering examples of imperfect data throughout the class. By "imperfect", I don't mean "mistake". Instead, I want to demonstrate how hard some things are to measure or classify and, therefore, that no data set is perfect (at least not any interesting data set).

The 2000 Presidential election was a good example of how difficult it can be to simply determine whether a vote is for Candidate A, Candidate B, or no one but that's ancient history for today's students.

Since many of my students are athletes, I've covered controversial sports calls. There are reams of historical sports data available. We rarely question that data in spite of the fact that we often argue during the game while the data is being created.

"That's a charge, not a foul!" "What do you mean ball? That was clearly a strike!!"

Regardless of the fans' preferences or the rule book's definitions, in sports the official data is whatever call the officials make at the time.

Charge or a foul? It's whatever the official says it is.

Ball or a strike? It's whatever the official says it is.

Packer fans won't forget this call for many, many years. Nearly everyone said that it was not a Seattle touchdown. Unfortunately for the Packers, the operational definition of a touchdown has nothing to do with what "nearly everyone" says. A touchdown is whatever the officials say it is and NFL data will classify this as a touchdown and the game as a Packer loss forever.

Wednesday, April 2, 2014

Nate Silver wasn't "wrong" about the 2012 congressional election.

Nate Silver has been in the news lately for his prediction about the 2014 congressional election. He says that the Republicans have a 60% chance of taking over the Senate. Several Democrats have reacted to the prediction by pointing out that Silver made a similar prediction in 2012 (a 61% chance) and he was "wrong" because it didn't happen. Therefore, we shouldn't trust his prediction this time.

That's a complete misunderstanding of Silver's prediction. He never said that the Republicans would win. He merely said that the probability they would win was larger than the probably that they would not win.

For example, suppose I hand you a bag with 10 poker chips in it and I tell you that four of the chips are blue and six are red. You can't see inside of the bag. You only know what I've told you and, based on my statement, there is a 60% chance that a randomly selected chip will be red.

You shake the bag to mix up the chips, reach in, and pull out a chip. If the chip is blue would you say that I was "wrong" when I told you that six of the ten were red? Of course not, there was a 40% probability that the chip would be blue.

What if the Republicans don't win this year either? Would that make Silver wrong?

Let's go back to the bag. If there are six red chips and four blue chips, what's the probability that you'd pull out a blue chip twice in a row (assuming that you put the first chip back)?

It's (0.40)(0.40) = 0.16. A sixteen percent chance isn't exceptionally large, but it's not tiny either.

You can't say that Silver is "wrong" based on his specific prediction. If a knowledgeable statistician wants to go back through Silver's process in detail to look at where and how he got his data and how he analyzed it, it's entirely possible that they would disagree with something. Even that wouldn't necessarily make Silver wrong. There are legitimate disagreements in the statistical world on how to obtain and analyze data. There might be actual mistakes in someone's process, but disagreements in the discipline aren't mistakes.

In this case, it's more likely an example of confirmation bias. Perhaps the greatest barrier to effective use of data in any organization is getting past our tendency to accept data the confirms our predetermined biases and reject data the contradicts them. That's not use of statistics, that's abuse of statistics.

Wednesday, March 12, 2014

Maybe Cheeseburgers Aren't so Bad?

After piles of national news headlines about how "bad" high-protein diets are, a little more truth comes out.

Debunked: Cheeseburger as bad as Smoking

The lesson? Never confuse observational studies with experiments (and journalists need to understand statistics better)

How many shoot up? No one knows.

It's very difficult to get good data on many behavioral issues. Here's a great example.

How Many Daily Heroin Users Are There in the U.S.? Somewhere Between 60,000 and 1 Million. Maybe.

Search This Blog