Search This Blog

Friday, July 18, 2014

Sometimes a single variable says a lot

Previously, I wrote a post that questioned the value of a single variable. I haven't changed my mind, but I admit that a single variable can sometimes be interesting.

I took a random sample of 8,000 records from the Cash For Clunkers program.  Below is a table of the odometer readings from the trade-in vehicles.


Look at the frequencies. The vast majority are one or two (or zero). Therefore it's not believable that 34 vehicles would have exactly 99,999 miles on the odometer or that 12 would have exactly 100,000 miles.  I'm not even going to get into the 17 vehicles with zero miles.  If you look at the full data set you'll also find this pattern repeated for any large sample.  You'll also find an unusually large number of cars with 1,000,000 and 10,000,000 miles on the odometer!

Just one variable. Yet it tells us a lot. Clearly, the odometer reading data isn't accurate. Since the odometer readings came from the same source as the other variables (the selling dealerships), this single table for a single variable allows us to question the accuracy of all the variables.

This doesn't mean that the data is useless. Nearly every interesting data set has inaccuracies. This just reminds us to be wary about blindly accepting data.

The first question that came to my mind upon seeing this table was: fraud or sloppy?  Was data intentionally cooked or were people just too lazy to try to get it right?

I can't answer that, but introducing a second variable can still be instructive.

I've examined the full data set for several states and found that the 99,999 and 100,000-mile odometer readings tend to cluster with just a few dealerships. We still don't know if it's fraud or sloppy, but I can see who did it. 

I won't name dealerships here, but you can get the full dataset (it's public data) and do your own analysis.

One variable is never(?) enough

Well, "never" is a strong word but I can't think of many situations where a single variable tells us much (maybe I'll post one example later). Sometimes, even two variables aren't enough.

Consider this article on  Millennial employment. It has several "yes, but..." situations. The article says "look at the relationship between two variables" and then goes on to say "yes, but when you add other variables you get different relationships."

Look - there's a gender gap in pay! That's two variables: gender and pay. Yes, but ... when you add in major, the pay gap reduces significantly. When you add in taking time out for raising children, the pay gap reduces even further and reverses in some fields.

Look - graduates of for-profit schools make more money than graduates of non-for-profit schools! Again, two variables: school type and pay. Yes, but ... when you add in age and work experience, you find that younger, inexperienced for-profit graduates make less.

There's another on race and unemployment (I'll blow the surprise - Asians have the highest unemployment rate).

This article is a great example of the difficulty in attributing cause to relationships in two variables. It makes me wonder why introductory statistics classes focus so heavily on single variable methods. Most students take, at most, one statistics course. Maybe that course should spend more time analyzing multivariate relationships.

Monday, July 14, 2014

Alcohol Deaths

A recent CDC study that claims alcohol consumption is killing "1 in 10 adults" and is the "leading cause of preventable death" based on 88,000 deaths in working age adults (20 to 64) between 2006 and 2010. These deaths were "due to health effects from drinking too much over time, such as breast cancer, liver disease, and heart disease, and health effects from consuming a large amount of alcohol in a short period of time, such as violence, alcohol poisoning, and motor vehicle crashes".

I don't think anyone is surprised that "drinking too much" is bad for you. However, before we revisit prohibition, maybe we should look at the CDC study more carefully.

The numbers are based on data for "Alcohol-Related Disease Impact (ARDI)" data provided by nation-wide and state counts of alcohol-attributable deaths.  This concept has been around since at least the early 1990s.  Look at the list of causes in the quote. Breast cancer?  Heart disease? Are they really claiming that breast cancer is caused by booze?

Not exactly. The second link doesn't have all the details but it explains that they try to separate deaths completely caused by alcohol from those sometimes caused by alcohol and adjust the data accordingly. There's something called an "alcohol-attributable fraction (AAF)". For example, they take the total number of breast cancer deaths and multiply by the AAF to estimate the number of alcohol-attributable breast cancer deaths. The concept isn't necessarily bad, but like all statistics, the AAFs are estimates and aren't estimated the same way for each cause of death. In other words the 88,000 number is an estimate based on estimates of estimates but they don't provide a margin of error on their estimate.

That doesn't mean that the basic claim is wrong.  Excessive alcohol consumption is a serious problem. However, instead of aggregating a bunch of estimates into one big scary number that isn't necessarily accurate, I'd prefer a table showing what alcohol does. Put the cause of death in Column A and an explanation in Column B. Put "Breast cancer" in A and in B explain how excessive alcohol consumption increases the risk of incurring breast cancer and dying from breast cancer. Do the same with "Heart disease", "Liver disease", etc.

Why don't they do that? I can think of two possibilities:
1) They think most of us are too stupid to understand a table. Sadly, they could be right.
2) Many of these causes of death have small AAFs (in other words, few alcohol-related deaths) so they wouldn't be impressive listed individually. However, added up, they help make the one, big scary number even bigger.

Note: The first link, the most recent study, says that alcohol is the "leading cause of preventable death." The second link is 10 years older and alcohol was merely the "third leading preventable cause of death". Does that mean that alcohol deaths increased relative to population size or other deaths decreased? Maybe someday I'll chase the data and find out.