Previously, I wrote a post that questioned the value of a single variable. I haven't changed my mind, but I admit that a single variable can sometimes be interesting.

I took a random sample of 8,000 records from the Cash For Clunkers program. Below is a table of the odometer readings from the trade-in vehicles.

Look at the frequencies. The vast majority are one or two (or zero). Therefore it's not believable that 34 vehicles would have exactly 99,999 miles on the odometer or that 12 would have exactly 100,000 miles. I'm not even going to get into the 17 vehicles with zero miles. If you look at the full data set you'll also find this pattern repeated for any large sample. You'll also find an unusually large number of cars with 1,000,000 and 10,000,000 miles on the odometer!

Just one variable and it tells us a lot. Clearly the odometer reading data isn't accurate. Since the odometer readings came from the same source as the other variables (the selling dealerships), this single table for a single variable allows us to question the accuracy of all the variables.

This doesn't mean that the data is useless. Nearly every interesting data set has inaccuracies. This just reminds us to be wary about blindly accepting data.

The first question that came to my mind upon seeing this table was: fraud or sloppy? Was data intentionally cooked or were people just too lazy to try to get it right?

I can't answer that but introducing a second variable can still be instructive. I've examined the full data set for a several states and found that the 99,999 and 100,000 mile odometer readings tend to cluster with just a few dealerships. Fraud versus sloppy I can't say, but I can see who did it. I won't name dealerships here, but you can get the full dataset (it's public data) and do your own analysis.

## No comments:

## Post a Comment