I came across an interesting TheUpShot article in a post from FlowingData. It seems that analysts have found ways to pull individual data records out of aggregated Census data that is supposed to protect our privacy.
This problem shouldn't be a surprise to anyone who has spent time working with Census data. We use census data extensively in my introductory statistics class. One semester the class spotted a divorced 13-year-old female in our sample. The sample included her county and state.
Another time, we uncovered a 42-year-old female lawyer whose fourth marriage was within the last year. Again, we had county and state information.
We didn't try to figure out who they were but we talked about whether or not we could figure it out. it depends on where they lived. Had both of them lived in Los Angeles County California (population 10.2 million) it would have been difficult. Had they lived in Langlade County Wisconsin (population 19,000) it wouldn't have been very hard to go through public marriage and divorce records to find them.
On the negative side, it's amazing how little statistical ability someone needed to spot these opportunities for bypassing privacy.
On the positive side, these two had to stand out from the rest of the data in order to get noticed and most of us don't stand out.
However, just a little more statistical ability and a few more variables would change what it takes to "stand out". Maybe you and I are more unusual than we think we are and, therefore, we're easier to identify. That's what the researchers in the linked article claim.
Before you panic and refuse to participate in any future Census, read the article. The Census Bureau is aware of the problem and they're working on it. That's both good and bad. It's good that the Census Bureau takes our individual privacy seriously, but it's bad that the solution might be intentionally screwing up the data.
One solution is virtually moving people (which is a nice way of saying "falsify the data"). Maybe the 42-year-old lawyer doesn't actually live where the data says she lives. Is it OK to swap her with another woman in a different census block (the smallest geographic unit)? If that's OK, is it OK to swap her county? How about her state?
It depends on your level of geographic analysis. Counties and states are often poor units (see MAUP). Census tracts or blocks are more useful.
As a researcher, I want the best data I can get and Census data is considered the gold standard of publically available data (trust me, private companies know a lot more about you). On the other hand, I value data privacy.
I'm glad that the Census Bureau has to solve this instead of me.