Data Matters: 2018

Tuesday, December 11, 2018

Are Bikes Legally Faster or Just Faster (and What Difference Does it Make)?

As a bicyclist who rides for both recreation and commuting, articles about bike-vs-car speeds catch my attention. Here's one backed up by data! Yes, actual data (because, you know, data matters).

However, I have questions about the data. From my own bike commuting experience, I fully accept that a bike can be as fast as a car. However, there are only two ways that a bicycle can actually be faster than a car.

The bike has access to different routes than cars. There could be bike paths that cut through parks, across rivers, etc. that cars can't drive on. This allows bikes to bypass congested streets and intersections. There could also be dedicated bike lanes along the roads that give bicycles different legal rights. I experienced this on a recent visit to Washington DC. There was a path near our hotel that went past the airport, through a park, and over the river with no stop signs. It was a shorter and faster trip to the National Mall than the best car route.
The bike uses the same routes as cars but passes on the right at every traffic signal. In heavily congested traffic, this allows bikes to get ahead of the cars. When traffic is stopping every few blocks, this can create a significant time advantage for bikes over cars. I've also experienced this in a variety of cities.

So here's the problem.

Under #1 above, too many cyclists think that sidewalks are legitimate routes for bikes. In most cities, that's illegal. It's also dangerous when there's pedestrian traffic. To clarify, I'm talking about actual sidewalks, not multi-user paths. Legally, those are two different things.

Under #2, in the absence of dedicated bike lanes, passing on the right is usually illegal and dangerous.

So when "Data From Millions of Smartphone Journeys Proves Cyclist Faster" I'd like to know what percent of those "millions" of journeys were completely legal. Based on my personal experience biking and talking to other bikers, I'm suspicious but I don't have data. Even if I had their data, I doubt that it would show whether or not the biker or cars broke any laws.

That leaves me at an impasse. Without data, I can't find evidence to support or contradict my suspicion.

That brings me to the second part of my title: does it matter?

Let's assume that I'm right and a large proportion of the "bikes are faster" data is from illegal riding. If the law is rarely, or never, enforced then is it really a law? The last time I saw an officer stop a bicycle for riding on a sidewalk was when Barney Fife stopped the spoiled kid in downtown Mayberry. I've never seen a bike stopped for passing on the right. If our culture accepts this sort of biking, then maybe it doesn't matter if it's technically illegal and it's fair to simply say "bikes are faster".

On the other hand, what if you're the insurance company that provides workman's compensation or liability coverage for Deliveroo? If a delivery rider gets injured or causes injury, then your company's financial responsibility could change if the rider was breaking the law. Even if "bikes are faster", you might want to encourage the use of cars if bikes are more likely to break the law.

I would argue that the legality issue does matter. Now how do we get the data?

Friday, December 7, 2018

Data Access Versus Data Privacy: US Census

I came across an interesting TheUpShot article in a post from FlowingData. It seems that analysts have found ways to pull individual data records out of aggregated Census data that is supposed to protect our privacy.

This problem shouldn't be a surprise to anyone who has spent time working with Census data. We use census data extensively in my introductory statistics class. One semester the class spotted a divorced 13-year-old female in our sample. The sample included her county and state.

Another time, we uncovered a 42-year-old female lawyer whose fourth marriage was within the last year. Again, we had county and state information.

We didn't try to figure out who they were but we talked about whether or not we could figure it out. it depends on where they lived. Had both of them lived in Los Angeles County California (population 10.2 million) it would have been difficult. Had they lived in Langlade County Wisconsin (population 19,000) it wouldn't have been very hard to go through public marriage and divorce records to find them.

On the negative side, it's amazing how little statistical ability someone needed to spot these opportunities for bypassing privacy.

On the positive side, these two had to stand out from the rest of the data in order to get noticed and most of us don't stand out.

However, just a little more statistical ability and a few more variables would change what it takes to "stand out". Maybe you and I are more unusual than we think we are and, therefore, we're easier to identify. That's what the researchers in the linked article claim.

Before you panic and refuse to participate in any future Census, read the article. The Census Bureau is aware of the problem and they're working on it. That's both good and bad. It's good that the Census Bureau takes our individual privacy seriously, but it's bad that the solution might be intentionally screwing up the data.

One solution is virtually moving people (which is a nice way of saying "falsify the data"). Maybe the 42-year-old lawyer doesn't actually live where the data says she lives. Is it OK to swap her with another woman in a different census block (the smallest geographic unit)? If that's OK, is it OK to swap her county? How about her state?

It depends on your level of geographic analysis. Counties and states are often poor units (see MAUP). Census tracts or blocks are more useful.

As a researcher, I want the best data I can get and Census data is considered the gold standard of publically available data (trust me, private companies know a lot more about you). On the other hand, I value data privacy.

I'm glad that the Census Bureau has to solve this instead of me.

Tuesday, November 13, 2018

If you doubt that data matters...

What was Amazon looking for with their widely publicized search for an HQ2 location?

One could answer by looking at their list: proximity to international airport, mass transit, regional population, ...

However, some are saying that Amazon was actually looking for DATA. Here are two articles:

Bloomberg: Amazon Said No to Cactus, Yes to Data in Hunt for New Office
Reason: HQ2: How Amazon Made Government Do Their Bidding for Free (note: this article links to the previous one)

Neither article is very long so you should read them yourself, but I'll provide a couple of interesting quotes:

From Bloomberg "But it kept hundreds of millions of dollars worth of free information from the cities to create the biggest corporate site location database in the world, according to Richard Florida, an urban studies professor at the University of Toronto."

From Reason: "Amazon is now privy to information about where different municipalities are going to direct investment and infrastructure in the near future. The company can exploit this information. ... Maybe Amazon just happens to purchase a new fulfillment center right around a soon-to-be-developed locale which would see increased demand for Amazon products. Maybe it simply decides to squat on land for a while, knowing that it will soon be smack dab in a hive of activity. A new brick-and-mortar store? They'll have the option. Or maybe knowing where news roads will be built will make it easier for Amazon to plan transit routes. There's profit to be extracted from this data that you and I could not even conceive."

Whether Amazon played a game just to obtain data or the data is a side benefit of an honest search, it's clear that data matters.

By the way - while not the same level and volume of data that Amazon got, ALL of us have access to a great deal of government data for free. Check out IPUMS.

Tuesday, October 16, 2018

Demographics is Destiny?

I just came across this link on cities and "peak Millennial" from posts by Digging Data.

We seem to like making broad generalizations when comparing generations. I usually cringe when I hear them because there's rarely much data behind the statements.

In particular, I'm getting tired of hearing how Millenials are so different from any prior generation. I don't see much of it. They show up in my classes. Some are smarter and some not so much. Some are lazy. Some are industrious. Some are liberal, some are conservative, and some don't even think about politics. I could go on but, essentially, they aren't that different from the students who came before them (or the ones before that or the ones before ...).

Still, there is data to support some generalizations about them. As post-college young adults, they - on average - seem more drawn to urban environments. Data supports that. However, some then predicted that their generation would completely revive the urban landscape for decades.

Maybe not. The article linked above says that, even though Millenials are marrying and having children later in life (which is data supported), their housing and community preferences for the married-with-children stage of life might not be all that different from their predecessors:

"But with a view of history and demographics, it’s not difficult to imagine a future where that love [of city life] fades with the years, and a different sort of life starts to seem appealing. Millennials have shown a tendency to delay marriage and children, and thus occupy their studio apartments in urban cores for longer. But that’s no reason not to be concerned that school quality and more space might factor into their choices as they age."

Saturday, September 29, 2018

K.I.S.S. in Graph Design

I ran across this post from Data to Viz. My title is misleading because their post is much more than a call for simplicity.

However, it's intriguing that their solution to many common problems comes down to "stop being fancy and make it a bar chart".

Friday, September 28, 2018

Is There Still Value in Political Polling?

A colleague sent me a link to Why Polling Can Be So Hard by Nate Cohen.

It's interesting and not very long so you should read it.

I'll summarize one of his major points: Voter registration data is important to pollsters but different states store different data for each voter. For example, Wisconsin is known for having minimal data. Of course, Wisconsin was pivotal in the 2016 elections.

However, I found the comments just as interesting as the article. They are largely negative. Some people refuse to participate in polls or intentionally lie. Again, you should read the comments yourself, but they don't look good for the future of polling. I recognized myself in the article and the comments.

I live in Wisconsin. I don't want my voter registration to have ANY data about me beyond the minimum legal need. Information privacy matters and I don't care if our minimal data makes pollsters' jobs harder.

I don't answer a call when I don't already know the caller. If you're not in my contacts and your call is important then you can leave a message.

I'm suspicious that excessive polling and reporting on polls is no longer predicting what will happen as much as it's changing what will happen. Whether it's the band-wagon effect or the Hawthorne effect, I think it's a problem.

Speaking of the Hawthorne effect, campaigns now use their own extensive polling to craft their message. Polling doesn't just change voter behavior, it changes politician behavior.

Perhaps polling is a victim of its own success. When it was new and not overly intrusive, it provided useful information (value). Economic theory says that value attracts new participants and will continue to attract participants until there is no longer any value available. In perfect competition, there are zero long-term profits.

Early polling methodology was pretty standard and easy to replicate - perfect competition. To break out of perfect competition, organizations need something to differentiate their output. If a poll is supposed to accurately predict the vote, how can one poll differentiate itself from the others?

Accuracy? There should be value in being more accurate but you need to come up with better, non-standard methodologies. You also need access to different or better data. Then it's still hard to show that you're more accurate.
Speed? Is there value in being the first to publish results? If you're the first by days then there could be value. Maybe even being first by hours. But minutes?
Frequency? If one group publishes a weekly poll, then you might gain value by publishing a daily poll. But how far can this go?

I think that all three of these approaches have been tried but they lead to the problems that voters complain about: information privacy and bombardment with polls.

The pessimist in me fears that polling and reporting on polling have become so ubiquitous that we're nearing the point of zero value. Worse, we might have reached negative value and polls are doing more harm than good.

A less pessimistic view suspects that the value of polling isn't gone (or negative): it's just changed. Maybe polls no longer tell us what we think they are. That creates new opportunities for the Nate Cohen's and FiveThirtyEight's of the world to find that new value.

Wednesday, August 29, 2018

Classic Probability Applied (or Not)

My wife and I are watching the America's Got Talent results show. Twelve acts performed last night and the audience voted. Based on those votes (sort of), seven acts get to stay. If we assume equally likely outcomes, every act as a 7/12 chance of going forward.

The first thing they do is pull aside "three acts in danger" for the Dunkin' Save. Out of this group, the audience re-votes to save one. Of the two left, the judges vote to save one. If the judge vote is a tie, then the audience vote from the previous night determines who stays. Either way, two of these three acts get to stay. If we assume equally likely outcomes in that group, then they have a 2/3 chance of going forward.

Since only seven acts go forward, there are five slots for the remaining nine acts. In other words, they have a 5/9 change of going forward.

Let's recap. Before the results show starts, each act as a 7/12 ≈ 0.58 chance of staying. After this first sort, each act is in one of two situations:
* Dunkin' Save where they have a 2/3 ≈ 0.67 chance of staying.
* Still on stage where they have a 5/9 ≈ 0.56 chance of staying.

It appears that acts are unhappy to be in the Dunkin' Save group, but the probabilities suggest otherwise. Which group would you rather be in? Don't answer right away. Think about it.
.
.
.
.
.
.
Think a little more.
.
.
.
.
.
.
Ok. Which group? Why?

Does the equally likely outcomes assumption required by classical probability makes any sense? We know that they're not really equally likely because the acts going forward are not randomly selected. But how does this play out.

All twelve acts are ordered on the the viewers' votes and the Dunkin' Save acts are the 6th, 7th, and 8th place. Therefore, if you're in this group you know that you're "on the bubble" with the audience. The Save acts could be grouped tightly based on votes. A 2/3 probability night be reasonable and it's a little better than what you had when the show started.

What about the other nine acts? Now you know that you're either in the top five votes or you're at the end of the pack. There's no middle ground left. Would you feel better in this group? If you think you did a great job, then you're really confident. If you think you blew your performance, then you think you're done. The 5/9 probability is probably useless in your mind.

Therefore, the Dunkin' Save group might be neither bad nor good. It's just different. Once the Save group is set aside, the remaining acts probably have a good idea where they stand while the Save group is still in suspense.

Note: There is a potential problem in my use of "probability". Consider a fair coin. If I'm about to flip the coin, the probability of a head is 50%. What if I've already flipped the coin but it's hidden under the couch and no one knows what side us facing up? What's the probability that it's a head? Some would say that it's still 50%. Others would say that the coin flip is already done and, therefore, the probability of a head is either 0 or 1. Our lack of knowledge regarding the outcome doesn't change the fact that it's already done.

If you interpret probabilities the second way, then that could change your preference for being in or out of the Dunkin' Save group. Being put in the Save group puts you into an uncertain outcome where probabilities matter. Being out of the Save group means that your outcome has already been determined (it's 0 or 1) even if you don't know what it is yet.

Wednesday, April 18, 2018

Another YouTube Video

Nothing spectacular here, but my class is making a transition from Excel ("the Swiss Army knife of analytical tools") to SPSS because SPSS is simply better at straight-ahead statistics. This is an example of the things they need to do for their first SPSS lab.

Data Matters

Search This Blog