Data Matters: 2019

Monday, October 7, 2019

Puppies, Mandopop, and Causality

Here's the best puppy video that I've ever seen on YouTube. It's only 9 seconds. Trust me, it's worth watching.

It's even better looped. Try it with YouTube's loop feature. It's just gets better and better.

After I watched it a few times, I wondered what that catchy song was. It didn't take long for Shazam to identify it, but I struggled a bit getting the Chinese characters into a search field. I still don't know the name of the song, but it's recorded by the Mandopop* group By2.

Here are some comments for the song's video.

There's a clear causal claim here: the popularity of the puppy video caused the popularity of the song video.

That sounds reasonable, but... ... is it true?

Think about why the puppy video is so great. Sure, the rear dog's face plant into the snow is funny and the can-do spirit he shows by quickly jumping back up is inspirational. But I think it's the way the song perfectly syncs to the front dog's bounces that makes the video great.

At first, you think that the dog syncs to the song but that's not true. The song was added later. The dog does not sync to the song. The song syncs to the dog. That's an important distinction.

Compare the first version of the puppies to this one. It's longer, about a minute, but you need to watch only the first 10 to 20 seconds.

It's the same puppies doing the same thing but it's not nearly as cute.

That should make you think about the causal claim in the music video's comments. Maybe the puppy video is not the cause of the song's popularity. Maybe they have it backwards. Maybe, just maybe, the song is the cause of the puppy video's popularity.

Then again, maybe the comments are right. Either way, you can't confirm a causal relationship from a simple observational relationship**.

Unjustified causal conclusions are a common statistical error. When two variables show a relationship and "it makes sense" that A causes B***, we often jump to a causal conclusion without question. The jump to a causal conclusion is so common that I think we'd benefit from training ourselves to think the other way. Our first reaction to a statistical relationship should be to question any causal relationship.

--------

* Thanks to my colleague Chao Zheng for clarifying that the song is Mandarin instead of Cantonese.

** To be clear. I don't think the presence of two different sound tracks on the same puppy video is sufficient to claim there's an experiment. This is still observational. However, it's a good thought exercise to think about how you would make it an experiment.

*** Or, worse yet, we already believed the causal relationship and we were just looking for data to prove ourselves right.

Thursday, September 5, 2019

Qualitative Data is Data Too (Part 3)

I laid out the value of qualitative data in Part 1 and and gave an example in Part 2. In this part, I've got my own analysis. Here's a series of qualitative data points.

Once upon a time, Council Bluffs Iowa was a major stopping point for westward travelers due to the presence of a steam powered boat to ferry wagons and livestock over the Missouri River. Then someone decided that a transcontinental railroad would be a good idea. Wagon trains were fine, but railroads would be faster and more efficient for people, products, and livestock.

The big question: where would it go?

Several broad routes were considered and a "central" route was selected. Within the central route, an eastern starting point had to be selected. As POTUS, Abraham Lincoln would make the decision.

Just imagine the economic loss for Council Bluffs if Kansas City Missouri were selected as the starting point. Council Bluffs would have turned into a ghost town.

As luck would have it, one of the investors had previously employed Lincoln as an attorney and this investor wanted Council Bluffs selected. There's no way to be certain, but it would be reasonable to conclude that his connection to Lincoln influenced the decision. Council Bluffs was selected and by the 1930's its status as a wagon train stopping point was replaced by its status as the 5th largest rail center in the US.

The invention of the automobile again changed the nature of transportation and Route 66 was established in 1926. When completed it ran from Chicago to Santa Monica and became one of the most famous highways in US history. By 1930, trucks rivaled rail for dominance in shipping.

Route 66 was a financial boon for towns and businesses along the route but, like many highways, its path wasn't static.

In Atlanta Illinois, the original route ran right through town and businesses thrived. Twenty years later, a bypass was built and businesses died.

In New Mexico, an angry Governor used his lame-duck power to move Route 66 and bypass the state capital: "In 1924, Democrat Arthur Thomas Hannett was unexpectedly elected for a single term (1925–1927) as governor, only to be defeated with various dirty tricks in the next election. Blaming the Republican establishment in Santa Fe for his defeat, Hannett used the lame duck remainder of his term to force through a sixty-nine mile cutoff from Santa Rosa directly to Albuquerque, bypassing Santa Fe entirely."

These are just a few of many examples of Route 66 changes over the years. Most were not as blatantly political as in New Mexico, but each change still had politics in the background. Whether they were elected or appointed, someone or some group made the decision.

Just as rail usurped wagons and highways usurped rail, interstates began usurping highways in the 1950s. Portions of these interstates (I55, I44, I40) follow Route 66 very closely but most of Route 66 was bypassed. Much of the current Route 66 nostalgia focuses on the economic impact of the interstates and how they created a new generation of ghost towns.

This might not look like "data". Maybe you think it's just a story, a narrative of changing modes of transportation (admittedly a very abbreviated narrative). However, I would argue that there's a discernible pattern here and that we can derive insights from that pattern.

1) Things change. Downtown dime stores (Kresge's, Ben Franklin, Grants) had a little of everything and threatened their neighboring, specialized stores. Then Kresge's morphed in big-box discount K-Mart and threatened all of downtown. Walmart came along and knocked K-mart off the top of the hill. The internet came along and Amazon knocked Walmart off the top.

2) When things change, politics matters. This is actually the main point, but I used #1 to emphasize the inevitability of change. Changes in transportation required political decisions about routes and right-of-ways. Changes in retail required political decisions about zoning, building codes, and taxation. If you plan to be in business, then you need to understand politics. Study politics. Study political economy. Read Travels of a T-shirt and pay attention to the policy and regulatory decisions made by a myriad of political bodies.

3) Nostalgia is selective. This is a minor point, but I still find it interesting. Route 66 is "hot" at the moment. Perhaps it's because of Disney's Cars or maybe it's relative to age of the Baby Boomers. Either way, there are books, blogs, articles, etc. about the sad loss of prosperity on Route 66. I don't see this level of concern over the ghost towns created when wagon trains disappeared. More recently, Lena Wisconsin's downtown was bypassed by reconstruction of Highway 141. Maybe the locals talk about the impact on Lena's downtown, but there's no national discussion that I've ever heard about.

Sources:

https://en.wikipedia.org/wiki/U.S._Route_66
https://www.national66.org/history-of-route-66/
http://library.nau.edu/speccoll/exhibits/route66/paths.html
https://en.wikipedia.org/wiki/U.S._Route_66_in_New_Mexico
https://www.route66news.com/2007/03/19/road-to-albuquerque-was-a-joke/
Personal experience gathered on a recent Route 66 vacation

Millenials Again - They Still Aren't All That Different

Last year, I wrote a post about millennials, demographics, and the risk of making overly broad generalizations.

Here's another report telling us that millennials aren't all that different than previous generations. They might be hitting some "life moments" later, but they still want the same things that their parents wanted.

Friday, August 23, 2019

Qualitative Data is Data Too (Part 2)

In Part 1, I contrasted traditional data analysis and qualitative analysis. In this part, I tell a story that combines them.

Unfortunately, I don't have a source for the story. At one point, I thought I read it in the work of Russ Ackoff but I haven't been able to find it in his writing. Then I thought it might be from Gene Woolsey. I was fortunate enough to spend some time on the phone with Dr. Woolsey toward the end of his career. He agreed that the story sounded like something that either he or Ackoff would have written but it wasn't his and he didn't recognize it from Ackoff's work. Therefore, even though I know that the story exists, I can't cite it and it remains apocryphal.

Since I can't find the source, I can't double-check the details. If anyone can confirm a source, please let me know.

Here's the story...

Back around 1970, a quantitative analyst was hired as a consultant by a Fortune 500 firm in Los Angeles. They were considering moving their operation out of the city to the suburbs and they wanted him to evaluate the long-range cost/savings of expanding where they were versus relocating. The consultant gathered the necessary data, did the analysis, and concluded that there would be significant savings if the company moved. He reported his results, got paid for his time, and went on his way. Nothing changed at the company.

About 18 months later, he ran into the CEO who had hired him. He brought up the project and apologized that his work hadn't been useful. The CEO seemed surprised and assured him that his analysis had been crucial to their decision.

"But you're not moving" the consultant said.

"No, we're not" admitted the CEO. "We wanted to stay in the city. We thought it was better for our employees and we believed that we had a positive impact on our neighborhood. Our corporate mission seemed to fit with the city location. There was a lot of pressure to move and save money but we didn't know what the savings would be or, conversely, what the cost would be to stay put. Your analysis made the numbers clear and we decided that would afford to not move."

The moral of the story? The analyst did his job with a traditional quantitative analysis. The company made their decision on a combination of quantitative and qualitative factors. On the numbers alone, one could argue that the company reached the wrong conclusion. On the qualitative analysis, it wasn't clearly a right or wrong conclusion but it was a decision that the executives and the board were comfortable with.

In Part 3, I'll give an example of my own qualitative analysis. By necessity, my conclusions will be "fuzzy" and you might disagree with them but you should still be able to follow the logic connecting my observations and conclusions.

Wednesday, August 14, 2019

Qualitative Data is Data Too (Part 1)

To data professionals, "data" implies some sort of structure with defined records and variables. In this context, quantitative variables are numbers (such as price, income, age) and qualitative variables are non-numbers (color, country, gender, payment method). Depending on the lingo in your particular world, qualitative data could also be called nominal data or categorical data but it's still structured data.

However, some fields use the term "qualitative data" differently. Their text or observations aren't easily codified into records and variables but they still examine large segments of data to discern patterns and themes. Philosophers might read Plato, Hobbes, Smith, and Marx to generate theories and find text passages to support, or refute, those theories.

Modern technology can be used to bridge these approaches (i.e. the digital humanities). There are tools available to process online comments or customer reviews and determine how many are "positive" but some questions and some data simply don't lend themselves to any sort of structured data methods.

Let's use the Bible as an example. One could ask "How many times is the word money in the Bible?" Since the Bible wasn't written in English, we'd first have to agree on which translation we're going to use. Then it wouldn't be difficult to process every single word and count the number of times "money" occurs. It's no surprise that this has already been done. I suppose it's interesting that the King James version uses "money" 140 times, but I'm not sure that this mini-fact is particularly useful.

There are other words for "money". The Bible might mention payment, wages, debt, inheritance, silver, gold, ... This source tells us that there are over 2300 verses in Bible that mention "money, wealth, or possessions".

But are "possessions" and "money" really the same thing? This analysis requires another step. As before, we'd have to agree on which translation to use but we'd also have to agree on a list of synonyms for "money". We might even come up with an ordinal scale for whether a word is a true equivalent or simply related. Then, as with the previous question, a program could process every word in the text and count how many times "money" and each synonym occurred.

In both of these examples, it's possible to process the Biblical text as more or less traditional data but the results aren't all that useful.

A much more interesting question is "What does the Bible teach about money?" Many have attempted to answer that question, but none of them were able to do so with traditional statistical or data analysis tools.

Data professionals often aren't comfortable with this type of qualitative analysis. It's too fuzzy or too touchy-feely. However, it's likely that the people data professionals report to are using all kinds of "fuzzy" analysis so it might be wise to study some fields where qualitative analysis is commonly used.

In Part 2, I'll tell a story of qualitative analysis and decision making. In Part 3, I'll do my own qualitative analysis.

Wednesday, July 31, 2019

A new article

Twenty years ago, I wrote about the role of spreadsheet modeling in Operations Research/Management Science (OR/MS) education. It got a fair amount of attention. This month, I'm taking on another potential controversy: the interplay of "analytics" and "OR/MS".

Tuesday, March 26, 2019

Small Data (really small) and Expectation Management

Years ago, my wife and I went to see the movie Romancing the Stone. We were visiting a small town and it was the only movie playing. It was new in theaters and we didn't know anything about it.

We loved it. We told lots of people how good it was.

We loved it so much that we went to see it again about a month later. It was still good, but it wasn't great. We realized that we had zero expectations the first time - we were just looking for something to do - and high expectations the second time.

Years later, our friends were raving about My Big Fat Greek Wedding. They insisted that we see it. Really insisted. We were told that we needed to see it.

We finally went and it was a disappointment. It wasn't a bad movie, but no movie could live up the hyped reviews we heard.

Over the years we've referred back to those movies when we find ourselves reacting differently than expected. We recently went to a restaurant that someone close to us insisted that we try. It was disappointing. Then one of us said "I guess this was a Big Fat Greek Wedding instead of Romancing The Stone". It was actually a nice restaurant but it couldn't possibly live up to the expectations we were given

In related news, I just finished reading The Undoing Project by Michael Lewis. I learned some of this material in graduate school but, as usual, Lewis does a great job telling the story. The discoveries of Kahneman and Tversky explain our experience with the movies and the restaurant.

Small data, such as word-of-mouth reviews from a few friends, are poor statistical samples but people still give it significant weight in forming judgments. With social media, small data can get repeated and amplified so that it looks like much larger data and, again, people will give it significant weight in forming judgments.

We all want good reviews for our endeavors, but we should also want accurate reviews. What if I do good work, but my good work merely meets your expectations (or even falls slightly short). I'd rather be judged against an accurate expectation than an inflated one. However, in the world of small data where people aren't completely rational, I'm not sure how to make that happen.

Tuesday, March 12, 2019

Using Excel's Filters (Spring 2019)

I just posted a new video for my students on YouTube.

Search This Blog