Data Matters: 2020

Thursday, August 27, 2020

Millennials: Yes, this was predicted.

I guess I'm developing a habit of an annual post about millennials. My first was in October 2018 and my second was September 2019.

In the first post, I talked about my personal experience with different generations of students and linked to an article about millennials. In the second, I didn't say as much and linked to another article.

The overall message was - millennials aren't wildly different than prior generations.

Well here's another article about millennials. Since it's in the Wall Street Journal, it might be behind a paywall. So here's the main point: millennials - who were supposedly going to reject traditional family housing and completely change urban living - are driving the 2020 housing market.

Yes, this was predicted. The millennials are, on average, marrying later and having kids later but they're still marrying, having kids, and buying houses.

Monday, August 17, 2020

Gaiters - Are they bad or did the news jump the gun (again).

Yes, I suppose that this is another COVID-19 post but I haven't written about COVID since April and March. Also - one could argue that this post isn't really about COVID. It's really about press coverage of emerging research (maybe any research) and COVID just happens to be the context.

You've probably seen the headlines or heard the news that the neck gaiters people have been wearing as face masks might not be working. Six days ago the headline was pretty scary: "Wearing a neck gaiter may be worse than no mask at all, researchers find".

OK, they did say "may" which could imply some uncertainty, but the coverage that I saw was pretty negative on gaiters. In fact, my employer has banned them based on these reports.

Four days ago, the headline was a little less scary: "Some neck gaiters may be worse than not wearing a mask at all, study shows". Now it says both "may" and "some".

Three days ago, the headline shifted again: "The results of this viral mask study found gaiters weren't effective - but it that true?" That's a very different headline and the article includes a quote from one of the researchers (Brian Labus):

“People have really gone overboard with their interpretation of this study. The goal of the study was not actually to evaluate masks”

What??? They weren't even trying to evaluate masks? You wouldn't know it from the headlines but they were trying to develop a low-cost method that could be used to evaluate masks.

In fairness, I should point out that all three articles include a link to the actual research report. Unfortunately, when major news outlets report on research, very few readers click through to the actual research (did you click my link?). Instead, people count on the news story to accurately summarize the research. In this case, the news blew it and focused on a peripheral issue.

In my experience, it's not unusual for news reports about research to do a poor job of representing the research. Sometimes it's intentional but often it's just sloppy reporting.

That said, there was a peripheral finding with a small sample size for a particular gaiter. That's far from conclusive but it should be enough to raise concerns and encourage further research on gaiters. I hope that research happens soon and gets better reporting.

Oh, about that sample size I mentioned in the previous paragraph? I could comment on it, but I won't. You should click through to the actual research study and see for yourself.

Saturday, April 25, 2020

Wow. How did so many of us miss this?

I've tried to stick to my plan to write only one post about COVID-19 but it's been hard because there are so many data issues to talk about. As an aside - I think that StatNews is doing a pretty good job covering things.

But then I ran across this Wired story by Ferris Jabr. You should read it yourself but I'll be nice and quote the main point:

"Both newspapers and scientific journals frequently state three facts about the Spanish flu: it infected 500 million people (nearly one-third of the world population at the time); it killed between 50 and 100 million people; and it had a case fatality rate of 2.5 percent. This is not mathematically possible. Once a pandemic is over and all the numbers are tallied, its case fatality rate is simply the total number of deaths divided by the total number of recorded cases. Each country and city will have its own CFR, but it’s also common to calculate a global average. If the Spanish flu infected 500 million and killed 50 to 100 million, the global CFR was 10 to 20 percent. If the fatality rate was in fact 2.5 percent, and if 500 million were infected, then the death toll was 12.5 million. There were 1.8 billion people in 1918. To make 50 million deaths compatible with a 2.5 percent CFR would require at least two billion infections—more than the number of people that existed at the time."

Wow. How did we all miss this? Are we so innumerate that we didn't see 500 million and 50 million and immediately say "Hey, that's 10% not 2.5%"? Shame on us.

Beyond pointing out that none of us are paying careful attention, Jabr digs into the history behind these numbers and uncovers a lot of uncertainty about the Spanish Flu.

So here's where we stand.

COVID-19. My original post is still correct. The data would matter greatly if we had it. But we don't. It's getting better but it's still inconsistent and unclear and we're still facing extensive uncertainly.
Spanish flu. This data would also matter greatly if we had it. But we don't have good data and, at this point in history, we never will.

The take-away? We need to get more comfortable with margins of error and ranges of estimates. Data literacy should emphasize the need to look beyond simple point estimates.

Along those lines, I've just started a simulation unit in one of my classes. Simulation is a great tool for dealing with high levels of uncertainty. If you want to see my opening lesson, it's right here:

Tuesday, April 14, 2020

Interesting Shifts in Consumer Spending

From one of my favorite blogs: https://flowingdata.com/2020/04/12/change-in-consumer-spending-since-the-virus/

To me, the most fascinating aspect of this graph is that "groceries" initially jumped nearly 50% but now appears to be trending back to pre-stay-at-home levels while all other categories continue to decrease.

Much of the drop may be due to the sudden increase in unemployment but we can hope that some of it is due to people simply not having opportunities to spend. If people are saving money now, then maybe they'll splurge when this is over and help revive the economy.

Sunday, April 12, 2020

If it saves just one life...

How many times have you heard someone say "If it saves just one life, then it's worth it"?

That's one of my pet peeves because it's blatantly wrong. We will not do whatever it takes to "save just one life".

Consider auto accidents. Just under 40,000 people are killed each year in traffic accidents in the US. We could eliminate nearly all of those with just two regulations:

Speed Limit: A strictly enforced national speed limit of 10 miles per hour on all roads of all types.
Safety Equipment: Require all passengers and drivers to use fire suits, helmets, and NASCAR style five-point safety harnesses.

Of course, there's no way that we're going to do that. The cost is too high. It would save tens of thousands of lives but we won't do it.

Maybe that's not enough lives or maybe the rules would be too hard to enforce. So here's another idea.

About 5,000 pedestrians and 800~900 cyclists are killed by motorized vehicles each year. Instead of those two suggested rules, we should simply outlaw motorized vehicles completely. Then we'd save about 45,000 lives combined from traffic accidents, vehicle-pedestrian accidents, and vehicle-cyclist accidents.

By banning all motorized vehicles, we'd also stop over 500 boating deaths, several hundred ATV deaths per year, and all plane crash deaths.

Now we're talking WAY more than "just one life" so surely it's worth it. Isn't it? No, it's not. If someone actually attempted to outlaw motorized vehicles, they wouldn't get very far.

Maybe people who say "it's worth it" mean "it's worth it, as long as you don't mess with our transportation system".

OK. Let's explore that.

Between 2012 and 2016, there were over 176,000 home structure fires started by cooking activities. This led to over 5000 fire injuries and 530 deaths. That's over 100 deaths per year caused by cooking at home. Therefore, we should outlaw cooking at home. It would save over 100 lives per year. That's more than just one life so it's worth it.

We also need to outlaw sports. Obviously, boxing is dangerous with over 500 deaths in the ring since 1884. Football has to go too. There's been one only death on the field in the NFL but there are several every year in youth, high school, and college football. Several. Every single year.

We can't switch over to soccer either. Or hockey. Or Basketball. Fatality rates in some of these sports aren't high but they're not zero and "if it saves just one life, then it's worth it". All sports would need to be banned.

Any one of us could easily come up with more examples, but I think you get my point. There are limits to what we will do, pay, give up, etc. to save lives. There always has been and always will be.

So what do people mean when they say "if it saves just one life, then it's worth it"? In some cases, they're probably not thinking rationally. It's a statement made in an emotionally stressful moment and, if they thought about it, they wouldn't say it.

However, I think it's more often an attempt to manipulate. When they say "if it saves just one life..." what they're actually saying is "this rule/policy/expenditure is really important to me and I don't want anyone to argue with me or do any sort of cost/benefit analysis".

It's really just an attempt to cut off discussion and ignore the data. When that happens, I call them on it because - data matters.

Friday, March 20, 2020

COVID-19 Data would matter...

... if we had decent data and knew what to do with it.

I'm not an epidemiologist so I'm not going to write a lot about COVID-19 on this blog but there are some interesting data lessons.

I'm also going to post only one link (edit: see the end of this post). There are many articles about COVID-19 that I could reference, but I'm not going to. Instead, I'm going to give you a single data source (https://coronavirus-realtime.com/) and comment on "things we're hearing about" instead of citing specific sources.

Lesson 1: Simple calculations don't work if you're using the wrong data.

As of this morning, the site above shows 247,400 total cases and 10,067 total deaths worldwide. For the moment, let's assume that both of those numbers are accurate. What's the fatality rate?

The simple calculation that many people are doing is 10,067/247,400 = 4.1%

The arithmetic is correct, but the number doesn't answer the question at all. Most of the 247,400 cases are still in-process. We don't know how they will end.

To answer the fatality rate, we need to look at cases that are resolved. That restricts us to cases that are recovered (86,037) or deceased (10,067). That's 96,104 cases and, tragically, 10.5% of them had ended in death.

But that's still not the fatality rate...

Lesson 2: Bad data can be worse the no data

Let's look at all three of those numbers:

247,400 total cases
86,037 recovered
10,067 deaths

None of them are correct. NONE. Let's take them in order.

First, Total Cases: There have been multiple reports of people being denied tests because they weren't ill enough or they didn't show the right symptoms. However, we've also been told that many who get COVID-19 will show mild or no symptoms. I've read claims that more than half of those who get COVID-19 will be completely asymptomatic.

Therefore, the total number of cases could be double the reported number. Tests are becoming more widely available, but they're still being reserved for people who actually have symptoms. We would need to test large samples of asymptomatic people in order to properly estimate total cases. This might eventually happen for research purposes, but it's not going to happen in the midst of the crisis.

The total cases number also has interpretive problems. As of yesterday, my county had four confirmed cases. One was a woman in her 50's with no known travel or contact with infected people (what they're calling "community spread"). The other three cases were all in one family that traveled together. To eventually compute fatality rates, those are four separate cases. In terms of contagion, I would say that they are just two cases. I think that's an important distinction but I doubt that we'll ever have solid data that allows us that distinction.

Second, Recovered: The inaccuracy of total cases makes this number wrong too. If we never knew that you had it, then we'll never count you as recovered.

But it's worse than that. One news story said that patients weren't cleared until they tested negative on two tests administered 24 hours apart. Remember the shortage of tests? How many people are currently recovered but not officially recovered?

The source above shows zero recovered cases in either California or Washington. I've been watching this site for more than two weeks and not everyone who was active two weeks ago has died. By now they should be recovered so I don't know why there are no reported recoveries. It could be the issue in the prior paragraph or something else.

Third, Deaths: As with recoveries, the total cases number makes this number wrong too. If we never knew that you had COVID-19, then we won't attribute your death to it. Even if we know that you had COVID-19, death can be complicated.

My 90-year old mother passed away last year 10 days after a bad fall. Did she die from "falling" or from "complications of a fall" or ...? Several months earlier, she nearly died from sepsis. But the sepsis was the result of an untreated urinary tract infection. The recurring urinary tract infections were a result of other medical complications. If she had died during the sepsis incident, what would the real cause be?

Consider someone who is completely healthy. Then they test positive for COVID-19 and they die fairly soon. It would be pretty clear that COVID-19 killed them. On the other hand, if someone has a myriad of health problems and COVID-19 becomes the tipping point, then maybe COVID-19 "sort-of" killed them.

When I say that "bad data can be worse than no data", it's not because the data shouldn't be collected. It's because people don't understand it, they make simple calculations with it, and then they push for public policy and private decisions based on incorrect numbers. However, we do want the data...

Lesson 3: Keep collecting the data and keep a level of skepticism

There are other issues that could be raised with all of these numbers. I touched on age and health but didn't dive deeply into those. I've also ignored differing data methods in different countries.

Still, we'll eventually have better data. It will never be perfect (no useful data is) but it will get better. I'm a big fan of statistical literacy and public access to raw data but "armchair" statisticians need to be careful about their own number crunching. This is a situation where you should listen to the experts but focus on

a) admitted uncertainties in their calculations - like margins of error - and

b) disagreements among them.

Admitted uncertainties and disagreements from the experts will give you a good idea of how uncertain their conclusions are.

------------------------------------------------------------------
NOTE: The day after I wrote this, I ran across a great article on the data aspect of COVID-19 so I'm posting it here and not re-writing my post.
------------------------------------------------------------------

Search This Blog