An “as if” ramble about Type I error rate
November 21, 2019 | General | 1 Comment
I just finished the below post in one sitting and came back to write a reasonable introduction — or maybe more of a disclaimer. Once again, it did not proceed as I had hoped or planned. It’s a ramble. But, perhaps the fact this topic always turns into a ramble for me is part of the point. So, instead of adding it to the collection of other unpublished draft posts labeled “too-rambly,” I’m going to make myself just put it out there for what it is. Have fun…?
***********************************************************
I have multiple draft posts in the queue related to the concept of a Type I error — not because I like it, but because it is so embedded in many areas of science and discussions about current problems in science. I feel obligated to try to take it on in a different way. Given the press it gets, there is way too little discussion of what concepts and assumptions lie behind the words. When I used to try to teach about related concepts I often drew a line down the whiteboard and wrote “Hypothetical Land” on one side and “Real life” on the other. An awareness of the hypothetical land is important for understanding parts of statistical inference — even if you choose to go forward assuming its a reasonable enough model to be useful in your science. Each post I have attempted has blown up into something much too big and chaotic. So, I am going to try to stay close to the trunk in this post and avoid venturing off on the branches — not matter how attractive they look.
To provide a starting place outside of my own brain, I just looked up “Type I error” in the only Intro Stats book I happened to still have on my shelves. It’s Intro Stats by De Veaux & Velleman with contributions by Bock (2004 by Pearson). It should be a little outdated, but I’m not convinced it really is — at least for the sake of today’s topic. The table of contents lists “Type I error” with subheadings “Explanation of” and “Methods for reducing.” I’m not sure how you reduce a single Type I error, so I assume in the second heading they are referring to reducing the Type I error rate. This is a subtle, but very important point.
I flipped to page 395 and found the relevant section. It starts with “Nobody’s perfect.” Okay — I can get behind that and completely agree.
Then, in the second sentence, it moves right to decisions, as if the goal and end result of any statistical analysis is a simple binary decision. This is the first branch I will avoid as I fight hard to on task for today. I just ask that you don’t ignore the as if as we pass this branch (and others).
According to the text, it turns out we can only “make mistakes in two ways.” I could certainly find that a huge relief — but really it’s more disturbing than not. But, remember the as if — if we appeal to a greatly simplified binary reality, then maybe we buy into the usefulness of the two-error theory of mistakes, and here are the two (as quoted from the book):
I. The null hypothesis is true, but we mistakenly reject it. II. The null hypothesis is false, but we fail to reject it.
These two types of errors are known as Type I and Type II errors.
Page 395. De Veaux and Velleman (2004), Intro Stats, Pearson.
Okay – even within this greatly simplified, binary decision, as-if world, the null hypothesis typically states that a parameter (of some model) is equal to a single value — and most commonly that single value is zero. Even taking an as-if view (to gain mathematical traction), I will never believe a parameter (defined on a continuum) is equal to a single number like 0.000000….. So, what does that imply about these errors? It appears I can never actually make a Type I error because the null hypothesis is never true. Phew — so my Type I error-rate is always zero. What a relief. I guess we can stop here. But wait, … I promised to stay on the trunk and this is definitely one of the branches luring me away. I’m using all the willpower I can muster to stay on track.
Here we go — Let’s narrow our as-if hypothetical world view even further and suppose it makes sense to consider the potential “truth” of a null hypothesis (it either is true or it is not true). It certainly provides space to approach the problem mathematically and perhaps that’s worth it (invoking the idea of willful ignorance).
Where does that send us? Well, the text (and many other intros to this concept) proceed by making the analogy to medical testing procedures and false positives and false negatives. I get the connection and I see the teaching angle of tying a scenario of tangible real-life errors to those associated with the idea of statistical hypothesis testing. The trial analogy with guilty or innocent is another one often appealed to, and I admit I have gone there with intro classes (before I thought hard enough about it). But it has always made me uncomfortable and I think it should make you uncomfortable too. The relevance of the analogy for teaching rests on the assumption that the null hypothesis being true or false is like a person having a disease or not, and the diagnostic test is like a statistical test. I won’t go further up this branch now, but I will say I do think there is huge benefit to asking students to think about how the scenarios with a clearly defined truth (disease or not, guilty or not) relate (and don’t relate) to the statistical hypothesis testing scenario — that is, let’s question the strength of the analogy and the assumptions it rests on.
Moving on again. The title of this post has “rate” in it, so let’s move in that direction. On page 396, De Veaux and Velleman provide this:
How often will a Type I error occur? It happens when the null hypothesis is true but we’ve had the bad luck to draw an unusual sample. To reject Ho, the P-value must fall below α . When Ho is true, that happens exactly with probability α. So when you choose level α, you’re setting the probability of a Type I error to α.
Page 398. De Veaux and Velleman (2004). Intro Stats, Pearson.
Here we’re getting closer to a “rate” — though I don’t see the term ever used in this section of the text, which is interesting. Let’s do some dissecting of words and see where that gets us. First — the words “How often will a Type I error occur?” This idea is key and often (if not usually) glossed over. It’s nice that they explicitly state the question here, but then they skirt around the context that’s needed to make sense of it. It is typical that words around this idea are written in such a matter-of-fact way that the intended recipient does not even realize the context is missing. To understand this statement, we have to understand what is being referred to by the term “how often.” How is this defined or supposed to be thought of? Over all hypothesis tests conducted? How often in time (e.g. once a week)? How often over a researcher’s or data analyst’s career? There has to be some reference or basis to move forward in terms of attaching meaning, as well as mathematics, to it.
The clue is in their next sentence: “… but we’ve had the bad luck to draw an unusual sample.” There is a lot embedded in these words and perhaps I should assume that a student who read the first 395 pages of the book would catch the subtle connections, but my years of attempting to teach statistical inference lead me to believe that’s not the case. There is already so much going on and it is referencing concepts that are extremely challenging for most students to master in a first course, regardless of their mathematical backgrounds.
The reader likely focuses more on the words “bad luck” than the words an “unusual sample.” We need to connect the idea of the “how often” to the idea of the “sample.” Where does this connection come from? Well, hopefully the concept of a “sampling distribution” or a “randomization distribution” are still in our conscious memory. Because they used the “sample” wording, we’ll go with the sampling distribution idea here. (The randomization distribution idea is analogous, but based on random assignment to groups rather than random sampling from a population.) The word I have added here is “random” — and this is really necessary to support the theory underlying the idea. It’s hard enough to wrap our heads around the idea with the random — and so, so much harder to get at the “how often” issue if there is no random mechanism fed into the design. But, that is yet another branch of the tree I am not going to get tricked into following now.
So, let’s go with the assumption that you chose one sample (collection of units, individuals, etc.) from your population of interest at random. If you actually did this, then you shouldn’t have to stretch your imagination too far to imagine a different (although hypothetical to you) random sample, and another, and another, and another, …. you get the point. So, the “how often” in the statement is connected to “how often” out of all possible random samples!! This idea of a collection of possible random samples or random assignments to groups provides a tangible (though largely hypothetical) basis for taking on the “how often” concept and this is what gets us to the rate.
As much as I tried, I can’t move on without admitting that I cringe to see the word “exactly” in the quote and my eye starts twitching when I notice the emphasis added to it. Remember, this all makes sense in some very hypothetical, very over-simplified model of reality. The word “exactly” only makes sense within that world as well — so take it with a grain of salt. Don’t stop asking yourself how much simplification and hypotheticalization (note to look up this word) is too much — when is the model too simple or too far from reality to be useful?
Continuing… Let’s gently, but seriously, remind ourselves that this whole Type-I error idea rests on assuming the null hypothesis is TRUE. So, IF the null hypothesis is true (e.g., the true mean of some measured quantity in your population of interest is 0.000000….), THEN, just because of variability in the measured outcome among individuals, a proportion of the possible random samples will result in summary statistics (like sample averages) that are far enough away from zero that if an automatic criteria and decision framework is used, the researcher will “reject the null” — when under the assumption of the null being true, they shouldn’t have. I think of this as a realistic implication of not having enough information, not bad luck, but I assume they use ‘luck’ to appeal to the ‘random’ that was left out of the words. There are plenty of random samples that would lead to the decision to “fail to reject the null” — which of course does NOT directly support the null being true. Phew. This is exhausting. We can see why intro stats book sweep a lot under the rug and further simplify an “as-if” model that is already unrealistically simplified. [Note: my use of mean and average in the example assumes the mean of some quantity over the entire population is what we should be trying to learn about, an assumption we’re rarely asked to justify or think hard about, but that’s yet another branch! And see previous related post. ]
Back to the “rate.” Let’s say we decide, probably arbitrarily or because someone told us to do it, that we’ll set our Type I error rate to be 0.05 (I am watching this branch go by as well). We happily say “the probability of making a Type I error is 0.05.” This feels good to say, but what does it really mean? I argue that most people have very little idea what that really means and that to be more honest we should at least refer to the Type I error rate, and not this vague idea of the “probability of a Type I error.” Here again — another enormous branch leads off the trunk I’m trying to stay on — related to the difficulties inherent in defining probability to begin with. We’ll stick with the rate version — as we are tying “rate” to a proportion of all the possible random samples [that would lead to rejecting the null when really the null is true and conditional on the set-up of the test and all its underlying assumptions.]. Oh, did I mention that we might assume an infinite population with infinitely many possible random samples from it? Shoot — I’ll let that one lie for now too.
I knew this was going to be hard, but it’s been far harder than I anticipated!! Hang in there. I’m almost to the point that I really wanted to make with this post. The De Veaux and Velleman quote, as well as many other references teaching about Type I error, make it very clear that you as the researcher can control your Type I error rate. This is another comforting thought — finally, something you can control regarding the outcome of your study. But, how useful is this, even if you do buy into the concept of the a Type I error and the model behind it? Well, let’s be clear on one thing. You CANNOT control whether your study and analysis results in a Type I error — you won’t know if it does and you can’t control that. But, you supposedly CAN control the Type I error rate associated with your study. But, what does this really mean and how helpful is it? If taking other samples is purely hypothetical and you don’t know if you’ve made a Type I error, what do you do with that information from a practical standpoint? We typically only take one sample and that’s all the information we have. We rarely (never?) take multiple random samples of the same size from the exact same population at the same time. If we can afford to do more sampling, then we take a single random sample of more individuals — we don’t keep them as separate random samples of a smaller size, because why would we?? We can always reduce this largely hypothetical idea of a Type I error rate by taking a very large sample — trying our best to ensure the sample average is “near” (on some scale that matters practically) to the population mean.
I fear I’m failing to make the point that seems so clear in my own head. We talk about this “Type I error rate” and “Probability of a Type I error” as if it is something that exists beyond a hypothetical construct. We treat it as if it is like false positives and false negatives where the “false” can actually me known at some point. We talk about broad implications for Science, making claims like we should expect 5% of all reported results to be wrong. While I firmly believe our reliance on hypothesis testing and belief in the utility of trying to avoid Type I errors is definitely leading to issues in science, it’s not that simple and we have to start looking deeper at the meaning of these concepts. They are not magic and they are not simple.
The “replication crisis” (known by other names too, including reproducibility) often rests on assuming an integral role of the Type I error rate. And now, there is a push to “replicate” studies, often motivated by the idea of Type I error rates, rather than just a desire to learn more about a problem by gathering additional information. I am all for continuing to build knowledge by repeating the design of a study and doing an in depth comparison of the results or rigorously combining the information toward a common goal. But, I worry our misunderstandings of Type I errors (or lack of opportunity to think hard about the concept) are leading to oversimplifying the fix to the replication crisis as well. I think it has contributed to a focus on simply comparing p-values, despite statisticians warning against this. And, let’s go back to the “how often” idea. Even if a different research, in a different lab, tries their best to repeat the design of study, we are still a long way from the definition of where the error rate comes from — unless we are drawing new truly random samples from the same population and re-doing the analysis exactly the same under all the same assumptions. So, we are not truly “replicating” a study in the sense that defines a Type I error rate. And even if we were, what do we do with two (or three) out of possibly infinitely many random samples. That’s barely better than one! We should be gathering more information, but let’s not fool ourselves about the usefulness of the Type I error rate construct. And here I’ve ventured onto yet another branch that I have to keep myself from going down.
I keep repeating myself, but this post was not at all as satisfying as it was supposed to be, but I guess that is the lesson for me. There is nothing satisfying about a Type I error rate (or a Type II error rate or Power) when you really try to dig deep into the weeds. These are not laws or principles that should dictate how we should be doing science — they are based on models of reality that score very high on the “as-if” scale. These concepts were created largely for mathematical convenience and are beautiful in a purely mathematical sort of way. I admit, they are superficially satisfying if we happily ignore the many unsettled conceptual issues and proceed as if we’re in an algebra class, rather than practicing science in the real world. It’s like eating a piece of delectable high-fructose-corn-syrup filled candy — it’s so enjoyable as long as you don’t let yourself think about what’s in it and the potential long term implications for your health. It always tastes better before you look at the ingredients.
1 Comment
mks@math.utexas.edu
Megan,
The course notes at https://web.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html might be helpful. But don’t just go to the discussion of Type 1 and Type 2 errors — the notes are designed to give the (often missing) background for things like what you are discussing above — in the hope that the background and cautions along the way will help in understanding of the (often not straightforward) concepts, and also in understanding cautions about their limitations in actual practice.