The problem with “no evidence” – and is it enough to bust a myth?
March 31, 2020 | General | No Comments
“Failing to reject the null is not the same thing as accepting the null!!” “No evidence for an effect is not the same thing as evidence of no effect!!” For anyone who has taken a traditional intro stats class based on null hypothesis tests, these messages were probably drilled into your head (or at least should have been).
There are sound theoretical foundations behind these statements (conditional on many assumptions of course). Despite the foundations, they represent a sticking point for students first exposed to the logic and become a chronic problem in practice beyond the classroom. In my experience, most people who use statistics in their work fall into one of two camps: (1) they accept the statements as fact and adhere to the wording recommendations (even if they haven’t really made sense of the words), or (2) they do not find the recommendations useful and so ignore them in practice, either explicitly or implicitly through carefully chosen words. This may sound overly critical, but that is not my point — these are not people trying to mislead others on purpose. These are people trying to do their best to communicate scientific results and make decisions. And maybe we should listen harder to those who choose (2), and question those who always go with (1).
Statisticians, and other teachers of introductory statistics, tend to throw up their hands and say things like “they just don’t get it!” “How can they blatantly ignore what we teach in practice!??” And so on. I don’t blame the statisticians and teachers either. I’ve certainly been there and feel the frustration. But, what if we are missing something? What if it is unsatisfying and confusing for a reason — because there is more to the story? If so many well educated humans push back against, or even blatantly ignore, these statements and the reasons for them, then maybe we need to open our minds a bit and reconsider these things we’ve been taught are truths. Maybe our recommended wording falls short in the real world (even if technically correct under assumptions that most people don’t really understand or consider).
Given our societal reliance on statistical tests and their results, the “no evidence” wording makes its way into journal articles read by practitioners needing to make decisions and also into general media reports (we’ll look at an example soon). What if our wording that’s fitting for Intro Stats homework assignments falls quite short when transferred into real life? What if it’s not strong enough to stand on its own, and yet we rarely give it the extra support it needs because that isn’t part of what we teach or model in practice? We should value technically correct wording, but what if it’s misinterpreted when taken out of it’s technical context? And how useful is it if the majority of people reading it do not understand the theory and assumptions behind it — and why should they!!??
Before we go on, I want to be clear about something. I suspect comments to this post may go straight toward blaming the method or approach. “Well, don’t do null hypothesis significance testing and that will solve the problem!” “Use a Bayesian approach to get the statements you want!” Etc. While I don’t completely disagree with these statements, that’s not the point I’m trying to make. The bigger problem does not go away with a different choice of method — the same wording is often used or different phrases are substituted in much the same way. In general, I think we try to oversimplify the challenges inherent in trying to communicate inferences — which always must be made without all the information we would like. And, the wording I discuss here is incredibly widespread. Like it or not, it is still most commonly taught in the context of null hypothesis testing and used in that context in many scientific disciplines.
Okay – let’s walk (or wade) through an example from the media last week.
Myth busting around coronavirus
Someone shared this article with me: Corona Myths Explored in Medical News Today. It’s not something I would usually find or read, but figured I would glance through the “24 coronavirus myths busted” (that’s the label of the article in my browser tab). I had begun drafting a post already about reflecting on the confusion expressed by intro stats students when trying buy into believing that “no evidence of an effect” is always a different thing than “evidence of no effect.” I have a PhD in Statistics and understand the logic — but I also see the practical problems that grow out of those phrases and can’t really blame the students any more. Then, when I read the myth busting article, the examples were impossible to miss – like the final quarter wind of a jack-in-the-box. I am not placing blame on the authors — just providing an example of something widespread to provide a setting for discussion.
I could have chosen almost any of the 24 myths, but let’s go with myth #11. I encourage you to first read it quickly, as if you’re just scanning the news. Think about the message you took away from it — without fighting it. Then, go back and read it again slowly, trying to really think about what the words convey, or might imply.
There is no evidence that a saline nose rinse protects against respiratory infections. Some research suggests that this technique might reduce the symptoms of acute upper respiratory tract infections, but scientists have not found that it can reduce the risk of infection.
https://www.medicalnewstoday.com/articles/coronavirus-myths-explored#What-should-we-do? (March 2020, written by Tim Newman).
What is implied?
Let’s start with the first sentence: “There is no evidence that a saline nose rinse protects against respiratory infections.” Note this paragraph is about respiratory infections in general, not specifically coronavirus. The connection may seem obvious (implicitly), but it’s still a connection that should be made and justified. Now, to the bold part.
What do they mean by “there is no evidence”? Suppose I would really like to make a decision for myself and my family about whether to continue daily saline nose rinses? I need to understand the statement, where it comes from, and what it actually implies. I would also like any information about potential risks and theories about why it might or might not be effective (but I’ll get to that later).
I’m willing to bet that many people read the article quickly and, despite the carefully chosen wording, take the statements as reasonable justification for busting a myth. On your first read, did your brain quickly process the words into “Sinus rinses will not protect you against coronavirus?” There is nothing to be ashamed of if you did. That’s certainly where my brain unwillingly tried to go. We’re forced to constantly filter and simplify information to make sense of all we are bombarded with. The framing of “myth busting” completely primes us to interpret the statements in that way — even if technically the words say something else. Aren’t they implying that the myth is “Saline nose rinses protect against coronavirus”? And therefore the busted version of the myth is that rinses do not protect against coronavirus. Even the connection to coronavirus is implied.
Given the setting and general confusion about the statement, it’s easiest for our brains to interpret by turning it around. “Okay, if there is no evidence that a saline nose rinse protects, then I guess I should infer that it probably doesn’t protect against upper respiratory infections.” The alternative is to take on the possibility that there just isn’t enough (if any) information. This is clearly more difficult to process and doesn’t really jive with the goal of the piece — why bother wasting space presenting and busting a myth if information are lacking? I think this thought can easily send the brain back to the simpler message. We read and process quickly — it doesn’t really feel painful or hard until it has our attention, and then it can start to drive us crazy.
Possible interpretations of “no evidence”
I see three general options for what “no evidence” could be translated to mean:
- There is no information because we haven’t studied it yet. In other words, translate “no evidence” to “no information.” For this post, academic definitions are not as important as how the phrases are interpreted by individuals. I will save a deeper discussion of evidence vs. information for another day.
- There have been some studies, but we haven’t yet gathered enough information to be comfortable making a call about whether the might be preventative. This may be summarizing a single study or used in a more general sense. In other words, translate “no evidence” to “not enough information” or “too uncertain.”
- There is enough information available to make a call, but researchers are relying on statistical hypothesis testing and their p-value is still over their cut-off (or the null is in their uncertainty interval). They would like to say there’s “evidence of no effect” and it might even be justified — but that wording is not allowed based on the approach. I look into this more below.
These three translations differ by the amount of information available for a problem, yet the exact same wording of “no evidence” can and is commonly used to describe all three. This is the suggested and accepted wording – not a mis-use. The wording is not technically wrong, it’s just really ambiguous and leaves a lot of open questions. It’s certainly not satisfying.
We haven’t found anything. But, how hard have you looked?
I can’t say for sure which translation was intended by the author, but if he was really trying to talk about coronavirus, then I’m pretty sure it has to be the first one. If we care about all upper respiratory infections, then maybe the second. He probably didn’t think through it himself — his job is to report wording from researchers or “experts” who are trying hard not to make the statistical faux pas of conflating “no evidence for an effect” with “evidence for no effect.” Avoiding this faux pas when in the first and second scenarios above is important, but real life isn’t as simple as we either “have evidence” or we “have no evidence.” This is yet another example of the strong pull we have to simplify by falsely dichotomizing — even at the expense of leaving out crucial information. In this case, by not adding context about the amount (and type!) of information available about the “myth,” the first and last sentences are almost meaningless.
I guess we haven’t looked at the last sentence yet, so let’s have a take a quick peek: “Scientists have not found that it can reduce the risk of infection.” Again, it’s clear care was taken in crafting this statement to avoid saying “Scientists have found that it cannot reduce the risk of infection.” But, how can I interpret “have not found” without some information about how hard they have looked? At the extremes, they might not have looked at all, and therefore definitely didn’t find anything — or, they might have done extensive searching and didn’t find anything. These are two very different scenarios. With respect to coronavirus in particular, I think it’s safe to assume they have not yet looked.
I can’t help thinking about my kids on this one. For you parents out there, how many times have you heard an anguished cry of “I can’t find X!!! It’s gone!” I’m betting that for many of you, your first response is similar to mine: “Where have you looked?” And often, it’s clear not much looking has been done, so I don’t buy the “it’s gone” conclusion. Only after we have thoroughly searched the house and retraced steps will I believe that claim. I suggest we apply a little of this parenting logic to interpretation of claims stated in the media, even if they are dressed up in wording that looks to have come straight from scientific experts.
Do either of the sentences bust a myth? Do they even justify that there is a myth that’s worth busting?
Things to consider beyond amount of data
Disclaimer — this is presented as a thought exercise in response to reading information from the media. This is not a thorough review of theory and data regarding sinus rinses and coronavirus. I am just trying to demonstrate how we can recognize the need for more information and avoid accepting our first interpretation of statements like those presented in the myth busting article.
Should you stop your sinus rinses due to “no evidence”? Well, it depends on how the phrase should be translated relative to the amount of information available (as discussed above), but it also depends on potential risks associated with the behavior (treatment/intervention) and strength of biological reasons for why it might or might not work.
I haven’t dug deep into the literature on the potential negative effects of sinus rinses, and am going on the assumption that negative side effects are minimal for most people. I know a handful of people who have been given medical advice to do daily sinus rinses, and more who rely on them when having sinus discomfort or feeling a cold coming on (I’m one of those). I’m not convinced they always help, but I’m not convinced they hurt either. So when I weigh the potential risks vs. benefits, I end up going with the sinus rinse. I put frequent drinking of water, and even hot herbal teas, in this category too. I believe hydration is healthy. There are also potential positive psychological effects of “doing something” that I don’t think should be ignored. The placebo effect is still an effect.
I would also like to hear more about the biological reasons why it might work, or why it shouldn’t be expected to work. I like a little theory and logic, preferably with a lot of data, but I’ll take it alone if there aren’t any data. It’s still information. Is it possible that it clears some of the virus out of the sinuses that could make it down to the lungs? What about “we don’t have data yet, but here’s why some people think it’s worth studying (or not)…”?
In the case where there is no (or very little) data about the benefits over some “control” and no worrisome side effects (I’m assuming this), what is the point of publishing this as a busted myth? Presenting it in this way may stop people from a low risk, but potentially useful, behavior simply because we have too little information. What if it is actually helpful for some people? It’s worth thinking about. Suggesting (even implicitly) that people should not do something is not equivalent to telling them there simply isn’t enough information to make a recommendation, so they should weigh the potential benefits and risks themselves and make their own decision without the help of data collected for that purpose. It may not be what they want to hear, but it’s honest.
Plenty of information, but still “no evidence” — more on the 3rd translation
Now, I’m going to dig into the statistical aspects of the third translation I offered. This is my explanation for what I see happening in practice in multiple disciplines. We have to really see and understand the problem before we care enough to try to fix it. I’m trying out a presentation that I hope will hit home with some.
Suppose plenty of data have been collected in a large single study, but the stated conclusion was in the form of a “no evidence” statement. What is probably going on in this situation? It probably stems from a long history of null hypothesis testing, and habits developed during that history. Wording and “logic” from that realm have even carried over to other approaches that don’t directly support it.
I have found I get the best traction on this “no evidence” idea when I appeal to uncertainty intervals. It’s really a demonstration of the additional information contained in a confidence interval as compared to a p-value. While this may sound like an old and boring topic, I find it to often be a key in understanding the confusion surrounding “no evidence.” I feel compelled to say that I don’t think confidence intervals are the answer to all of our statistical-use problems, but they contain information that can be used in more holistic (for lack of a better word) and creative ways (no, I do not think creativity is a bad thing in science). However, confidence intervals have their own baggage and should be taken with healthy skepticism and an awareness of all the assumptions behind them. Okay, here we go…
Let’s compare two intervals and the conclusions supported by each. And before we do, let’s also suppose that before carrying out the research, we did the hard work to think through what values of the parameter we decided to estimate (e.g. a difference in means) would be considered clinically relevant or practically meaningful. We even drew a nice picture representing the backdrop of clinical relevance — any results would be laid out in front of this backdrop and interpreted relative to it. In this case, let’s suppose that values greater than about 2 are considered definitely practically meaningful and values below about 0.5 are considered to represent values not considered clinically relevant. That leaves values between about 0.5 and 1.5 in the uncomfortable, but unavoidable, gray area. Maybe I should give the backdrop an acronym to make it more attractive — Backdrop of Clinical Relevance (BCR)? (For another day: Why do acronyms make things more attractive?)
Here’s my picture:
Let’s consider the two uncertainty intervals (labeled A and B) drawn below the BCR. Let’s suppose they are 95% confidence intervals, for ease of comparing them to two sided p-values and the usually automatic choice of 0.05 as a cutoff (again – I am not recommending this, just acknowledging current practice and trying to appeal to what most people are still taught).
First, note that both intervals include the value 0. Unfortunately, this is often the sole use of a confidence interval in practice, which is no different than just looking at a p-value. If the null value (0 in this case) is in the 95% confidence interval, then the two-sided p-value will be greater than 0.05. The textbook recommended wording will lead to statements of “no evidence for…” — for both A and B!!
But A and B look very different!! The difference is particularly obvious when interpreted relative to the BCR. If the backdrop is never drawn, then it can be incredibly hard to assess how similar or different they are (leading to an over-reliance on statistical “significance”). A and B tell very different stories and should lead to different conclusions about what to do next (both in terms of future research and making a decision about what to do)!!
Interval A represents a situation where there are plenty of data available. That is, there is enough precision in the estimate to distinguish between values that are and are not considered clinically relevant. If interval A was centered at 2.5, then you would conclude “there is evidence for a clinically relevant effect.” So, why when it’s centered near 0 and excludes all values representing clinically relevant effects are we not allowed to say “there is evidence of no clinically relevant effect”? Because of the structure and logic behind hypothesis tests and their lingering negative effects on critical thinking and statistical inference. The recommended wording of “no evidence for an effect” is not wrong — it’s just ambiguous. No wonder things get so confusing.
A large p-value (or null contained in a CI) can arise because there actually is no effect OR because there isn’t enough information to distinguish values deemed clinically relevant from those that are not. In this case, interval A is consistent with reason (1) and interval B is consistent with reason (2). They are not the same!! Interval B leaves us with more uncertainty than we would like. We could consider the study a failure and toss it out, or we could take it as valuable information that suggests further study. Maybe it was incredibly hard to collect data or there was simply more variability in the system than expected, etc. “No evidence for an effect” seems an overly harsh and easily misinterpreted end. It might lead to tossing something in the file drawer that shouldn’t be tossed, or concluding an intervention is not effective — a conclusion that is not really back up by the statistical results.
People have been thinking about this problem for a long time
People have been thinking about and trying to address this problem for a long time. I’m talking about here and trying to describe it in my own words, because it’s still out there in real life! And the confusion is real.
There are recommendations for approaches that try to deal with this problem (e.g., equivalence testing, Kruschke’s region of practical equivalence (ROPE), Bentsky’s discussion here, my previous post here, as well as many other commentaries). Note that all of these (as far as I can tell) rely on a commitment to developing a meaningful backdrop. And again, the point of this post was not to present alternative methods to better distinguish between intervals A and B, but to focus on how the statement of “no evidence” is interpreted in scientific sounding reports.
Similarly, I would like to think that use of Bayesian inference and seeing the posterior interval in A would be interpreted differently. While sometimes this is the case, I have witnessed (more times than I care to remember) the same “check if zero is in the interval and conclude “no evidence” or “evidence” based on that check. Many people using Bayesian methods are first trained in hypothesis testing and its associated logic and wording. The move to using Bayesian methods does not imply a shift to fully understanding or buying into the philosophy of Bayesian inference.
Finally, we have to start grappling with our over-reliance on averages and our default mode of operating as if an average applies to an individual. Our imperfect human brains can’t help but assume results stated in impressive scientific sounding wording apply to us individually. See here and here for a previous posts on this, though I feel there’s still so much more to say on the topic. I worry about it, a lot.
Responsibilities and recap
At least for today, I see the confusion around statements of “no evidence” coming from the combination of two things: (1) negative effects of automatically applying hypothesis testing logic and wording, and (2) real difficulties in distinguishing between “no information” and “no evidence.” This isn’t a trivial issue affecting only students in Intro Stats. It’s a real problem we should all be struggling with. It’s one of those things that after you become aware of it, will be in your face every direction you turn — particularly in this time of the global pandemic. We must add more context and information to be able to make sense out of the phrase “no evidence.” It’s incredibly important for how statements sent out the public are interpreted and internalized — and acted upon!
Scientists and writers have a responsibility to try to think about what is implied by their wording — not only if it is technically correct according to dry rules applied out of context. Of course we cannot be expected to read the minds of others and predict all interpretations. But, there is a middle ground. We can recognize when wording is justifiably confusing and when there are obvious misinterpretations. We can recognize when we’re taking the easy way out and not presenting enough information and context. It’s too easy to keep falling back to textbook-justified wording. I have certainly been guilty this — before I took the time to think through the real consequences it can have on our ability to make decisions.