As president of the American Statistical Association in 2019, Karen Kafadar appointed a task force with the goal to “clarify the role of hypothesis tests, p-values, and their relation to replicability” – as described in her recent editorial in the Annals of Applied Statistics (where she is editor-in-chief). The editorial directly precedes the outcome of the task force – statement titled “The ASA President’s Task Force Statement on Statistical Significance and Replicability” (of which Kafadar is a co-author). The statement itself describes establishment of the task force to “address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of ‘p<0.05’ and ‘statistically significant’ in statistical analysis.)” The authors go on to more specifically identify the purpose of the statement as “two-fold: to clarify that the use of P-values and significance testing, properly applied and interpreted, are important tools that should not be abandoned, and to briefly set out some principles of sound statistical inference that may be useful to the scientific community.”
My prior expectations
I admit I had low expectations for any broad usefulness for what would eventually come out of the task force – not because of the task force itself, but perhaps mainly because of the implied value placed on producing a unified statement in the end, along with the reasons for the task force creation. But, I also thought it was well-meaning and probably benign exercise. I’m familiar with the work of a few of the task force members, and of course could have done more research on others, but I wasn’t involved with and didn’t care to know what hidden agendas and politics were lurking behind the scenes – I’m no longer naive enough to think there weren’t any, but my views here are not wrapped up in anything of the sort. When I originally heard about the task force, I actually had a positive view of the motivation and point behind it (perhaps naively), thinking “it couldn’t hurt” and that it was a good thing to continue these difficult conversations and acknowledgements of disagreements and challenges. My views here are just my thoughts after reading the statement and editorial, and before reading opinions of others.
I would have loved to be part of the discussions, or even just to have overheard them – but for the disagreements rather than points of consensus. We need to see the disagreements and discussion more than we need to see a pretty unified statement that doesn’t acknowledge the inherent messiness and nuances that create the problems to begin with. The acknowledgement of a lack of consensus was the best part of the discussion around the ASA’s Statement on Statistical Significance and p-values, including the Supplemental Materials. The task force statement does not have that feel. Kafadar’s editorial introduces it by saying “remarkable unanimity was achieved,” which is in stark contrast to the introduction to the ASA’s Statement on Statistical Significance and p-values written by Wasserstein and Lazar that highlights points of disagreement rather than unanimity: “Though there was disagreement on exactly what the statement should say, there was high agreement that the ASA should be speaking out about these matters.” There must be more to learn from disagreement than agreements. But disagreement is uncomfortable and not what most practicing scientists want on this matter – even if it’s simply the reality of where we are.
I suspected that getting to consensus would result in vague language that would have little direct applicability to practicing scientists and statisticians still trying to make sense of all these statements and form their own opinions and philosophies for practice. It is no surprise that many statisticians and scientists see the value in p-values and statistical hypothesis tests (I’m still not sure why we call them “significance tests”) and see them as important to science if properly applied. It wasn’t clear to me why a task force was needed to further articulate this, and I guess it still isn’t (beyond feeling the need to respond to the Wasserstein’s TAS editorial). I suppose I’m writing this to help myself come to some opinion about how useful (or not) the statement might be and to speculate on what implications might be for practice. Even if not overly insightful (as I expected), is it expected to be a benign contribution or is there the possibility of negative consequences?
Thoughts on the editorial
I will try to stay just shy of the weeds in my providing my thoughts on what’s found in the editorial and statement. The first time through, I read the statement first and then the editorial. But before writing this, I read them in the reverse (and intended) order so the editorial could serve as an introduction to the statement.
I’d didn’t take me long into the editorial to start feeling disappointed. I perceive recent gains in momentum and motivation for questioning statistical practice norms and even willingness to take hard looks at statistical foundations (by statisticians as well as scientists) – and the editorial did not inspire (at least me) in that direction. It strikes an interesting chord of almost implying that we’re doing just fine, while trying to avoid any ruffling of feathers and I don’t follow some of the logic.
Kafadar seems to argue that statistical hypothesis tests are important because they are used as if they are important (in reports of scientific work and in the judicial system). Their current use in courts of law (and probably often inappropriately) is not a reason for their continued use there, or elsewhere. The fact that they have been relied on in the past for courts of law does not automatically imply they are important for courts of law – there is nothing about the fact of their past use that makes an argument for how helpful (or not) they actually were. In reality, I suspect they helped develop arguments in the right direction in some cases, but one has to also consider the mistakes that have been made in arguments through common misuses and misunderstandings. She seems to simply ignore this possibility. Overall, the use of statistical hypothesis tests in the judicial process is not, in and of itself, evidence of their “importance or centrality.”
She states that hypothesis tests and p-values “when applied and interpreted properly, remain useful and valid, for guiding both scientists and consumers of science (such as courts of law) toward insightful inferences from data.” It’s hard to argue with a statement that starts with “when applied and interpreted properly,” but that disclaimer also points to the substance that is missing for me – there is plenty of disagreement on what constitutes “properly” – and that is the real issue. Unfortunately, the editorial and the statement gloss over this real substance and make arguments and statements conditional on “proper” application. I’m not sure where the conditioning gets us in terms of helping the scientific (or judicial?) community grapple with what the role of different statistical methods should be in different contexts.
I am not sure I understand, or that most readers will understand, what she is referring to with “unproductive discussion” – it would be nice to have an example of discussion around the topic deemed unproductive in her opinion. I don’t disagree that there has been unproductive discussion, but I suspect my views on what is productive and what is unproductive differ a bit from her’s – but that’s hard to judge. The whole sentence is “The ‘unproductive discussion’ has been unfortunate, leading some to view tests as wrong rather than valid and often informative.” To me, this sentence implies an argument for the blanket validity of hypothesis tests with no needed context. It is not the case that the tests are always valid, and just because users might “often” find them useful and informative does not mean they are actually providing the information the user is seeking.
I do agree that “A misuse of a tool ought not to lead to complete elimination of the tool from the toolbox, but rather to the development of better ways of communicating when it should, and should not, be used.” But – back to my comment about conditioning on “proper” – I don’t see this type of advice in the editorial or the task force statement. I also agree that “some structure in the problem formulation and scientific communication is needed,” but I don’t see how this is addressed either.
Kafadar then boldly asserts that “The Statement of the Task Force reinforces the critical role of statistical methods to ensure a degree of scientific integrity.” I cringe at statements like this. I think they do more harm than good – overselling a glorified and simplified view of Statistics. Sure, I agree there are times when statistical methods play a role in increasing scientific integrity, but their use certainly doesn’t “ensure” some degree of scientific integrity and in my opinion use of statistical methods can often be associated with practices that compromise scientific integrity. Scientific integrity is far larger and more important than anything statistical methods can insert into the process. It doesn’t help to simply state the potential positives and ignore the negatives.
Finally, I wasn’t sure how to digest the following: “I hope that the principles outlined in the Statement will aid researchers in all areas of science, that they will be followed and cited often, and that the Statement will inspire more research into other approaches to conducting sound statistical inference.” I assume by “other approaches” she means those other than statistical hypothesis tests? This makes it sound as if there aren’t many out there already, which is obviously not the case. As stated, the principle describes a worthy and lofty goal for such a task force, but this is unfortunately unrealistic given what was actually produced. It might be cited often, but as discussed further below I don’t think the principles are specific enough to “aid researchers in all areas of science.” And lots of citations isn’t always evidence of good work.
Thoughts on the statement
Now, to the statement itself. The first thing that will likely jump out to anyone sitting down to read the statement is its length – or lack thereof. This isn’t necessarily a bad thing, but given the nuances, different points of view, etc. regarding the topic, it surprised me – either as an impressive feat or an indication of lack of depth. What follows are my initial thoughts on the intro and each of the “principles” described in the statement. Admittedly, I focus on criticisms of words that left me disagreeing or worried about how they might be interpreted and used by practicing scientists. So before I go there, I also want to acknowledge that there are phrases in the statement that I appreciate for their careful wording. For example, “An important aspect of replicability is the use of statistical methods for framing conclusions.” They could have ended the sentence after “methods,” but including the “for framing conclusions” highlights an important, though subtle, point.
P-values always valid?
I am not one who thinks p-values (and associated statistical hypothesis tests) are evil, but I am one who is leery about their common use in practice and I can’t help but worry about their negative contributions to science. It’s easy to say they have “helped advance science,” but do we really know that? Can we really weigh the harms and mistakes against the good? Sure, they are “valid” and useful in some settings, but I can’t go so far as agreeing with the first statement of substance in the statement: “P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results.” I could write a whole post on this sentence – I would not have been part of the unanimity had I been a member of the task force. Here are a few things: (1) Stating “P-values are valid statistical measures” says nothing of when they are or are not valid (or any gray area in between) – instead, it implies they are always valid (especially to those who want that to be the case); (2) I completely agree that they “provide convenient conventions,” but that is not a good reason for using them and works against positive change relative to their use; and (3) I don’t think p-values do a good job “communicating uncertainty” and definitely not The uncertainty inherent in quantitative results as the sentence might imply to some readers. To be fair, I can understand how the authors of the statement could come to feeling okay with the sentence through the individual disclaimers they carry in their own minds, but those disclaimers are invisible to readers. In general, I am worried about how the sentence might be used to justify continuing with poor practices. I envision the sentence being quoted again and again by those who do not want to change their use of p-values in practice and need some official, yet vague, statement of the broad validity of p-values and the value of “convenience.” This is not what we need to improve scientific practice.
It is easy to say “they have advanced science through their proper application” – when one doesn’t have to say what counts as “proper application.” And it is easy to say that “much of the controversy surrounding statistical significance can be dispelled through a better appreciation of uncertainty, variability, multiplicity, and replicability” – particularly if definitions of the terms and details don’t have to be provided.
Uncertainty
The first general principle provided states that “capturing uncertainty associated with statistical summaries is critical.” It’s hard to disagree with this statement, but I am not sure how practicing scientists will be able to use this “general principle” in practice, particularly without depth of understanding around the concepts of uncertainty and variability, and lacking practice identifying and communicating about different sources of uncertainty relative to what is captured in statistical summaries. Given the difficulty even formally trained statisticians have in these areas, it’s no small feat for scientists in general – and the statement does not point to a roadmap for practice. The sentence includes a “where possible” – another disclaimer that I think could end up eliciting a feeling of relief in many readers. What is possible depends on the person and their current state of knowledge.
The statement implies the p-value is a measure of uncertainty – but I don’t think there are many people who can easily explain why the authors describe it in this way and what sources of uncertainty it encompasses (or how) – particularly for observational studies? And if that can’t be articulated by the user, should the user be using them? I think we can agree as a scientific community, that we (as scientists) have a responsibility to understand, clearly communicate, and interrogate our methods – convention and convenience are not adequate justification for continued use. If I can’t explain to you why a p-value is an important and useful summary measure in the context of my study (beyond that it’s convenient and expected by convention), then I shouldn’t be using it.
The heart of statistical science?
The statement describes “dealing with replicability and uncertainty” as lying at the heart of statistical science and part of the title of this general principle is “Study results are replicable if they can be verified in further studies with new data.” Again, this is vague enough to be hard to disagree with, but glosses over the hard stuff. What are the criteria for “verified”? How many further studies? etc. There is a lot contained in the sentence and each reader may interpret it differently based on their own prior experiences and notions – for example, some will automatically interpret “verified” to mean that the p-values fall on the same side of a threshold, and there is nothing to elicit questioning of current norms. I agree with their list of problems with research, but still not sure how they are defining a “replicability problem.”
Also hard to disagree with “even in well-designed, carefully executed studies, inherent uncertainty remains” – of course we can’t get rid of all uncertainty. The statement that “the statistical analysis should account properly for this uncertainty” is vague and unrealistic. What sources of uncertainty can the analysis account for and how are those conveyed? What assumptions are we conditioning on? Again, it is easy to say we should “account properly,” but who knows what counts as “properly”. It doesn’t serve science to make simple sounding statements like these without at least acknowledging that what counts as “proper” is not agreed upon or easy to communicate.
Theoretical basis of statistical science
The next general principle states “The theoretical basis of statistical science offers several general strategies for dealing with uncertainty.” Again – to me, this oversells what statistical science is capable of, as if it can deal broadly with the uncertainty. What sources of uncertainty can it deal with and what sources can it not? What assumptions are inserted along the way? I suspect I understand what the authors mean, but it’s important to consider what will be taken away from the words by practicing scientists more than what was really meant by the authors. What does the “principle” really offer in terms of advice to help improve science?
Thresholds and action
Another principle states “Thresholds are helpful when actions are required.” This is often a true statement in my opinion, but again does not reflect the foundation and justification for thresholds in the context of the action or decision required. There is a difference between “are helpful” in the sense that better decisions or actions are arrived at, and “are helpful” as perceived by the user just to get to a decision made regardless of justification or quality. The sentence again seems to appeal to placing value on convenience and perceived usefulness, regardless of foundations or theory behind it. They say p-values shouldn’t be taken “necessarily as measures of practical significance,” but the wording seems to imply that often they can be – and there is no mention of the work it takes to assess when that might be the case. I agree that thresholds should be context dependent, but there is a deeper layer to consider regarding whether it’s even appropriate to directly tie a p-value and threshold to the decision. Yes, it’s convenient, but what is the loss function implied and does it really align with the problem? I’m starting to sound like a broken record, but again – I think I understand what the authors are after, but the paragraph leaves so much open to interpretation that a reader can easily use it to back-up their preferred way of doing things.
More “properly” and rigor
Finally – the last general principle provided is: “In summary, P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data.” Hard to know where to start with this one. It repeats the dangers I have already discussed. It can easily be used as justification for continuing poor practices because the issue is a lack of agreement or understanding of what is “proper” and what counts as “rigor.” As is, I don’t agree with such a general statement as “increase the rigor of the conclusions.” Too broad. Too big. Too little justification for such a statement. Again, I’m not sure what a practicing scientist is to take away from this that will “aid researchers on all areas of science” as Kafadar states in the accompanying editorial. Scientists do not need vague, easily quotable and seemingly ASA-backed statements to defend their use of current practices that might be questionable – or at least science doesn’t need scientists to have them.
Closing thoughts
The process of writing this blog post shifted my initial perspective on the statement and the editorial. I did not start out intending to be so critical (in a negative way). I started out by viewing them as well-meaning and rather benign (even if not overly helpful to practice), but now I’m not convinced of benign. I am worried about the potential harmful consequences, though I’m sure unintended. I won’t pretend to have an easy solution, but providing vague statements qualified by “proper use” that make it easier for people to justify current practices by grabbing a quick citation does have the potential for harm. And, this will likely come across as ASA-endorsed, thus continuing the issue of what is officially endorsed by the ASA and what is not. I guess I’m just struggling to see the good that could come of them relative to scientific practice. But, I hope I’m just being pessimistic and missing something in the whole exercise.