Category: General

Home / Category: General

For years, I have been proposing a relatively simple strategy for better tying quantitative analysis to specific research and decision contexts — with the goal of helping researchers give less authority to statistical methods detached from context (e.g., move away from reliance on arbitrary and default statistical criteria). Through work on different projects, I have developed more explicit steps to help researchers/analysts work through the process of developing a quantitative contextual “backdrop” early in the research process. The ideas started in the context of working through sample size investigations with alternatives to power (https://critical-inference.com/sample-size-without-power-yes-its-possible/), but have evolved into what I see as an important undertaking in any project depending on a quantitative scale and some form of estimation (whether sample size justification is needed or not). The process can also be particularly important if practical decisions based on the results are high-stakes (politically, financially, etc.), as it lays out a framework for decision making in the face of uncertainty before results are in; ideally it can motivate early conversations among stakeholders with potentially opposing values or viewpoints and lead to less conflict around interpretation after data analysis.

The term “backdrop” is borrowed from theater. As part of a play, the backdrop is a picture that hangs behind the action to provide meaningful contextual reference. For quantitative research, the action is study design, analysis, and interpretation — and the picture created in the planning stages of research hangs behind and provides structure and justification along the way. The actual backdrop is very simple (a colored number-line), but the process of developing it is surprisingly challenging — partly because researchers just haven’t been asked to so something like this before.

The following document outlines, and attempts to justify, steps in the process of creating a backdrop. It is a work in progress that I have decided to share publicly before it is “done” — and ultimately is meant to accompany a longer paper (that has been in progress for a few years) and helpful examples.

The soup we are cooked in

February 10, 2023 | General | No Comments

I am still spending a fair amount of time reading, listening, and thinking about various forms of therapy (very generally interpreted as practices/approaches to help us understand and be at peace with our own minds — including things like mindfulness practices, meditation, somatic experiencing, psychotherapy, etc.). The time provides intellectual pleasure and of course has obvious benefits for my life more broadly – and for the lives of those around me (particularly my kids!). As I’ve expressed before, there’s a fairly constant series of collisions between that “work” and my thoughts about how we humans “do science.” Processing the collisions continues to help me understand and put words to my longterm discomfort with the attractive cloak I see us often laying over an inherently messy process of science.

Photo from Pexels.

How do we end up in the habits of mind and relationships with others that we do? How do we end up in the scientific habits of mind and places within scientific institutions that we do? I love the phrase used by Brad Reedy (in his book The Audacity to Be You and podcasts): the soup we are cooked in. The soup metaphor conveys something deeper and more complex than usual about how our experiences simmer and seep into what feels like our inherent way of being in the world – the amalgamation of all that now just feels like us. The metaphor is not of an external environment passively leaking into an organism or the process sprinkling on spices, but instead one of being rolled around and softened at a boil, and then simmered for long periods of time with ingredients that quickly become unrecognizable alone. Flavors from all the ingredients seep into each other. It’s impossible to recognize a carrot as it was before it went into the soup — it is now part of the soup and the essence of that soup in that carrot, and the carrot in the soup, cannot be erased. The many flavors cannot be separated, everything in the pot becomes part of everything else in the pot — even if something is removed.

We are all humans who were cooked in a unique soup – and most of us can see that on a personal level, even if we haven’t spent a lot time reflecting on the processes contributing to our own soup. If we think ours was a simple recipe of common ingredients, we are likely very wrong. We can see it on a personal level, but are we able to see it in our roles as scientists? (I am assuming you have some scientist-like role even if you don’t identify as a scientist professionally – we are all consumers of science on a daily basis). In my experience, I see very little awareness of the science-related soup we are cooked in among scientists, and even less reflection or deep curiosity around it. It feels like even admitting there is a soup that scientists are cooked in feels a little unscientific — according to the soup we were cooked in, that is. The soup-related work has largely been handed over philosophers of science to do from the outside (who were also cooked in an academic soup), rather than taken on by the practicing scientists themselves. Some might refer to the science soup as a Kuhn-like “paradigm” – and while that fits to a degree, I am trying to get at something deeper that includes the uncomfortable fact that we cannot separate our personal soups from our scientist soups. There are so many aspects of culture and values that I don’t think are often considered part of a “scientific paradigm,” not to mention an individual’s way of interacting with the world (both human and other).

Looking at the soup we’re cooked in is a natural part of personal therapy. Many of us have had the experience of how hard it can be to actually gain enough perspective, and courage, to put a ladle deep into the soup, stir it around, and sip a little at a time to try to understand why it looks and tastes the way it does. It’s a process of starting to understand, as much as possible, why we might keep doing the same things over and over again — without much awareness of doing them. It is a process of questioning our motivations and our reactions, interrogating all the implicit and explicit assumptions underlying how we live and interact with others, and letting go of being “right.” It’s a serious blow to the ego – and that’s really the point. And this leads me back to science.

Photo by Jamie Diaz (Pexels)

Doing science still has a too-good-to-be-true vibe. The soup is served with fancy garnishes, but there is a fear of peering under the lid of the pot. It might upset too much of the status-quo and the way of doing business. There is a fear it might inadvertently contribute to mis-trust of science. It might expose limitations based on a culture of narratives and beliefs that are rarely questioned by individuals. This is ironic when the scientist stereotype is portrayed to value questioning and healthy skepticism — as long as the ladle doesn’t dip too far below the surface of the soup. I believe a deeper stir and curiosity would be a good thing for science, even if uncomfortable and even if it means giving up a little of the scientist-as-hero narrative that’s been added in large doses to the soup.

There are the obvious ingredients in our personal soups that don’t take much work to be aware of (e.g. exposure to religion as children); though being aware of an ingredient does not mean it is translated into questioning or interrogating or thinking about it as an ingredient in a complex soup. But I’m more interested in the very subtle parts of the process – the mild spices, the temperature, the type of pot, the length of time it simmers, when different ingredients are entered, how often it is stirred, etc. — things that are not obvious to become aware of, but can drastically change the nature of the soup. All those things that make up our experience, but that we rarely give attention to – most likely because we aren’t aware of them. Why do we do what we do? Why do we interact with the world as we do?

Back to science. What is different about doing science? Why do you do science how you do science? Where did your beliefs about how to do science come from? We usually are aware of obvious things, like the program within which we received formal education, our advisors, and other mentors. This feels like the same level as the religion of our parents – we see that part of the soup, even if it’s its effects are not questioned deeply. But again, I’m more interested in the subtler parts of the scientific soup. Things like valuing objectivity, believing in a materialistic truth that is discoverable, believing a scientist can overcome human biases, valuing the peer-review process, beliefs and habits about how to carry out statistical inference, valuing RCTs over experiences, etc. And even just within Statistics, the soup and its process are hardly recognized – instead presented as a garnished meal in an attractive bowl to satisfy almost anyone and keep them from thinking about how to make their own meal. But, what are the ingredients when you look deeper? Where did they come from? Who cooked the original soup and why did they? Who kept it simmering on the stove? Who adds new ingredients and who prefers the original recipe? I could go on.

It’s easy to think we don’t start getting cooked as scientists as college or graduate school, but the science culture is part of American culture. It definitely starts early — probably preschool for a lot of kids who end up becoming professional scientists, or even earlier from the their parents. It has been hard for me as a parent to watch what comes home from school about how to “do science” – and definitely anything related to statistical inference. For example, at an early age we teach kids to form and latch on to hypotheses with all our might — and judge the success of an experiment (even if implicitly) by whether they “proved” or “disproved” their hypothesis. I spent most of my comments volunteering at science fairs just encouraging kids that their projects weren’t a failure if they didn’t find support for their hypothesis. I advocated for adding “limitations” sections to poster boards.

It can take an enormous amount of painful work to make sense of how the soup we were cooked in affects our lives and the lives of those who interact with us. It makes sense that few people do it. Many days I would like to go back to being blissfully unaware of the soup. What would the practice of Science look like if scientists were more willing to really look at the scientific culture soup they, and their scientific ancestors, were cooked in? It would be painful, but could it help lead to a more creative, fulfilling, unencumbered process of discovery? I think so. A kind of science therapy, or maybe therapy science — though both of those phrases could easily be misinterpreted to imply our current scientific soup applied to therapy. All these words are just expressing what I see as a need to look deeply at our practices, beliefs, and assumptions, where they might have come from, figuring out what’s worth holding onto and what’s worth letting go of. I see a shift to not just accepting what has been handed down and what constitutes “rigor” and “success” – and really all those ingredients that have been tossed into the soup and stirred to the point they’re not recognizable as individual ingredients anymore.

What’s in the soup matters, but what matters more is to recognize that we are not separate from a soup. No one is without a soup. It is frustrating to hear people talk as if science-related work that is separate from a messy soup. The scientist-as-hero narrative is, to me, portraying scientists as if they weren’t cooked in a complex soup, both personally and professionally (as if we can separate those).

Here’s to recognizing there is a soup, and you would be interacting with your environment differently had the soup you were cooked in been different — and that includes the ways we do and consume science.

I enjoy listening to the hours of recorded lectures by Alan Watts (they happen to be easily accessible to me in my favorite meditation app). This quote from my morning listen motivated me to write a few thoughts about objectivity and science, not too different from things I have said before.

“So people began to think that the differentiation between mind and matter was no use. Because actually what happens in making such a differentiation is that you impoverish both sides of it. When you try to think of matter as mindless or mind as immaterial, you get kind of a mess on both sides.”

Alan Watts – at about 1:46:00 of The Power of Space in the Sam Harris’s Waking Up App

“But, you see, what has enabled us to make a transition is first of all, above all I would say, two sciences — Biology and Neurology. Because through Biology and to some extent Physics — the methods that physics has shown us — that the idea that man can be an objective observer of an external world – that is not himself – so that as it were he can stand back from it and look at it and say “what is out there?” – we see that this cannot be done. We can approximately do it, but we cannot really and fully do it for the simple… for two reasons. One – the most important reason – is that the biologists will show us very clearly there is no way of definitively separating a human organism from its external environment. The two are single field of behavior. And then furthermore, to observe something, either simply by looking at it or more so, by making experiments, by doing science on it, you alter what you’re looking at. You cannot carry out an observation without in some way interfering with what you observe. It is this that we try when we’re watching, say, the habits of birds, to be sure the birds don’t notice us that we’re watching. To watch something, it must not know you’re looking. And of course what you ultimately want to do is to be able to watch yourself without knowing you are looking – ha ha. Then, you can really catch yourself, ahh… not on your best behavior and see yourself as you really are. But this can never be done. And likewise, the physicist cannot simultaneously establish the position and the velocity of very minute particles or wavicles. And this is in part because the experiment of observing nuclear behavior alters and affects what you are looking at. This is one side of it – the inseparability of man and his world, which deflates the myth of the objective observer standing aside and observing a world that is merely mechanical – a thing that operates as a machine out there. The second is from the science of neurology where we understand so clearly now that the kind of world we see is relative to the structure of the sense organ. That, in other words, what used to be called the qualities of the external world – its qualities of weight or color, texture, and so on are possessed by it only in relation to a perceiving organism. The very structure of our optical system confers light and color upon outside energy.”

Alan Watts – at about 1:47:44 of The Power of Space in the Sam Harris’s Waking Up App

Objectivity is a concept that keeps coming up for me in a lot of scientific work and discussions, both implicitly and explicitly. It is a concept I’m used to being frustrated over — mainly because I often feel the lack of deep thinking around it within the scientific community. It is one of those things that budding scientists are told they should value and then they take it on without ever thinking through the difficulties and challenges and complexities of being a human and being “objective” in the process of learning about the world that we cannot separate from ourselves. What does it really mean for a human being (scientist or not) to be objective in their observations or experiments? Is it possible to be objective in the way many seem to believe is possible? Can particular methods be objective in some way that is separate from the human applying the methods? What are the harms in carrying on pretending that we are capable of some level of objectivity that isn’t really possible? There may be value in working toward an ideal of objectivity in science, but that is different than pretending we can actually reach that ideal. And should we continue to value acts of pretending over honest admissions of lack of objectivity that come from actually doing the hard and uncomfortable work of interrogating the ways we *do not* meet that ideal in our own work? These thoughts are not separate from other things I have said about the importance of scientists accepting that scientists are human too and bringing this in as an integral part of doing science and thinking about the limitations of science. I think we can greatly benefit from admitting humanness is wrapped up in science , instead of pretending we can rise out of our humanness for the sake of science

Examined life … and scientific practice

October 2, 2022 | General | No Comments

I happen to be reading James Hollis’ Living an Examined Life: Wisdom for the Second Half of the Journey. And, as seems to happen often, I can’t help but see the connections between his descriptions and how I view the way statistical methods are often used in science today. Of course there is no real separation between how we live life and how we do science, even if much of society finds comfort in pretending there is. For myself, and especially for those areas where I have no direct connection to the science, there is a sneaky feeling of wishing that we could separate doing science from all our human challenges; a wish that we could really hang on the “objective” and the “seeking truth” from an unbiased place. We can try for such ideal, but I think to keep it from being dangerous, the impossibility (in most cases) of achieving the ideal also needs to be acknowledged. We give too much authority to automatic methods and precise-looking numbers without examining the why. In my experience, there just isn’t a sense of urgency instilled for the need to examine one’s own practices and current paradigms — just as for many of us, there doesn’t seem to be a need to examine life in a different way — until there is. It’s too bad we often need to a crisis to hit home to get us there.

Here are a few quotes from the first couple of chapters that sent a little bolt of my thought toward common statistical practices in science today:

I am not in any way suggesting that our cultural values, our religious traditions, our communal practices are wrong; that is not for me to judge. Many of those values link us with community, give us a sense of belonging and guidance in the flood of choices that beset us daily. I am saying, however, that the historic powers of such expectations, admonitions, and prohibitions are to be rendered conscious, considered thoughtfully, and tested by the reality of our life experience and inner prompting. No longer does received authority — no matter how ratified by history, sanctioned by tradition — automatically govern. We are rather called to a discernment process.

James Hollis, page 3 of Living an Examined Life: Wisdom for the Second Half of the Journey, 2018, Sounds True, Inc.

In any moment, we view the world through a distorting lens and make choices based on what the lens allows us to see, not what lies outside its frame.

James Hollis, page 3 of Living an Examined Life: Wisdom for the Second Half of the Journey, 2018, Sounds True, Inc.

Tiny in a world of giants, we reason that surely the world is governed by those who know, who understand, who are in control. How disconcerting it is then when we find our own psyches in revolt at these once protective adaptations, and how disillusioning it is to realize that there are very few, if any, adults on the scene who have a clue what is going on.

James Hollis, page 5 of Living an Examined Life: Wisdom for the Second Half of the Journey, 2018, Sounds True, Inc.

Thee uncertainty

June 5, 2022 | General | 2 Comments

I’m back for a long overdue post as I reflect on my number of years on this planet. I officially started this blog 3 years ago on this day when the name came to me on my annual birthday trail run/hike (the blog picture is from that day – a beautiful Montana May dusting of snow). It marked a hard left turn in my career path, which was followed by many unexpected sharp turns in my life path. The turns keep coming, but my relationship to them keeps changing for the better.

It is hard to sum up the last three years, but if there’s one word that works – it’s uncertainty. So much change, so much unexpected, so much joy, so much pain, and so much learning. The uncertainty explains the lack of consistent posts here, but even when I’m not posting I credit the existence of this blog as helping me synthesize my thoughts into mental posts and still often record thoughts and ideas in draft posts. And, one of the things I am forced to learn through the uncertainty is self-compassion for not publishing many posts – not an easy one for me, but very important. I now realize that most of the recent ideas I have written draft posts for hit on the theme of uncertainty in some way, so it seems a good point to jump back in on.

Before I go on, I want to acknowledge that *uncertainty* is very complicated and unsettled concept and today I am certainly (hah!) not digging deep into foundations or historical and philosophical arguments about the meaning of uncertainty.

I have been involved in several conversations in the last few months where the term “the uncertainty” has come up; in the context of Statistics, it’s variations on phrases like “we want to quantify the uncertainty” or “get the uncertainty right.” These are said with a plain “the”, but I think it’s worth playing with the “thee” to emphasize my point because I don’t think the implicit interpretation is much different. I know I have uttered such phrases myself over the years, but I am now trying to be more aware. And I don’t think it’s a silly, picky wording issue. It matters if it subconsciously affects how we interpret what a method is capable of and it affects how results are interpreted.

Describing the goal of statistical inference as “getting the uncertainty right” sounds okay on the surface, but has implications that I worry are not good for science. I suppose it is fine to state as a goal – as long as it is clear that the goal isn’t one that can be fully met, but I don’t see that as part of the conversations. Can trusting the statement that we can “get the uncertainty right” lead to overconfidence in statistical results, giving them too much authority, or taking them to represent something they simply don’t, and can’t? Accentuating the “the” by replacing it with “thee” makes the problem more obvious.

It is attractive to quantify. It feels good to feel that we’ve captured “thee” uncertainty using our statistical methods – that we’ve taken messy uncertainty and turned it into a tidy feeling interval (or some other summary). It not only feels good, but it’s expected. An expectation to attempt to quantify uncertainty isn’t necessarily a bad thing, as long as the specific expectation is reasonable. There is a big difference between quantifying “thee uncertainty” and quantifying “some uncertainty.” We talk as if we can get to “thee,” but we’re always really doing “some” and we don’t even do a good job understanding or communicating about what is included in that “some.” We’re usually leaving out a lot of the story, sweeping it under the rug, or more commonly nonchalantly tossing it into the closet so it’s out of sight.

A crucial question, even if we often can’t answer it easily or satisfactorily, is “What sources of uncertainty are we actually quantifying? And what assumptions are we relying on to do it?” How often do we attempt to convey what sources of uncertainty are actually captured in a standard error, interval, or posterior distribution? The complementary question is just as crucial: “What sources of uncertainty are we NOT quantifying?” This one blows up quickly for most real-world problems, particularly in an observational study setting, but that frustration is important information in and of itself.

How much is missing vs. captured in reported quantities (meant to convey something about uncertainty in another quantity) of course depends on the context, the data, the model, the question, etc. However, except for very well controlled laboratory studies with random assignment, or sampling exercises using finite and very homogeneous populations, “thee uncertainty” is really “some sources of uncertainty under a bunch of assumptions.” Somehow the latter doesn’t sound nearly as appealing, but it’s more honest and I think science would be better off if we were better at acknowledging that. When statisticians imply any and all uncertainty can be captured by our methodology, and that we can “get the uncertainty right,” we’re already sending a misleading, and in my view harmful, message. Maybe if we do a better job working with researchers to attempt to articulate what is represented in an interval, for example, the limitations of methods would be clearer, and over-statements of results would be lessened. Maybe we wouldn’t give so much authority to those attractive, tidy seeming results.

Progress doesn’t come from pretending we have harnessed uncertainty by hiding the mess and chaos driving that uncertainty. Progress comes from acknowledging the mess and our part in it, and learning to change our relationship to the uncertainty. We can’t get rid of the uncertainty, we can only change how we deal with it.

(Note: I wrote the draft of this post on May 18th, 2022)

Andrew Gelman published a post last week titled Stop talking about “statistical significance and practical significance.” Combined with previous posts he links to, he shares traps people can fall into when using logic/explanations such as “its statistically significant, but not practically significant.” He points out the importance not forgetting about variation in effect sizes, conveys issues with implying larger effect size estimates are better, and says “I’m assuming here that these numbers have some interpretable scale.” These are all things I agree with — so my comment is not in disagreement with his points, but to voice my concern that we should balance these points with talking about the positive aspects of talking about practical importance. My motivation probably comes out of worry for how the post might be interpreted by those who didn’t really want to deal with justification of practical significance anyway, or were struggling to even know how to do it. As I said in my comment (copied below), I may be overreacting, but I don’t think it can hurt to put my reaction out there. The post produces an image in my head of a comic strip where the researcher is first working hard to justify what effect magnitudes might be judged practically meaningful as part of their study design, then they take a break to read Andrew’s blog for statistical wisdom, and then the next frame shows them smiling with a text balloon saying “Gelman says I don’t have to worry about practical significance!”

I am not going deep into the issues here, but just sharing my comment, which I tried to keep fairly short. I guess the main message I’m hoping to get across is that understanding and actually interpreting statistical summaries relative to the scale of the chosen measurement is important to good research. A researcher should understand what differences or changes on that scale might have relevance to practice or theory before setting out. One of my biggest frustrations as a collaborative statistician has been hearing the perception that such an exercise is unnecessary because it is believed that is the role of statistical tests. And I think I even remember feeling this way as a graduate student in another discipline before going back to school in Statistics. I believe we should have higher expectations for researchers to take this on in the design and in the interpretation of results from statistical models — after years of many letting only “statistical significance” do the judging for them. Sure, mistakes will still be made, but anything that encourages or builds expectations for doing such work is a step in the right direction in my opinion. I would love to see people arguing over how practically meaningful, or even realistic, an estimated effect (or range of effects) is — as opposed to continuing to accepting an unjustified verdict based on a p-value or other default criteria (assuming people aren’t going to give up on such criteria anytime soon).

Andrew,

While I agree with your points about the potential pitfalls of “talking about statistical significance and practical significance” (as made in this and the previous blog posts you link to), I worry more about the harm in *not* acknowledging clinical/practical relevance than I do about the harm from falling into traps you describe. We have seen what the world is like when researchers are not required to explain/interpret/justify magnitudes of effect sizes in the specific context of the problem before declaring them “significant” (or not); a world in which it is easy to bypass the challenges of interrogating choice of measurement and how that choice connects to statistical parameters to be estimated in favor of handing the hard work over to “statistical significance” thresholds. I see the momentum for talking about clinical significance as a small step in building expectations for justifying practical relevance, *including* justification for why a large estimated effect is realistic and should (or should not) be trusted. I am likely overreacting here, but I worry that some people will take too much from your blog post headlines (without reading or understanding the details) and think “Gelman says we don’t have to worry about clinical relevance,” thus inadvertently giving some the perception of a free pass.

Mistakes are going to be made on either end (for large and small estimated effects), but not talking about clinical relevance isn’t going to solve that, particularly if we continue to largely rely on single studies, or even a few studies by the same research group. Encouraging discussions of practical/clinical significance can at least start to push people to not stop with “statistical significance.” I believe anything that gets researchers, or those using the outcomes of research, to have to think hard about and justify what magnitudes of an effect would have practical/clinical relevance is important to research. This goes both ways – not only having to justify why a small, but precisely estimated, effect should not be celebrated, but also having to justify why a large (probably uncertain) effect could be plausible in real life before celebrating (given the design, measurement, things that cannot be controlled for, etc.) For example, the reported effects of the childhood interventions, such as the Perry Preschool program, on adult earnings always seemed unrealistically large to me, especially when considering the huge challenges in estimating such an effect in real life (RCT or not). Another small step forward in the practical/clinical relevance discussion is questioning large effects, as well as small. 

So, I agree that discussion of statistical vs. clinical significance could be improved and that there are still holes one can fall in even if considering clinical significance, but I see an expectation for the discussion as a step in the right direction. I see the good outweighing the harm given current practices in many disciplines. In general, I think it is about creating an expectation for justification and explanation in context, and getting away from simply trusting a point estimate and/or p-value — the whole interval of values should be considered and interpreted, and then revisited and interpreted when results from other studies studying a similar effect come out. And, even better, ranges of clinical/practical relevance (as well as those too large to be realistic) could be specified a priori.

From the Statistical Modeling, Causal Inference, and Social Science blog – December 29, 2021

Thanks again to Andrew for motivating a post here after a bit of a dry spell. It’s a new year!

How typical is a median person?

October 8, 2021 | General | 1 Comment

I have talked about cautions and implicit assumptions that surface around the use of averages before, but seeing a few references to the “median person” in the news a couple of weeks ago led me to writing this post for the Statisticians React to the News blog. I am cross-posting it here.

***************************

Last week, references to a “median person” showed up enough times in my life to spur me to write this post.  I think the use of an “average person” is still more popular, but this “median person” seems to be gaining popularity too.  Regardless, as attached to a person, the concept is essentially the same – it’s an appealing way to take a collection of averages (or medians) calculated for each of many measured characteristics and conceptually construct a new individual, as least hypothetically. The question that doesn’t get asked enough — Who, if anyone, does an individual constructed in this way actually “look” like? 

There’s certainly an appeal to the phrase, most obviously that it is short (in number of words), but there’s also something story-like and personal conveyed that makes it easy to digest.  Would you rather read “the median voter is expected to …” or something like “we predict a person with median income, median socio-economic status, in a median sized county, of median age will vote …”   It’s easy to see why journalists, and researchers, use the shortcut. The median voter description actually conjures up an image of a particular type of person, mostly likely a “typical” person – which of course probably looks different to each of us based on our own biases and experiences of what is typical.  Sometimes writers will provide a description, but usually it’s up to us as readers to throw a little reality check into our thinking – to ask about how the median/average person is defined and how useful the hypothetical individual really is.

The main problem comes from the natural, though potentially misleading, connection between the concepts of a median/average person and a “typical” or “common” person. Journalists may never explicitly say they are talking about a typical or common person, but there is a tendency for our brains to go there. The extent of the problem largely depends on how many characteristics are being used to construct the hypothetical person. If we’re only defining an average or median person based off one characteristic, then it can be perfectly reasonable, as long as there are many individuals in the group of interest who do fall near the average or median.  But, imagine when the list of characteristics defining the person starts to get long!  The median person is defined as an individual whose measurements of all the characteristics of interest are at, or very close to, the median!  As the list of characteristics gets longer, the median or average person becomes rarer and rarer — such that the hypothetical person described is not at all typical or common, and might not even exist!

The story of the average pilot

Todd Rose, in his book The End of Averageuses an effective historical example to illustrate this point.  In the 1940’s, the United States Air Force was experiencing an issue with too many pilots losing control of their planes, which ultimately led to examining the design of the cockpits. It’s hard to imagine now, but the first cockpits were fixed in their dimensions with no way to adjust things like the shape of the seat, the distance to the pedals and stick, the height of windshield, etc.  To get the fixed dimensions used in all planes, the engineers took measurements of physical dimensions of 100’s of male pilots in 1926, calculated the averages, assumed the collection of them represented an average pilot, and then designed the cockpit according to the those dimensions.

Decades later, when the cockpit dimensions became a suspect in the rate of crashes, the initial thought was that the size of the average pilot had just changed. They decided to update their average pilot dimensions with a more data – measuring 140 dimensions on over 4000 pilots!  Fortuitously, they hired a physical anthropologist, Gilbert Daniels, to participate in collecting and analyzing the data. Daniels had just finished graduate work in anthropology where he had measured dimensions of human hands – and he had concluded that the idea of an “average hand” was not useful because the average hand, as constructed from the collection of many average dimensions, did not resemble any individual’s hand measurements.

Needless to say, Daniels’ previous work led to skepticism about the “average pilot” idea and he used the data to do a little extra analysis. Based on just 10 of the physical dimensions thought to be most relevant to cockpit size, he defined an “average pilot” as an individual with measured dimensions falling within the middle 30% of the range of values for each of the ten dimensions (a pretty loose definition!). He then went through the individual pilots to count the number who met the “average pilot” criteria. There was general agreement among Air Force researchers that most pilots would be considered “average” based on this definition – recall that pilots, at least at that time, were already pre-selected based on their size to fit into the existing cockpits!

How do you think it came out?  Well, out of the 4063 pilots, none (zero!) of the pilots met the “average pilot” criteria!  Even for just three relevant characteristics, less than 3.5 percent had all three measurements fall near enough to the individual dimension averages.   The “average pilot” just didn’t exist – the cockpit had been designed to fit no-one!  This ultimately led to cockpits with adjustable features, and of course eventually to all the ergonomic adjustments we have in our cars, and even bikes, today.  Can you imagine trying to drive a car made just for the average person?

As an aside, there is a lot of interesting history (and far reaching implications!) related to going after the concept of an average person – starting in the mid-1800’s with the work of a Belgian astronomer turned social scientist, Adolph Quetelet. The End of Average provides a nice starting place for those interested.

How big is the problem?

Is this a big problem in reading the news? It depends. How long is the list of characteristics? Do many characteristics tend to vary together, or do individual profiles (collections of measurements) tend to look very different?  Do we attach too much meaning to an “median person” and can we get beyond the “typical” and “common” misinterpretation?  The distinction can seem subtle between a “median person” and a “hypothetical person with all characteristics falling at median values,” but the implications in terms of what one takes away from the article may not be subtle. It’s so tempting to assume the “median person” looks close to us or the people we know.

The phrase also gives writers the easy chance to leave out the list of characteristics used in the definition, which takes away a chance for readers to have information that could be used to better criticize the work or consider how it might apply to them or those people they care about.  We are usually in need of more details about what’s behind a stated result – and cute, vague shortcuts don’t do us any favors in such a situation, except to reduce the number of words we must read.

The “real median voter”

Some writers at least acknowledge the issue; this very short section called “The real median voter” showed up in the New York Times The Morning newsletter on September 29, 2021 by David Leonhardt and Ian Prasad Philbrick. The authors discuss the assumption made by those in elite U.S. political circles that a “median voter” is a “political moderate” (that happens to look a lot like those around them), and contrast that with what is actually common in the rest of the country where “this ideological combination is not so common.”  From a political perspective, the essay has some shortcomings, but I was still happy to see some explicit acknowledgment of problems with the interpretation of a “median person” — that the median person is probably not common, and that our ideas of what a median person may look like depend on who we are surrounded by.

In summary

I hope more journalists, and readers, can do a better job asking how typical or common a median or average person might really be in a particular situation. It is worth trying to separate our image of an actual person from the more appropriate, but also more boring, idea of simply reporting some prediction based on average or median values of all the characteristics, or variables, of interest – even if a person with that combination of characteristics does not exist.

Reasonable reflections from jury duty

August 25, 2021 | General | 4 Comments

I was called for jury duty yesterday and spent about 5 hours experiencing the process. I was not ultimately selected for the jury – I wasn’t even close given where I sat in the long line (3rd from the end). I’m not sure if my position near the end was based on information in the questionnaires we submitted (mainly about employment and education) or just random. Regardless, I got to listen to the juror selection process (voir dire) – and had to be ready to address the same questions myself should enough people have been excused that they ever got to me.

As someone who thinks a lot about uncertainty in science, and life, I found the questions posed by the attorneys, and the discussion around them, fascinating. There are a few things I want to remember and so decided to write about them briefly here.

p-values?!

I was expecting some discussion of making judgements in the presence of uncertainty, but was not expecting the word “p-value” to pop up in the couple hours of questioning I witnessed. Not as surprising as p-values coming up in the conversation was that a wrong definition and interpretation of p-value was provided by a potential juror (with a PhD in a natural science). I don’t really fault the scientist, but it was very clear example of how confident people can be in their incorrect understandings the concept – so confident that they might voluntarily repeat it several times in front of a relatively large audience under oath! They provided the court with the information that “a p-value gives the probability of a hypothesis being true.” I admit I squirmed a bit in my seat — but I wasn’t one of those being directly questioned and had to stay quiet. Plus, I couldn’t see how it would really affect anything related to the case or selection of the jurors, as I suspect that any connection between p-values and duties of a juror was lost on most, if not all, others in the room.

More interesting was the set of questions from the defense attorney that prompted the scientist to give the mini-pseudo-lecture on p-values. The attorney noted the PhD’s involvement in doing research and asked about uncertainty in conclusions and how that uncertainty is typically handled – and whether they ever know anything without a doubt. The researcher first described that they use modeling, and that they report uncertainty using statistical techniques – like confidence or credible intervals — giving a nice description of not relying solely on a point estimate. The attorney then asked how they test hypotheses – which I think he meant in a general sense, but was immediately interpreted as statistical null hypothesis testing; the researcher responded by saying they don’t test hypotheses in practice and then that lead to the topic of p-values. I guess my main point in telling this story is that it was an interesting example of someone trying to engage a scientist in a high level discussion of making decisions in the face of uncertainty — and it quickly ending up in the weeds of null hypothesis tests.

Beyond reasonable doubt

This was all part of a larger period of questioning around the standard of proof to be used in the trial – centered around the concept of reasonable doubt. I’m a little embarrassed to say I think I’ve taken the phrase for granted and never thought through its implications to the extent I should have. It was fun to see the attorneys create this realization among most, if not all, the people there. What does it really mean to establish guilt beyond reasonable doubt? I certainly didn’t have an easy answer and neither did any of the potential jurors actually questioned. And, it got me thinking about the potential relevance of the term to a scientific context.

Given the vagueness of the phrase, it must be common practice for attorneys to start with a discussion about what the phrase “beyond reasonable doubt” means during voir dire. Interpretation of the phrase is clearly challenging and varies by state (to get a sense, do a web search for “what does ‘beyond a reasonable doubt’ mean?”)! The interpretation (I hesitate to call it a definition) given verbally to us was the following (or close to it):

Proof beyond a reasonable doubt is proof of such a convincing character that a reasonable person would rely and act upon it in the most important of his or her own affairs. Beyond a reasonable doubt does not mean beyond any doubt or beyond a shadow of a doubt.

From https://lawofselfdefense.com/jury-instruction/mt-1-104/

I wish it would have been provided visually as well, but as I sat there for two hours after hearing it once, I kept thinking about the implication that “reasonable” apparently refers not to the “doubt” itself, but to the person expressing the doubt. So, not only do we maybe need to asses what qualifies as “beyond” reasonable doubt, but we should invoke some assessment of what makes a person reasonable. It also seems to imply that “reasonable” people should have some very similar threshold for what constitutes “proof of such convincing character.” The interpretation thus brings in other vaguely defined terms, like “convincing character” to give people a “common sense” feel for what the standard is after — since there really is no way to explicitly and clearly define it (or I’m sure it would have been done by now!). The attorneys also focused on making the point that the term does not mean “beyond any doubt” or “beyond the shadow of doubt.” But there is no way to clearly define the line and the line must be expected to vary by individuals.

At one point, one the attorneys even asked one of the potential jurors something like the following: “If you were in a position to uphold the standard of proof (given it was decided upon by courts), would you do so?” I admit, I was thankful to not be on the spot for that one. There are clearly problems with its vagueness and openness to different interpretations – but it’s not something that can be abandoned without an alternative that is deemed better. Does it result in mistakes being made? Sure. Can we come up with different wording for a standard that would lead to fewer mistakes in judgements? Probably, but it’s not so easy to come up with, implement, or measure and compare something like number of mistakes without knowing the truth.

Reasonable?

The conversation made me reflect on how often I rely on the term “reasonable” when discussing use of statistical inference in science. I have found it really useful, but maybe I have relied on it too much – without an adequate definition of what I mean by it. I remember being laughed at once at a conference when I used it as a suggestion for how we can refer to assumptions (rather than being “met”) – I think the person maybe thought I was joking and muttered something like “right – like we use with toddlers!” I remember being surprised by the comment – which is probably why I still remember it — and obviously it missed the mark of what I intended.

To me, there is an important difference between the questions “How reasonable are the assumptions?” and “How correct are the assumptions?” and “reasonable” captures something for me that’s difficult to capture with other words. But what do I really mean by it? I guess I connect it to sound reasoning or good judgement, but with an air of practicality. Is the reasoning good enough that others who are knowledgeable on the subject would make the same decision about its usefulness? I’m usually using it in the context of whether something seems good enough to be relied on, but not in the sense of being perfect or correct or without uncertainty. Does that really send us to the same place as burden of proof in Montana courts? Maybe it’s not that far off — in that it’s beyond any doubt that would be considered useful or that should be acted on?

I suppose I also implicitly invoke the “by a reasonable person” part of it. If the decision is one that is judged to be okay by many reasonable people (if given the justification) or represents a decision that would be made by many reasonable people, then that would lead me to saying the decision was reasonable. But, then I’m left with having to define who counts as a reasonable person. Ugh. It’s sure easy to end up in the downward spiral.

As is painfully clear in public discourse today, for most people, the judge of “a reasonable person” is probably more a statement of how much the person judging who’s reasonable agrees with the views of the person whose reasonableness is being questioned. As far as I know, there isn’t some objective measure or threshold for “reasonable” that could stand up to judgement from individuals with diverse views and backgrounds.

Clearly I don’t have answers, but I will certainly stop and think more before I use the term reasonable and try to give it a more meaningful interpretation in context when I can. But, it also seems like it’s a word that crops up to capture something that is otherwise hard to put into words – making it inherently hard to drill down through.

For fun, I briefly looked through different definitions of “reasonable” in on-line dictionaries. They didn’t offer much more insight, but did open up additional cans of worms like “what is fair?” “what constitutes sound thinking?” “who counts as a rational or just person?” etc. Here are a few:

Cambridge English Dictionary: “based on or using good judgment, and therefore fair and practical.”

Vocabulary.com: “showing reason or sound judgment,” “marked by sound judgment,” “not excessive or extreme”

Some of the “Best 16 definitions of Reasonable” as given here:

  • Governed by or being in accordance with reason or sound thinking.
  • A standard for what is fair and appropriate under usual and ordinary circumstances; that which is according to reason; the way a rational and just person would have acted.
  • Having the faculty of reason; endued with reason; rational.
  • Just; fair; agreeable to reason.
  • Not excessive or immoderate; within due limits; proper.
  • Being within the bounds of common sense.

Finally, a query into the legal definition here, brings me full circle by giving the following disclaimer: “The term reasonable is a generic and relative one and applies to that which is appropriate for a particular situation.” There you have it.

I guess you get to judge if this is a reasonable post by a reasonable person. I hope you are a reasonable person to provide a reasonable opinion.

Statistical hermeneutics

August 3, 2021 | General | 2 Comments

I started this post a few months ago and it caught my attention as I was looking through drafts. I now can’t remember exactly what instigated the draft, but I’m guessing it was something in the Philosophize This! podcast pointing me to this entry by Theodore George in the Stanford Encyclopedia of Philosophy. I just read quickly through the essay and despite the broad and varied history and use of the term hermeneutics (hərməˈn(y)o͞odiks/), I find something very valuable in considering the term and what its implications could be relative to the study of statistical inference in science. I have alluded to this in other posts, but not said explicitly — a huge aspect of what I see as missing in the use of statistical methods in science is a meta- view, but even bigger than what meta-research often hits on (at least as discussed in a science reform context). I hadn’t before identified what I’m after as meta-interpretation, but that is where hermeneutics leads me and it seems fitting.

Here is George’s introduction to the concept:

Hermeneutics is the study of interpretation. Hermeneutics plays a role in a number of disciplines whose subject matter demands interpretative approaches, characteristically, because the disciplinary subject matter concerns the meaning of human intentions, beliefs, and actions, or the meaning of human experience as it is preserved in the arts and literature, historical testimony, and other artifacts. Traditionally, disciplines that rely on hermeneutics include theology, especially Biblical studies, jurisprudence, and medicine, as well as some of the human sciences, social sciences, and humanities. In such contexts, hermeneutics is sometimes described as an “auxiliary” study of the arts, methods, and foundations of research appropriate to a respective disciplinary subject matter (Grondin 1994, 1). For example, in theology, Biblical hermeneutics concerns the general principles for the proper interpretation of the Bible. More recently, applied hermeneutics has been further developed as a research method for a number of disciplines (see, for example, Moules inter alia 2015)

Stanford Encyclopedia of Philosophy

This leads me to thinking about a type of missing meta study that keeps us from tackling big questions about the role of Statistics within the larger scientific process. The complexities in the interpretation of statistical models, results, predictions, etc. are often downplayed and taken for granted, as well as the downstream effects on science and decision making. It’s just not often that we are hit with an explicit call to discuss and study our, often implicit, interpretations of the statements that come out of our use of statistical methods and results (statements we think of as “interpretation”). So I like the word, despite that fact it doesn’t exactly roll off the tongue, at least at first.

George focuses on its meaning within philosophy, but I don’t see a great distance between his words in the context of Philosophy and those that could be useful in the context of Statistics:

Within philosophy, however, hermeneutics typically signifies, first, a disciplinary area and, second, the historical movement in which this area has been developed. As a disciplinary area, and on analogy with the designations of other disciplinary areas (such as ‘the philosophy of mind’ or ‘the philosophy of art’), hermeneutics might have been named ‘the philosophy of interpretation.’ Hermeneutics thus treats interpretation itself as its subject matter and not as an auxiliary to the study of something else. Philosophically, hermeneutics therefore concerns the meaning of interpretation—its basic nature, scope and validity, as well as its place within and implications for human existence; and it treats interpretation in the context of fundamental philosophical questions about being and knowing, language and history, art and aesthetic experience, and practical life.

The key sentence to me is “Hermeneutics thus treats interpretation itself as its subject matter and not as an auxiliary to the study of something else.” There are times when we focus heavily on interpretation in teaching and practicing Statistics, but it’s rare to go up a level to where we might seriously study interpretation, beyond quantitative and theoretical properties of methods. Hermeneutics would bring in the complex human and social dimensions that are carved into interpretation of statistical results in science, whether we like it or not. And that’s hard for many reasons that I think are obvious.

Here’s my first attempt at translating George’s words from the philosophy-context into a statistical context: Statistical hermeneutics concerns the meaning of interpretation of quantitative results based on statistical approaches – its basic nature, scope and validity, as well as its place within and implications for science; and it treats interpretation in the context of fundamental questions of inference, philosophy of science, history and philosophy of statistics, decision making, and scientific practice.

Maybe statistical hermeneutics can be defined more simply as something like “the study of interpretation of results and conclusions based statistical theory, methods, and reasoning.” I’m sure there’s a better definition waiting for us, but I think this simple one gets us somewhere.

In Statistics courses, we teach how to interpret things like estimated regression coefficients, credible/confidence/compatibility intervals, and p-values. But there is rarely a meta-interpretation layer that honors the disagreements in many of those interpretations, the unsettled foundations of how they should be used in science, and the downstream effects on science and decision making. The meta-interpretation context is probably what I find most fascinating in Statistics and also most challenging, and why I have a hard time fully hopping on board with typical interpretations and find it uncomfortable to carry out analyses of my own — I just never quite buy into the whole process, which I think mostly comes down to my being uncomfortable with the simple explicit interpretations are ultimately interpreted on a deeper and implicit level by the humans doing the processing.

Here is another quote in the essay that particularly jumped out at me for its message being one I think is often missing in the context of Statistics. I also like the positive spin on the finitude of human understanding – which I find myself strongly agreeing with.

Hermeneutics may be said to involve a positive attitude—at once epistemic, existential, and even ethical and political—toward the finitude of human understanding, that is, the fact that our understanding is time and again bested by the things we wish to grasp, that what we understand remains ineluctably incomplete, even partial, and open to further consideration. In hermeneutics, the concern is therefore not primarily to establish norms or methods which would purport to help us overcome or eradicate aspects of such finitude, but, instead, to recognize the consequences of our limits. Accordingly, hermeneutics affirms that we must remain ever vigilant about how common wisdom and prejudices inform—and can distort—our perception and judgment, that even the most established knowledge may be in need of reconsideration, and that this finitude of understanding is not simply a regrettable fact of the human condition but, more importantly, that this finitude is itself an important opening for the pursuit of new and different meaning.

There are also interesting connections to discussions about the validity of Human Science. I won’t go into the details here, but you can find them in the essay under that heading. Here’s just a taste: “In this, Dilthey’s concern is to defend the legitimacy of the human sciences against charges either that their legitimacy remains dependent on norms and methods of the natural sciences or, to his mind worse, that they lack the kind of legitimacy found in the natural sciences altogether.” There is also more on beliefs among philosophers that “modern science, despite all the methodological and technological sophistication, has failed to account for the basic epistemic foundation on which it relies.”

Things that we, as scientists, could benefit from thinking about more often, even if we might disagree.

Section 7 describes Postmodern Hermeneutics and mainly references the work of Lyotard related to the dangers and possibilities of postmodern rejection of “meta-narratives”, where “meta-narratives include, say, stories about the objectivity of science and the contribution that science makes to the betterment of society.” Lyotard sees both “danger and possibility in the postmodern rejection of metanarratives.” 

I barely scratched the surface of philosophy of hermeneutics by reading this one essay – but I see value in continuing to dig into the concept (?) of hermeneutics. The essay makes it clear there are obvious ties to influential philosophical work related to science, but I’m not sure how much it has filtered into the worlds of practicing scientists. It seems to me that there is much that could be done to help scientists grapple with some underlying challenges with use of statistical methods in practice – and how results are ultimately interpreted and used in the process of doing science and contributing to “understanding.”

After writing most of this post, I did a quick search to see if/how others might have combined hermeneutics and Statistics. There are some interesting looking ideas and contributions. I share a few here, but with the disclaimer that I have not yet read them carefully myself (they are now part of my anti-library for the time being). Ironically, maybe there are too many interpretations of hermeneutics to make it directly useful, but I still find myself thinking it could carry something worthwhile.

Paola Gerbaudo on Data Hermeneutics

Robert Groves on Bayesian Statistics and Hermeneutics

Diana Taylor on Hermeneutics, Statistics, and the Repertory Grid

Graham White on Semantics, Hermeneutics, Statistics – and the Semantic Web

Herbert Kritzer on the Nature of Interpretation in Quantitative Research

Thoughts on the Task Force Statement

July 9, 2021 | General | 11 Comments

As president of the American Statistical Association in 2019, Karen Kafadar appointed a task force with the goal to “clarify the role of hypothesis tests, p-values, and their relation to replicability” – as described in her recent editorial in the Annals of Applied Statistics (where she is editor-in-chief). The editorial directly precedes the outcome of the task force – statement titled “The ASA President’s Task Force Statement on Statistical Significance and Replicability” (of which Kafadar is a co-author). The statement itself describes establishment of the task force to “address concerns that a 2019 editorial in The American Statistician (an ASA journal) might be mistakenly interpreted as official ASA policy. (The 2019 editorial recommended eliminating the use of ‘p<0.05’ and ‘statistically significant’ in statistical analysis.)” The authors go on to more specifically identify the purpose of the statement as “two-fold: to clarify that the use of P-values and significance testing, properly applied and interpreted, are important tools that should not be abandoned, and to briefly set out some principles of sound statistical inference that may be useful to the scientific community.”

My prior expectations

I admit I had low expectations for any broad usefulness for what would eventually come out of the task force – not because of the task force itself, but perhaps mainly because of the implied value placed on producing a unified statement in the end, along with the reasons for the task force creation. But, I also thought it was well-meaning and probably benign exercise. I’m familiar with the work of a few of the task force members, and of course could have done more research on others, but I wasn’t involved with and didn’t care to know what hidden agendas and politics were lurking behind the scenes – I’m no longer naive enough to think there weren’t any, but my views here are not wrapped up in anything of the sort. When I originally heard about the task force, I actually had a positive view of the motivation and point behind it (perhaps naively), thinking “it couldn’t hurt” and that it was a good thing to continue these difficult conversations and acknowledgements of disagreements and challenges. My views here are just my thoughts after reading the statement and editorial, and before reading opinions of others.

I would have loved to be part of the discussions, or even just to have overheard them – but for the disagreements rather than points of consensus. We need to see the disagreements and discussion more than we need to see a pretty unified statement that doesn’t acknowledge the inherent messiness and nuances that create the problems to begin with. The acknowledgement of a lack of consensus was the best part of the discussion around the ASA’s Statement on Statistical Significance and p-values, including the Supplemental Materials. The task force statement does not have that feel. Kafadar’s editorial introduces it by saying “remarkable unanimity was achieved,” which is in stark contrast to the introduction to the ASA’s Statement on Statistical Significance and p-values written by Wasserstein and Lazar that highlights points of disagreement rather than unanimity: “Though there was disagreement on exactly what the statement should say, there was high agreement that the ASA should be speaking out about these matters.” There must be more to learn from disagreement than agreements. But disagreement is uncomfortable and not what most practicing scientists want on this matter – even if it’s simply the reality of where we are.

I suspected that getting to consensus would result in vague language that would have little direct applicability to practicing scientists and statisticians still trying to make sense of all these statements and form their own opinions and philosophies for practice. It is no surprise that many statisticians and scientists see the value in p-values and statistical hypothesis tests (I’m still not sure why we call them “significance tests”) and see them as important to science if properly applied. It wasn’t clear to me why a task force was needed to further articulate this, and I guess it still isn’t (beyond feeling the need to respond to the Wasserstein’s TAS editorial). I suppose I’m writing this to help myself come to some opinion about how useful (or not) the statement might be and to speculate on what implications might be for practice. Even if not overly insightful (as I expected), is it expected to be a benign contribution or is there the possibility of negative consequences?

Thoughts on the editorial

I will try to stay just shy of the weeds in my providing my thoughts on what’s found in the editorial and statement. The first time through, I read the statement first and then the editorial. But before writing this, I read them in the reverse (and intended) order so the editorial could serve as an introduction to the statement.

I’d didn’t take me long into the editorial to start feeling disappointed. I perceive recent gains in momentum and motivation for questioning statistical practice norms and even willingness to take hard looks at statistical foundations (by statisticians as well as scientists) – and the editorial did not inspire (at least me) in that direction. It strikes an interesting chord of almost implying that we’re doing just fine, while trying to avoid any ruffling of feathers and I don’t follow some of the logic.

Kafadar seems to argue that statistical hypothesis tests are important because they are used as if they are important (in reports of scientific work and in the judicial system). Their current use in courts of law (and probably often inappropriately) is not a reason for their continued use there, or elsewhere. The fact that they have been relied on in the past for courts of law does not automatically imply they are important for courts of law – there is nothing about the fact of their past use that makes an argument for how helpful (or not) they actually were. In reality, I suspect they helped develop arguments in the right direction in some cases, but one has to also consider the mistakes that have been made in arguments through common misuses and misunderstandings. She seems to simply ignore this possibility. Overall, the use of statistical hypothesis tests in the judicial process is not, in and of itself, evidence of their “importance or centrality.”

She states that hypothesis tests and p-values “when applied and interpreted properly, remain useful and valid, for guiding both scientists and consumers of science (such as courts of law) toward insightful inferences from data.” It’s hard to argue with a statement that starts with “when applied and interpreted properly,” but that disclaimer also points to the substance that is missing for me – there is plenty of disagreement on what constitutes “properly” – and that is the real issue. Unfortunately, the editorial and the statement gloss over this real substance and make arguments and statements conditional on “proper” application. I’m not sure where the conditioning gets us in terms of helping the scientific (or judicial?) community grapple with what the role of different statistical methods should be in different contexts.

I am not sure I understand, or that most readers will understand, what she is referring to with “unproductive discussion” – it would be nice to have an example of discussion around the topic deemed unproductive in her opinion. I don’t disagree that there has been unproductive discussion, but I suspect my views on what is productive and what is unproductive differ a bit from her’s – but that’s hard to judge. The whole sentence is “The ‘unproductive discussion’ has been unfortunate, leading some to view tests as wrong rather than valid and often informative.” To me, this sentence implies an argument for the blanket validity of hypothesis tests with no needed context. It is not the case that the tests are always valid, and just because users might “often” find them useful and informative does not mean they are actually providing the information the user is seeking.

I do agree that “A misuse of a tool ought not to lead to complete elimination of the tool from the toolbox, but rather to the development of better ways of communicating when it should, and should not, be used.” But – back to my comment about conditioning on “proper” – I don’t see this type of advice in the editorial or the task force statement. I also agree that “some structure in the problem formulation and scientific communication is needed,” but I don’t see how this is addressed either.

Kafadar then boldly asserts that “The Statement of the Task Force reinforces the critical role of statistical methods to ensure a degree of scientific integrity.” I cringe at statements like this. I think they do more harm than good – overselling a glorified and simplified view of Statistics. Sure, I agree there are times when statistical methods play a role in increasing scientific integrity, but their use certainly doesn’t “ensure” some degree of scientific integrity and in my opinion use of statistical methods can often be associated with practices that compromise scientific integrity. Scientific integrity is far larger and more important than anything statistical methods can insert into the process. It doesn’t help to simply state the potential positives and ignore the negatives.

Finally, I wasn’t sure how to digest the following: “I hope that the principles outlined in the Statement will aid researchers in all areas of science, that they will be followed and cited often, and that the Statement will inspire more research into other approaches to conducting sound statistical inference.” I assume by “other approaches” she means those other than statistical hypothesis tests? This makes it sound as if there aren’t many out there already, which is obviously not the case. As stated, the principle describes a worthy and lofty goal for such a task force, but this is unfortunately unrealistic given what was actually produced. It might be cited often, but as discussed further below I don’t think the principles are specific enough to “aid researchers in all areas of science.” And lots of citations isn’t always evidence of good work.

Thoughts on the statement

Now, to the statement itself. The first thing that will likely jump out to anyone sitting down to read the statement is its length – or lack thereof. This isn’t necessarily a bad thing, but given the nuances, different points of view, etc. regarding the topic, it surprised me – either as an impressive feat or an indication of lack of depth. What follows are my initial thoughts on the intro and each of the “principles” described in the statement. Admittedly, I focus on criticisms of words that left me disagreeing or worried about how they might be interpreted and used by practicing scientists. So before I go there, I also want to acknowledge that there are phrases in the statement that I appreciate for their careful wording. For example, “An important aspect of replicability is the use of statistical methods for framing conclusions.” They could have ended the sentence after “methods,” but including the “for framing conclusions” highlights an important, though subtle, point.

P-values always valid?

I am not one who thinks p-values (and associated statistical hypothesis tests) are evil, but I am one who is leery about their common use in practice and I can’t help but worry about their negative contributions to science. It’s easy to say they have “helped advance science,” but do we really know that? Can we really weigh the harms and mistakes against the good? Sure, they are “valid” and useful in some settings, but I can’t go so far as agreeing with the first statement of substance in the statement: “P-values are valid statistical measures that provide convenient conventions for communicating the uncertainty inherent in quantitative results.” I could write a whole post on this sentence – I would not have been part of the unanimity had I been a member of the task force. Here are a few things: (1) Stating “P-values are valid statistical measures” says nothing of when they are or are not valid (or any gray area in between) – instead, it implies they are always valid (especially to those who want that to be the case); (2) I completely agree that they “provide convenient conventions,” but that is not a good reason for using them and works against positive change relative to their use; and (3) I don’t think p-values do a good job “communicating uncertainty” and definitely not The uncertainty inherent in quantitative results as the sentence might imply to some readers. To be fair, I can understand how the authors of the statement could come to feeling okay with the sentence through the individual disclaimers they carry in their own minds, but those disclaimers are invisible to readers. In general, I am worried about how the sentence might be used to justify continuing with poor practices. I envision the sentence being quoted again and again by those who do not want to change their use of p-values in practice and need some official, yet vague, statement of the broad validity of p-values and the value of “convenience.” This is not what we need to improve scientific practice.

It is easy to say “they have advanced science through their proper application” – when one doesn’t have to say what counts as “proper application.” And it is easy to say that “much of the controversy surrounding statistical significance can be dispelled through a better appreciation of uncertainty, variability, multiplicity, and replicability” – particularly if definitions of the terms and details don’t have to be provided.

Uncertainty

The first general principle provided states that “capturing uncertainty associated with statistical summaries is critical.” It’s hard to disagree with this statement, but I am not sure how practicing scientists will be able to use this “general principle” in practice, particularly without depth of understanding around the concepts of uncertainty and variability, and lacking practice identifying and communicating about different sources of uncertainty relative to what is captured in statistical summaries. Given the difficulty even formally trained statisticians have in these areas, it’s no small feat for scientists in general – and the statement does not point to a roadmap for practice. The sentence includes a “where possible” – another disclaimer that I think could end up eliciting a feeling of relief in many readers. What is possible depends on the person and their current state of knowledge.

The statement implies the p-value is a measure of uncertainty – but I don’t think there are many people who can easily explain why the authors describe it in this way and what sources of uncertainty it encompasses (or how) – particularly for observational studies? And if that can’t be articulated by the user, should the user be using them? I think we can agree as a scientific community, that we (as scientists) have a responsibility to understand, clearly communicate, and interrogate our methods – convention and convenience are not adequate justification for continued use. If I can’t explain to you why a p-value is an important and useful summary measure in the context of my study (beyond that it’s convenient and expected by convention), then I shouldn’t be using it.

The heart of statistical science?

The statement describes “dealing with replicability and uncertainty” as lying at the heart of statistical science and part of the title of this general principle is “Study results are replicable if they can be verified in further studies with new data.” Again, this is vague enough to be hard to disagree with, but glosses over the hard stuff. What are the criteria for “verified”? How many further studies? etc. There is a lot contained in the sentence and each reader may interpret it differently based on their own prior experiences and notions – for example, some will automatically interpret “verified” to mean that the p-values fall on the same side of a threshold, and there is nothing to elicit questioning of current norms. I agree with their list of problems with research, but still not sure how they are defining a “replicability problem.”

Also hard to disagree with “even in well-designed, carefully executed studies, inherent uncertainty remains” – of course we can’t get rid of all uncertainty. The statement that “the statistical analysis should account properly for this uncertainty” is vague and unrealistic. What sources of uncertainty can the analysis account for and how are those conveyed? What assumptions are we conditioning on? Again, it is easy to say we should “account properly,” but who knows what counts as “properly”. It doesn’t serve science to make simple sounding statements like these without at least acknowledging that what counts as “proper” is not agreed upon or easy to communicate.

Theoretical basis of statistical science

The next general principle states “The theoretical basis of statistical science offers several general strategies for dealing with uncertainty.” Again – to me, this oversells what statistical science is capable of, as if it can deal broadly with the uncertainty. What sources of uncertainty can it deal with and what sources can it not? What assumptions are inserted along the way? I suspect I understand what the authors mean, but it’s important to consider what will be taken away from the words by practicing scientists more than what was really meant by the authors. What does the “principle” really offer in terms of advice to help improve science?

Thresholds and action

Another principle states “Thresholds are helpful when actions are required.” This is often a true statement in my opinion, but again does not reflect the foundation and justification for thresholds in the context of the action or decision required. There is a difference between “are helpful” in the sense that better decisions or actions are arrived at, and “are helpful” as perceived by the user just to get to a decision made regardless of justification or quality. The sentence again seems to appeal to placing value on convenience and perceived usefulness, regardless of foundations or theory behind it. They say p-values shouldn’t be taken “necessarily as measures of practical significance,” but the wording seems to imply that often they can be – and there is no mention of the work it takes to assess when that might be the case. I agree that thresholds should be context dependent, but there is a deeper layer to consider regarding whether it’s even appropriate to directly tie a p-value and threshold to the decision. Yes, it’s convenient, but what is the loss function implied and does it really align with the problem? I’m starting to sound like a broken record, but again – I think I understand what the authors are after, but the paragraph leaves so much open to interpretation that a reader can easily use it to back-up their preferred way of doing things.

More “properly” and rigor

Finally – the last general principle provided is: “In summary, P-values and significance tests, when properly applied and interpreted, increase the rigor of the conclusions drawn from data.” Hard to know where to start with this one. It repeats the dangers I have already discussed. It can easily be used as justification for continuing poor practices because the issue is a lack of agreement or understanding of what is “proper” and what counts as “rigor.” As is, I don’t agree with such a general statement as “increase the rigor of the conclusions.” Too broad. Too big. Too little justification for such a statement. Again, I’m not sure what a practicing scientist is to take away from this that will “aid researchers on all areas of science” as Kafadar states in the accompanying editorial. Scientists do not need vague, easily quotable and seemingly ASA-backed statements to defend their use of current practices that might be questionable – or at least science doesn’t need scientists to have them.

Closing thoughts

The process of writing this blog post shifted my initial perspective on the statement and the editorial. I did not start out intending to be so critical (in a negative way). I started out by viewing them as well-meaning and rather benign (even if not overly helpful to practice), but now I’m not convinced of benign. I am worried about the potential harmful consequences, though I’m sure unintended. I won’t pretend to have an easy solution, but providing vague statements qualified by “proper use” that make it easier for people to justify current practices by grabbing a quick citation does have the potential for harm. And, this will likely come across as ASA-endorsed, thus continuing the issue of what is officially endorsed by the ASA and what is not. I guess I’m just struggling to see the good that could come of them relative to scientific practice. But, I hope I’m just being pessimistic and missing something in the whole exercise.