Category: General

Home / Category: General

This is not a typical post. It’s some combination of an ode to my dog who is in the late stages of hemangiosarcoma, combined with some commentary on information I found related to the diagnosis. This is not an update on the state of research on hemangiosarcoma — it’s just a post following the path of about an hour’s worth of my thoughts as I processed information. I’ll put a review of the state of research on hemangiosarcoma on my to-do list.

A little over a month ago, my beloved dog Ginny started bleeding internally from a tumor on her spleen. She went through a few bouts of lethargy over a few weeks before the diagnosis was made. I made the quick decision to have her spleen + tumor removed and she made a rapid recovery to her usual self. For a lovely month, we joked that her spleen was just weighing her down. She didn’t miss it at all and we didn’t have to miss her.

Ginny in 2016

The surgery was on a Tuesday and we got biopsy results back early the following week — visceral hemangiosarcomy (HSA). I don’t know why I didn’t look up more information at the time or ask more questions of the vet. In retrospect, it’s rather unlike me, but I didn’t want to dwell on what I couldn’t control. I wanted to spend quality time with best friend. The vet who gave me the news simply said that the cancer was “in her blood” and she would have symptoms again at some point — it could be in months or years and there was no way to know. It might have just been what I wanted to hear, but I feel she emphasized the years option more than the months option. She said we would do check-ups every 6 months, instead of every year. She said to look out for signs of liver disease, but she didn’t say (that I remember) to stay on the look out for more internal bleeding. And she didn’t tell me that dogs with this type of cancer that’s not treated with chemo are more likely to survive a few months — it’s pretty rare to go a whole year or more. I’m not saying this to criticize or blame — it’s just what happened. I trusted and interpreted implicit messages how I wanted to and chose not to ask more questions. It seems I have a habit of trusting too much in personal matters. And, there’s the interesting aspect of clinicians simplifying things for people with little science background — she isn’t someone I had enough of a relationship with for her to know I’m a statistician and fully capable of digesting and evaluating information about uncertainty in outcomes — but that’s a bigger topic for another day.

Ginny was just shy of her 10th birthday at the time — and we celebrated her 10 years in this world on June 24th. We adopted her from a local animal shelter when she was 4. The six years went by so fast. She made our lives richer and brought daily joy to the family. She’s an incredibly perceptive dog, always in tune with our emotions and determined to always be with us. She looks deep into my eyes with her wise, beautiful brown ones. She’s smart and very stubborn. She’s my first dog with extreme facial hair — and I have grown to love it so much. I love her bed-head when she wakes up in the morning and I love smoothing it out for her. She’s not soft to the touch, but she doesn’t need to be.

It turns out we had weeks, not months… and certainly not years. Early last week she got lethargic and weak again, and I knew … even though I didn’t want to know. I tried to blame it on a raw hide we had given her, but my gut took us back to the vet — where they confirmed she was bleeding internally again. I knew there wouldn’t be options, beyond keeping her comfortable with medication and trying some medicine to help clotting and reabsorption. After a night of thinking we might lose her, she slowly got better over a few days — some really good hours turned into a few really good days. She even took a couple steps toward chasing a bunny, she chased a magpie out of the yard, she played with a little stuffed horse she found that makes neigh sounds, and she had a couple of nice walks with wading in creeks and rivers.

I wrote most of this post last week, but never found the time to finish it up and post it — probably for a reason. This morning she’s lethargic again and her gums are pale again. The looks she gives me are hard. I’m sitting on the floor next to her while she rests. Maybe the bleeding will stop again and maybe it will not. It is a lesson in accepting and living with uncertainty.

This post really is more about her than hemangiosarcoma. But, I’ll still share thoughts that came up as I looked through some information. It helps things sink in a little more, as I start to learn and wonder more about this type of cancer.

[Some thoughts on information I found]

This fact sheet by the American College of Veterinary Internal Medicine (ACVIM) seemed like a good place to start to me:

  • Cancer of the epithelial cells that make blood vessels
  • More common in dogs than in any other species [interesting…]
  • Primary sites it shows up are spleen, liver, heart, and skin [Ginny got spleen first]
  • Internal tumor symptoms: unexplained weight loss, bulging belly [Ginny had], decreased exercise/stamina [Ginny had], lethargy/sleeping more [Ginny had], decreased appetite [Ginny had only when feeling really bad], increased panting, pale gums [yep], weakness [yep], cough [nope], and collapse [nope].
  • “There is not currently a perfect blood screening test for HSA, though one has been developed and investigators are working on refinement of our understanding of how to use such a test.” [Interesting — we did not get that option and went with the biopsy after removing the tumor on her spleen.]
  • “it is rare when patients with spleen HSA are cured following surgery to remove the spleen as tumors that arise in that site are usually associated with metastasis (spread of tumor cells from primary site via the blood stream to new locations such as lung). This metastasis occurs even if there is no evidence of secondary tumor sites at the time of surgery. The average survival prognosis for patients with spleen HSA following surgery alone is approximately two months, with only 10% survival at one year. The average survival for dogs with spleen HSA treated with surgery and chemotherapy is improved at six to eight months and patients typically experience an excellent QoL during with treatment.” [This one is interesting, and clearly relevant. I actually really appreciate that the vet did not just hand me the average survival prognosis! She made it very clear that the prognosis was uncertain and could be a couple months to a few years. She focused on variability rather than average. I do find it somewhat ironic though that despite my often loathing averages, it looks like my dog is pretty close to average (only on this outcome of course). We weren’t offered and didn’t consider chemotherapy.]

In my hour of looking and reading, here’s the second source I took the time to read and a couple of quotes:

  • “The reported median survival times for dogs with splenic hemangiosarcoma treated only with surgery are 19-86 days. Nevertheless, patients who do undergo surgery tend to feel better in the short term.” [I can attest to the feeling better and Ginny was in the 28 day range]
  • “Chemotherapy after surgery is often recommended because hemangiosarcoma is highly malignant and readily metastasizes. Indeed, this cancer has typically already spread at the time of diagnosis. Many different chemotherapeutic agents have been investigated to treat hemangiosarcoma. Use of the drug doxorubicin is associated with longer survival times. The reported median survival times for splenic hemangiosarcoma treated with surgery and doxorubicin-based chemotherapy is 141-179 days.”

I then found this trial at University of Minnesota, but…

For a few minutes, it made me wish I would have done immediate research, but really I am really at peace with where we ended up. I don’t know what chemotherapy effects might look like and there is a financial reality, even if that’s hard to admit.

And finally, here’s a 2018 paper describing a Brazilian study comparing outcomes from surgery vs surgery + doxorubicin (the same treatment being looked at in the University of Minnesota study). The analysis uses survival data from 37 dogs who were treated between 2005 and 2014. It’s retrospective and they do a good job of making that clear. Of course, there’s “significance” all over the place and it’s not clear if they mean statistical significance or practical or both — but I think they make the common mistake of assuming statistical significance implies practical importance. And, it’s hard for me to stay engaged with papers when I get to paragraphs like this:

But to be fair, there’s a lot of good stuff in the paper and I don’t blame the authors specifically — they’re doing what they’re expected to do within the system they work in. But, when will expectations shift to require real justification of the reasonableness of p-value cutoffs? What will it take? Why does it not make scientists uncomfortable that most people using p-value cutoffs probably could not provide a reasonable justification for their choice if forced to? But, for today, I can forgive the cut-offs… I just really want to see the raw data displayed clearly and with relevant information. I mean, there’s only 37 dogs at the beginning… should not be a huge task. Just show me the data. But no such luck.

They do show Kaplan-Meier curves and give estimates of median survival time for the two groups, trying to take into account censoring for dogs that were still alive at the end of study. But no raw data plots, or even tables, just some descriptions in the text. It is my experience that we don’t teach creative data visualization in data analysis courses and many researchers do not have the computing skills to make non-automatically generated plots from raw data. Again, a fault of the system and norms — but I can’t help but wonder how they cannot really want to see their data. But, just because I love looking a raw data and thinking about creative ways to display relevant information with it (like treatment (obviously), age, severity, etc.).

In this case, of the 37 total dogs included, the paper states that 11 were alive at time of data analysis and 2 were were lost to follow-up. Twenty three of the 37 dogs were treated with surgery only and 14 with surgery + the chemotherapy. It’s not explicitly stated (that I saw), but looking at the stats, it appears that all 11 of the dogs still alive at time of data analysis were in the chemo group. That would be some important information!? What survival times are actually going into the 274 day summary statistic? What about the two lost for follow up that were censored? Were they treated in the same way as dogs who were still alive? Instead of offering an intuitive discussion of the raw data, as the information going into the model, it’s all about default and automatic results from the survival analysis. I just want raw data… is that too much to ask?

Here are a couple of relevant results paragraphs:

They talk about clinical stage, type of tumors, and give the numbers of dogs in each of these groups. They even get their p-value comparing estimated survival time for different stages, but they don’t take into account treatment in that analysis. That is, the analysis looks at potential questions separately (e.g., differences in stages are completely separate from differences in treatments) — making the very strong implicit assumption that there is no interaction between the stage of disease and effect of treatment (or choice of treatment). There may not be enough data to estimate the interaction, but that doesn’t mean it should unquestionably be ignored in the analysis or discussion.

I appreciate that they gave the fractions of dogs in different categories and not just the percentages. At least that some adherence to the raw data.

They acknowledge the “absence of a significant difference” among primary organ … may represent a false-negative due to a type II error related to the small sample size of the study.” It would be interesting to look into how often the possibility of type II errors are brought up in papers when large p-values occur vs. the possibility of type I errors when small p-values occur. Just a thought. They even say “it is not possible to establish definitive conclusions on the importance of primary tumor locations of HSA based on our study.” It is just 37 dogs that are likely pretty heterogeneous in many respects.

If I can get beyond the usual issues I have reading scientific papers that overly rely on automatic, default-like use of statistics, I do get useful information from this one. It does appear — and I could get this without any statistical modeling — that survival times were generally longer for the dogs whose owners chose the chemotherapy option after surgery. The raw data would be helpful! Eleven of the dogs were still alive at the time of data analysis, so really they would just have to give me 26 numbers and some text about characteristics of the still alive dogs. Even without the skills to made a cool plot, it could easily all go into an ugly (but useful) table. I can look at the Kaplan Meier curves and pull out a lot of information, but should I have to?

Finally, I do appreciate the last paragraph. I think it is humble and honest and doesn’t try to oversell anything. Overall, my take is the authors did a good job within their understanding and skills relative to data analysis and the system they are operating in.

As I said in the beginning, this post is not meant to be a thorough or rigorous review of hemangiosarcoma research. I just provided a running view of my thoughts as I spent an hour looking up information. Am I being overly picky and critical of the paper I found? Maybe. But, most of the time I don’t think so. To me, the point is to consider what the people making care recommendations and decisions are getting out of such an article. That is — what is the veterinarian getting out of it? What’s the take home message? Likely the take home message would be that chemotherapy is helpful and should be recommended if the dog’s humans are open to it and can afford it. And maybe that’s the right choice, but this paper doesn’t provide as much support for that as I think a vet would conclude. Vets are busy treating our beloved dogs and shouldn’t be expected to have the time or specific expertise to dig into how trustworthy a data analysis is – to think about the information in the data vs. the information inserted through modeling assumptions. They may not understand the issues with p-values and the assumptions underlying estimates from a Kaplan-Meier survival analysis — so they are put in a situation to trust. I guarantee they could easily understand a plot (or even table!) of the raw data though — particularly one that codes by stage of cancer, breed, etc. In my opinion, no fancy statistical modeling was needed here. We just needed to see the raw data in ways that are easy to digest and include as much relevant information as possible. But, statistical software packages automatically spit out Kaplan-Meier plots, but they don’t automatically spit out creative and useful plots of the raw data. So, here we are. We need statisticians whose job is creating plots of raw data for people — I would take that job, as long as I don’t have to do needless modeling for the sake of modeling.

Thank you

And finally… I have to do it. Thank you Ginny. Thank you for all the support, the laughs, and the general joy you brought me over the last six years. You have been my constant. I hope we gave you as much. Regardless of estimated survival times and raw data, I just can’t imagine life without your wise, caring, deep brown eyes keeping an eye on me.

Like many of you — I’ve had enough of 2020.

References

Batschinski, K., Nobre, A., Vargas-Mendez, E., Tedardi, M. V., Cirillo, J., Cestari, G., Ubukata, R., & Dagli, M. (2018). Canine visceral hemangiosarcoma treated with surgery alone or surgery and doxorubicin: 37 cases (2005-2014). The Canadian veterinary journal = La revue veterinaire canadienne59(9), 967–972.

https://vetmed.umn.edu/centers-programs/clinical-investigation-center/current-clinical-trials/pro-dox-propranolol-and-doxorubicin-dogs-splenic-hemangiosarcoma

http://www.acvim.org/Portals/0/PDF/Animal%20Owner%20Fact%20Sheets/Oncology/Onco%20Hemangiosarcoma.pdf

Oh, the words we use

June 16, 2020 | General | 5 Comments

On a road trip this morning, I decided to catch up on some science and statistics related podcasts. Maybe it was the little break I took from listening to them, but it wasn’t long before I was cringing. The implicit messages sent by statisticians and other scientists to a broader audience play a part in perpetuating and contributing to misconceptions and misinterpretations related to statistical inference… and therefore many scientific results.

The goal of simplifying concepts of statistical inference for a general audience is an admirable one. But it does not come without mistakes and unforeseen consequences. There are many phrases that are rampant among scientists who rely heavily on statistical methods in their work and even statisticians. Some statisticians are working hard to counter misinterpretations and misused language, but it will take an awful lot of work to counter the subtle statements fed to the public at a high frequency. And the subtlety of the words makes it more difficult. It’s more about things implied, rather than things explicitly wrong. We hone in on things like p-values, but we rarely talk about the more subtle language used to convey statistics-related concepts within science communication. That’s what made me cringe during the podcasts.

I don’t listen to as many science and statistics related podcasts as I would like to — mostly because of the cringing. The feelings of frustration, or at least feelings that the scientific community is making big mistakes in communication, creep in around the edges and steal my attention — and any enjoyment along with it. It’s probably not fun to be in the car with me at those points, though I’m getting better at transferring the energy into writing instead of voicing my opinions out loud to my family in a less than pleasant tone. Sometimes, I just turn it off. But, I realize that’s no solution — it just keeps me from noticing and learning.

Simplification for communication

In this post, I briefly call your attention to a few phrases that came up in the podcasts I listened to this morning. The goal is just to bring your attention to them, not to do a deep dive into meanings and implications. It’s important to keep in mind that they come from a place of wanting to communicate effectively to a broad audience — they do not come from any blatant desire to mislead. They are meant to make answers or explanations accessible to those with no formal background in statistical theory and foundations.

The goal is to convey complex concepts in simple terms and few words. There is nothing wrong with this goal, but simplifying can be dangerous. It means we’re presenting words that don’t quite capture the truth — the words shave off corners — but we hope they don’t shave off enough to do harm. There are very few statisticians and other scientists who escape this problem if they are trying to communicate about their work, as they should be. I know I am guilty and I’m sure I will be guilty again in the future — even when I’m thinking hard about it! There will always be phrases I haven’t thought through enough before they come out of my mouth.

Opinions will always vary about the severity of the problem and how much harm certain phrases might actually be causing. I believe the harms are bigger than we like to think they are, or at least we should start from that assumption. I think benign sounding phrases have contributed greatly to misunderstandings of the role of Statistics in science, and will continue to do so.

Whether or Not

The phrase “whether or not” is one that comes up over and over again. It can show up with many words around it, but the meaning is generally in line with the statement: “Statistics allows us to conclude ‘whether or not’ there is a real effect.” In one of the podcasts from this morning, the statement (by a prominent statistician) described the purpose of a clinical trial as determining whether or not evidence exists to say that there is a real difference.

On the surface, the statement is simple enough and accessible, but assessing the potential harms of the statement requires trying to put ourselves in the shoes of those who don’t have the knowledge to understand the simplification. Interpretation comes from their understanding of the words in their typical use, not in their understanding of statistical inference. In this phrase, there are some big words whose meanings are pretty clear. We have the “whether or not” — implying a yes or no answer and the ability of research to distinguish between black and white. It ignores the gray area where most of work lands and implies either “evidence exists” or “evidence does not exist”. It ignores considering evidence as measured on a continuum. And we have the words “determine” and “real” – but for today I will stick with the “whether or not.”

This is a phrase where the problem is fairly straight forward. Evidence is not presented as if exists on a continuum, and subject to many assumptions, it is presented as an oversimplified binary “evidence” or “no evidence.” The language implies there is nothing in between. It does not matter if statisticians know there is something in between — it is about how this is translated and internalized by those who don’t realize the subtleties of that point.

We need to think more about how the message is interpreted by those who do not necessarily understand the continuum and the gray area. It suggests the outcome is simply a binary one and that the answer comes without uncertainty. In broad audience explanations, there’s rarely any discussion about arbitrary thresholds needed to move from continuous to dichotomous. This wording implies the ultimate goal is dichotomization and that statistics is helpful because it removes uncertainty and gets us to the black or white answer we all crave.

The “whether or not” phrase is one I have been talking about for almost a decade, mostly with students in my classes. I have tried to remove it from all my wording — in language and writing, though it still can surprise me by creeping in.

Objectivity and unbiasedness

I hesitate to give this section such an enormous heading and I will not dig in deeply. But, the fact it is such a huge topic is exactly the reason we should stop using the terms as if it they are simple and uncomplicated. Commonly stating that the use of statistics, or even aspects of it, magically makes things objective and unbiased is misleading. The entire process of statistical inference is not as objective as statisticians and scientists would like to believe and definitely not as objective as commonly conveyed to broad audiences.

I am skirting around the definition of objective here, but the main point is acknowledging the many decisions and judgement calls along the way — from the design through analysis and interpretation. It is not reasonable to think the same thing would be done by two qualified and reasonable statisticians or scientists, which I think is often what people interpret “objective” to mean — something objective doesn’t have a human component and the “answer” shouldn’t be dependent on the human carrying it out. It’s a story we’ve told about statistics for decades. And, it’s a dangerous story. Even simple statistics reported fall prey to this — there are decisions in terms of how to aggregate or not aggregate that make a difference to the outcome. Those decisions are not objective and they are not free of human biases.

I think I understand what people are trying to say, and because it’s said over and over again, it sounds okay to most ears. But, what are the implications? Should we really be saying it? Are we oversimplifying? The statements may be true on some level, but most people listening to them don’t understand the conditions under which that level is true. Here are some paraphrased examples I heard in the podcasts.

  • “Use of random assignment means we have no conscious or unconscious bias.”
  • “We develop objective rules.”
  • “There’s a desire for decisions that are objectively based on evidence.”
  • “A randomized study allows for an unbiased evaluation of the two things being tested.”

Do we have an understanding of what people listening to statements interpret “unbiased” and “objective” to mean? Are we making things sound far better and more trustworthy than they are?

Defaulting to averages

This is something I have written about before here, here, and here and I suspect there will be more in the future. I don’t have anything against averages, but I do have something against the default use of averages. By default use I mean choosing to base analyses (and then conclusions) on averages (or models for means) without ever thinking about it as a choice or a decision. It is often done without awareness of how the choice of assuming there are groups of homogeneous individuals may impact inferences. Use of averages is so accepted, and expected, that justification for the choice is not asked for.

In my experience, most people relying on them do not even think of it as a choice or an assumption to be justified. When conclusions are translated for a broad audience, the “on average” is often excluded or tagged onto the end of a sentence in a way that implies it can be ignored. It’s not easy to bring this into a conversation simply, but I hope we can stop implying averages are the option and that implications need not be questioned.

For example, in one of the podcasts today the use of average survival time measured in months was referred to. It is often implied that there are no other options and that it is inherently a parameter of interest. There’s no mention of what the distribution of survival times for those of interest might look like. What if many have survival times for about half of participants range from 1-3 months and the rest survive over a year? How useful is the average?

Disclaimers or explanation about who or what results may actually represent are rarely provided and it is easy to assume that the estimate inserted into the statement represents some common person or even the individual listening. This may seem like a minor oversight, but I do not think it’s minor at all when we try to carefully consider how things are interpreted by those who have not had a reason to think through the details (and shouldn’t have to). I’m not suggesting we get rid of using the word “average” or “mean” — just that we consider it and talk about it as yet another assumption that does influence statements about results and general conclusions.

Stop pretending it isn’t hard

I see pretending as a huge part of the problem. We want so badly to communicate ideas and result in a way that is understandable that we greatly simply and then pretend that the simplification covers it. We ignore the extent to which the corners are shaved. We don’t openly acknowledge how hard it is to simplify and not accidentally mislead. There are plenty of other words and phrases we could go into — like “proven.”

We need to reel in the tendency to pretend like most aspects of statistical inference are simple enough to convey in a tweetish-length response. And, we stop pretending that those with formal degrees in Statistics always get it right. It is not about whether the person talking or writing understand what they are trying to say, it is about carefully considering how the words may be interpreted in potentially misleading ways.

It is difficult to talk about how statistical inference works (or doesn’t) to a broad audience. If we’re trying, we will make mistakes and we will want to update things we’ve said and written in the past. I don’t think we can realize the potential harm in what we’re saying until we recognize that. It take others pointing it out or time spent thinking about implications. So, if we slip up or inadvertently imply something, let’s just admit it openly and be honest, even if it’s hard to talk about. The more we acknowledge the challenge as statisticians, maybe the more other scientists and writers will start saying it too, without worrying about egos. Inference and uncertainty are hard we should try not to deny that or lose sight of it, especially when speaking from a place of expertise.

I have been keeping a least one trade science book on my nightstand — trying to learn from how others have gone after big topics in science, or more interestingly how we do science. I ordered Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hopes, and Wastes Billions on a whim. It’s written by Richard Harris — the NPR science correspondent and published by Basic Books in 2017. For some reason, I had low expectations going into it — but in retrospect I don’t think they were deserved. I think they mainly came from what I perceived to be an overly dramatic title. I like the play on words with rigor mortis — but somehow the whole title has the feel of a tabloid title to me. But, I get it — I suspect it did its job attracting people who might not have given it a second look if it had a less dramatic title. So it goes, whether I like it or not.

It didn’t take many pages for me to get over my unfair title-based expectations. Harris packs a lot of information, opinions, and arguments into an accessible, fairly short, and very easy to read book. It’s really pretty impressive. What did I like most about it? I think its effectiveness at giving the reader a bird’s eye view of different problems affecting biomedical research and how the negative effects of weak links can accumulate unseen. It makes the reader acutely aware that an individual researcher in a particular field, or a technician carrying out a very specific and seemingly minor task, rarely — if ever — gets the chance to see and contemplate how their decisions, and potential mistakes, may get magnified when considered over the whole process of carrying out the research and translating results to practice.

I don’t intend this as a full on book review. I just felt like sharing — maybe because it’s well written and not expensive. It’s a very low cost investment for the reader in terms of time and money. Why not?

I’ll end with one paragraph from the last chapter — mostly because it fed into something I was already thinking a lot about it. I believe we should try harder to recognize, understand, and acknowledge the social and human sides of doing science — but I also am acutely aware of the huge challenges making rigorous work in the area difficult. I don’t think many would argue against the idea that much of doing science is largely a social enterprise within a social system. Yet, it largely feels like we are living in a state of denial over this, usually stuck in the mentality that things like objectivity and unbiasedness are the real deal — even in social science research. We’re not doing ourselves any favors by pretending that when we’re doing science, we’re able to rise above all the human faults that afflict us in our non-science-related lives. It’s like we need therapists specifically for scientists and their work! Hmmm… maybe I’ve finally hit on my career calling! Though I question my qualifications on multiple levels.

And finally — here’s the paragraph:

Scientists make judgments all the time, not only about their own work but about the papers they read. Kimmelman hopes that these judgments can be quantified and reported as a matter of course. With this strategy, Kimmelman is trying to take advantage of human abilities that are not conveyed in the dry analysis of journal articles. It’s “getting to what’s going on in the heads of people,” he told me. “That’s not only one of the missing pieces of the puzzle here, but I think it’s a really, really critical issue.

Page 234 from Rigor Mortis: How Sloppy Science Creates Worthless Cures, Crushes Hopes, and Wastes Billions by Richard Harris (2017, Basic Books)

I haven’t yet found the time to dig into Jonathan Kimmelman’s work, but his university webpage is here, along with one for STREAM (Studies of Translation, Ethics, and Medicine). Overall, the work related to “translation, ethics, and medicine” looks fascinating and important to me. It’s something I could easily get very excited about, but of course I have some reservations too.

I’m not super optimistic about being able to quantify and report most judgments (as suggested in the quote). That idea feeds into my general worry that “meta-research” work has clear potential to fall into exactly the same traps as the work is trying to assess. It is social science research largely within the same system and framework that other social science research is done. It’s one of those never-ending mirror situations — if we study how to do research, then we should probably study how to do research on how to do research, etc. I’m not arguing this makes the endeavor not worth while, just that it has complicated layers that should be acknowledged. It reminds me of a more tangible analogy in data analysis –the never-ending conundrum created by checking assumptions of a test using another test with its own assumptions that should be checked, etc. Where does it ever stop? It doesn’t mean the original assumptions aren’t useful or reasonable to appeal to — it just means we need to be careful and do the hard work of justifying them more qualitatively (not shortcutting the hard work with another statistical test with its own assumptions). This all leads me to the feeling that very few things worth doing are simple and easy.

The “sorry for being me” attitude

May 19, 2020 | General | No Comments

Apologies are hard. Really good apologies can seem almost impossible. I suspect many of us who are pandemic-confined with other humans have had a lot of opportunities to think about apologies. Or at the very least, see them go wrong. I am thankful for Brené Brown‘s new podcast Unlocking Us and last week she gave us a long, thoughtful conversation with Harriet Lerner (I’m Sorry: How To Apologize & Why It Matters). I highly recommend it for any human who intends to interact with other humans, and who might make a mistake now and then.

With the words of the podcast still fresh in my head, I was handed a “sorry for being me” apology. It’s a flavor of apology I’m sure most of us have given at one time or another – probably feeling like it was noble of us to admit our serious inherent faults. It can be hard to see the implicit messages it sends to the person it’s addressed to, or the unproductive stories it reinforces. Despite presumably good intentions, it is packaged in a way that can’t really be opened by the other person. It doesn’t acknowledge the recipient’s hurt or feelings, and what options does it leave? Comfort the apologizer? Tell the apologizer they are forgiven for being them? It gets complicated fast. So much is contained in that one phrase.

I hone in on one of the implicit messages that echoes most loudly in my ears: “I am sorry, but there’s nothing to be done about it.” The “things just are as they are” message sends me into a strong, and probably overreactive, negative response mode. It zaps my hope.

Why am I writing about it here, on my blog meant for talking about science and statistical inference? Because… I was recently hit with the realization that I had heard and felt this message repeatedly over years working as a collaborating statistician. I just didn’t see the connection – even though it seems obvious now. It’s so easy to compartmentalize our work lives from our personal lives — and not see how feelings in one might be magnified because of feelings in the other. We need awareness first, before we can use such information in a positive way — and becoming aware is hard.

I am certainly no expert in this area of apologies and analyzing feelings; but I do have the human thing going for me. I suppose my hope in writing this is that people may understand and relate to the personal apology story — and then by seeing the connection will understand the one related to my experiences as a statistician, which then might lead to increased awareness of problems underlying use of statistical methods in science. On the surface it may sound like quite a reach, but at the moment it sure doesn’t feel like it.

Just to be clear – I am well aware that I fall into the trap I’m describing in many areas of my life. But, there’s no rule against writing about things we also suffer from and are trying to understand. We’re all human — in our personal lives and in our professional lives. Yes, even the scientists.

Same underlying message

I’ve never had an explicit “I’m sorry for being me” from a researcher or client (thank goodness). But, I’ve had plenty of comments that I now see come from a very similar place and share a basic underlying message — they still convey the “sorry for being me” attitude, just dressed up in a more professional context. These typical arrive as some sort of apology after the person decides not to adopt my, or another statistician’s, professional advice to change their intended approach to design or analysis or interpretation.

The message takes me straight to frustrations around dogmatic practices based on statistical methods that continue in many fields – despite plenty of cautions and warnings. This post isn’t about specific practices, but about the attitude, or mindset, that reinforces and encourages behaviors despite well documented problems or professional advice from well-meaning statisticians.

Here are a few examples of paraphrased comments I have received repeatedly over the years in response to suggesting a different approach that is more justifiable from a statistical perspective:

  • “I agree with you, but I think I need to go with this approach to get my grant [or get my paper published].”
  • “I see what you’re saying, but I have to go with what’s accepted in my field.”
  • “I think the approach you’re suggesting makes sense, but it will never fly with reviewers. It’s just not how we do things.”
  • “For my career, I need to stick with what people are used to, even if it seems wrong to you.”
  • “I’m sorry to disagree with you, but I’m going to keep doing what I was originally taught.”

On the surface, these may not sound a lot like “I’m sorry for being me,” but I hear the same underlying message. The message can be further translated into something like “I’m sorry for how I’m going to do things, but I’m not open to change,” or “I’m not proud of my choice, but I can’t change it, even if I wanted to. It’s how I have to operate to survive.”

Just as in my personal life, the underlying message zaps my hope and respect. In the language now common for many elementary and middle school students — it reflects a fixed mindset (as opposed to a growth mindset) [based on work by psychologist Carol Dweck]. Willingness for change doesn’t bubble up from a belief that things are inherently fixed and unchangeable. There is no need to find the effort or put up with the pain of change if it won’t be worth it in the end. The message reflects a mindset that justifies maintaining the status quo – even when we’re well aware of its problems or at least see others trying to get us to hear about the problems. Our brains are so good at ignoring warning signs just to be able to stay the course we’re on.

In doing science, it’s easier to continue to operate as you’re used to operating, and probably as your advisors, and their advisors, also operated. It may have already led you to a successful career, or it may look like the only way to end up with success. The dangers of changing course, of trying something different, and putting in the additional effort and fight to justify it, are real. They are real because the dogmatic approaches get so embedded in the system of doing business and in the surrounding culture. The culture is generally unforgiving when it comes to change. Those with the most power in the system generally got that power using the approaches. Recognizing the problems and allowing others to change requires (or at least encourages) some pretty deep and uncomfortable reflection on previous work.

There are so many forces acting against change. It is easy to get overwhelmed or simply choose to ignore the problems. Maybe seeing the connections to struggles we tend to place only in our personal life compartment is a way to glimpse things from different perspective and gain a different type of awareness? It may be worth a try, as it’s hard to see how it could hurt.

Statistician’s perspective

I want to go into a little more detail related to use of statistical inference in scientific practice. Let’s take the statement:  “I’m sorry, but I have to continue using statistics as I was originally taught.” What am I to do with this as a statistician? It’s quite likely they were “originally taught” by someone with little formal background in statistical inference, but we don’t know what we don’t know. Does the person want me to forgive them for their decision, even if it goes against my own professional opinion? To me, that feels like condoning practices I disagree with – not something I can do and still preserve my integrity. While I might not be able to forgive, I can understand and sympathize with the pressures and forces against change. I do get it. I understand how the system operates. But, that doesn’t mean I have to accept and feel okay with the decision to choose status quo over greater rigor.

I also can’t ignore how the collection of many, many of those decisions may work against progress in science. I think statisticians get a relatively unique perspective through our work with many different individuals across disciplines. We get a birds eye view of the accumulation of these decisions that isn’t obvious from the ground within the very specific niche of a single person’s work. We see the heaping mound of debris resulting from decisions made by a whole culture reinforcing itself to produce more of the same.

Statements like those mentioned above, send the message that it is more important to cling to “the way things are” than to be open to looking at why things are as they are and how they can be improved (because things can always be improved). For you scientists, what is the evidence that change will be that bad? Do you know you won’t get your grant? Do you know you won’t get your paper published? Or, are you proceeding as if that’s the case, without decent evidence or justification?

What do we do about it?

I don’t have the secret to pushing our brains outside their default way of operating, but I do know humans manage to do it all the time. Something happens — some awareness is gained, some level of motivation is reached — that pushes a person to reach outside their comfort zone or suddenly see the harm in their usual way of thinking and operating. We expect this in the context of tending to meaningful relationships with other people, so why not set expectations for something similar in the practice of science?

Finding motivation has to first come from a willingness to search for greater understanding and an openness to awareness. Maybe it can be captured with a renewed or deeper sense of curiosity about why and how we tend to do the things we do. We’re human — in our relationships and in our work. There is no easy magical fix. But, it is clear that staying confined within a fixed mindset is not helpful. It continues the pain, contributes to systemic problems in systems, and makes widespread positive change seem nearly impossible.

The “I’m sorry for being me” attitude pops up all areas of our lives. Even when delivered with the best intentions, it’s worth realizing that it’s likely serving as poor justification for an excuse to not change. It does not move us forward in doing science or living our lives — as if there is really much of a difference between the two most of the time.

I have a lot of hope for those growing up with awareness of the power of mindsets, and even the language to talk about it. Maybe they will be able to better shed the fixedness that hurts relationships and hinders scientific rigor and progress. We’re all stuck somewhere and not doing the world any favors by it. Why does it seem easier to make our mistakes by sticking with the status quo than risk making our mistakes through change? The latter artificially feels like we’re more accountable, whereas the first leaves us somehow absolved because we didn’t really “do” anything. We can so easily trick ourselves into believing that acting in the way we usually do (not changing) is not acting. But, both involve making a decision about how to act. Deciding to do nothing is making a decision to act in a certain way, as much as we would like it not to be.

In my last post (Thieving from the Assumption Market), I suggested effort put into the design can earn a nice down payment toward justifying a model and associated inferences. This post considers down payments earned through random assignment and random sampling. It may not sound that exciting, or at all new, but I think I offer an atypical perspective that might be useful. You may be expecting me to argue that the down payment takes the form of combatting sources of bias — the motivation for common statements like “we used random assignment to ensure balance” and “we used random sampling to ensure representativeness.” While combatting potential biases and thinking about balance and representativeness are important, they are unfortunately misunderstood in the context of randomization and random sampling relative a single study (see more here). This post tackles a different type of down payment — one that is often neglected.

Connecting design and model

There is a fundamental connection between use of a probability mechanism in the design (random assignment or random sampling) and justification of a probability model to base statistical inferences on. It’s not common to see acknowledgement of this connection. It’s a connection that’s easy to lose sight of in the model-regardless-of-design centric world we now live in. I suspect a lot of users of statistical models, and even statisticians, never got the chance in their education to appreciate the historical and practical importance of the connection.

I had been using statistical methods in research for a few years before seeing connection (or at least appreciating it) for the first time. It didn’t happen until I had changed my path to study Statistics. I still remember the feeling of really grasping it for the first time — a bit of an amazement at the logic behind the idea, but also of the fact I hadn’t seen it before. I doubt that I would have fully appreciated it without the lag. While simulation-based teaching strategies are now common in intro stats classes, I’m not optimistic it’s sinking in in a way that will transfer to research or work later on. To me, the connection feels like a crucial piece of the puzzle to understanding how we got to where we are in terms of use of statistical methods — sure, we can continue without it and call the puzzle done, but it’s a lot more satisfying to find the piece and fill in the hole. When I’m feeling optimistic, I even believe it can be an integral part of starting to re-value and set higher expectations for time and effort spent justifying choice of model.

Design vs. model based inference

I used to think a lot about the differences between model-based and design-based inference, but that’s not where I’m going here because dwelling on the difference isn’t often that helpful in practice. Those going with design-based inference already understand the benefits and are likely on that path from the beginning. The use of “versus” makes it sound like those not choosing design-based inference can just ignore design, or at least that it’s not firmly connected to their models and inferences. Really it’s all model based — just different types of models with different justifications — and design is always important.

I am probably coming across as more of a purest on this issue than I intend to. I am not about to argue that you should only use probability models if you insert probability into your design. I am simply arguing that expectations for justification should depend on design. It affects the size of the loan needed to buy your assumptions, and the size of your down payment.

Now, finally to the thought exercises I promised.

Envisioning haphazard and convenience distributions

Before we go further — I am assuming you have some basic understanding of what is represented in randomization distribution or a sampling distribution (beyond just picturing a textbook perfect t-distribution) and how they can be used for statistical inference (intro stats level).  Also, when I use the word random, I am referring to actual use of a probability mechanism (nowadays a computer’s pseudorandom number generator). A human brain does not count as a random number generator.

Part I: Random assignment and the randomization distribution

When I say random assignment, I am referring to the random assignment of individuals (or experimental units) to treatments (or groups) — or vice-a-versa. Random assignment is widely recognized as a valuable design concept and action, but mainly for the reason of suppressing “bias” by buying “balance” (as previously mentioned). We’re going after something different here — its connection to building a relatively easy to justify probability model.

Creating the collection of all possible assignments

What is created by inserting random assignment into the design? It brings to life the collection of all possible random assignments that could have occurred – not just the one that did. This collection is well defined and justified because of the random assignment — it creates a situation where we know all the possible assignments and their probabilities of being chosen. So… while carrying out the experiment under different random assignments is hypothetical, the collection of possible random assignments is not hypothetical – IF random assignment is actually used in the design.

It’s this collection that leads directly to a randomization distribution, as commonly used for inference (by sprinkling some assumptions on top). For example, if the treatment does nothing (a common “null” assumption), then each unit would have the same outcome regardless of the group it was randomly assigned to. Because the collection of possible random assignments is well defined (and not hypothetical!), a summary statistic can be calculated for each one of the potential assignments — and voila! — we get a randomization distribution! This is not only conceptually important, but also practically important.

You might have done plenty of group comparisons without ever constructing or visualizing a randomization distribution — or even considering its existence. The number of possible random assignments in the collection gets quite large, quite fast. (Even if you have only 20 total units randomly assigned to two group of 10 you’re at ’20 choose 10 = 184,756′). So, it’s not surprising that we typically rely the t-distribution as an approximation to the randomization distribution we’re really after. And, it’s a pretty darn good approximation under a lot of conditions.

With computing power now, we don’t need the t-distribution so much, but it’s easy and it’s “how we do things.” Unfortunately, it also lets us forget foundations. It lets us forget it’s there as an approximation of a distribution we can build and justify through our design. If we employed random assignment, then we have a huge downpayment toward justification of using the t-distribution in our statistical inferences. It doesn’t buy all assumptions outright, but it gets us pretty far down the road. Or, I can forget the t-distribution all together and get an even larger down payment using the randomization distribution directly.

The thought exercise …

Now to the question I have been trying to get to. What are the implications of not employing a random mechanism? What if I haphazardly assign units to groups in a way that feels random to me? What if I assign them out of convenience or in a way that clearly improves balance between groups? Does it really matter? You might think — well, I could have gotten that same assignment through actual random assignment, so it shouldn’t affect my inferences. But, it’s not about the one assignment you implement (random or otherwise), it’s about the collection of possible assignments! The act of real random assignment creates a collection that is not hypothetical – the probabilities associated with each assignment can be calculated! It’s this collection that can form the basis for a statistical inference.

What does the collection look like if you haphazardly assigned units to groups? How many possible haphazard assignments were there? How likely were some to show up vs. others? If you can’t answer these questions, then you don’t have a well defined collection of possible assignments. If you can’t answer these questions, the concept of a randomization distribution breaks down (as well as the approximation using the t-distribution). Sure, you can bust ahead with your inferences, but you should have to take out a bigger loan to justify your model.

Let’s go with a mini-example. Suppose I have 8 individuals (creatively numbered 1 through 8) and randomly assign them to two groups. I actually did this using my computer and got 4,3,7,5 in group 1 and 1,2,6,8 in group 2. But this information isn’t that interesting. What’s interesting is that I know there were ‘8 choose 4 = 70’ ways the randomization could have turned out. I can use my computer to list all those possibilities, or even write them out by hand on a piece of paper. I know they all could have happened and I know their probabilities of happening.

Now, let’s repeat the exercise. Suppose I decide to haphazardly or conveniently assign individuals to groups using my brain and judgement rather than an external probability mechanism (see later note about this). Maybe I inadvertently use some information to assign individuals to groups — perhaps trying to “balance” according to some obvious factor I can see (like age). I could have easily ended up with the same four units in each group as I did when using the random assignment (described above). So, what does it matter?

Well, now try to write out the collection of all possible assignments to groups. Some of the assignments included in the list of 70 above are now no longer an option. For example, I doubt the split of 1,2,3,4 in group 1 and 5,6,7,8 in group 2 would ever show up — it seems to non-random when someone is trying hard to be Radom. If you really try to construct the list of assignments and assign probabilities (or equivalently, take the list of 70 and just try to assign probabilities, including zeroes) — you should feel the conundrum it puts you in. How can you construct and envision your “haphazard distribution?” To be able to do it, you would have to understand the inner workings of your brain far better than I believe any human is capable of. I don’t think it’s too strong to say — it’s impossible!

How do you then go forward, assuming you’re not going to throw your hands in the air and just give up? Well, you are forced to make more assumptions and use a pretend probability model — to assume assignment was random when in fact it was not. You should lose your down payment over this and need to get a larger loan to cover your use of the same model.

Is it worth it? In an ideal world, it wouldn’t be worth it because you would actually have to get the loan and do the justifying, but … if you live in a culture of thieving from the assumption market, then it’s actually less effort in the long run to forgo the actual random assignment. It’s not likely you’ll ever be expected to really justify it. Okay, I’m being pessimistic — but I don’t think unrealistic. I am optimistic that attempting to envision one’s “haphazard distribution” or “convenience distribution” will open some eyes as to why actual random assignment should get you a down payment.

Part II Random sampling and sampling distributions

The same exercise can be applied to the idea of random sampling — and its fundamental connection to the concept of a sampling distribution. Instead of the collection of all possible random assignments, we have the collection of all possible random samples from a population. The sampling distribution is admittedly harder to wrap one’s head around — particularly when we enter into infinite populations and an infinite number of possible random samples.

But, it’s easy enough to walk through the exercise with a baby example — say you are taking a random sample of 2 individuals from a population of 20. It would take you awhile to list out the 190 possibilities, but you could do it and it’s easy enough to envision that entire collection. This collection, with the known probabilities, would form the basis for a sampling distribution.

Now, suppose obtaining a random sample is deemed to difficult and too time consuming — or maybe even impossible for logistical reasons. What does the collection of possibilities look like? What does the implementation of convenience sampling do to the look of the sampling distribution? What “samples” are possible and are some individuals much more likely than others to appear? What are the probabilities associated with each of the “samples”? For example, maybe unit 3 is very obvious and easy to include and the all possible samples include that unit! Would two different people in charge of the convenience sampling end up with the same “convenience distribution”? Not at all likely.

Where does this leave us in terms of justification of an underlying probability model based on sampling distribution theory? It puts us in an analogous situation to that discussed for the randomization distribution. There’s no way to build a distribution to align with the actual design, so we’re left pretending as if random sampling took place. The foundation is tenuous without some serious justification. The potential down payment has been lost — and we should have to apply for a much larger loan to be able to justify a the same inferences.

In current practice, what do we charge for using convenience sampling Not much. We don’t typically expect more justification in return for giving up the foundations of a sampling distribution. Giving up effort in the design should lead to more effort in justification of the model, but I don’t see that happening. Typically, we just bust ahead, use the probability model as if there was random sampling — pretending as if our convenience sampling would have churned out the same collection of possible samples as probabilistic sampling would have. It’s another theft from the Assumption Market.

If any justification is given, it’s typically something like “we assume the convenience sample was as representative of the population as a random sample would have been.” This is analogous to the “balance” example under randomization. For a single study, you might get a more “representative sample” through convenience sampling, but that’s not the point. We’re missing the bigger picture that lack of random sampling destroys the foundation of a sampling distribution.

A note on “random”

As a consulting/collaborative statistician for almost two decades, I have witnessed mis-use (or misunderstanding) of the word “random” over and over again. If a researcher says or writes that they used “random assignment” or “random sampling” in their design — you should always ask for more detail about the actual mechanism used. In my experience, it is more common that “random” means “haphazard,” than random as defined by a statistician. In my experience, it has been rare to see a formal probability mechanism used — it’s usually just a human brain trying its hardest to be random. There is no malicious intent to cut corners, just a genuine lack of understanding of why it matters. It is my hope that this post might help convince people that it does matter.

Wrapping up

I believe it is important to recognize the crucial link between inserting probability into design in a useful way and the justification of probability models as a basis for inference. Next time you are deciding whether to employ randomization and/or random sampling — don’t think just about “balance” and “representativeness,” but think about putting effort into the design to be able to afford better justification of a probability model. It seems like a reasonable first step toward curbing the culture of theft from the Assumption Market.

You are completely free to use whatever probability model you want, whenever you want, but … you should have to justify your choice all the way through the inferences you ultimately make. And, we should be checking each other’s justifications all the time — and weaving those checks into our judgements about how much to trust an inference.

Thieving from the Assumption Market

April 16, 2020 | General | 2 Comments

This post grew unexpectedly from other drafts. I found myself vaguely referencing an analogy around the cost and purchase of assumptions. I decided it was worth exploring a little more on its own, and for reference in other posts — it ended up being more fun than I expected!

The heart of what I’m after is more awareness and discussion about how we justify our inferences — and any models (in a broad sense) propping them up. This post relates to thoughts I recently shared on the cost of assumptions here.

Envisioning the Assumption Market

Let’s suppose we have a huge store, filled with shelves of neatly stacked and organized assumptions needed to support scientific inferences. Some of the shelves are neatly stacked with shiny mathematical assumptions (the ones we’re most used to talking about — normality, linearity, independence, etc.) and some shelves are messy and dusty from lack of turnover (the ones we don’t talk about much — like assumptions about what is possible to measure, that averages are the most meaningful quantity, what we must ignore to rely on probability models in general, etc.).

The shop owners fill the shelves and aisle ends at the front of the store with the shiny, accessible mathematical assumptions — they disappear quickly and there’s constant demand. The messier and less accessible assumptions are in the very back aisles, and maybe just downstairs in overflow where few customers actually wander. Customers typically run in in a hurry, grab their shiny assumptions from the shelf (usually the popular ones on sale), and head toward the door (hopefully stopping to pay on the way out!)

The pricing challenge

Now, the hard part. What is an assumption worth? How is it priced? What currency should we even use to buy it?

As I sort-of communicated here, assumptions add information into a statistical analysis, and therefore into inferences. There is a tendency to maintain laser focus on the information in the data — while ignoring how much information is added through assumptions. It would be wonderful if we could easily quantify the information in a particular assumption using a data-like currency, but we don’t currently have the ability to do that for most assumptions and I’m not optimistic we ever will, particularly for the assumptions in the back of the store collecting dust. And, even if we were able to come up with a reasonable method to quantify an assumption’s information content relative to data (through some sort of sensitivity analysis) — and thus price them for the Market — that method would have its own assumptions. So … where would it ever end?

I accept that defining currency through such quantification will not be a practical solution, at least one that is broad enough to cover the range of assumptions we should be considering. But, this doesn’t mean the concept isn’t still useful in a more qualitative way. Yeah, it’s messy and difficult and won’t be satisfying to those used to shopping from just the front of the market, but it may be far better than how we’re currently operating.

What am I suggesting practically? Well, I’m still working on this, but here are some thoughts. We can create a list of all the assumptions we are aware that we are making that have some influence on the scientific inference or decision. The list will be long and making it will be tedious and uncomfortable. Then, we need to think through how each one might affect our inferences and make judgements about relative size of the influence. How much information might it be adding to inferences beyond (and relative to) that contained the data? Reorder the list in terms of increasing perceived impact on inferences (or sensitivity of inferences to the assumption). This is then the list of assumptions from cheapest to most expensive — even if a specific price can’t be quantified. Using the information about relative cost, how can we purchase them?

Taking out a loan

How do we actually make a purchase if we don’t really have a currency? Here’s how I envision it. We are all cash poor and need to take out a loan to buy any assumption. The Assumption Market is also a lender — by necessity. In the real world, to secure a loan we need to justify to the lender that we will be able to pay it back. The lender has no way of knowing for sure if we will be good for it — so their decision to lend depends on how well we can convince them that we will be able to pay it back. Ultimately, the transaction proceeds on good faith — but the better the justification that the loan will be repaid in a timely manner, the better the chances of getting that loan. So, the hidden currency is the strength of the justification. The bigger the loan, the more documentation and support is needed — the more convincing the justification must be.

I see purchasing of assumptions as completely analogous to this. The more costly the assumption, the more justification should be required to get the loan to purchase that assumption. Lacking the cash currency to buy assumptions, the default currency to get loans should be justification. The more expensive the assumption, the more extensive and convincing the justification should have to be to get the loan. Unfortunately, in real life, the most expensive assumptions are often the ones with the least expected justification — those at the back of the store that are rarely even purchased. That’s a door to through at a later time, but it does lead me to the next section…

A culture of thieves

In current practice, the amount of justification required to buy assumptions is minimal and often non-existent. We treat assumptions as if they are readily available in the free bin — but they are never free. There is always a cost. If assumptions are relied upon with no justification, they are stolen from the market (gasp!). We have created a culture based on stealing from the Assumption Market. Common statistical (and more generally scientific) practice makes us a bunch of thieves. [Note — listing the shiny assumptions and stating that they are “met” does not count as justification (more here in an old post).]

I don’t suppose anyone is trying to steal from the market or wants to be thief. It’s just the accepted way of doing business. We keep teaching how to sneak into the market, grab what we need, and sneak out — no stopping at the check out, no checking the cost, no applying for the loan — or even teaching students how to apply for a loan. It’s like there’s someone passing out invisibility cloaks to customers as they come in the door.

How to raise a down payment

There is one other important piece to add to the analogy. Loans should be easier to get if one has a down payment and the same should go for buying assumptions. For scientific inferences, down payments can be raised through efforts put into study design. The same assumption requires a much smaller loan for the customer who brings in a large down payment — earned through care in the design. For example, use of random assignment in the design provides enough down payment to easily purchase a probability model based on a randomization distribution. It also puts the buyer well on their way to purchasing the assumptions needed for causal inferences. Random sampling is the other obvious design decision leading to a large down payment.

Let’s come by our assumptions honestly

We could improve scientific inferences (and associated decisions) by coming by our assumptions more honestly. The thieving, and culture of handing out invisibility cloaks, should stop. It will not make life easier, but so it goes. It’s time to acknowledge that assumptions are not free and time to build a system of accountability around their purchase. This might go a long way toward increasing humility around our methods, our models, and most importantly … our inferences.

“Failing to reject the null is not the same thing as accepting the null!!” “No evidence for an effect is not the same thing as evidence of no effect!!” For anyone who has taken a traditional intro stats class based on null hypothesis tests, these messages were probably drilled into your head (or at least should have been).

There are sound theoretical foundations behind these statements (conditional on many assumptions of course). Despite the foundations, they represent a sticking point for students first exposed to the logic and become a chronic problem in practice beyond the classroom. In my experience, most people who use statistics in their work fall into one of two camps: (1) they accept the statements as fact and adhere to the wording recommendations (even if they haven’t really made sense of the words), or (2) they do not find the recommendations useful and so ignore them in practice, either explicitly or implicitly through carefully chosen words. This may sound overly critical, but that is not my point — these are not people trying to mislead others on purpose. These are people trying to do their best to communicate scientific results and make decisions. And maybe we should listen harder to those who choose (2), and question those who always go with (1).

Statisticians, and other teachers of introductory statistics, tend to throw up their hands and say things like “they just don’t get it!” “How can they blatantly ignore what we teach in practice!??” And so on. I don’t blame the statisticians and teachers either. I’ve certainly been there and feel the frustration. But, what if we are missing something? What if it is unsatisfying and confusing for a reason — because there is more to the story? If so many well educated humans push back against, or even blatantly ignore, these statements and the reasons for them, then maybe we need to open our minds a bit and reconsider these things we’ve been taught are truths. Maybe our recommended wording falls short in the real world (even if technically correct under assumptions that most people don’t really understand or consider).

Given our societal reliance on statistical tests and their results, the “no evidence” wording makes its way into journal articles read by practitioners needing to make decisions and also into general media reports (we’ll look at an example soon). What if our wording that’s fitting for Intro Stats homework assignments falls quite short when transferred into real life? What if it’s not strong enough to stand on its own, and yet we rarely give it the extra support it needs because that isn’t part of what we teach or model in practice? We should value technically correct wording, but what if it’s misinterpreted when taken out of it’s technical context? And how useful is it if the majority of people reading it do not understand the theory and assumptions behind it — and why should they!!??

Before we go on, I want to be clear about something. I suspect comments to this post may go straight toward blaming the method or approach. “Well, don’t do null hypothesis significance testing and that will solve the problem!” “Use a Bayesian approach to get the statements you want!” Etc. While I don’t completely disagree with these statements, that’s not the point I’m trying to make. The bigger problem does not go away with a different choice of method — the same wording is often used or different phrases are substituted in much the same way. In general, I think we try to oversimplify the challenges inherent in trying to communicate inferences — which always must be made without all the information we would like. And, the wording I discuss here is incredibly widespread. Like it or not, it is still most commonly taught in the context of null hypothesis testing and used in that context in many scientific disciplines.

Okay – let’s walk (or wade) through an example from the media last week.

Myth busting around coronavirus

Someone shared this article with me: Corona Myths Explored in Medical News Today. It’s not something I would usually find or read, but figured I would glance through the “24 coronavirus myths busted” (that’s the label of the article in my browser tab). I had begun drafting a post already about reflecting on the confusion expressed by intro stats students when trying buy into believing that “no evidence of an effect” is always a different thing than “evidence of no effect.” I have a PhD in Statistics and understand the logic — but I also see the practical problems that grow out of those phrases and can’t really blame the students any more. Then, when I read the myth busting article, the examples were impossible to miss – like the final quarter wind of a jack-in-the-box. I am not placing blame on the authors — just providing an example of something widespread to provide a setting for discussion.

I could have chosen almost any of the 24 myths, but let’s go with myth #11. I encourage you to first read it quickly, as if you’re just scanning the news. Think about the message you took away from it — without fighting it. Then, go back and read it again slowly, trying to really think about what the words convey, or might imply.

There is no evidence that a saline nose rinse protects against respiratory infections. Some research suggests that this technique might reduce the symptoms of acute upper respiratory tract infections, but scientists have not found that it can reduce the risk of infection.

https://www.medicalnewstoday.com/articles/coronavirus-myths-explored#What-should-we-do? (March 2020, written by Tim Newman).

What is implied?

Let’s start with the first sentence: “There is no evidence that a saline nose rinse protects against respiratory infections.” Note this paragraph is about respiratory infections in general, not specifically coronavirus. The connection may seem obvious (implicitly), but it’s still a connection that should be made and justified. Now, to the bold part.

What do they mean by “there is no evidence”? Suppose I would really like to make a decision for myself and my family about whether to continue daily saline nose rinses? I need to understand the statement, where it comes from, and what it actually implies. I would also like any information about potential risks and theories about why it might or might not be effective (but I’ll get to that later).

I’m willing to bet that many people read the article quickly and, despite the carefully chosen wording, take the statements as reasonable justification for busting a myth. On your first read, did your brain quickly process the words into “Sinus rinses will not protect you against coronavirus?” There is nothing to be ashamed of if you did. That’s certainly where my brain unwillingly tried to go. We’re forced to constantly filter and simplify information to make sense of all we are bombarded with. The framing of “myth busting” completely primes us to interpret the statements in that way — even if technically the words say something else. Aren’t they implying that the myth is “Saline nose rinses protect against coronavirus”? And therefore the busted version of the myth is that rinses do not protect against coronavirus. Even the connection to coronavirus is implied.

Given the setting and general confusion about the statement, it’s easiest for our brains to interpret by turning it around. “Okay, if there is no evidence that a saline nose rinse protects, then I guess I should infer that it probably doesn’t protect against upper respiratory infections.” The alternative is to take on the possibility that there just isn’t enough (if any) information. This is clearly more difficult to process and doesn’t really jive with the goal of the piece — why bother wasting space presenting and busting a myth if information are lacking? I think this thought can easily send the brain back to the simpler message. We read and process quickly — it doesn’t really feel painful or hard until it has our attention, and then it can start to drive us crazy.

Possible interpretations of “no evidence”

I see three general options for what “no evidence” could be translated to mean:

  1. There is no information because we haven’t studied it yet. In other words, translate “no evidence” to “no information.” For this post, academic definitions are not as important as how the phrases are interpreted by individuals. I will save a deeper discussion of evidence vs. information for another day.
  2. There have been some studies, but we haven’t yet gathered enough information to be comfortable making a call about whether the might be preventative. This may be summarizing a single study or used in a more general sense. In other words, translate “no evidence” to “not enough information” or “too uncertain.”
  3. There is enough information available to make a call, but researchers are relying on statistical hypothesis testing and their p-value is still over their cut-off (or the null is in their uncertainty interval). They would like to say there’s “evidence of no effect” and it might even be justified — but that wording is not allowed based on the approach. I look into this more below.

These three translations differ by the amount of information available for a problem, yet the exact same wording of “no evidence” can and is commonly used to describe all three. This is the suggested and accepted wording – not a mis-use. The wording is not technically wrong, it’s just really ambiguous and leaves a lot of open questions. It’s certainly not satisfying.

We haven’t found anything. But, how hard have you looked?

I can’t say for sure which translation was intended by the author, but if he was really trying to talk about coronavirus, then I’m pretty sure it has to be the first one. If we care about all upper respiratory infections, then maybe the second. He probably didn’t think through it himself — his job is to report wording from researchers or “experts” who are trying hard not to make the statistical faux pas of conflating “no evidence for an effect” with “evidence for no effect.” Avoiding this faux pas when in the first and second scenarios above is important, but real life isn’t as simple as we either “have evidence” or we “have no evidence.” This is yet another example of the strong pull we have to simplify by falsely dichotomizing — even at the expense of leaving out crucial information. In this case, by not adding context about the amount (and type!) of information available about the “myth,” the first and last sentences are almost meaningless.

I guess we haven’t looked at the last sentence yet, so let’s have a take a quick peek: “Scientists have not found that it can reduce the risk of infection.” Again, it’s clear care was taken in crafting this statement to avoid saying “Scientists have found that it cannot reduce the risk of infection.” But, how can I interpret “have not found” without some information about how hard they have looked? At the extremes, they might not have looked at all, and therefore definitely didn’t find anything — or, they might have done extensive searching and didn’t find anything. These are two very different scenarios. With respect to coronavirus in particular, I think it’s safe to assume they have not yet looked.

I can’t help thinking about my kids on this one. For you parents out there, how many times have you heard an anguished cry of “I can’t find X!!! It’s gone!” I’m betting that for many of you, your first response is similar to mine: “Where have you looked?” And often, it’s clear not much looking has been done, so I don’t buy the “it’s gone” conclusion. Only after we have thoroughly searched the house and retraced steps will I believe that claim. I suggest we apply a little of this parenting logic to interpretation of claims stated in the media, even if they are dressed up in wording that looks to have come straight from scientific experts.

Do either of the sentences bust a myth? Do they even justify that there is a myth that’s worth busting?

Things to consider beyond amount of data

Disclaimer — this is presented as a thought exercise in response to reading information from the media. This is not a thorough review of theory and data regarding sinus rinses and coronavirus. I am just trying to demonstrate how we can recognize the need for more information and avoid accepting our first interpretation of statements like those presented in the myth busting article.

Should you stop your sinus rinses due to “no evidence”? Well, it depends on how the phrase should be translated relative to the amount of information available (as discussed above), but it also depends on potential risks associated with the behavior (treatment/intervention) and strength of biological reasons for why it might or might not work.

I haven’t dug deep into the literature on the potential negative effects of sinus rinses, and am going on the assumption that negative side effects are minimal for most people. I know a handful of people who have been given medical advice to do daily sinus rinses, and more who rely on them when having sinus discomfort or feeling a cold coming on (I’m one of those). I’m not convinced they always help, but I’m not convinced they hurt either. So when I weigh the potential risks vs. benefits, I end up going with the sinus rinse. I put frequent drinking of water, and even hot herbal teas, in this category too. I believe hydration is healthy. There are also potential positive psychological effects of “doing something” that I don’t think should be ignored. The placebo effect is still an effect.

I would also like to hear more about the biological reasons why it might work, or why it shouldn’t be expected to work. I like a little theory and logic, preferably with a lot of data, but I’ll take it alone if there aren’t any data. It’s still information. Is it possible that it clears some of the virus out of the sinuses that could make it down to the lungs? What about “we don’t have data yet, but here’s why some people think it’s worth studying (or not)…”?

In the case where there is no (or very little) data about the benefits over some “control” and no worrisome side effects (I’m assuming this), what is the point of publishing this as a busted myth? Presenting it in this way may stop people from a low risk, but potentially useful, behavior simply because we have too little information. What if it is actually helpful for some people? It’s worth thinking about. Suggesting (even implicitly) that people should not do something is not equivalent to telling them there simply isn’t enough information to make a recommendation, so they should weigh the potential benefits and risks themselves and make their own decision without the help of data collected for that purpose. It may not be what they want to hear, but it’s honest.

Plenty of information, but still “no evidence” — more on the 3rd translation

Now, I’m going to dig into the statistical aspects of the third translation I offered. This is my explanation for what I see happening in practice in multiple disciplines. We have to really see and understand the problem before we care enough to try to fix it. I’m trying out a presentation that I hope will hit home with some.

Suppose plenty of data have been collected in a large single study, but the stated conclusion was in the form of a “no evidence” statement. What is probably going on in this situation? It probably stems from a long history of null hypothesis testing, and habits developed during that history. Wording and “logic” from that realm have even carried over to other approaches that don’t directly support it.

I have found I get the best traction on this “no evidence” idea when I appeal to uncertainty intervals. It’s really a demonstration of the additional information contained in a confidence interval as compared to a p-value. While this may sound like an old and boring topic, I find it to often be a key in understanding the confusion surrounding “no evidence.” I feel compelled to say that I don’t think confidence intervals are the answer to all of our statistical-use problems, but they contain information that can be used in more holistic (for lack of a better word) and creative ways (no, I do not think creativity is a bad thing in science). However, confidence intervals have their own baggage and should be taken with healthy skepticism and an awareness of all the assumptions behind them. Okay, here we go…

Let’s compare two intervals and the conclusions supported by each. And before we do, let’s also suppose that before carrying out the research, we did the hard work to think through what values of the parameter we decided to estimate (e.g. a difference in means) would be considered clinically relevant or practically meaningful. We even drew a nice picture representing the backdrop of clinical relevance — any results would be laid out in front of this backdrop and interpreted relative to it. In this case, let’s suppose that values greater than about 2 are considered definitely practically meaningful and values below about 0.5 are considered to represent values not considered clinically relevant. That leaves values between about 0.5 and 1.5 in the uncomfortable, but unavoidable, gray area. Maybe I should give the backdrop an acronym to make it more attractive — Backdrop of Clinical Relevance (BCR)? (For another day: Why do acronyms make things more attractive?)

Here’s my picture:

Backdrop for judging clinical relevance of parameter values — darker blue represents less clinically relevant and darker green represents more practically meaningful. Two potential uncertainty intervals are shown (labeled A and B). This assumes the chosen statistical parameter is meaningfully connected clinical relevance (e.g. means are of interest) and a lot of other assumptions.

Let’s consider the two uncertainty intervals (labeled A and B) drawn below the BCR. Let’s suppose they are 95% confidence intervals, for ease of comparing them to two sided p-values and the usually automatic choice of 0.05 as a cutoff (again – I am not recommending this, just acknowledging current practice and trying to appeal to what most people are still taught).

First, note that both intervals include the value 0. Unfortunately, this is often the sole use of a confidence interval in practice, which is no different than just looking at a p-value. If the null value (0 in this case) is in the 95% confidence interval, then the two-sided p-value will be greater than 0.05. The textbook recommended wording will lead to statements of “no evidence for…” — for both A and B!!

But A and B look very different!! The difference is particularly obvious when interpreted relative to the BCR. If the backdrop is never drawn, then it can be incredibly hard to assess how similar or different they are (leading to an over-reliance on statistical “significance”). A and B tell very different stories and should lead to different conclusions about what to do next (both in terms of future research and making a decision about what to do)!!

Interval A represents a situation where there are plenty of data available. That is, there is enough precision in the estimate to distinguish between values that are and are not considered clinically relevant. If interval A was centered at 2.5, then you would conclude “there is evidence for a clinically relevant effect.” So, why when it’s centered near 0 and excludes all values representing clinically relevant effects are we not allowed to say “there is evidence of no clinically relevant effect”? Because of the structure and logic behind hypothesis tests and their lingering negative effects on critical thinking and statistical inference. The recommended wording of “no evidence for an effect” is not wrong — it’s just ambiguous. No wonder things get so confusing.

A large p-value (or null contained in a CI) can arise because there actually is no effect OR because there isn’t enough information to distinguish values deemed clinically relevant from those that are not. In this case, interval A is consistent with reason (1) and interval B is consistent with reason (2). They are not the same!! Interval B leaves us with more uncertainty than we would like. We could consider the study a failure and toss it out, or we could take it as valuable information that suggests further study. Maybe it was incredibly hard to collect data or there was simply more variability in the system than expected, etc. “No evidence for an effect” seems an overly harsh and easily misinterpreted end. It might lead to tossing something in the file drawer that shouldn’t be tossed, or concluding an intervention is not effective — a conclusion that is not really back up by the statistical results.

People have been thinking about this problem for a long time

People have been thinking about and trying to address this problem for a long time. I’m talking about here and trying to describe it in my own words, because it’s still out there in real life! And the confusion is real.

There are recommendations for approaches that try to deal with this problem (e.g., equivalence testing, Kruschke’s region of practical equivalence (ROPE), Bentsky’s discussion here, my previous post here, as well as many other commentaries). Note that all of these (as far as I can tell) rely on a commitment to developing a meaningful backdrop. And again, the point of this post was not to present alternative methods to better distinguish between intervals A and B, but to focus on how the statement of “no evidence” is interpreted in scientific sounding reports.

Similarly, I would like to think that use of Bayesian inference and seeing the posterior interval in A would be interpreted differently. While sometimes this is the case, I have witnessed (more times than I care to remember) the same “check if zero is in the interval and conclude “no evidence” or “evidence” based on that check. Many people using Bayesian methods are first trained in hypothesis testing and its associated logic and wording. The move to using Bayesian methods does not imply a shift to fully understanding or buying into the philosophy of Bayesian inference.

Finally, we have to start grappling with our over-reliance on averages and our default mode of operating as if an average applies to an individual. Our imperfect human brains can’t help but assume results stated in impressive scientific sounding wording apply to us individually. See here and here for a previous posts on this, though I feel there’s still so much more to say on the topic. I worry about it, a lot.

Responsibilities and recap

At least for today, I see the confusion around statements of “no evidence” coming from the combination of two things: (1) negative effects of automatically applying hypothesis testing logic and wording, and (2) real difficulties in distinguishing between “no information” and “no evidence.” This isn’t a trivial issue affecting only students in Intro Stats. It’s a real problem we should all be struggling with. It’s one of those things that after you become aware of it, will be in your face every direction you turn — particularly in this time of the global pandemic. We must add more context and information to be able to make sense out of the phrase “no evidence.” It’s incredibly important for how statements sent out the public are interpreted and internalized — and acted upon!

Scientists and writers have a responsibility to try to think about what is implied by their wording — not only if it is technically correct according to dry rules applied out of context. Of course we cannot be expected to read the minds of others and predict all interpretations. But, there is a middle ground. We can recognize when wording is justifiably confusing and when there are obvious misinterpretations. We can recognize when we’re taking the easy way out and not presenting enough information and context. It’s too easy to keep falling back to textbook-justified wording. I have certainly been guilty this — before I took the time to think through the real consequences it can have on our ability to make decisions.

At this time in history, it’s worth reflecting on medical research and how it’s done. The following two opinion pieces made it into my awareness through recent Twitter posts, and I decided to share them here. I also spent last Thursday writing a letter to the editor at New England Journal of Medicine, and got the rejection email today. So, figured I should just share what I wrote in this post instead.

Scandal of Poor Medical Research

Editorial The Scandal of Poor Medical Research by D. G. Altman in the British Journal of Medicine (BJM) written in 1994 and a follow up opinion titled Medical research — still a scandal written by Richard Smith 20 years later.

My unpublished editorial

In reading my less-than-400-word-editorial again, I’m not in love with it, but so it goes. I was trying to contribute something practical and relevant to the situation we are in — as we try to learn about the virus, the disease, and potential treatments while never having enough time or information.

*********************************

After decades of habitually using statistical significance as if it measures clinical relevance, maybe the Corona crisis will motivate change.  Scientists have a responsibility to use all information available to them and to justify their methods and conclusions.  We cannot risk ignoring or misinterpreting information from any study that fails to reach an arbitrary threshold in terms of precision in estimates. We cannot afford to fail to pursue potentially valuable treatments simply because of large p-values or intervals including default null values. 

In this crisis, we are forced to deal with uncertainty due to lack of information.  In response, we must work harder to interpret the information we do have relative to a useful clinical yardstick, rather than an unjustified statistical yardstick.  It is time to reach outside the comfort zone of statistical tests to embrace more practically meaningful inference.  Statistical methods can quantify uncertainty — under a lot of assumptions — but cannot rid a problem of uncertainty.  They do not logically lead to the decisions we would like them to. Common misinterpretations can and will have serious consequences.

  • Do not imply a binary decision about clinical relevance based only on whether a default null value is included (or not) in an uncertainty interval.  This is equivalent to claiming statistical significance (or not) by comparing a p-value to a cut-off.
  • Prepare a practical backdrop for inference.  Picture a number line distinguishing (as much as possible) values deemed clinically relevant from those not believed to indicate clinical relevance.  This should be based on clinical and scientific knowledge, not previous statistical estimates or tests.  Start collaborating, debating, and preparing the backdrop now, even before we have the data!
  • Use all the information presented in an uncertainty interval.  Does the interval contain values considered clinically relevant?  Does the interval contain values not deemed clinically relevant?  
    • The answer can be ‘yes’ to both.
    • An answer of ‘yes’ to the second does not imply an answer of ‘no’ to the first. This is a common mistake in interpretation.
    • An answer of ‘yes’ to both does not imply the study should be ignored or that we should conclude a potential treatment is not effective.  There is danger in prematurely deciding not to pursue a treatment based on ignoring information.
  • Don’t take the ends of an interval as hard boundaries. There’s always more uncertainty than accounted for mathematically.

Additional articles (will continue to add to this)

With outside-the-home life slowed way down, I have seized the opportunity for home organization. Yawning my way through old files yesterday, I happened across a file folder with a paper version of the application letter I sent out to prospective Statistics graduate programs — over 20 years ago. It’s simultaneously embarrassing and enlightening to look back at things we wrote when we were 20-something. For me, it’s always a weird combination of glaring naivety and surprise at my more mature insights. In some ways I’ve traveled so far, and in some ways I’ve arrived right back where I started. I’m posting the whole letter here — as some sort of tribute to our simultaneous naivety and wisdom … at any age.

[Boldface added to parts I found most surprising or interesting]

My desire to enter the field of Statistics stems from a culmination of my experiences in graduate school thus far. I originally chose a Motor Control and Biomechanics graduate program. The two related topics appealed to me because they encompassed both fascinating biological and mathematical ideas and principles. After earning my Master of Science degree and beginning work in a Motor Control and Biomechanics doctoral program, I came to several important realizations and conclusions. First, I do not want to spend my professional life collecting data only on humans. I became frustrated with the seemingly impossible number of human variables that cannot or should not be controlled and came to respect how difficult it often is to find appropriate and willing participants. Second, I found that I loved being involved in the design and analysis aspects of experiments. I found it very rewarding and motivating to spend time talking with other students and professors about their proposed studies, or problems they were having in their experimental set-up or analysis. This heightened my already existing interest in statistics. Third, I became disenchanted with the research process I was observing in fellow students and professors. I believe the most valuable class I have taken as a graduate student is a Research Methods class based on the principles outlined in Strong Inference by John R. Platt (1964). In my experience, I found these principles seldom applied to real-life research. For example, an emphasis existed on collecting as much data as possible, analyzing it in some convenient way, and then thinking up a question to fit an already statistically significant answer. This frustrated me and motivated me to learn more about proper design and analysis procedures, and to pass that information on. In response, I incorporated research methods principles into a junior/senior level undergraduate Motor Control class that I co-taught with a fellow graduate student, and found this to be an exciting, though challenging, subject to teach. Fourth, I realized that I wanted a broader range of job options and choices, rather than being confined to only an academic setting.

Therefore, as a result of the previous conclusions, I searched for a program that would prepare me to (1) be actively involved in many varied research projects, (2) concentrate on helping others to valid, reliable, and objective research, (3) use the mathematics and biology that I love, and (4) choose from a broader range of career options. A statistics program, focusing on Biostatistics, appears to combine all of the qualities that I desire in a graduate program and degree.

I realize that the actual number of statistics classes I have on my record is small. However, I feel that I have a much broader knowledge of statistics than it appears on paper. I have limited experience with statistical analysis programs Minitab, SAS, and SPSS, and gained valuable experience in the experimental design and analysis of data for my master’s thesis. I have also been exposed to statistics throughout graduate school in various classes, seminars, and discussions. I am very will to, and capable of, completing any independent work or extra classes to give myself a broader statistical base to build on.

I head Graduate Assistantships at both school I attended, and feel this to be an invaluable part of the graduate school experience. My teaching and research experience is summarized on the attached pages. I look forward to the possibility of a new program, new ideas and concepts, and new experiences as a Graduate Assistant.

Megan Dailey (Higgs) – 1999 letter included with my application to Statistics graduate programs

A moment

I had a moment yesterday. One of those moments when you hear the words of someone else and immediately realize those are the words you have been searching for to try to make a point. I owe a thanks to comedian Ricky Gervais for that, and to Sam Harris’ conversation with Scott Galloway (about 10 minutes from the end of Making Sense podcast #189). It was one of those “I have to write this down right now so I don’t forget!!” moments. It was the push I needed to write this post.

Thoughts I have tried to convey (unsuccessfully)

For years, I have expressed my opinion that statistical inference, as currently relied upon in many scientific disciplines, is a historical fluke. I see it as the product of some combination of sociological, philosophical, and psychological factors – not a law of science we “discovered” or a logical result from mathematics. There is no proof we should rely on statistical inference — and the specific methods commonly used — to the extent we do. There was no Scientific Inference Summit of statisticians, philosophers, and other scientists that arrived at a consensus for how we should go about using statistical inference in science.

Instead, current approaches developed as if having a life of their own, often evolving differently in different disciplines. Few scientists are aware enough of methodology beyond their discipline boundaries to see the differences. The variability in dogmatic use of particular methods across disciplines has always been a huge red flag to me — if things were truly settled, wouldn’t they have settled at the same place? Statisticians, in the unique position to work in many disciplines simultaneously, see this. To me, scientists adopting discipline-specific statistical methodological dogma is analogous people unquestioningly adopting the religion they are born into. Why not at least consider the others, or more importantly, what it means that there are others?

Scientists who rely heavily on statistical methods and inferences to do their science are rarely aware of the foundations and history that led them to that practice. Questions like “How did we end up relying on this method?” and “Why do I feel expected to use this method in my work?” are ignored in favor of time spent on technical skills needed to carry out the methods in practice — with justification conditional on the answers to questions that are not asked.

Methods of statistical inference are sold to future scientists and the rest of the public as the way to do research, an integral part of the scientific method, and as an objective way of doing science. This attitude, its perpetuation, and blind acceptance of methods are serious problems for science… and society.

The eye rolls

Okay, this is where I think I start to lose people (if I haven’t lost them already). I can see the eye shifts, the less subtle eye rolls, the transitions to blank faces, and other physical responses in reaction to my alarmist-sounding words. I can hear the voices. “Here we go again…” “Can we focus on doing science and be less philosophical?” “What else is there?” “What do you expect me to do?” “I don’t have time for history … I assume others have checked this or we wouldn’t be doing it.” “It wouldn’t have been taught in university if there wasn’t broad acceptance and justification.” And so on.

There is some hump that I’m rarely able to get over. I need a different strategy. I need a something that doesn’t just lead to zoning out or defensiveness. Maybe Ricky Gervais can help me out.

The thought exercise we need, thanks to Ricky

If we destroyed all the work of the last roughly 1000 years (or rewound 1000 years) and started over again — what would reappear in the same form as we currently have it?

Gervais makes this point relative to religion — arguing that now existing fiction and holy books would not appear in another version of history, while much in astronomy, physics, and chemistry would reappear. This thought exercise seems is an incredibly important one, and a clever definition of what counts as hard science. Our physics and chemistry books might look similar to how they look now, but how broadly does this apply across what we currently call science? What information conveyed through text books now (as if fact) would not likely reappear in a re-play of the last 1000 years? This is meant to be a fun intellectual exercise, not to inspire fear or defensiveness — we do not as a society assign value to work based only on the answer to this question.

Now, back to my discipline — Statistics reaches across disciplines and affects how we do a lot of science and decision making. Methods and skills for carrying out those methods (calculations and computing) are presented in textbooks, courses, and through mentoring as if they are beyond questioning. Information is presented as if it would re-appear in another version of history. But, I don’t believe it would. I would love to have someone convince me otherwise, but I have thought a lot about this over the last twenty years.

This does not imply that I believe we should not be using statistical inference or that it isn’t valuable — only that I believe we should change our attitudes toward it. Inference is crucially important to science and decision making. We need to study it and debate it — and not treat the problem as if it is already settled.

Maybe we would have some version of inference based on probabilities (assuming we even get to the notion of probability, which isn’t clear to me either), but I cannot even begin to convince myself that statistical inference as carried out in practice today would look anything like what we have now. I believe the probability is zero that our Statistics textbooks would look as they do today, or even that we would have something called Statistics textbooks.

I suspect those who still equate statistics with probability theory may disagree with me. Yes, there are deductive mathematical foundations, but they are conditional on acceptance of so much more. I am worried about what we are conditioning on, not the mathematics and computing we do after the fact. It is one thing to study and develop the mathematics underlying games of chance, and quite another to apply the work to hard and messy scientific questions in need of inductive inference.

A dose of humility toward our methodology

If I’m right (or even just could be right) that statistical inference as practiced today would not reappear, what does this say about how we are practicing Science today? At the very least, we need to stop acting as if we are certain the current version would reappear.

Inference is hard and inference should be hard. Statistical methods can only take us so far. Scientific inference is far larger than statistical inference — we need to stop pretending as if statistical inference is the scientific way to get us scientific inferences.

Again, I am not saying we need to (or even should) give up statistical inference and current statistical methodology. I am just appealing for a more humble view of our methods and a shift in attitudes around how we use them. They are not a law of nature, they are a conceptual human invention whose evolution has been greatly influenced by flukes, social structure, and human psychology. I believe we find ourselves in a difficult situation because they have grown to serve the non-scientific purpose of making inferences and decisions feel easier and more comfortable.

Hear it from Ricky Gervais

The moment came from his thoughts between 3:40 to 4:00, and it obviously resonated with Stephen Colbert as well.

For the part relevant to this post, see the 20 seconds starting at about 3:40.