Category: General

Home / Category: General

A large collection of authors describes a “consensus-based transparency checklist” in the Dec 2, 2019 Comment in Nature Human Behavior. I suspect a generally positive reception to the article and checklist, so I think there is probably room to share some skeptical thoughts here. I have mixed emotions about it — the positive aspects are easy to see, but I also have a wary feeling that is harder to put words to. So, it seems worth giving that a try here. I do suspect this checklist will help with transparency at a fairly superficial level (which is good!), but could it potentially harm progress on deeper issues?

As usual, I’m worried effects on deeper issues related to the ways statistical methods are routinely (and often inappropriately) used in practice. It seems clear that increased transparency in describing methods and decisions used is an important first step for identifying where things can be improved. However, will the satisfactory feeling of successfully completing the checklist lull researchers into complacency and keep them from spending effort on the deeper layers? Will it make them feel they don’t need to worry about the deeper stuff because they’ve already successfully made it through the required checklist? I suspect these could happen to some degree, and mostly unconsciously. Perhaps one way to phrase my sense of wariness is this: I worry the checklist is going to inadvertently be taken as a false check of quality, rather than simply transparency (regardless of quality). If this happens, I expect it will hinder deeper thought and critique, rather than encourage it. I hope I’m wrong here, but we’re human and we tend to make such mistakes.

My opinions come from my experiences as a statistician and my knowledge and ideas regarding statistical inference. I come at this already believing that many common uses of statistical methods and results contributes to a lack of transparency in science (at many levels) — mainly through implicit assumptions and lack of meaningful justification of choices and decisions. We have a long, hard road ahead of us to change norms across many disciplines and a simple 36-item (or 12 for those who can’t find the time for 36 yes/no/NA items) is a tiny step — and one that might have its own issues. We should always consider the lurking dangers of offering easy solutions and simple checklists that make humans feel that they’ve done all that is needed, thus encouraging them to do no more. I don’t mean this in a disrespectful way — it is just a reality of humanism as we attempt to prioritize our time for professional survival.

My main worries are described above and expanded on in the Shortcuts and easy fixes section below. The other sections are thoughts that nagged me while I was reading, and I end the post with very brief and informal commentary about some of the items on the short checklist. I realize I am actually nervous about doing this — it will count as one of my “things that make me uncomfortable, but I really should do” for the week. Open conversation and criticism should be directly in-line with the philosophy and motivation underlying the work, so that makes me feel better.

Consensus among whom?

Let’s start with the title. My first reaction was “Consensus among whom?” Before reading further, I quickly skimmed the article to answer this question — because I needed the answer to reasonably evaluate the work described. Was it consensus among all those included in the author list? I knew the backgrounds of a few of them, but not enough to gauge the overall group. Given the goal of transparency, I thought this would be easy to assess, given that in my opinion, the worth of a consensus depends entirely on who’s involved in the process of coming to that consensus! Does the title imply a stronger inference than it should? Does it imply wide scientific consensus, or is it understood to just refer to the method used to decide on the items? It does feel stronger than it should in my opinion (but I have a very low threshold for such things). On the second page, I found some good info: “We developed the checklist contents using a pre-registered ‘reactive-Delphi’ expert consensus process, with the goal of ensuring that the contents cover most of the elements relevant to transparency and accountability in behavioral research. The initial set of items was evaluated by 45 behavioral and social science journal editors-in-chief and associate editors, as well as 18 open-science advocates.” In the next paragraph: “(for the selection of the participants and the details of the consensus procedure see Supplemental Information). As a result, the checklist represents a consensus among these experts.” I did find the lists of people and their affiliations in supplemental information, as well as more information about the process of choosing the journal editors. But, the “among these experts” led to my next thoughts.

Who qualifies as experts on this topic?

My opinion about who deserves the “expert” label for this topic should not be expected to agree completely with the researchers (See previous blog post about labeling people as experts). Rather than just stating they are experts, it would be nice to acknowledge in the actual article that we should expect disagreement on this — rather than presenting it in a way that supposes everyone should trust and agree with their criteria. In the article itself, there is no room for justifying the criteria used and I think this is crucial to evaluating the work — the reader is expected to trust or to check the Supplementary materials (I’m curious how many people actually did or will check it). In the supplementary info, I was happy to see their main assumption for justifying the choice of editors clearly stated: “We aimed to work with the editors of the high ranked journals, assuming that these journals hold high methodological standards, which makes their editors competent to address questions related to transparency.” This seems to imply that “competent” is good enough for the “expert” label. Fine with me, as long as that definition is transparent. Once it is transparent, then people can actually have a real conversation about it and we can better assess the methods and claims in the paper. With limited space, maybe it is just now fact of life that such things end up in the supplementary info?

Back to the assumption — It would take a little more convincing to get me to the point of believing they should be the experts relied on for the consensus, but maybe they bring expertise of logistics and understanding what journals and authors will realistically do. Also — a discussion of self-selection bias would be great (45 of the 207 people identified by the inclusion criteria actually participated, with 34 completing the second round). Maybe the more competent people self-selected because they cared more, but maybe not, because maybe the more competent were too busy? Maybe I am over-reacting, but it’s our job as scientists to be skeptical and I don’t know how much to trust these 45 scientists to provide a consensus to be adopted discipline-wide — they too are integrated in the culture and norms that need changing. In the little I looked at, it doesn’t appear this was a difficult concensus to come to, so maybe I am spending way too much time on this not-so-important part. But, the title led me toward thinking it was important from the first glance. At the very least, we should be asking how qualified everyone is to assess all components, and particularly the statistical ones I’m most worried about (most of the items do rely on statistical inference concepts — design and/or analysis)? [Statisticians were included on the panel of experts — it would be nice to see disciplines listed in the spreadsheet.]

Short cuts and easy fixes

I strongly believe that one of the fundamental problems underlying potential crises in science (including transparency and replication issues) is the ready availability of short-cuts and easy fixes (e.g., dichotomizing results based on p-values, encouraging use of overly simple and unjustified power calculations, etc.). Easy hoops to jump through (for researchers, reviewers, funders, etc.) contribute to the continued use of less than rigorous methods and criteria and low expectations for justifications of their use. As they become part of culture and expectations, it gets harder and harder to push back against them. In the context of Statistics, many researchers are not even aware of the problems underlying common approaches in their discipline. Is this checklist another quick fix that people will be quite happy with because it is very easy and leads to a feeling of comfort that they have done all they need to do to “ensure” transparency of their research? I think the danger is real, I’m just not sure how big it is. And, what if it takes attention away from moving attention to deeper levels of transparency — hiding them behind expected methods (like power analysis). Maybe it’s better than nothing, but maybe it’s not. It’s at least worth thinking about.

Short cuts and easy fixes. The need for a 12-item instead of 36 item checklist seems to help make my point and increases my wariness. “By reducing the demands on researchers’ time to a minimum, the shortened list may facilitate broader adoption, especially among journals that intend to promote transparency but are reluctant to ask authors to complete a 36-item list.” Well, I certainly agree that a “shortened list may facilitate broader adoption.” It’s generally easier to get people to do things that require less time and effort. But, how does it not send the following message: “We think this is important, but not so important that we would expect you to answer an additional 24 yes/no/NA questions? Your time is more valuable than worrying about justifying transparency in your research.” So, even if I wholeheartedly agreed with the checklist, I don’t agree with the implicit message sent by giving the option for the 12 item version.

A very low bar

From my statistician perspective, this check-list might give the green light to scientists to continue with questionable practices that are default norms — and to feel the comfort of getting a gold star for it. Could the existence of this checklist actually decrease motivation for change in the realms I, and other statisticians, care about? Again, maybe I am being too cynical, but at the very least, we need to contend with this possibility and discuss it. This checklist represents a very low bar for scientists to jump over (even at the 36 items, not to mention the 12). We don’t need more low bars — we need to encourage critical thinking and deep justification of decisions and ideas.

Getting picky about words

There are multiple phrases used in the paper that I believe should require justification. I personally do not buy the statements with the given information. You may see these as picky, but I think it is important to realize that even our papers about doing better science fall pray to common issues plaguing the dissemination of science. I see this as more the norm than the exception – we all do it to some degree! I provide a few phrases here, each following by a little commentary.

  • This checklist presents a consensus-based solution to a difficult task: identifying the most important steps needed for achieving transparent research in the social and behavioral sciences.” Wow– that’s strong. I encourage questioning of the use of the phrases “solution” and “most important” — Are they justified in this context? What are the implications of using such strong language that, in my mind, clearly cannot be true? I really think we should try to be very aware of wording and when we can use it to reflect our work in a way that is more honest and humble — even if it means it won’t sell as well.
  • “In recent years many social and behavioral scientists have expressed a lack of confidence in some past findings partly due to unsuccessful replications. Among the causes for this low replication rate are underspecified methods, analyses, and reporting practices.” First, it’s definitely not just social and behavioral scientists expressing lack of confidence. Second, we need to carefully consider this idea of “unsuccessful replication” and the false dichotomy (successful or unsuccessful) and hidden criteria it represents (often based on questionable statistical practices). I have another blog post started on this topic, so will save more discussion on this point for that post.
  • These research practices can be difficult to detect and can easily produce unjustifiably optimistic research reports. Such lack of transparency need not be intentional or deliberatively deceptive.” I whole-heartedly agree with this statement. I also suggest that my first bullet point, along with thoughts on the title, represent “unjustifiably optimistic” wording relative to this research report — and I doubt it was intentional or deliberative.

On a positive note

On a more positive note, I am happy to see it is a living checklist, subject to continual improvement. It will be interesting to see how it evolves and how it might become embedded in norms and culture. The authors explicitly acknowledge that it doesn’t cover everything: “While there may certainly remain important topics the current version fails to cover, nonetheless we trust that this version provides a useful to facilitate starting point for transparency reporting.” I want to be cautiously optimistic, but at the same time I couldn’t naively ignore the wary feeling trying to get my attention.

A look at some of the items

Quick thoughts about some of the items included in the 12-item checklist included as Figure 1 of the article.

  1. Prior to analyzing the complete data set, a time-stamped preregistration was posted in an independent, third-party registry for the data analysis plan.” So – it is fine to analyze the incomplete data set? This could technically mean all but one observation. This presents a very easy way to technically adhere, but really not. I wish the focus was on “prior to collecting data” rather than “prior to analyzing” data. Who knows what changed over the course of collecting data and analyzing the incomplete data set?
  2. The preregistration fully describes… the intended statistical analysis for each research question (this may require, for example, information about the sidedness of the test, inference criteria, corrections for multiple testing, model selection criteria, prior distributions, etc.).” I’ll just make a point about the general language here. I guarantee my view of “fully describe” will not match that of most authors. Who is qualified to assess that? Self-assessment of this seems to me riddled with potential problems — and again does not have to have anything to do with quality of the work, only that whatever is intended is described. Describing completely inappropriate and unreasonable methods does still mean a ‘Yes’ is a perfectly legitimate answer to the checklist. This is where the deeper layers come in. Is it better than nothing, allowing a place to check for potential problems, or does it leave the author with a feeling like things have been checked for quality?
  3. The manuscript fully describes… The rationale for the sample size used (e.g. an a priori power analysis).” Again, hard for me to address this. See my recent post here on power analysis and I have seen plenty of manuscripts and grant proposals that “fully describe” the ingredients used without adequate justification of those ingredients. This is an easy box to check that will continue to promote poorly justified and misunderstood power analyses, while continuing to help people feel good about it. I realize it’s included just as an e.g., but…
  4. The manuscript fully describes…the study design, procedures, and materials to allow independent replication.” I assume by manuscript, they are including Supplementary Materials? Otherwise, this isn’t realistic. Again, this should not be taken as indicating any quality or rigor in the study design and procedures, only that they are described — which is an important first step!
  5. The manuscript fully describes… the measures of interest (e.g. friendliness) and their operationalizations (e.g. a questionnaire measuring friendliness).” While I could quibble with the wording here, I think this one could nudge people into doing a better job in this area — as long as the distinction is used throughout the paper, and not simply in one disclaimer sentence.
  6. The manuscript … distinguishes explicitly between “confirmatory” (i.e. prescribed) and “exploratory” (i.e. not prescribed) analyses.” I agree this one could do good — in my experience many researchers do not really understand the difference and simply don’t report much of the exploratory analysis and results. I don’t think this checklist will protect against that, but might nudge thinking in the right direction.

In this post, I am sharing an email update from the authors of the March 2019 commentary in Nature that included signatures from 100’s of statisticians and was timed to coincide with the special issue on-line supplement “Moving to a World Beyond ‘p<0.05′” of the The American Statistician.

The email contains links that may be of interest. I have not read all of them yet and my posting the links here does not imply I agree with the content. However, I do agree with having these conversations and welcome comments and thoughts.

I did watch the NISS webinar and can say I disagreed with a fair amount of what was said — particularly the philosophy underlying Berger’s presentation and many of the assumptions needed to justify the neat and tidy math. I agree with most of Greenland’s talk. We have to get away from the focus on identifying quick methodological fixes and be willing to look at the bigger underlying problems — even if it makes life less comfortable.

Dear Colleague,

We are writing with a brief update on events following the Nature comment "Retire Statistical Significance". In the eight months since publication of the comment and of the special issue of The American Statistician, we are glad to see a rich discussion on internet blogs and in scholarly publications and popular media.

One important indication of change is that since March numerous scientific journals have published editorials or revised their author guidelines. We have selected eight editorials that not only discuss statistics reform but give concrete new guidelines to authors. As you will see, the journals differ in how far they want to go with the reform (all but one of the following links are open access).

1) The New England Journal of Medicine, "New Guidelines for Statistical Reporting in the Journal"
https://www.nejm.org/doi/full/10.1056/NEJMe1906559

2) Pediatric Anesthesia, "Embracing uncertainty: The days of statistical significance are numbered"
https://onlinelibrary.wiley.com/doi/10.1111/pan.13721

3) Journal of Obstetric, Gynecologic & Neonatal Nursing, "The Push to Move Health Care Science Beyond p < .05"
https://www.sciencedirect.com/science/article/pii/S0884217519304046?via%3Dihub

4) Brain and Neuroscience Advances, "Promoting and supporting credibility in neuroscience"
https://journals.sagepub.com/doi/full/10.1177/2398212819844167

5) Journal of Wildlife Management, "Vexing Vocabulary in Submissions to the Journal of Wildlife Management"
https://wildlife.onlinelibrary.wiley.com/doi/full/10.1002/jwmg.21726

6) Demographic Research, "P-values, theory, replicability, and rigour"
https://www.demographic-research.org/volumes/vol41/32/

7) Journal of Bone and Mineral Research, "New Guidelines for Data Reporting and Statistical Analysis: Helping Authors With Transparency and Rigor in Research"
https://asbmr.onlinelibrary.wiley.com/doi/pdf/10.1002/jbmr.3885

8) Significance, "The S word … and what to do about it"
https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2019.01295.x

Further, some of you took part in a survey by Tom Hardwicke and John Ioannidis that was published in the European Journal of Clinical Investigation along with editorials by Andrew Gelman and Deborah Mayo:
https://onlinelibrary.wiley.com/toc/13652362/2019/49/10

We replied with a short commentary in that journal, "Statistical Significance Gives Bias a Free Pass"
https://onlinelibrary.wiley.com/doi/full/10.1111/eci.13176

And finally, joining with the American Statistical Association (ASA), the National Institute of Statistical Sciences (NISS) in the United States has also taken up the reform issue:
https://www.niss.org/news/digging-deeper-radical-reasoned-p-value-alternatives-offered-experts-niss-webinar

With kind regards,
Valentin, Sander, Blake

“Don’t ask” teaching in Statistics

November 22, 2019 | General | 3 Comments

This short Nature World view article by Jerry Ravetz is definitely worth a read by all teachers of Science. The title “Stop the Science Training That Demands ‘Don’t Ask'” pretty much says it all. He doesn’t explicitly mention teaching of Statistics — so I’m throwing this out there directly. Teaching of Statistics happens behind the scenes within classrooms, research, and mentoring — as well as in formal classes with statistical-sounding words in the title. It certainly flows through other scientific disciplines and media reporting of results (like the vacillating advice on nutrition Ravetz mentions).

In my experience related to the teaching Statistics within academia (both formally and informally), I think we need to add another layer to the appeal to stop “don’t ask” training. We often assume that those doing the training (again, formal or informal) have a sufficient depth of knowledge on the subject to have some control over perpetuating a “don’t ask” culture. But, such a culture can be conveyed unknowingly when the teacher does not possess the depth of knowledge or practical experience needed to open up space for asking deep questions and to promote meaningful discussions around them. And the most dangerous case is that when the teacher does not realize they don’t have that depth of knowledge. They may think they are offering an open culture of questioning, but the depth of the questions asked and answered is so superficial that it doesn’t get at what I believe Jerry Ravetz is arguing for in this article.

We must face, as a scientific community (not just Statistics community), that many of our teachers of Statistics cannot be expected to elicit or navigate the tough questions that need asking — and may not even realize there is much to ask. This is not the fault of the teachers themselves — it is the fault of a system of norms that has developed over decades within scientific culture. It is completely accepted by most for a math teacher, a researcher with training in another discipline, etc. to teach statistical inference, even formally in a classroom. This is not done out of blatant disregard for quality or ethics, but because it appears to have worked in the past (as judged by those who were also taught by such teachers) and we simply don’t have the workforce of people with depth of knowledge in statistical inference to meet the demand.

Imagine coming straight out of an undergraduate degree in Mathematics, showing up as a new graduate student in Statistics, and being handed your own course to teach statistical inference. You may be another university student’s first and only teacher of Statistics before they head out into the real world. You might laugh this off as unrealistic, but I assure you it is not. The mathematics involved in introductory statistics is low enough level that anyone with some math skills should be able to teach it, right? Never mind about the “inference” part of the story, hopefully they’ll pick that up somewhere else in life — both student and teacher.

For my first master of science degree (pre-Statistics life), I took a calculation-focused first course in statistics (literally using handheld calculators). I repeatedly asked about the difference between a standard deviation and a standard error and finally got an answer that may have been technically correct according to formulas, but was not satisfying conceptually. I would later have an “ah-hah” moment of understanding as a graduate student of Statistics when I was taught by people with PhDs in Statistics and years of experience working in science. I would also later realize how very little knowledge I had about statistical inference from my first master’s degree (though I did use statistical inference for my thesis!) and some doctoral work before then. At the time, I had no clue about how much I didn’t know, but proceeded to help my fellow doctoral students because I had (or at least thought I had) more knowledge and skills than they did. Had you asked me at the time, I probably would have thought I was qualified to teach an introductory class in Statistics. After a year in a Statistics graduate program, I would have changed my answer. After seven years and a PhD I would have added strong emphasis to the ‘no.’ I am now very thankful I did enter a program that immediately called on me to do something I was not prepared to do. And, I was not an undergraduate math student! I had a master’s degree of science and had done my own research using statistical inference!

It’s very uncomfortable to face the extent of the problem, largely because I don’t believe there is an easy solution to it. The only realistic option is probably more continuing education to get people to the point where they realize there is so much more to learn. The problem with this idea is that the people we really need to reach don’t think they need continuing education — the very source of the problem is hindering the fix of the problem. And there needs to be a workforce of teachers for the teachers — that doesn’t exist even close to the extent needed. Unfortunately, the current love affair with “data science” is likely to just make the problem worse — there are so many more attractive ways to spend resources than educating those who already have jobs that involve teaching. I apologize for ended on a pessimistic note, but I believe it is a realistic note.

Here is a paragraph from Ravetz’s article that should leave us with food for thought relative to the teaching and practice of statistical inference in Science today.

The philosopher Thomas Kuhn once compared taught science to orthodox theology. A narrow, rigid education does not prepare anyone for the complexities of scientific research, applications and policy. If we discourage students from inquiring into the real nature of scientific truths, or exploring how society shapes the questions that researchers ask, how can we prepare them to maintain public trust in science in our ‘post-truth’ world? Diversity and doubt produce creativity; we must make room for them, and stop funnelling future scientists into narrow specialties that value technique over thought.

Jerry Ravetz, Nov 19 2019 online version of Nature https://www.nature.com/articles/d41586-019-03527-y

An “as if” ramble about Type I error rate

November 21, 2019 | General | 1 Comment

I just finished the below post in one sitting and came back to write a reasonable introduction — or maybe more of a disclaimer. Once again, it did not proceed as I had hoped or planned. It’s a ramble. But, perhaps the fact this topic always turns into a ramble for me is part of the point. So, instead of adding it to the collection of other unpublished draft posts labeled “too-rambly,” I’m going to make myself just put it out there for what it is. Have fun…?

***********************************************************

I have multiple draft posts in the queue related to the concept of a Type I error — not because I like it, but because it is so embedded in many areas of science and discussions about current problems in science. I feel obligated to try to take it on in a different way. Given the press it gets, there is way too little discussion of what concepts and assumptions lie behind the words. When I used to try to teach about related concepts I often drew a line down the whiteboard and wrote “Hypothetical Land” on one side and “Real life” on the other. An awareness of the hypothetical land is important for understanding parts of statistical inference — even if you choose to go forward assuming its a reasonable enough model to be useful in your science. Each post I have attempted has blown up into something much too big and chaotic. So, I am going to try to stay close to the trunk in this post and avoid venturing off on the branches — not matter how attractive they look.

To provide a starting place outside of my own brain, I just looked up “Type I error” in the only Intro Stats book I happened to still have on my shelves. It’s Intro Stats by De Veaux & Velleman with contributions by Bock (2004 by Pearson). It should be a little outdated, but I’m not convinced it really is — at least for the sake of today’s topic. The table of contents lists “Type I error” with subheadings “Explanation of” and “Methods for reducing.” I’m not sure how you reduce a single Type I error, so I assume in the second heading they are referring to reducing the Type I error rate. This is a subtle, but very important point.

I flipped to page 395 and found the relevant section. It starts with “Nobody’s perfect.” Okay — I can get behind that and completely agree.

Then, in the second sentence, it moves right to decisions, as if the goal and end result of any statistical analysis is a simple binary decision. This is the first branch I will avoid as I fight hard to on task for today. I just ask that you don’t ignore the as if as we pass this branch (and others).

According to the text, it turns out we can only “make mistakes in two ways.” I could certainly find that a huge relief — but really it’s more disturbing than not. But, remember the as if — if we appeal to a greatly simplified binary reality, then maybe we buy into the usefulness of the two-error theory of mistakes, and here are the two (as quoted from the book):

I. The null hypothesis is true, but we mistakenly reject it. II. The null hypothesis is false, but we fail to reject it.

These two types of errors are known as Type I and Type II errors.

Page 395. De Veaux and Velleman (2004), Intro Stats, Pearson.

Okay – even within this greatly simplified, binary decision, as-if world, the null hypothesis typically states that a parameter (of some model) is equal to a single value — and most commonly that single value is zero. Even taking an as-if view (to gain mathematical traction), I will never believe a parameter (defined on a continuum) is equal to a single number like 0.000000….. So, what does that imply about these errors? It appears I can never actually make a Type I error because the null hypothesis is never true. Phew — so my Type I error-rate is always zero. What a relief. I guess we can stop here. But wait, … I promised to stay on the trunk and this is definitely one of the branches luring me away. I’m using all the willpower I can muster to stay on track.

Here we go — Let’s narrow our as-if hypothetical world view even further and suppose it makes sense to consider the potential “truth” of a null hypothesis (it either is true or it is not true). It certainly provides space to approach the problem mathematically and perhaps that’s worth it (invoking the idea of willful ignorance).

Where does that send us? Well, the text (and many other intros to this concept) proceed by making the analogy to medical testing procedures and false positives and false negatives. I get the connection and I see the teaching angle of tying a scenario of tangible real-life errors to those associated with the idea of statistical hypothesis testing. The trial analogy with guilty or innocent is another one often appealed to, and I admit I have gone there with intro classes (before I thought hard enough about it). But it has always made me uncomfortable and I think it should make you uncomfortable too. The relevance of the analogy for teaching rests on the assumption that the null hypothesis being true or false is like a person having a disease or not, and the diagnostic test is like a statistical test. I won’t go further up this branch now, but I will say I do think there is huge benefit to asking students to think about how the scenarios with a clearly defined truth (disease or not, guilty or not) relate (and don’t relate) to the statistical hypothesis testing scenario — that is, let’s question the strength of the analogy and the assumptions it rests on.

Moving on again. The title of this post has “rate” in it, so let’s move in that direction. On page 396, De Veaux and Velleman provide this:

How often will a Type I error occur? It happens when the null hypothesis is true but we’ve had the bad luck to draw an unusual sample. To reject Ho, the P-value must fall below α . When Ho is true, that happens exactly with probability α. So when you choose level α, you’re setting the probability of a Type I error to α.

Page 398. De Veaux and Velleman (2004). Intro Stats, Pearson.

Here we’re getting closer to a “rate” — though I don’t see the term ever used in this section of the text, which is interesting. Let’s do some dissecting of words and see where that gets us. First — the words “How often will a Type I error occur?” This idea is key and often (if not usually) glossed over. It’s nice that they explicitly state the question here, but then they skirt around the context that’s needed to make sense of it. It is typical that words around this idea are written in such a matter-of-fact way that the intended recipient does not even realize the context is missing. To understand this statement, we have to understand what is being referred to by the term “how often.” How is this defined or supposed to be thought of? Over all hypothesis tests conducted? How often in time (e.g. once a week)? How often over a researcher’s or data analyst’s career? There has to be some reference or basis to move forward in terms of attaching meaning, as well as mathematics, to it.

The clue is in their next sentence: “… but we’ve had the bad luck to draw an unusual sample.” There is a lot embedded in these words and perhaps I should assume that a student who read the first 395 pages of the book would catch the subtle connections, but my years of attempting to teach statistical inference lead me to believe that’s not the case. There is already so much going on and it is referencing concepts that are extremely challenging for most students to master in a first course, regardless of their mathematical backgrounds.

The reader likely focuses more on the words “bad luck” than the words an “unusual sample.” We need to connect the idea of the “how often” to the idea of the “sample.” Where does this connection come from? Well, hopefully the concept of a “sampling distribution” or a “randomization distribution” are still in our conscious memory. Because they used the “sample” wording, we’ll go with the sampling distribution idea here. (The randomization distribution idea is analogous, but based on random assignment to groups rather than random sampling from a population.) The word I have added here is “random” — and this is really necessary to support the theory underlying the idea. It’s hard enough to wrap our heads around the idea with the random — and so, so much harder to get at the “how often” issue if there is no random mechanism fed into the design. But, that is yet another branch of the tree I am not going to get tricked into following now.

So, let’s go with the assumption that you chose one sample (collection of units, individuals, etc.) from your population of interest at random. If you actually did this, then you shouldn’t have to stretch your imagination too far to imagine a different (although hypothetical to you) random sample, and another, and another, and another, …. you get the point. So, the “how often” in the statement is connected to “how often” out of all possible random samples!! This idea of a collection of possible random samples or random assignments to groups provides a tangible (though largely hypothetical) basis for taking on the “how often” concept and this is what gets us to the rate.

As much as I tried, I can’t move on without admitting that I cringe to see the word “exactly” in the quote and my eye starts twitching when I notice the emphasis added to it. Remember, this all makes sense in some very hypothetical, very over-simplified model of reality. The word “exactly” only makes sense within that world as well — so take it with a grain of salt. Don’t stop asking yourself how much simplification and hypotheticalization (note to look up this word) is too much — when is the model too simple or too far from reality to be useful?

Continuing… Let’s gently, but seriously, remind ourselves that this whole Type-I error idea rests on assuming the null hypothesis is TRUE. So, IF the null hypothesis is true (e.g., the true mean of some measured quantity in your population of interest is 0.000000….), THEN, just because of variability in the measured outcome among individuals, a proportion of the possible random samples will result in summary statistics (like sample averages) that are far enough away from zero that if an automatic criteria and decision framework is used, the researcher will “reject the null” — when under the assumption of the null being true, they shouldn’t have. I think of this as a realistic implication of not having enough information, not bad luck, but I assume they use ‘luck’ to appeal to the ‘random’ that was left out of the words. There are plenty of random samples that would lead to the decision to “fail to reject the null” — which of course does NOT directly support the null being true. Phew. This is exhausting. We can see why intro stats book sweep a lot under the rug and further simplify an “as-if” model that is already unrealistically simplified. [Note: my use of mean and average in the example assumes the mean of some quantity over the entire population is what we should be trying to learn about, an assumption we’re rarely asked to justify or think hard about, but that’s yet another branch! And see previous related post. ]

Back to the “rate.” Let’s say we decide, probably arbitrarily or because someone told us to do it, that we’ll set our Type I error rate to be 0.05 (I am watching this branch go by as well). We happily say “the probability of making a Type I error is 0.05.” This feels good to say, but what does it really mean? I argue that most people have very little idea what that really means and that to be more honest we should at least refer to the Type I error rate, and not this vague idea of the “probability of a Type I error.” Here again — another enormous branch leads off the trunk I’m trying to stay on — related to the difficulties inherent in defining probability to begin with. We’ll stick with the rate version — as we are tying “rate” to a proportion of all the possible random samples [that would lead to rejecting the null when really the null is true and conditional on the set-up of the test and all its underlying assumptions.]. Oh, did I mention that we might assume an infinite population with infinitely many possible random samples from it? Shoot — I’ll let that one lie for now too.

I knew this was going to be hard, but it’s been far harder than I anticipated!! Hang in there. I’m almost to the point that I really wanted to make with this post. The De Veaux and Velleman quote, as well as many other references teaching about Type I error, make it very clear that you as the researcher can control your Type I error rate. This is another comforting thought — finally, something you can control regarding the outcome of your study. But, how useful is this, even if you do buy into the concept of the a Type I error and the model behind it? Well, let’s be clear on one thing. You CANNOT control whether your study and analysis results in a Type I error — you won’t know if it does and you can’t control that. But, you supposedly CAN control the Type I error rate associated with your study. But, what does this really mean and how helpful is it? If taking other samples is purely hypothetical and you don’t know if you’ve made a Type I error, what do you do with that information from a practical standpoint? We typically only take one sample and that’s all the information we have. We rarely (never?) take multiple random samples of the same size from the exact same population at the same time. If we can afford to do more sampling, then we take a single random sample of more individuals — we don’t keep them as separate random samples of a smaller size, because why would we?? We can always reduce this largely hypothetical idea of a Type I error rate by taking a very large sample — trying our best to ensure the sample average is “near” (on some scale that matters practically) to the population mean.

I fear I’m failing to make the point that seems so clear in my own head. We talk about this “Type I error rate” and “Probability of a Type I error” as if it is something that exists beyond a hypothetical construct. We treat it as if it is like false positives and false negatives where the “false” can actually me known at some point. We talk about broad implications for Science, making claims like we should expect 5% of all reported results to be wrong. While I firmly believe our reliance on hypothesis testing and belief in the utility of trying to avoid Type I errors is definitely leading to issues in science, it’s not that simple and we have to start looking deeper at the meaning of these concepts. They are not magic and they are not simple.

The “replication crisis” (known by other names too, including reproducibility) often rests on assuming an integral role of the Type I error rate. And now, there is a push to “replicate” studies, often motivated by the idea of Type I error rates, rather than just a desire to learn more about a problem by gathering additional information. I am all for continuing to build knowledge by repeating the design of a study and doing an in depth comparison of the results or rigorously combining the information toward a common goal. But, I worry our misunderstandings of Type I errors (or lack of opportunity to think hard about the concept) are leading to oversimplifying the fix to the replication crisis as well. I think it has contributed to a focus on simply comparing p-values, despite statisticians warning against this. And, let’s go back to the “how often” idea. Even if a different research, in a different lab, tries their best to repeat the design of study, we are still a long way from the definition of where the error rate comes from — unless we are drawing new truly random samples from the same population and re-doing the analysis exactly the same under all the same assumptions. So, we are not truly “replicating” a study in the sense that defines a Type I error rate. And even if we were, what do we do with two (or three) out of possibly infinitely many random samples. That’s barely better than one! We should be gathering more information, but let’s not fool ourselves about the usefulness of the Type I error rate construct. And here I’ve ventured onto yet another branch that I have to keep myself from going down.

I keep repeating myself, but this post was not at all as satisfying as it was supposed to be, but I guess that is the lesson for me. There is nothing satisfying about a Type I error rate (or a Type II error rate or Power) when you really try to dig deep into the weeds. These are not laws or principles that should dictate how we should be doing science — they are based on models of reality that score very high on the “as-if” scale. These concepts were created largely for mathematical convenience and are beautiful in a purely mathematical sort of way. I admit, they are superficially satisfying if we happily ignore the many unsettled conceptual issues and proceed as if we’re in an algebra class, rather than practicing science in the real world. It’s like eating a piece of delectable high-fructose-corn-syrup filled candy — it’s so enjoyable as long as you don’t let yourself think about what’s in it and the potential long term implications for your health. It always tastes better before you look at the ingredients.

Given a few recent questions about my thoughts on statistical power, I’m going to try to get more practical in this post. There is still so much more to be said on this topic — and I am just trying to take it in blog-sized pieces.

For those of you who were lucky enough to first learn from The Statistical Sleuth by Fred Ramsey and Dan Schafer (3rd edition, 2013, Brooks/Cole Cengage Learning), your first introduction to using concepts of statistical inference to help choose sample size for a study did not involve power. I took my first serious course in applied statistics from Fred (as a prerequisite for getting into the master’s program in Statistics), and then was lucky enough to later be his teaching assistant. I did take a Statistics course for my first master’s degree in Kinesiology, but my main memory from that class is using a calculator to produce numbers — it was all about calculations, and light on the concepts motivating the calculations. I don’t even remember what book we used.

“Choosing a Sample Size” is covered in Chapter 24 – Elements of Research Design. Let’s start by noting their choice of words. The section is not titled “Calculating Sample Size” or “Determining Sample Size” — as if there is one correct answer we should arrive at by punching numbers into our calculators. It is a choice. This in itself is a shift from the typical attitude I see in science today and that shift can make a huge difference — it seems completely reasonable to expect someone to fully justify their choice, but less so to have someone justify their calculation. Before I dig in — I need to give the disclaimer that any sample size calculation will suffer from many of the same issues plaguing those based on power, but there are improvements that can be made that encourage a more thoughtful and holistic approach that I hope results in a more realistic view of the trustworthiness, and thus usefulness, of the final numbers. As I said about power, I think there is huge value in going through the challenge of the exercise, and the exercise should lead to us putting less emphasis on the numbers coming out of it — a way of seeing them for what they are.

The approach presented by Ramsey and Schafer is not based on power. It is not based on the assumption that the outcome of a study will be a dichotomized into “reject” or “fail to reject”, or worse “significant” or “not significant.” Instead, they assume the desired statistical result will be an interval conveying uncertainty in the estimate or parameter (however you look at it). Then, they appeal to the more general idea of going after a desired precision relative to practical significance, rather than avoidance of largely hypothetical Type I and Type II errors (another post to be released soon).

Here is a quote from the intro to the section (pg 705): “The role of an experiment is to draw a clear distinction between practically meaningful alternatives.” That is, the role of a study is to do work distinguishing between values of the parameter that do not have practical relevance (usually coinciding with small values) and values that clearly do have practical relevance — thus allowing researchers to support next steps. Next steps might be deciding to not pursue the idea further (hopefully after disseminating the results!), deciding to repeat the study design, seeking funding for more research, arguing for adopting an intervention, etc. This starting place can bring a different mindset to using statistical inference in practice.

So, from a practical side, how do we start the process when our goal is distinguishing among values of the parameter that have different practical implications? The key is the necessary first step. And ironically, this is the step that has been left out of teaching, left out of textbooks, and left out of practice. The researcher must be willing to do the hard a priori work of attaching practical meaning (or implications) to values of the statistical parameter of interest. The attitude I tend to see rests on a belief that the point of statistical inference is to do the hard work for us — a belief that statistics will tell us what values are practically meaningful or important in some attractively objective way. This belief is where things start to go horribly wrong in practice.

You may still be wondering what I am actually asking that a researcher be able to do. There are multiple approaches to this, but I will show you one here that I have found tangible enough to consistently motivate productive work. I like it because it comes with a picture, and one that takes little artistic skill to produce. It is a sketching and coloring exercise. Here are the general steps:

  1. Draw a number line a piece of paper. It should convey the range of realistic values of the your parameter of interest. Even this step can be hard in situations where researchers have only a vague idea of how the parameter in a statistical model relates to their research goals (more on this below).
  2. Choose a color to represent values of the parameter that would be associated with practically meaningful results in the sense that it would clearly motivate more work in the area or advocacy for the “treatment”, etc. Color the corresponding interval(s) on the number line.
  3. Choose a different color to represent values of the parameter that would indicate values clearly associated with non-practically meaningful results (ones that would back-up not pursuing the idea further). Color the number line accordingly.
  4. At this point, you should have intervals of the two colors, but you likely have regions between them that are not colored. It also likely felt uncomfortable to make the decision about where a color should end, and it should have! That uncomfortable feeling is okay and reflects depth in thinking about the problem. It is completely unrealistic to think that values identified with Step 2 will change to values identified with Step 3 at one point on the number line! The gray area in between is incredibly annoying when it comes to having to carry out calculations, but incredibly important when taking a more holistic view to interpreting results from statistical analysis. I typically try to represent the uncomfortable gray area in the sketch as a gradual transition and blending of the colors in that region.

Here is a picture I just drew (I only included zero on the number line for reference, and added some wording assuming a difference in means was of interest). Note: I am still working on improving the wording around this “practically meaningful” discussion, but I think it’s an okay starting point.

Sketch of regions connecting values of parameters to the idea of practically meaningful/relevant

The sketching is simple, but the exercise is typically quite difficult and quite uncomfortable for researchers, and that is okay. Doing this work up front builds a framework to support subsequent statistical inference, rather than pretending as if Statistics can do the uncomfortable work for us. Statisticians can guide researchers through the process, but this is territory for the subject matter expertise – for those who are intimate with the tools of measurement, the literature, and the implications different values of the parameter should have on life. It is a matter of judgement — but the researcher needs to put something out there and then being willing to justify it to their scientific community. This does not imply that all researchers studying a topic should come to the same picture — it depends on the model, the parameter, the implications of the research, etc. It would, however, be great if it motivated researchers studying similar problems and using similar instruments to work together to come to some agreement. The resulting picture then forms an important component of the research — both before and after data are collected. Pre-registration of studies should involve submitting this sketch! 🙂

As I suspected would happen, I have veered away from where I had planned to go in this post, to covering what was supposed to be a separate post. It might not be so bite-sized after all, but I am going to continue and try to get back to where I started!

On that note, let’s continue on a side path for a minute more. The idea I just presented (as well as more traditional power analysis) are based on the premise that the researcher can satisfactorily interpret the parameter in the context of the research and attach practical meanings to different values. A difference in means is about as simple as it gets, but you may be surprised how complicated even this situation quickly gets. For example, suppose a composite score from a survey with numbers attached to different answers is being used as the response variable — what does the scale really represent from a practical standpoint and are means/averages over people meaningful enough to focus analysis on? Skipping this thought exercise and going straight to relying on statistical summaries like p-values or grabbing estimates from other studies as if they have some magical connection to what is practically meaningful does not make sense and is not justified theoretically. Just because someone else using the same instrument carried out a study and got an estimated difference of 1.5 (or translate to an “effect size” as defined in your discipline) does not mean it should be plugged into power calculations as if it represents a threshold for practically meaningful values. It may feel more comfortable because it’s easy and feels “objective,” but those are not reasons to support the practice.

The picture should be drawn from a deep understanding of the problem, the measurement tool, the model, and the parameter. Running into problems with this exercise can be incredibly frustrating, but it is an amazing opportunity to understand and possibly adjust your design before you waste time and money collecting data. The existence of a power analysis and its associated result is sometimes used to judge potential worth of a study. My opinion is that if a researcher isn’t willing to, or can’t, go through the process of drawing the above picture, then that is an indication they aren’t yet ready to spend money and time collecting data — because there is no deep plan for what results are going to be gauged against or how.

Okay — finally back to connecting all of this to investigating choices for sample sizes. The sketch provides the context and backdrop for the investigation. I hesitate to even go forward toward calculations, but I think the calculation aspect can help solidify the underlying concepts — and the difficulties in doing it point toward the more holistic (for lack of a better word) of interpreting results from statistical analysis. That is, the sticky points in the calculations point to where we have to make hard-to- justify assumptions and decisions to arrive at a number in the end. And the number is only as good as the justifications going into it.

Over the years, I have generalized the ideas presented by Ramsey and Schafer into something I feel comfortable with — mainly distancing my version from the null hypothesized parameter value. But, given how comfortable most people are with that idea, it probably is still a useful starting point. So, here is their Display 23.1 Four possible outcomes to a confidence interval procedure, to give you a flavor and context for starting.

From Ramsey & Schafer (2013). The Statistical Sleuth: A First Course in Methods of Data Analysis, 3rd edition, Brooks/Cole Cengage Learning.

Ramsey and Schafer’s approach to using calculations to help choose a sample size (I’m sure others deserve credit for this too) is based on using sample size to control the width of a confidence interval. Holding all other inputs constant, increasing the sample size decreases the width of the confidence interval. Attempting to control the width of a confidence interval can then serve the goal of trying to design a study that is capable of helping to distinguish among values with different practical consequences. We are going after controlling precision in estimation, rather than preventing Type I and Type II errors. The desired precision is then directly tied to the information in our sketch — we would like to have a confidence interval narrow enough that it’s possible to land completely in one color or the other. To go forward with calculations, we have to be willing to choose a sharp cutoff between our colors even though we know a sharp cutoff is not realistic (willful ignorance is needed). They involve solving for the sample size that gives a desired interval width — conditional on the model and all other inputs (more willful ignorance needed). If you can obtain a confidence interval using a formula or using a computer and statistical software, then you can carry out the rather boring calculations using algebra or computer simulation. I will not spend more time on details of the calculations, because I hope by this point it’s clear that the calculations are not the important part of the process.

Beyond justifying choice of sample size, the sketch exercise can be used throughout the research process. After data collection and analysis, uncertainty intervals can be placed on top of the sketch to provide a framework for critically evaluating the results in the context of the research and its implications — at a much deeper level than statistical summaries alone can possibly provide. In the process of designing the study, you can go through many hypothetical outcomes for where your uncertainty interval may fall and what you would do/say with that information.

I believe this is a tangible way we can improve use of statistical inference in practice. It has clear connections to calls for the use of interval nulls, but goes well beyond that suggestion in terms of connections to the research context. It doesn’t rely on weighing results against arbitrary p-value or effect size cutoffs. It does not have to result in a yes or no answer. It is simply an honest comparison of the values in interval conveying particular sources of uncertainty to a priori information about what those values are believed to mean in a larger context involving implications of the study.

This approach gives power back to the researcher, rather than blindly turning it over to statistical analysis.

I very much welcome comments and questions. And, I would love to have people submit their sketches attached to real studies!

This is another post inspired by an entry on Andrew Gelman’s blog and the first comment: https://statmodeling.stat.columbia.edu/2019/11/14/is-abandon-statistical-significance-like-organically-fed-free-range-chicken/. I hesitate to continue to give so much attention to p-values, but it’s clearly a topic not going away anytime soon and one we need to be discussing, even if it often feels superficial. Hopefully it is an easily accessible door leading to other conversations.

I am intrigued by this commonly given reason for wanting to continue to rely on p-values: “because I understand what they mean.”

In my experience, it is rare that a researcher/reviewer really understands and can articulate what a p-value means, particularly in the context of observational studies. When we venture outside the simple intro-stats contexts that can be easily programmed with pseudo-random number generators — like one-factor experiments with random assignment or random selection from a well-defined population — things get sticky. I’m all for using simulation to initially teach the concept of a p-value (if needed), but we also need to push students to think about how those ideas do, and do not, extend to more complicated situations and studies without any random mechanism linked to the design. To rely on a p-value as if it represents useful information about the results of a study, the researcher should be able to explain what information is contained in that number in the context of their study — in an interpretation sort of way, not reciting a generic definition. This is hard, even for formally trained statisticians. We combine information from the data with assumptions to arrive at a neatly packaged and attractive number — but what are those assumptions and when do they make enough sense to appeal to? For observational studies, it involves a lot of “as if” hypothetical thinking. If it’s not making your brain hurt, you’re probably not going deep enough.

So…my interpretation of the statement “I understand what they mean” is that it typically represents the person’s belief in their ability to successfully use p-values according to common practices — the same practices that some statisticians have been arguing against for decades. The ability to use p-values in a socially accepted way is not synonymous with understanding what they mean. In fact, being comfortable using p-values in ways not recommended by the people who have thought deeply about them is probably evidence that you don’t really understand what a p-value means. I’m not sure I generally find p-values useful enough to enthusiatically promote spending the time to deeply understand them — though of course if you are someone who would like to keep using them (in your own research or to judge the research of others), then I think you should. I also think judging their degree of usefulness should be context dependent — there is plenty of room for disagreement and assessing reasonableness of embedded assumptions situation by situation.

If we don’t have a deep understanding of a concept, are we qualified to self-assess whether we understand what it means? I suspect most will answer ‘no.’ Unfortunately, if we don’t have a deep enough understanding, then we don’t have the knowledge to self-regulate our self-assessments. Somehow I keep ending up at the “we don’t know enough to know what we don’t know” conundrum. [And, can’t help but wonder where I am inevitably making such mistakes today.]

Research out of spite?

November 13, 2019 | General | No Comments

The post I started out to finish today got hijacked when I came across a barely legible quote in an old notebook of mine.

A couple of years ago, I sat in a committee meeting as part of the defense of a student’s dissertation proposal (it served as the PhD oral comprehensive exam). We sat around a make-shift conference table in a drab university classroom with no windows. After about an hour of discussing the work she had already done and where she planned to go, I asked her to back up and describe to us why she had originally chosen the dissertation topic — where did the original motivation come from, how had the topic changed, and why? Her research origin story involved her questioning a comment made by another person on the topic. More than just questioning, she found herself disagreeing with the view and wanted to investigate it using research. This sounded like an honest, and time honored, way of arriving at a dissertation topic. She questioned the conclusions or assumptions of others — in an academic and philosophical way. I’ve always felt we don’t spend enough time on graduate committees discussing how to generate research ideas and how to move theories and ideas forward. Instead, our conversations tend to stay very methods-focused. In this case, she had used her creativity, critical thinking, and expertise to independently identify an interesting research question and then proceeded with the hard work of figuring out how to investigate it. I was feeling good about the question and where the discussion might head.

Then…

The full professor in the group took the floor, and with what I remember as a wry smile, said “Out of spite! You really wanted to win an argument. I admire that.” There was a general obligatory chuckle, but I was shocked. So shocked that I immediately wrote down the exact words she said — which is what I found in my notebook today. Is choosing a research topic out of spite to win an argument really something to be admired? [Note: I do not think a sense of “spite” was actually conveyed by the student.]

There is a lot to reflect on relative to the reasons the phrase was, and still is, disturbing to me. But, today I want to focus on considering the potential impacts of such attitudes on science. What does such a comment imply about admirable reasons for doing science and what unintended consequences might it have?

Bringing spite in as a motivator attaches a pride factor to the work. It creates an “I need a particular outcome from this research to win the argument and not damage my own pride” sort of attitude. Admittedly, there is always some pride factor involved in research, though we typically work hard to convince others it’s not there (we can substitute other p- words like promotion factor or publication factor). Explicitly laying spite out there as an admirable quality for motivating research seems to move things well across a line. We [should] strive to honestly investigate questions by inserting as little of our own desire for a particular outcome as possible (again, arguable for how effective we are in that). It is quite possible that without the comment the student would have subconsciously influenced the results of the study because of her a priori views and a potential vested interest in a particular outcome. But, I feel confident she would not have been proud of this and would have tried to limit it. She certainly would not have celebrated or publicized such motivations as if they were admirable.

I guess one could take the more cynical view that the professor was simply being more honest than the rest of us and stating out loud what is actually happening in our scientific culture. I’m used to taking a pretty cynical view, but I have a hard time swallowing this one. In my view, by turning the research into an exercise of spite and winning an argument, the professor was telling the student to firmly take one side and accept that winning was tied to a particular research outcome. And this is in conflict with setting out to investigate for the sake of gaining knowledge (regardless of how it turns out). I can’t help but believe there are huge implications in sending such messages to students as they embark on their careers in research.

My regret for that day is that I stayed silent. I felt fear at speaking up — which was confusing to me, but very real at the time. Now, time has allowed me to reflect on the situation and better understand why I felt compelled to write down those words.

The word “spite” doesn’t have to be included in a phrase for us to subtly (and most likely unconsciously) push our graduate students toward believing there is only one outcome of their research that will be deemed a success. Setting this tone is just asking for our human faults to disrupt the process of trying to do good science. Motivations for favoring particular outcomes are high enough due to current incentive systems in scientific culture — we don’t need to add admiration of spite.

Giving too much power to power

November 11, 2019 | General | 6 Comments

In many scientific disciplines, power analysis has become a prerequisite for grant funding. And grant funding has become prerequisite for survival of a scientist.

I strongly believe that effort spent in design phase of any study is the most important part of the research process. But… I have always felt very uneasy about power analysis. I suppose my uneasiness is less about power analysis itself, and more about the extreme and automatic reliance on it, coupled with a surprising lack of accountability for justifying its use and its results. Why do presumably skeptical scientists seem to give so much power over to power analysis?

If you’re reading this and don’t really understand what I mean by “power analysis,” I’m referring to statistical power and its common use in justifying the number of subjects (or other units) for a study or experiment. “Sample size calculations” don’t have to be based on statistical power, but often are. Statistical power is directly related to the concept of Type II error rate — and I have more blog posts coming about Type I and Type II error rates (and what they might really mean, or not mean, to you). There is plenty of intro-level information out there on these concepts — just read it with skepticism. For this post, there’s one really important bit of background information needed, and I don’t think it’s controversial. I’ll state it one long sentence with three parts. Power analysis relies on a set of assumptions; the results (the seemingly satisfying number(s) spit out from the analysis) are conditional on those assumptions; and the results are only as justified as the assumptions are justified in the context of the problem (e.g., “garbage in, garbage out”).

Now, back to the question. Why is there such a tendency to over-rely on power analysis, particularly without adequate justification for the underlying assumptions? I am fully aware I can’t answer this huge question adequately in one blog post, but I would like to throw a few thoughts out there. I have to start somewhere.

The reasons for the often blind trust in power are less about Statistics and more about human nature and current scientific culture and paradigms. I keep coming back to two things that help me understand this phenomenon (and others like it). First, relying on it simplifies life for a lot of people and seems logical if you have a superficial understanding of power analysis. It serves the gate-keepers, allowing them to do their job quickly by simply checking tickets without the knowledge to assess if they might be forged. Second, though related, it provides comfort because it spits out numbers and makes an incredibly challenging study design decision seem very easy and as if it has a correct answer. It provides a false sense that we have taken a challenging problem, full of uncertainty, and dramatically reduced the uncertainty associated with it. The inherent uncertainty does not disappear, but is effectively swept under a rug. As conveyed by Herbert Weisberg, we can proceed with the calculations after willfully ignoring many conceptual sticking points. If one is willing to ignore the very tenuous underlying assumptions, then power analysis appears a very useful construct.

But, I think there’s another layer to the willful ignorance part of the story. The term includes the word “willful,” implying the person appealing to it has enough knowledge to be aware of what they are ignoring. That is, they understand the possible problems and unresolved issues, but they willfully make a decision to ignore them — presumably out of weighing risks associated with doing so. Appealing to willful ignorance brings some sense of comfort at being able to move forward with the problem, but it should also carry a healthy dose of discomfort from an understanding of what is being ignored. If one is not aware of the underlying issues and problems, then the decision to go forward is very comforting because there is nothing to invoke the balancing discomfort. Unwillful ignorance brings far more comfort than willful ignorance — naively proceeding without the ugly knowledge of what is being swept away. It gives a greater sense of trust in the method and its results. This envisions yet another continuum based on depth of knowledge about a topic. In order to willfully ignore something, we have to have awareness of it. To gain awareness, we have to be open to listening to the views of people who have spent time thinking hard about a problem — which is often (hopefully!) the case with statisticians and power analysis.

For nearly 20 years, I have been having conversations with researchers about my views on a healthier approaches when power analysis is desired. I have tried many different strategies, tones, etc. But, it nearly always felt like I was arguing against a tidal wave of pressure pushing researchers to do it “the usual way.” My advice as a PhD statistician usually could not compete with the culture and system the researchers were trying to swim in. I walked a fine and uncomfortable line in trying to help as part of my paid job as a statistician — by trying to help justify assumptions and push out of default settings to critically think through the logic for each problem. I am no longer in such a job and hope never have to be in that position again — the realization that I have escaped it still elicits an overwhelming sense of relief (might even qualify as joy). That said, there is still plenty I want discuss about power analysis. It is a tangible context for researchers and has the potential to be a door into deeper conversation we need to be having about the use of Statistics in science in general.

Here’s an email exchange I had with a very successful researcher. I quotes are direct from emails — I only left out minor detail with the ellipsis to help with anonymity.

“Megan, I don’t know what to tell you. Let’s stop here and if the reviewers want a power analysis I’ll find someone else to help me.”

Me: “Just wanted to reach out again and say I am happy to work on this with you if we can collaborate to think through and justify (as best as possible) the choice of numbers going into the power analysis.”

“Thanks for reaching out, Megan, and thanks also for your kind offer. I know you disagree, but I’m going to stick with my bad science power analysis for this proposal — it’s what the NIH program officer I’ve been talking with told me to do. I will appreciate your help with a real power analysis … once we have some pilot data to inform good decision making. But thank you.”

This scenario is not at all uncommon and spans many disciplines. I share this one not to pick on the person, but because I have it in an easily quotable email. I always talked openly about it with my students, but have found it difficult to motivate and engage in productive discussion with researchers who had already had plenty of success navigating through the gate. It is a sensitive topic that is uncomfortable for many to talk about frankly. It can feel embarrassing for everyone involved. It is not productive to blame any one individual who is trying to survive and thrive in their profession with the current gate keepers. Honest and open conversations are needed without fear of it impacting a career.

My hope is that one day my work to educate and help people think through the foundations and underlying logic of things like power analysis will be valued more than my ability (that I refuse to use) to thoughtlessly punch unjustified numbers into an unjustified formula to appease a gate keeper who probably isn’t aware the tickets they are punching are forged. And, refusal of a statistician to participate in forging a power analysis ticket should be professionally respected.

Low tolerance for bullshit

November 6, 2019 | General | 1 Comment

I enjoy the honest narrative shared by Simon Raper, a frequent contributor to the Significance magazine published by the Royal Statistical Society. He routinely delivers a refreshing dose of reality and honesty — inviting thoughts and discussion beyond the seemingly cut and dry world of data analysis.

His most recent contribution — “Bullshit jobs in statistics” did not disappoint and hit a sensitive nerve with me, though not in a bad way. It’s a nerve that should be poked and one I’m still figuring out how to hit in others without causing understandable feelings of depression or serious defensiveness. Raper managed to make some brutal points with a fair tone — by tying them to words described by anthropologist David Graeber’s recent work (I haven’t yet read his 2018 book Bullshit Jobs: A Theory published by Penguin in London).

A little over a year ago, I was deep into trying to understand and accept my career crisis (which just happened to correspond with mid-life). The crisis had been bubbling under the service for a decade, maybe longer. I came to realize that the very things that sent me toward Statistics were essentially those pushing me away. In beautiful retrospect, I never had a chance of it being a fulfilling career for me, though I fought hard to get there. I was caught in the vicious cycle of telling myself I had dedicated too much of my life to the path I was on to change directions — 6 years of graduate school in Statistics and a years as a professor of Statistics! So, I kept putting in more time, which then made it even harder to consider change.

That day, I sat in the waiting room waiting to see the therapist I had finally hired to help me make sense of the deep, sickening feeling I now almost always carried around with me. I glanced through my notebook, still proud of what I had written down the night before. I desperately needed simple words to describe reasons for the work-related feelings I was in constant struggle with, and I felt like maybe I had found them. I had watched a webinar sponsored by the National Center for Faculty Development and Diversity (NCFDD) by Cristi Cook on her Pillars of Genius concept. I spent an hour trying to articulate four pillars of genius for myself. “Low tolerance for bullshit” was the second thing I wrote down, right after “Not patient.” A little note scribbled on the opposite side of the page says “I chose this career to help avoid BS and lack of integrity in research, but instead that just led me here.” I knew I loved science. I knew I loved research. But, I loathed most of my work and interactions as statistician.

Just writing down the words “Low tolerance for bullshit” made a huge difference for me. It explained so much about my life and decisions I had made. I then spent 20 minutes searching for synonyms for the word “bullshit” so I could display it as a reminder for myself — but in a way my kids wouldn’t giggle over and feel that they shouldn’t say. “Crap” was the obvious option, but didn’t sit right either. I had already written an article with “s-word” in the title and hadn’t planned on it becoming a general life theme. Turns out not to be an easy exercise and one I’m sure many before me have tried — the BS acronym seems to be the best option. But, the process of looking was also therapeutic — definitions popped up including words and phrases like “exaggerated talk”, “foolish talk”, “unjustified talk”, “false importance,” “illogical”, “doing things for the wrong reasons, “deceitful talk”, “pretentious talk”, “nonsense” and my favorite “eloquent and insincere rhetoric.” I was certainly in the right place. [Note: on a last read through before publishing this, I found the word “pooped” instead of “popped” in the previous sentence…hmm…definitely explainable.]

Unraveling the ways the theme ran through the experiences in my career as a statistician, and even my decision to become a statistician in the first place, was complicated. I tried to accept “Low tolerance for bullshit” as one of my core “pillars of genius” and make decisions that were consistent with it. It wasn’t something I wanted to get rid of — but it was something that would be hard to live with given my daily work experiences. It still amazes me how naming something we already know and feel on a daily basis can give us new perspectives on life and change our actions and ways of thinking. But there it was.

Finally, back to Simon Raper’s article – you can now imagine my response to seeing the title of it! I enjoyed and agreed with the whole thing, but my favorite part was actually the last section All too human where he relates the ideas back to our human faults. It’s a common theme in his writings and one that I now see running clearly through my posts. I end here with a couple of quotes from that section of his article:

Not only are we ourselves the victims of bullshit jobs, but when we fail to push back against the incorrect use of statistics we add another shovel-load of dung to the heap.

Simon Raper, “Bullshit Jobs in Statistics”, Significance, October 2019, 16(5).

We also need to give the messily human, anthropological side of statistics as much attention in our journals and conferences as the tidy, safe, non-human mathematical side. After all, the elegance and sophistication of a new statistical technique are worth nothing if its main use is for conning business executives.

Simon Raper, “Bullshit Jobs in Statistics”, Significance, October 2019, 16(5).

Here’s to fighting against bullshit jobs in Statistics, and to any efforts to scrub away the bullshit often covering the use of Statistics in practice.

Oh, the incentives…

November 5, 2019 | General | 1 Comment

For scientists to continue to be scientists, they must survive in the environment in which scientists live. Understandably, most scientists prefer to try to thrive in that environment, not just survive. The environment is a social system, constructed by and for humans just like all social structures. Yes, it is made up of scientists who strive to seek “truths” of the universe, but it is no less human and with no fewer faults than other social systems. In fact, maybe it is destined to have more faults for its difficulty in acknowledging its connections to humanity — because just acknowledging that connection seems to imply weaknesses and biases.

Scientists pride themselves on their deep commitment to “objectively” seeking knowledge in ways that are as least biased and most honest as possible. This is the case, at least, until the focus is on the scientist’s career and not their science. As much as we would like to believe there is a healthy connection between our measurements of a scientist’s success and the quality of their science, it’s no secret that the incentives built into the social system do not always align with quality. I sense cringes from scientists, but I only mean to point out that scientists are human, their social systems are human, and the mistakes and decisions they make are unmistakably human. I consider myself a scientist and am proud to be one. I love science, I love research, and I believe in their value. Peering into the human sides of “doing science” is a positive thing, not an unfair criticism of science or scientists! It is a fascinating part of the process that we shouldn’t ignore if we are truly committed to doing the best science possible.

Well, that was a much longer introduction than I had planned. My mind gets pulled in so many directions when I start down this road — the reason it has taken me years to know where to start writing! I’m now going to drag myself back to where I was heading when I sat down to write this — incentives. Incentives for scientists — and a glimpse from the view of a collaborating statistician.

The social system in which scientists live is based on one main currency — publications. The system around how and what to publish is complex, often discipline specific, and scientists must work hard to understand it and navigate it effectively to survive. Much has been written in the last couple of decades identifying issues with the system — and I think there is a general consensus among scientists that quantity of publications does not imply quality of a scientist’s work. I also think there is general consensus that this fact is usually ignored in the face of promotion (both formal and informal). Scientists measured as successful by quantity-based metrics have figured out how to thrive within the current incentive system — possibly while doing their best possible science, but possibly just by figuring out how to effectively accumulate the prizes.

Let’s construct a quick hypothetical scenario. Suppose you are statistician (formally trained and with years of experience) collaborating with other researchers on a project. You are not only a statistician, you are a scientist, evaluated in the same way as your collaborators, but your work must span discipline boundaries and the shifts in the social systems that come with that. You want to be proud of your work, comfortable that it will be judged as high quality, or at least reasonable, by those who read it (maybe even for an external review of your tenure case!). Hopefully I’m not stretching anyone’s imagination too far yet.

Now, suppose that you have worked to develop and fully justify a reasonable approach to the design, analysis, or interpretation. You present the approach to your collaborators and it is applauded… but then quickly deemed inadequate relative to unwritten rules of the publication system in the their discipline. Your collaborators agree that the approach you have suggested is more reasonable and better justified than the approach they want to go with — but theirs is believed to have a better chance of earning a publication carrot. On the level that should matter most, you are all in agreement! But, they think it is too boring or uncommon in their discipline and are unwilling to potentially dampen their careers by risking no publication. What to do?

You go back and forth, you promise to fully justify your reasons for the approach, respond to reviewer comments, etc. But, they believe they are correct in their assessment (and they very well might be!)– and in the end, the social system in which they must operate wins out. You are then put in an awkward (and arguably ethical) dilemma: Do you remove yourself from the publication and get no credit toward your career or do you keep your name on the paper and get a carrot for your career for something you might not be proud of? The ultimate decision would be incredibly context dependent and person dependent and I do not mean to judge that here. There is a lot of extra baggage (emotional, professional, and ethical) entangled with scenario and everything that it leads to — and I am trying to stay out of that today.

Warning. Here is where my experiences lead me to cynical view. In my almost 20 years as a collaborating statistician, I think I removed my name from more papers than I left it on (in retrospect, I wish I would have kept careful count). My very first collaboration ended in this way. I was a first year graduate student in Statistics and the collaborator was a graduate student in the veterinary school. I assumed at the time it was an unfortunate first experience that would be rare in my bright future as a statistician. Instead, nearly 20 years later, my last few collaborations also ended in this way. It is not rare and does not seem to be going away. Statisticians vent to each other about it, but I haven’t seen it talked about as openly as I think it should be. It is a tangible example of our incentive system operating against the flow of doing the best science we can do.

On that note, I was relieved to see the last two paragraphs of this Nov 5th 2019 post of Andrew Gelman’s Statistical Modeling, Causal Inference, and Social Science blog. While Andrew’s tone sounds a little surprised, I felt no surprise at all when reading it. Maybe Andrew doesn’t have to deal with it on a daily basis because his name on a publication is able to outweigh any perceptions that the approach isn’t hip enough. This adds an interesting perspective I haven’t thought a lot about. For me (a less than famous statistician), it felt unavoidable and became a huge force that pushed me hard — away from academia and its social incentive systems.

Andrew Gelmans’ blog post from November 5, 2019

A final thought. The irony of my current position — trying to make a go at writing for a living — isn’t lost on me. I’m now more of a slave to publication than ever — but hopefully I’ll be able to publish honest material that I’m proud of.