Sample size without power — Yes, it’s possible.

Home / Sample size without power — Yes, it’s possible.

Given a few recent questions about my thoughts on statistical power, I’m going to try to get more practical in this post. There is still so much more to be said on this topic — and I am just trying to take it in blog-sized pieces.

For those of you who were lucky enough to first learn from The Statistical Sleuth by Fred Ramsey and Dan Schafer (3rd edition, 2013, Brooks/Cole Cengage Learning), your first introduction to using concepts of statistical inference to help choose sample size for a study did not involve power. I took my first serious course in applied statistics from Fred (as a prerequisite for getting into the master’s program in Statistics), and then was lucky enough to later be his teaching assistant. I did take a Statistics course for my first master’s degree in Kinesiology, but my main memory from that class is using a calculator to produce numbers — it was all about calculations, and light on the concepts motivating the calculations. I don’t even remember what book we used.

“Choosing a Sample Size” is covered in Chapter 24 – Elements of Research Design. Let’s start by noting their choice of words. The section is not titled “Calculating Sample Size” or “Determining Sample Size” — as if there is one correct answer we should arrive at by punching numbers into our calculators. It is a choice. This in itself is a shift from the typical attitude I see in science today and that shift can make a huge difference — it seems completely reasonable to expect someone to fully justify their choice, but less so to have someone justify their calculation. Before I dig in — I need to give the disclaimer that any sample size calculation will suffer from many of the same issues plaguing those based on power, but there are improvements that can be made that encourage a more thoughtful and holistic approach that I hope results in a more realistic view of the trustworthiness, and thus usefulness, of the final numbers. As I said about power, I think there is huge value in going through the challenge of the exercise, and the exercise should lead to us putting less emphasis on the numbers coming out of it — a way of seeing them for what they are.

The approach presented by Ramsey and Schafer is not based on power. It is not based on the assumption that the outcome of a study will be a dichotomized into “reject” or “fail to reject”, or worse “significant” or “not significant.” Instead, they assume the desired statistical result will be an interval conveying uncertainty in the estimate or parameter (however you look at it). Then, they appeal to the more general idea of going after a desired precision relative to practical significance, rather than avoidance of largely hypothetical Type I and Type II errors (another post to be released soon).

Here is a quote from the intro to the section (pg 705): “The role of an experiment is to draw a clear distinction between practically meaningful alternatives.” That is, the role of a study is to do work distinguishing between values of the parameter that do not have practical relevance (usually coinciding with small values) and values that clearly do have practical relevance — thus allowing researchers to support next steps. Next steps might be deciding to not pursue the idea further (hopefully after disseminating the results!), deciding to repeat the study design, seeking funding for more research, arguing for adopting an intervention, etc. This starting place can bring a different mindset to using statistical inference in practice.

So, from a practical side, how do we start the process when our goal is distinguishing among values of the parameter that have different practical implications? The key is the necessary first step. And ironically, this is the step that has been left out of teaching, left out of textbooks, and left out of practice. The researcher must be willing to do the hard a priori work of attaching practical meaning (or implications) to values of the statistical parameter of interest. The attitude I tend to see rests on a belief that the point of statistical inference is to do the hard work for us — a belief that statistics will tell us what values are practically meaningful or important in some attractively objective way. This belief is where things start to go horribly wrong in practice.

You may still be wondering what I am actually asking that a researcher be able to do. There are multiple approaches to this, but I will show you one here that I have found tangible enough to consistently motivate productive work. I like it because it comes with a picture, and one that takes little artistic skill to produce. It is a sketching and coloring exercise. Here are the general steps:

  1. Draw a number line a piece of paper. It should convey the range of realistic values of the your parameter of interest. Even this step can be hard in situations where researchers have only a vague idea of how the parameter in a statistical model relates to their research goals (more on this below).
  2. Choose a color to represent values of the parameter that would be associated with practically meaningful results in the sense that it would clearly motivate more work in the area or advocacy for the “treatment”, etc. Color the corresponding interval(s) on the number line.
  3. Choose a different color to represent values of the parameter that would indicate values clearly associated with non-practically meaningful results (ones that would back-up not pursuing the idea further). Color the number line accordingly.
  4. At this point, you should have intervals of the two colors, but you likely have regions between them that are not colored. It also likely felt uncomfortable to make the decision about where a color should end, and it should have! That uncomfortable feeling is okay and reflects depth in thinking about the problem. It is completely unrealistic to think that values identified with Step 2 will change to values identified with Step 3 at one point on the number line! The gray area in between is incredibly annoying when it comes to having to carry out calculations, but incredibly important when taking a more holistic view to interpreting results from statistical analysis. I typically try to represent the uncomfortable gray area in the sketch as a gradual transition and blending of the colors in that region.

Here is a picture I just drew (I only included zero on the number line for reference, and added some wording assuming a difference in means was of interest). Note: I am still working on improving the wording around this “practically meaningful” discussion, but I think it’s an okay starting point.

Sketch of regions connecting values of parameters to the idea of practically meaningful/relevant

The sketching is simple, but the exercise is typically quite difficult and quite uncomfortable for researchers, and that is okay. Doing this work up front builds a framework to support subsequent statistical inference, rather than pretending as if Statistics can do the uncomfortable work for us. Statisticians can guide researchers through the process, but this is territory for the subject matter expertise – for those who are intimate with the tools of measurement, the literature, and the implications different values of the parameter should have on life. It is a matter of judgement — but the researcher needs to put something out there and then being willing to justify it to their scientific community. This does not imply that all researchers studying a topic should come to the same picture — it depends on the model, the parameter, the implications of the research, etc. It would, however, be great if it motivated researchers studying similar problems and using similar instruments to work together to come to some agreement. The resulting picture then forms an important component of the research — both before and after data are collected. Pre-registration of studies should involve submitting this sketch! 🙂

As I suspected would happen, I have veered away from where I had planned to go in this post, to covering what was supposed to be a separate post. It might not be so bite-sized after all, but I am going to continue and try to get back to where I started!

On that note, let’s continue on a side path for a minute more. The idea I just presented (as well as more traditional power analysis) are based on the premise that the researcher can satisfactorily interpret the parameter in the context of the research and attach practical meanings to different values. A difference in means is about as simple as it gets, but you may be surprised how complicated even this situation quickly gets. For example, suppose a composite score from a survey with numbers attached to different answers is being used as the response variable — what does the scale really represent from a practical standpoint and are means/averages over people meaningful enough to focus analysis on? Skipping this thought exercise and going straight to relying on statistical summaries like p-values or grabbing estimates from other studies as if they have some magical connection to what is practically meaningful does not make sense and is not justified theoretically. Just because someone else using the same instrument carried out a study and got an estimated difference of 1.5 (or translate to an “effect size” as defined in your discipline) does not mean it should be plugged into power calculations as if it represents a threshold for practically meaningful values. It may feel more comfortable because it’s easy and feels “objective,” but those are not reasons to support the practice.

The picture should be drawn from a deep understanding of the problem, the measurement tool, the model, and the parameter. Running into problems with this exercise can be incredibly frustrating, but it is an amazing opportunity to understand and possibly adjust your design before you waste time and money collecting data. The existence of a power analysis and its associated result is sometimes used to judge potential worth of a study. My opinion is that if a researcher isn’t willing to, or can’t, go through the process of drawing the above picture, then that is an indication they aren’t yet ready to spend money and time collecting data — because there is no deep plan for what results are going to be gauged against or how.

Okay — finally back to connecting all of this to investigating choices for sample sizes. The sketch provides the context and backdrop for the investigation. I hesitate to even go forward toward calculations, but I think the calculation aspect can help solidify the underlying concepts — and the difficulties in doing it point toward the more holistic (for lack of a better word) of interpreting results from statistical analysis. That is, the sticky points in the calculations point to where we have to make hard-to- justify assumptions and decisions to arrive at a number in the end. And the number is only as good as the justifications going into it.

Over the years, I have generalized the ideas presented by Ramsey and Schafer into something I feel comfortable with — mainly distancing my version from the null hypothesized parameter value. But, given how comfortable most people are with that idea, it probably is still a useful starting point. So, here is their Display 23.1 Four possible outcomes to a confidence interval procedure, to give you a flavor and context for starting.

From Ramsey & Schafer (2013). The Statistical Sleuth: A First Course in Methods of Data Analysis, 3rd edition, Brooks/Cole Cengage Learning.

Ramsey and Schafer’s approach to using calculations to help choose a sample size (I’m sure others deserve credit for this too) is based on using sample size to control the width of a confidence interval. Holding all other inputs constant, increasing the sample size decreases the width of the confidence interval. Attempting to control the width of a confidence interval can then serve the goal of trying to design a study that is capable of helping to distinguish among values with different practical consequences. We are going after controlling precision in estimation, rather than preventing Type I and Type II errors. The desired precision is then directly tied to the information in our sketch — we would like to have a confidence interval narrow enough that it’s possible to land completely in one color or the other. To go forward with calculations, we have to be willing to choose a sharp cutoff between our colors even though we know a sharp cutoff is not realistic (willful ignorance is needed). They involve solving for the sample size that gives a desired interval width — conditional on the model and all other inputs (more willful ignorance needed). If you can obtain a confidence interval using a formula or using a computer and statistical software, then you can carry out the rather boring calculations using algebra or computer simulation. I will not spend more time on details of the calculations, because I hope by this point it’s clear that the calculations are not the important part of the process.

Beyond justifying choice of sample size, the sketch exercise can be used throughout the research process. After data collection and analysis, uncertainty intervals can be placed on top of the sketch to provide a framework for critically evaluating the results in the context of the research and its implications — at a much deeper level than statistical summaries alone can possibly provide. In the process of designing the study, you can go through many hypothetical outcomes for where your uncertainty interval may fall and what you would do/say with that information.

I believe this is a tangible way we can improve use of statistical inference in practice. It has clear connections to calls for the use of interval nulls, but goes well beyond that suggestion in terms of connections to the research context. It doesn’t rely on weighing results against arbitrary p-value or effect size cutoffs. It does not have to result in a yes or no answer. It is simply an honest comparison of the values in interval conveying particular sources of uncertainty to a priori information about what those values are believed to mean in a larger context involving implications of the study.

This approach gives power back to the researcher, rather than blindly turning it over to statistical analysis.

I very much welcome comments and questions. And, I would love to have people submit their sketches attached to real studies!

About Author

about author

MD Higgs

Megan Dailey Higgs is a statistician who loves to think and write about the use of statistical inference, reasoning, and methods in scientific research - among other things. She believes we should spend more time critically thinking about the human practice of "doing science" -- and specifically the past, present, and future roles of Statistics. She has a PhD in Statistics and has worked as a tenured professor, an environmental statistician, director of an academic statistical consulting program, and now works independently on a variety of different types of projects since founding Critical Inference LLC.

4 Comments
  1. Andy

    Hi Megan,

    Your last sketch reminds me of Kruschke’s region of practical equivalence (ROPE), I’d be curious to hear your thoughts about this procedure. There is a similar image in the paper titled “Rejecting or accepting parameter values in Bayesian estimation”

    • MD Higgs

      Andy,

      Thanks so much for the comment! I admit that I am aware of ROPE, but have not looked into the details of it (despite the fact I probably should have!). There are definitely others presenting similar pictures and ideas, such as Betensky’s paper in the The American Statistician special issue. I will use your comment to put “Look into ROPE details” on my to-do list. My sense is that a lot of statisticians have independently come to this general idea or framework over the years and perhaps where my suggestion/view/presentation differs is in the idea that it should be used as more of a holistic backdrop for the research – a context for interpretation — rather than motivating an alternative test-like procedure. I may be being unfair to ROPE here, and will respond accordingly if I think I am later.

      Megan

      • MD Higgs

        Sorry it took me awhile to get back to this, but here are a few more specific thoughts about the ROPE approach as described by Kruschke in Chapter 12 of Doing Bayesian Data Analysis: A Tutorial with R and BUGS (2011, Academic Press), relative to what I was trying to capture in my post and its pictures.

        There are definitely similarities, but there are important differences too and as of now I stand by what I said in my original reply — that I am trying to propose a more general framework or backdrop for quantitative research, rather than an alternative procedure that stays within the boundaries of current practice (e.g., rejecting a null). What I see as the important part of my picture is that it portrays the gradual change from values that are clearly practically meaningful to values that are clearly not practically meaningful, without ignoring the uncomfortable gray area in between. This combines ideas going into power analysis and ideas going into equivalence testing (or non-inferiority testing), but it lets go of the hard edges needed for both of them. Maybe I should have left off the last picture all together — because I see how I may have ventured too far into “procedure” and lost some of the case I was trying to make.

        ROPE seems to stay within the equivalence testing arena – which is only a small part of my picture. Kruschke defines it in the first paragraph of 12.1.3 as “a small range of values that are considered to be practically equivalent to the null values for purposes of the particular application.” I am more interested in defining the gray area regions where things move from a region of “practical equivalence to a null” as I think Kruschke is describing to those considered practically meaningful. Just defining ROPE says nothing about how far the end of that interval is to an interval that might be described as “practically equivalent to alternative values for the purpose of a particular application.” It stays too close to hypothesis testing (null or equivalence) and all its issues with oversimplification and encouraging the cutting of corners.

        But, if he is going to stay in the binary decision making camp, I do appreciate that Kruschke is using the ROPE construct to make decision rules explicit (hopefully encouraging their justification). If we’re going to use them, they should be explicit. He describes it as “Once a ROPE is set, we make decisions according to the following rule: A parameter value is declared to be not credible, or rejected, if its entire ROPE lies outside the 95% HDI of the posterior distribution of that parameter.” This is explicit and clear. But, still reject the notion that we must make binary decisions like this — do we need “credible or not credible” or “rejected or not rejected”? Do we need to do so much “declaring?” Why 95%? What if I choose 96%? The slope is slippery to getting us back to all the problems we’re already struggling against when naively checking if an interval contains a value — though admittedly it’s an improvement.

        I appreciate Kruschke’s description of “How is the size of the ROPE determined?”, though I would update the question to think about “How should I choose the ends of the ROPE interval?” This makes it explicit that endpoints must be a choice and that choice should be well justified. There is nothing magic to determine and not everyone will, or should, agree with the choice. I do not agree with the recommendation that the “ROPE might be established with somewhat arbitrary criteria.”

        To be fair, Kruschke communicates his concerns in the last paragraph of the section and for the most part they align with the points I’m trying to make. I’ll use his own words here instead of paraphrasing: “It is important to be clear that any discrete declaration about rejecting or accepting a null value does not exhaustively capture our beliefs about the parameter values. Our beliefs about the parameter value are described by the full posterior distribution. When making a binary declaration, we have merely compressed all that rich detail into a single bit of information. The broader goal of Bayesian data analysis is conveying an informative summary of the posterior and where the value of interest falls within the posterior. Reporting the limits of an HDI region is more informative than reporting the declaration of a reject/accept decision. By reporting the HDI and other summary information about the posterior, different readers can apply different ROPEs to decide for themselves whether a parameter is practically equivalent to a null value.”

        In summary, the ROPE strategy may be useful in starting to draw the picture, but it only provides a portion of it. More effort is needed to complete it and then to use it in a more holistic way in the end.

  2. Quantitative backdrop to facilitate context dependent quantitative research – Critical Inference

    […] started in the context of working through sample size investigations with alternatives to power (https://critical-inference.com/sample-size-without-power-yes-its-possible/), but have evolved into what I see as an important undertaking in any project depending on a […]

Leave a Reply