Sorry, but randomization does not ensure balance.

Home / Sorry, but randomization does not ensure balance.

I wasn’t planning to write this today, but it’s something that has bothered me for years and after reading “randomization ensures balance” in two separate places this morning, I felt like discussing it.

First, let’s just think about the word “ensure” — a word that means “to guarantee” or “make certain.” We should always be wary of this word in the context of using probability. Randomization of experimental units (e.g., people, plants, groups of people, groups of plants, etc.) to treatments (or vice-versa) involves appealing to the concept of probability through using a random mechanism to assign the units to groups. At a very fundamental level, how can it make sense to use a random mechanism (one time) to ensure something for a single study?

“Balance” is a word that does have different meanings within statistical contexts. When used in the statement “randomization ensures balance,” I interpret it to mean that differences among units (e.g., characteristics of individuals) that might be related to the outcome — but are not controlled for in the design — are equally allocated across the treatment groups. That is, the groups are “balanced” in the sense that they each contain the same number (or close enough to it) of individuals with said characteristic. If groups are balanced relative to the characteristic, then we feel comfortable saying that the treatment was not confounded with that characteristic, which can help justify any causal statements about the effect of the treatment in the end. This conversation could open many other cans of worms — I am only going so far as to try to get us on the same page.

“Randomization ensures balance” is a myth (relative to a single study) that gets spread over and over and over again — and reflects a huge lack of understanding about the role of randomization in statistical (and scientific!) inferences. There are very good reasons to employ randomization in a design, but proselytizing for “ensuring balance” in a single study is very misleading and should be stopped. The statement has strong roots in statistical theory in the context of repeating the randomization process over and over again to create alternative versions of the experiment — a sort of balance on average “in the long run.” But it does not apply to a single study, as it is often appealed to do — or at least interpreted to do by those reading the words. And I think “ensure” is misleading even in the theoretical version.

Sure, randomization helps guard against researcher bias in terms of who ends up in which group, but guarding against researcher bias does not imply balance will be achieved. In fact, I argue that “ensuring balance” (at least relative to variables the researcher has knowledge of) is only possible if NOT using randomization! It is impossible to ensure if using randomization (assuming it’s being used honestly).

So, what does randomization buy us? Why go to all the trouble if we can’t ensure balance? It buys us a simple and easily justified probability model that we can use as the foundation for statistical inferences! Isn’t that exciting and amazing?!! Yeah… I get it…. that doesn’t sound even close to as attractive or useful as “ensuring balance.”

Randomly assigning units to groups (randomization) buys us justified use of the associated randomization distribution. I’m tempted to delve into a tutorial about what a randomization distribution is and why we should care, but I’m going to reel myself in and maintain focus on the “ensuring balance” idea. I’ll start a draft post titled something like “It’s been a long time since STAT 101. What’s a randomization distribution and why should I care for my research?” There are also plenty of resources out there using simulation to help convey what a randomization distribution is (as well as a p-value based on that randomization distribution). Yep — hoping the connection to coveted p-values might increase interest in the more important concept of the randomization distribution. If you don’t have a deep understanding of what a randomization distribution or a sampling distribution represents (beyond regurgitation of a textbook definition), then I question your license to use (or teach!) p-values.

Let’s look more closely at this “ensuring balance” idea by appealing to a small and simple example — though hopefully not too simple.

Suppose we have recruited 8 participants for an experiment where we would like to compare the effectiveness of two treatments — A and B (creatively named!). There is some speculation that the difference between the treatments may depend on age. [An aside: At this point, bringing the very important (but often ignored) experimental design concept of blocking into the discussion would be useful, but we’ll naively go ahead as if the researcher firmly believes that randomization will “ensure balance between the groups” and that age doesn’t need to be explicitly brought into the design.]

Let’s suppose 4 of the 8 participants are in their mid-20’s and 4 are in the mid-50’s — so there is a clear split between young adults and older adults (I said old-ER, not old!). Now, the researcher uses their computer (and its pseudorandom number generator) to randomly assign the 8 participants to treatments (with four in each group). Balance (relative to age) means that each group contains two 20 yr olds and two 50 yr olds. Does use of randomization ensure this outcome?

How many ways can the eight subjects be randomly assigned to two groups of four? “8 choose 4” or 8!/(4!4!) is 70. So, there are 70 possible allocations of individuals to the two groups. If honestly employing unrestricted randomization, then all of these are possible (and have equal probabilities of occurring). How many of the 70 possible randomizations are balanced in terms of age?

Here’s how it breaks out (you can do the math or use statistical software like R to play around yourself — see Appendix if interested). There are three scenarios capturing the relevant allocations of age classes to the treatment groups:

  1. 2 of the 70 randomizations result in all 4 of the 20 year olds in one group and all 4 of the 50 year olds in the other. I think we can all agree this doesn’t count as balanced. This is as bad as it can get. Age is completely confounded with treatment and we learn absolutely nothing about the difference between the treatments within an age class. Ugh. But we used randomization!!? The probability of this horrible design is a 2/70 — almost .03! Small, but not that small when you consider the consequences.
  2. 32 of the 70 possible randomizations result in each treatment group having 3 from one age class and 1 from the other. This isn’t as bad as the previous situation, but it certainly isn’t great! Age class and intervention are nearly confounded and any information about the difference between treatments within an age classes relies on a single person in the one group. I don’t think many would label this arrangement as “balanced.”
  3. Finally, and just for completeness, the remaining 36 out of 70 possible randomizations do result in balance relative to age classes, with 2 participants from each age class in each treatment group.

Here’s the quick summary. Using randomization in the design does not guarantee that you end up in Scenario 3! In fact, only a little over 1/2 of the randomizations (36/70) land you there. Is it at all appropriate to use the word “ensure” to go along with a probability of 0.51? Really, “ensure” only goes with a probability of 1.0 — which doesn’t typically belong in the same room with something called “randomization.” Looking at the other side, nearly half of all possible randomizations result in a design that is clearly not balanced relative to age class.

If the researcher wants to “ensure balance” relative to age, then they should forgo unrestricted randomization. They can ensure balance for this variable by forgoing randomization all together (and giving up its benefits) — OR much better by employing restricted randomization through blocking by randomly assigning two individuals to each treatment within each age class. Note that use of restricted randomization in the design should then be accounted for in the analysis — because it does change the randomization distribution by changing the collection of possible random assignments! The potential benefits of blocking will be saved for another post as well.

The example used a well defined variable suspected to be related to effectiveness of the treatment and measured on all the participants. What about the characteristics that we don’t yet know might be related to the effectiveness of the treatment or that we haven’t measured, or don’t know how to? Will those be automatically balanced through randomization? We can only control for so much in the design and then have to get to the point where we are willing to ignore everything else. But we end up at the same place — randomization will not magically ensure balance for the unknown, or willfully ignored, variables in a study.

At the risk of being repetitive, I feel the need to restate my point again. Randomization does not ensure balance in a single study, and we should stop saying or implying it does. To be very clear – I am not arguing against using randomization. I am arguing against selling false benefits of it.

Appendix – some R code

About Author

about author

MD Higgs

Megan Dailey Higgs is a statistician who loves to think and write about the use of statistical inference, reasoning, and methods in scientific research - among other things. She believes we should spend more time critically thinking about the human practice of "doing science" -- and specifically the past, present, and future roles of Statistics. She has a PhD in Statistics and has worked as a tenured professor, an environmental statistician, director of an academic statistical consulting program, and now works independently on a variety of different types of projects since founding Critical Inference LLC.

2 Comments
  1. Martha Smith

    I think it’s worth pointing out that the problem here is an “abuse of notation” one. In the theory leading to hypothesis tests, randomization does provide a type of “balance” — but it is balance “in the long run”, and not within any particular experiment or study. As so often happens, the devil is in the details — but the details often seem to be “lost in translation” when ideas go from one person to another. (I often call this “the game of telephone problem” — referring to the game in which participants sit in a circle; one person thinks of a word or phrase, whispers it to the person next to them, who whispers it to the next person, etc. When the message gets all around the circle, the last person says it out loud — and the difference between the input and the output typically produces a lot of laughter. Unfortunately, the analogous “successive mishearings” are not funny when trying to communicate the subtleties of statistical techniques. People may be well-intended in trying to explain a complex idea more simply, but the successive simplifications only serve to distort the understanding of the concept.)

    • MD Higgs

      Martha,

      Agreed. I should have left in my draft sentence acknowledging the difference between discussing a single study and discussing the in-the-long-run properties of a hypothetical collection of replications of a study. I chose to stick with the focus on implications of the statement for a single study because that is where the problem lies in practice — taking in-the-long-run properties and talking as if they apply in a meaningful way (or ensure something) about a single study. Thanks again for bringing this up!

      P.S. I decided to edit the post to add back in some wording and acknowledgement of this problem.

Leave a Reply