What does your haphazard distribution look like?

Home / What does your haphazard distribution look like?

In my last post (Thieving from the Assumption Market), I suggested effort put into the design can earn a nice down payment toward justifying a model and associated inferences. This post considers down payments earned through random assignment and random sampling. It may not sound that exciting, or at all new, but I think I offer an atypical perspective that might be useful. You may be expecting me to argue that the down payment takes the form of combatting sources of bias — the motivation for common statements like “we used random assignment to ensure balance” and “we used random sampling to ensure representativeness.” While combatting potential biases and thinking about balance and representativeness are important, they are unfortunately misunderstood in the context of randomization and random sampling relative a single study (see more here). This post tackles a different type of down payment — one that is often neglected.

Connecting design and model

There is a fundamental connection between use of a probability mechanism in the design (random assignment or random sampling) and justification of a probability model to base statistical inferences on. It’s not common to see acknowledgement of this connection. It’s a connection that’s easy to lose sight of in the model-regardless-of-design centric world we now live in. I suspect a lot of users of statistical models, and even statisticians, never got the chance in their education to appreciate the historical and practical importance of the connection.

I had been using statistical methods in research for a few years before seeing connection (or at least appreciating it) for the first time. It didn’t happen until I had changed my path to study Statistics. I still remember the feeling of really grasping it for the first time — a bit of an amazement at the logic behind the idea, but also of the fact I hadn’t seen it before. I doubt that I would have fully appreciated it without the lag. While simulation-based teaching strategies are now common in intro stats classes, I’m not optimistic it’s sinking in in a way that will transfer to research or work later on. To me, the connection feels like a crucial piece of the puzzle to understanding how we got to where we are in terms of use of statistical methods — sure, we can continue without it and call the puzzle done, but it’s a lot more satisfying to find the piece and fill in the hole. When I’m feeling optimistic, I even believe it can be an integral part of starting to re-value and set higher expectations for time and effort spent justifying choice of model.

Design vs. model based inference

I used to think a lot about the differences between model-based and design-based inference, but that’s not where I’m going here because dwelling on the difference isn’t often that helpful in practice. Those going with design-based inference already understand the benefits and are likely on that path from the beginning. The use of “versus” makes it sound like those not choosing design-based inference can just ignore design, or at least that it’s not firmly connected to their models and inferences. Really it’s all model based — just different types of models with different justifications — and design is always important.

I am probably coming across as more of a purest on this issue than I intend to. I am not about to argue that you should only use probability models if you insert probability into your design. I am simply arguing that expectations for justification should depend on design. It affects the size of the loan needed to buy your assumptions, and the size of your down payment.

Now, finally to the thought exercises I promised.

Envisioning haphazard and convenience distributions

Before we go further — I am assuming you have some basic understanding of what is represented in randomization distribution or a sampling distribution (beyond just picturing a textbook perfect t-distribution) and how they can be used for statistical inference (intro stats level).  Also, when I use the word random, I am referring to actual use of a probability mechanism (nowadays a computer’s pseudorandom number generator). A human brain does not count as a random number generator.

Part I: Random assignment and the randomization distribution

When I say random assignment, I am referring to the random assignment of individuals (or experimental units) to treatments (or groups) — or vice-a-versa. Random assignment is widely recognized as a valuable design concept and action, but mainly for the reason of suppressing “bias” by buying “balance” (as previously mentioned). We’re going after something different here — its connection to building a relatively easy to justify probability model.

Creating the collection of all possible assignments

What is created by inserting random assignment into the design? It brings to life the collection of all possible random assignments that could have occurred – not just the one that did. This collection is well defined and justified because of the random assignment — it creates a situation where we know all the possible assignments and their probabilities of being chosen. So… while carrying out the experiment under different random assignments is hypothetical, the collection of possible random assignments is not hypothetical – IF random assignment is actually used in the design.

It’s this collection that leads directly to a randomization distribution, as commonly used for inference (by sprinkling some assumptions on top). For example, if the treatment does nothing (a common “null” assumption), then each unit would have the same outcome regardless of the group it was randomly assigned to. Because the collection of possible random assignments is well defined (and not hypothetical!), a summary statistic can be calculated for each one of the potential assignments — and voila! — we get a randomization distribution! This is not only conceptually important, but also practically important.

You might have done plenty of group comparisons without ever constructing or visualizing a randomization distribution — or even considering its existence. The number of possible random assignments in the collection gets quite large, quite fast. (Even if you have only 20 total units randomly assigned to two group of 10 you’re at ’20 choose 10 = 184,756′). So, it’s not surprising that we typically rely the t-distribution as an approximation to the randomization distribution we’re really after. And, it’s a pretty darn good approximation under a lot of conditions.

With computing power now, we don’t need the t-distribution so much, but it’s easy and it’s “how we do things.” Unfortunately, it also lets us forget foundations. It lets us forget it’s there as an approximation of a distribution we can build and justify through our design. If we employed random assignment, then we have a huge downpayment toward justification of using the t-distribution in our statistical inferences. It doesn’t buy all assumptions outright, but it gets us pretty far down the road. Or, I can forget the t-distribution all together and get an even larger down payment using the randomization distribution directly.

The thought exercise …

Now to the question I have been trying to get to. What are the implications of not employing a random mechanism? What if I haphazardly assign units to groups in a way that feels random to me? What if I assign them out of convenience or in a way that clearly improves balance between groups? Does it really matter? You might think — well, I could have gotten that same assignment through actual random assignment, so it shouldn’t affect my inferences. But, it’s not about the one assignment you implement (random or otherwise), it’s about the collection of possible assignments! The act of real random assignment creates a collection that is not hypothetical – the probabilities associated with each assignment can be calculated! It’s this collection that can form the basis for a statistical inference.

What does the collection look like if you haphazardly assigned units to groups? How many possible haphazard assignments were there? How likely were some to show up vs. others? If you can’t answer these questions, then you don’t have a well defined collection of possible assignments. If you can’t answer these questions, the concept of a randomization distribution breaks down (as well as the approximation using the t-distribution). Sure, you can bust ahead with your inferences, but you should have to take out a bigger loan to justify your model.

Let’s go with a mini-example. Suppose I have 8 individuals (creatively numbered 1 through 8) and randomly assign them to two groups. I actually did this using my computer and got 4,3,7,5 in group 1 and 1,2,6,8 in group 2. But this information isn’t that interesting. What’s interesting is that I know there were ‘8 choose 4 = 70’ ways the randomization could have turned out. I can use my computer to list all those possibilities, or even write them out by hand on a piece of paper. I know they all could have happened and I know their probabilities of happening.

Now, let’s repeat the exercise. Suppose I decide to haphazardly or conveniently assign individuals to groups using my brain and judgement rather than an external probability mechanism (see later note about this). Maybe I inadvertently use some information to assign individuals to groups — perhaps trying to “balance” according to some obvious factor I can see (like age). I could have easily ended up with the same four units in each group as I did when using the random assignment (described above). So, what does it matter?

Well, now try to write out the collection of all possible assignments to groups. Some of the assignments included in the list of 70 above are now no longer an option. For example, I doubt the split of 1,2,3,4 in group 1 and 5,6,7,8 in group 2 would ever show up — it seems to non-random when someone is trying hard to be Radom. If you really try to construct the list of assignments and assign probabilities (or equivalently, take the list of 70 and just try to assign probabilities, including zeroes) — you should feel the conundrum it puts you in. How can you construct and envision your “haphazard distribution?” To be able to do it, you would have to understand the inner workings of your brain far better than I believe any human is capable of. I don’t think it’s too strong to say — it’s impossible!

How do you then go forward, assuming you’re not going to throw your hands in the air and just give up? Well, you are forced to make more assumptions and use a pretend probability model — to assume assignment was random when in fact it was not. You should lose your down payment over this and need to get a larger loan to cover your use of the same model.

Is it worth it? In an ideal world, it wouldn’t be worth it because you would actually have to get the loan and do the justifying, but … if you live in a culture of thieving from the assumption market, then it’s actually less effort in the long run to forgo the actual random assignment. It’s not likely you’ll ever be expected to really justify it. Okay, I’m being pessimistic — but I don’t think unrealistic. I am optimistic that attempting to envision one’s “haphazard distribution” or “convenience distribution” will open some eyes as to why actual random assignment should get you a down payment.

Part II Random sampling and sampling distributions

The same exercise can be applied to the idea of random sampling — and its fundamental connection to the concept of a sampling distribution. Instead of the collection of all possible random assignments, we have the collection of all possible random samples from a population. The sampling distribution is admittedly harder to wrap one’s head around — particularly when we enter into infinite populations and an infinite number of possible random samples.

But, it’s easy enough to walk through the exercise with a baby example — say you are taking a random sample of 2 individuals from a population of 20. It would take you awhile to list out the 190 possibilities, but you could do it and it’s easy enough to envision that entire collection. This collection, with the known probabilities, would form the basis for a sampling distribution.

Now, suppose obtaining a random sample is deemed to difficult and too time consuming — or maybe even impossible for logistical reasons. What does the collection of possibilities look like? What does the implementation of convenience sampling do to the look of the sampling distribution? What “samples” are possible and are some individuals much more likely than others to appear? What are the probabilities associated with each of the “samples”? For example, maybe unit 3 is very obvious and easy to include and the all possible samples include that unit! Would two different people in charge of the convenience sampling end up with the same “convenience distribution”? Not at all likely.

Where does this leave us in terms of justification of an underlying probability model based on sampling distribution theory? It puts us in an analogous situation to that discussed for the randomization distribution. There’s no way to build a distribution to align with the actual design, so we’re left pretending as if random sampling took place. The foundation is tenuous without some serious justification. The potential down payment has been lost — and we should have to apply for a much larger loan to be able to justify a the same inferences.

In current practice, what do we charge for using convenience sampling Not much. We don’t typically expect more justification in return for giving up the foundations of a sampling distribution. Giving up effort in the design should lead to more effort in justification of the model, but I don’t see that happening. Typically, we just bust ahead, use the probability model as if there was random sampling — pretending as if our convenience sampling would have churned out the same collection of possible samples as probabilistic sampling would have. It’s another theft from the Assumption Market.

If any justification is given, it’s typically something like “we assume the convenience sample was as representative of the population as a random sample would have been.” This is analogous to the “balance” example under randomization. For a single study, you might get a more “representative sample” through convenience sampling, but that’s not the point. We’re missing the bigger picture that lack of random sampling destroys the foundation of a sampling distribution.

A note on “random”

As a consulting/collaborative statistician for almost two decades, I have witnessed mis-use (or misunderstanding) of the word “random” over and over again. If a researcher says or writes that they used “random assignment” or “random sampling” in their design — you should always ask for more detail about the actual mechanism used. In my experience, it is more common that “random” means “haphazard,” than random as defined by a statistician. In my experience, it has been rare to see a formal probability mechanism used — it’s usually just a human brain trying its hardest to be random. There is no malicious intent to cut corners, just a genuine lack of understanding of why it matters. It is my hope that this post might help convince people that it does matter.

Wrapping up

I believe it is important to recognize the crucial link between inserting probability into design in a useful way and the justification of probability models as a basis for inference. Next time you are deciding whether to employ randomization and/or random sampling — don’t think just about “balance” and “representativeness,” but think about putting effort into the design to be able to afford better justification of a probability model. It seems like a reasonable first step toward curbing the culture of theft from the Assumption Market.

You are completely free to use whatever probability model you want, whenever you want, but … you should have to justify your choice all the way through the inferences you ultimately make. And, we should be checking each other’s justifications all the time — and weaving those checks into our judgements about how much to trust an inference.

About Author

about author

MD Higgs

Megan Dailey Higgs is a statistician who loves to think and write about the use of statistical inference, reasoning, and methods in scientific research - among other things. She believes we should spend more time critically thinking about the human practice of "doing science" -- and specifically the past, present, and future roles of Statistics. She has a PhD in Statistics and has worked as a tenured professor, an environmental statistician, director of an academic statistical consulting program, and now works independently on a variety of different types of projects since founding Critical Inference LLC.

1 Comment
  1. Martha Smith

    I agree that the points you are making are very important. See the Day 1 and Day 2 notes at https://web.ma.utexas.edu/users/mks/CommonMistakes2016/commonmistakeshome2016.html for how I have addressed these points in a continuing education course.

Leave a Reply