Defining replication
December 20, 2019 | General | 1 Comment
Replication is word that has long been important in conveying concepts fundamental to experimental design and statistical inference in general. It is one of the first ideas students of Statistics must wrap their heads around in terms of designing, analyzing, and interpreting results from an experiment. First exposure to the word in Statistics is usually as a within-study idea, rather than between-study. Within the larger scientific community, the between-study version of the concept has been receiving a growing amount of attention — even in the form of dramatic terms such as the “replication crisis” or “the crisis of replicability.” The word “reproducible” is also used and the similarities and differences in meaning are deserving of more attention, but I restrict this discussion to the concept of replication and wording around it. Warning: this post contains a lot of scare quotes, but I find them necessary to adequately get my point across.
We can gain some clarity, or at least perspective, by revisiting the within-study context first. Replication is a word representing a concept that seems simple at first glance and leads to clean mathematics, but it is not as simple as it seems — surprise, surprise. The Statistical Sleuth (Ramsey & Schafer, 2013) broadly defines it on page 704 with the sentence: “Replication means conducting copies of a basic study pattern.” I like this very general description, for what it says and what it does not try to say. It immediately points to the problem that comes up in reality — the reliance on “copies.” We proceed by assuming we have (or can conduct) “copies” when we know this is not actually doable. We end up with things that are not exactly copies, but close enough that we are willing to ignore their differences. That is, any explainable differences are willfully ignored because they are not deemed to present a big enough problem that we should discard the “copies” version of the model we hope to use for inference. There is always a continuum — it might be very easy to willfully ignore differences between widgets or petri-dishes, but very hard to willfully ignore differences among humans (or at least it should be). The copies assumption then justifies attributing differences in the measurements taken on the “replicates” to “pure error.” This idea of “pure error” opens another large can of worms — but for the sake of this discussion it’s important to see that replication is the strategy for obtaining units (used generally) that are considered copies in the sense that we can happily ignore other explanations of differences in measurements from those units.
Now, let’s move closer to the context we’re used to hearing about today. The between-study context frequently referred to in discussions today is still adequately described by the same quote from The Statistical Sleuth: “Replication means conducting copies of the basic study pattern.” In this case “copies” refers to whole studies or experiments carried out in the same (or similar) way to investigate the same question, rather than smaller units within a single experiment. The degree to which one study can be considered a copy of the other also exists on a continuum. In some cases, it may be the same researcher, the same lab, the same protocol, etc. and in some cases it may be an experiment carried out by different people with differences in the design, but with the goal of investigating the same question or estimating the magnitude of the same effect. The idea that any experiment is an exact copy of another is clearly false — so again, we must proceed as if any differences are irrelevant enough that we can ignore them for the sake of making inferences. So, the idea of replication is defined both within an experiment and across experiments. But, was actually is the word “replication” referring to and is it consistent with how it’s being used across science today?
Replication is the act of trying to create a copy of (or repeating) the “basic study pattern.” This says absolutely nothing about the results of the two studies and how similar they might be. A study design is replicated (at least to some degree) if it is copied to the extent that others agree any differences can be ignored (or accounted for in another way). The idea behind replication in statistical inference is to use it to quantify variability in some measured quantity that we can’t (or don’t want to) explain away. If, in a single experiment, measurements taken on multiple experimental units are farther apart than expected, we do not necessarily take that as a failure in the experimental design. We instead, assuming no errors are identified or other explanations arise, take that information and quantify the variation to represent some level of “background variability” (used as a basis for relative comparisons among units that are not copies, but instead differ by characteristics that are of interest). It is worth thinking for a few minutes about how this differs from many discussions around “replication” today.
What is happening in the evolution of the word “replication” in science, and why? It is used as if it represents a dichotomy (and yes, it is another false one) — a study is either replicated or not. Well, according to the information in the first part of this post, if the follow-up study can be considered an adequate copy of the first then it was [successfully] replicated (under the limitations of “copies”). People may disagree on the quality of the “copy” and there may be argument about whether a study should be used as if it is represents a “replication” — and that is all good. This, as a lot of statistical inference, is all carried out in a world of as-if’s — even if it seems hard or uncomfortable to constantly remind ourselves of this. Note that nothing I have said here has anything to do with the results of the two (or more) studies!
Today, in many conversations I see “replication” being used as if it is a property of the results of the studies, and not the designs and methods f conducting the studies. Phrases include “the study replicated” or “the study was (or was not) successfully replicated.” This binary results-focused view and its associated language are causing oversimplification, confusion, and other misunderstandings. In the new language, “replicated” is typically used to mean that results from a study carried out as a “copy” of an original study “match” those from the original study. But, this definition brings in a whole other layer of problems because it not only requires assessing whether the studies should be treated as copies, but also requires assessing whether results “match” according to some criterion or set of criteria. This is not an easy problem and clearly depends on the set of criteria used to categorize results as a match (or not). Going into the nuances of this problem is beyond the scope of this single post, but I hope it is clear how complicated the situation becomes. Instead, I hope we can acknowledge the challenges in this definition implied by the language we use. Let’s back up and pay more attention to implications and be clear about what is being assumed and what needs to be justified. We should separate the two ideas in our language.
(1) Assess the degree to which the studies are copies. Assessing whether one study replicated a previous study (or was a successful replication of a previous study) should focus on assessing the claim that it can/should be considered a “copy” of the first. From this point, then one can figure out what to do with the results from two copies (assumed) of the same experiment. There is already a lot to think about here.
(2) Assess the degree to which the results of two (or more) studies are consistent (consistency of results). Perhaps I should come up with a catchier phrase, but this is what I have for now. It may make sense to do this even when one can argue that the differences between two studies are such that one should not be considered a replication of the other. And, consistency is not a yes or no answer. It is way oversimplified and naive to think that the results from two studies investigating the same question are either consistent with each other or not consistent. Let’s try to not fall into this false dichotomy and do the hard work of explaining, in a continuous way, how the results might agree and differ (at the same time) — and being transparent about what criteria and methods we might use for doing so.
Just for the record, I completely disagree with using an assessment of whether a p-value is < 0.05 in two studies as the criterion for labeling a a study as successfully replicated or not. I disagree on multiple levels — with the implied definition of “successfully replicated,” with the lack of emphasis on comparing the designs and analyses, with using a single criterion to asses match, and with using the p-value for a criterion. There are likely other layers that I also disagree with that I’m just not thinking of at the moment. It is time to stop oversimplifying and falling prey to the same things that we tend to blame this “replication crisis” on — overuse of false dichotomies, unwillingness to acknowledge uncertainty and work on continuums, and in general, taking short-cuts.
I end with a quote from RA Fisher’s 1944 Statistical Methods for Research Workers — to remind us that act of replicating (within and between studies) is and was fundamental to statistical inference and science using it. Let’s remember this historical context and build upon it, rather than ignoring it or replacing it with dramatic-sounding, but oversimplified, new language.
The idea of a population is to be applied not only to living, or even to material, individuals. If an observation, such as a simple measurement, be repeated indefinitely, the aggregate of the results is a population of measurements. Such populations are the particular field of study of the Theory of Errors, one of the oldest and most fruitful lines of statistical investigation. Just as a single observation may be regarded as an individual, and its repetition as generating a population, so the entire results of an extensive experiment may be regarded as but one of a population of such experiments. The salutary habit of repeating important experiments, or of carrying out the original observations in replicate, shows a tacit appreciation of the fact that the object of our study is not the individual result, but the population of possibilities of which we do our best to make our experiments representative. The calculation of means and standard errors shows a deliberate attempt to learn something about that population.
RA Fisher (1944). Statistical Methods for Research Workers, 9th edition reprinted by University of Michigan Libraries Collection, pages 2-3. BOLDFACE added by me for emphasis.
1 Comment
MD Higgs
Linking to this comment by Andrew Gelman titled: “Don’t characterize replications as successes or failures”
http://www.stat.columbia.edu/~gelman/research/published/Making_Replication_Mainstream_gelman_comment.pdf