“It is essential that the experiments give the correct result”
June 14, 2021 | General | No Comments
I happened across this “fact sheet” awhile ago – “How to Reduce the Number of Animals Used in Research by Improving Experimental Design and Statistics” provided by The Australian and New Zealand Council for the Care of Animals in Research and Teaching (ANZCCART). According to their website, ANZCCART is an “independent body which was established to provide a focus for consideration of the scientific, ethical and social issues associated with the use of animals in research and teaching.” I’m not sure it deserves a whole post – but for some reason I started writing one about it – I think because it represents something very common, but not as much talked about as egregious mistakes, misuses, or misinterpretations related to Statistics. It’s not so much about the details, as the attitude or implicit messages about the role of statistical methods and results in the research process. The idea that churning out the numbers with a software package is the end of the game and that statistical analysis plays that end-game role in “confirmatory studies.”
The title and goal of fact sheet sounded good to me, and I was just a little curious about how they wrapped up their advice into a concise “fact sheet” style report for users without much of a background in stats. Taking experimental design and statistical concepts seriously should be part of trying to reduce the number of animals needed for experiments – as well as avoiding common statistical mistakes and misinterpretations. But the fact sheet leaves some things feeling complicated and others feeling way too easy – like what to do with statistical results after you have them and how they play into the conclusions of a “confirmatory study.”
I’m just going to comment on a few things in this post, by focusing more on the implicit messages we can send (inadvertently?) when writing about design and statistical analysis to an audience of researchers with potentially little background in Statistics. I think we do more harm than we realize by perpetuating over simplified beliefs about what is involved in Statistics – beyond just computing statistics.
The advice in the report is meant to apply to “confirmatory experiments,” as opposed to pilot studies or exploratory studies. This always makes me wonder about the assumed role of statistical hypothesis testing in confirmation of a scientific hypothesis. We should always wonder about what role we’re placing the statistical methods in, and particularly whether we’re giving too much of the science or decision making over to simple statistical estimations or tests.
In this fact sheet, there is the implicit assumption or expectation that statistical tests for means (likely through ANOVA) are the way to test scientific hypotheses of differences among treatments in a “confirmatory” way. I am not delving into the extensive and rich literature and debates about exploratory vs. confirmatory research in Philosophy and Statistics, but do point out that the fact sheet presents the approach to confirmatory research as if it is a broadly accepted and agreed upon way of doing business — the disagreements, unsettled nature, and philosophical arguments aren’t mentioned or appreciated. And maybe a “fact sheet” is just not the place or venue to bring them up – but then we at least need to consider the potential implications of continuing to pretend as if there is more consensus than there really is on the role of Statistics in doing science, even pretty clean cut science as described here. The “fact sheet”-like strategy and presentation is simply, oversimplified.
Here is the description of a confirmatory experiment:
Confirmatory experiments are used to test a formal, and preferably quite simple, hypothesis which is specified before starting the experiment. In most cases there will be a number of treatment groups and the aim will be to determine whether the treatment affects the mean, median or some other parameter of interest. In this case it is essential that the experiments give the correct result. It is this type of experiment which is discussed in more detail here.
I find the emphasis on “it is essential that the experiments given the correct result” fascinating. Of course, we would like a “correct result” – but what does that even really mean? Are we talking about capturing some true difference in means by an estimate or statistical interval? Or concluding there is any non-zero difference in means when in fact there really is (or vice-versa)? I could go on, but the point is — If we’re in the land of statistical inference, we’re not in the land of knowing we’ll get “the correct result,” as much as we would love to be. However, I find this attitude common, and concerning. It supposes that the goal of Statistics is to rid a situation of uncertainty, rather than provide something useful in the face of variability and uncertainty. There are many places in the document that I think feed this attitude or message – even if subtle.
The usual issues arise with the brief discussion of “significance level” and “power” – really, the language just perpetuates dichotomous thinking and decision making through the “false positives” and “false negatives” narrative, which I guess is consistent with the need to get a “correct” or “incorrect” result implied in other wording.
Some things I liked
Some tidbits I did like included the easy-to-digest description of identifying or defining the “experimental unit” – something that often confusing in laboratory research with animals, and often where a conversation between a statistician and researcher first leads. The fact sheet directly discusses implications of animals should be “caged” – the implications of which are often considered far too late in the process, rather than the design phase.
Perhaps the part I’m happiest with is this description of “The effect size on the parameter of scientific interest” in the context of power analysis:
This is the difference between the means of the treated and control groups which are of clinical or biological significance. A very small difference would be of no interest, but the investigator would certainly want to be able to detect a large response. The effect size is the cut-off between these two extremes. A large response is easy to detect, but a small one is more difficult so needs larger groups.
It doesn’t say “use the estimate of the effect size from a pilot or other study” – it clearly says to use one that represents “clinical or biological significance,” as I’ve talked about elsewhere. Where the cutoff ends up being placed is tricky business not discussed, but overall it was refreshing to see this. I wonder how/if it could be translated into practice based on the brief description…
A few worrisome things
Use of in-bred strains of mice is encouraged. The decrease in variability makes sense and of course impacts number of experimental units in an obvious and practically relevant way. However, there is no discussion or acknowledgement of the downsides of using in-bred mice in terms of limiting scope of inference, or external validity. This is something that should be considered in the design as a tradeoff, though it’s hard because not quantified by any sample size formula.
Blocking is referenced – which I see as a good thing. But, I find it interesting that a randomized block design with no replication within blocks is presented as the default option. This may be reasonable given the emphasis on minimizing the number of animals, but having to purchase the untestable assumption of no block x treatment interaction should at least be mentioned and considered. There are always trade-offs, and we need to be careful when presenting something as a default, rather than a decision to be made and justified.
Then, a method for justifying sample sizes called the “Resource Equation method” is presented as an alternative to the “usually preferred” power analysis method. This isn’t a method I’m familiar with, so I was a little intrigued – particularly when it was described as a way to get around the parts of power analysis people find most difficult (and therefore often don’t do or take shortcuts to get around). The challenges with power analysis are described:
However, this can be difficult where more complex experimental designs are employed as can happen in more fundamental research projects. For example, if there are several different treatment groups, it can be difficult to specify the effect size (signal) that would be of scientific interest and if many characters are to be measured, it may be difficult to decide which is the most important. The power analysis also requires a reliable estimate of the standard deviation, so it cannot be used if this is not available.
And then the Resource Equation method is provided as an easier-to-use improvement:
The Resource Equation method provides a much better alternative for experiments with a quantitative outcome (i.e. using measurement data). It depends on the law of diminishing returns. Adding one more experimental unit to a small experiment will give good returns, but as the experiment gets larger the value of adding one additional unit diminishes. The resource equation is:
E = (total number of experimental units) -(number of treatment groups)
E should normally be between 10 and 20, although it can be greater than 20 if the cost of an experimental unit is low (e.g. if it is a well in a multi-well plate) or in order to ensure a balanced design with equal numbers in each group. As an example, suppose an experiment is to be set up to study the effect of four dose levels of a compound on activity in male and female mice. This is a factorial design (discussed below), and it involves eight groups (4 doses x 2 sexes). How many animals should be used in each group? According to the Resource Equation if there were, say, three mice per group, that would involve a total of 24 mice and with eight groups E=24-8 = 16. So this would be an appropriate number. Of course, these animals should be chosen to be free of disease, of uniform weight and preferably of an inbred strain.
Wow – very simple and easy. But what is the justification, beyond appealing to the law of diminishing returns? And where does the “between 10 and 20” really come from? I didn’t look into it further. But “The Resource Equation method provides a much better alternative for experiments with a quantitative outcome (i.e. using measurement data).” is a strong statement. I’m not sure why the previous paragraph states that power analysis is “usually preferred” then – as that would have to use a quantitative outcome as well. I do see why practicing researchers would greatly prefer the Resource Equation method for its simplicity and lack of justification needed, but is that a good enough reason? It is certainly easier to disconnect number-of-animal considerations from statistical estimation, but how is that consistent with still relying on statistical estimates or tests in the end? I find this part a bit perplexing.
Interactions are mentioned, but the wording implies that main effects and interactions can be estimated – without making the point that meaningful interpretation of “main effects” in the presence of interactions is an issue. In my experience, even intro classes in analysis of variance don’t do a great job in presenting the reasons, except to say that tests for interactions should happen before tests for main effects.
Finally, to The Statistical Analysis…
Then, the fact sheet ends with a whole section on “The Statistical Analysis”. It has some good advice, such as:
The statistical analysis of an experiment should be planned at the time that the experiment is designed and no scientist should start an experiment unless he or she knows how the results will be analysed. To do so is asking for trouble. They may find that they do not have the tools or the know-how for the analysis so that it does not get done correctly. They may put off the analysis until they have done several similar experiments, but in this case they will be unable to adjust conditions according to results observed in the earlier experiments.
Beyond that, there’s not much substance in the section – it’s short and quite vague. — it sounds like one just needs access to a reputable software package and basic knowledge of analysis of variance. After some exploratory data analysis, “the statistical analysis should be used to assess the significance of any differences among groups.” No mention of estimation of effects or interpretation of results — just assessing “significance of any differences.” Example results are given, but not interpreted.
I’m not sure what I expected from this section, and maybe less info is better than more, but still not enough. I’m not sure. I guess what bothers me is that it’s presented as the culmination of the process – the end step. The hard parts of model checking, interpretation, etc. aren’t acknowledged. Maybe that’s meant to be included in a “basic understanding of analysis of variance,” but that’s not consistent with my experiences as a teacher or a collaborator.
How should the results be used and reported? What is the analysis capable of? What is is not capable of? What are common mistakes and misinterpretations in this context? And maybe my worry and skepticism was largely fed through the statement early in the paper that it is “essential” that the confirmatory studies they are referring to give “the correct result.” If that’s the goal, and the end step in the fact sheet is assessing significance from analysis of variance-related F-tests, what does that imply about the role of statistical inference in the science?