Assumptions are not free – they are added information
February 19, 2020 | General | 5 Comments
This is a topic I have thought a lot about and discussed with students for years, but I have yet tried to convey my thoughts in writing. My goal is to keep this high level and not open too many cans of worms that might derail the main point I’m trying to make. I appeal to the royal “we” throughout — just for simplicity.
Broadening our view of information
For any inference, conclusion, recommendation, etc., it’s hard to disagree with the importance of considering what information is behind it. However, we fall short in terms of the sources of information we tend to consider. If we fall short in terms of evaluating the information used, then we fall short in evaluating the inference itself. While I find this to be the case in many contexts, here I’m focusing on how it manifests through our reliance on statistical methods in science and decision making.
We are trained to think of information (at least that which we should pay attention to) as coming from collected data. Information from collected data is important, but it is not the only source of information informing our inferences and associated conclusions. Statistical inferences focus on quantifying a particular type of uncertainty given (i.e., dependent or conditional on) a large collection of assumptions. Some of these assumptions may be explicitly stated (e.g., assuming data are generated from a normal distribution) and others may be very implicit (e.g., choices in the design, other model choices, choices about what results to report, choices about how to interpret the results, etc.). Assumptions are never met and there are many researcher degrees of freedom — and we tend to pretend they are free. We are not forced to consider and justify their costs relative to inferences ultimately made.
We can easily agree that data are not free — and shouldn’t be. We pay in time and money to gather information as data to support our inferences. We should not make up data (even though made up data are easy and free) because… well, it’s unethical. It is adding information into inferences that is not justified and potentially very misleading. Glad we can agree on that.
What are assumptions? Can they potentially add unjustified and/or misleading information into our inferences? In my view, they certainly can. Should they be free of any costs?
Assumptions are called assumptions, because they are just that. We don’t proceed with an analysis under “truths” or “facts” — we proceed under human-made assumptions and other choices that we might not even label as assumptions. Just as collecting additional data has a cost, this should have a cost (just a different type of cost).
You may think I’m taking it too far to describe assumptions in the same vein as “making up data,” but I do not see them as so far apart (despite how far apart they are treated on the scale of ethics). It could do us some good to at least go through the thought exercise of considering the similarities.
Assumptions insert information into an inference — this has to be the case if an inference depends at all on what assumptions are made (which it typically does). Statistical results and the inferences associated with them are often sensitive to assumptions and other design, modeling, and interpretation choices. Sensitivity implies information (beyond that in the data) is being inserted into the process.
However, we rarely think about assumptions as information inserted into an inference beyond that coming from the data (as we would if it were fake data). I don’t think we do because we don’t have an easy way to quantify the amount of information coming from assumptions and choices vs. the information coming from collected data. There are situations in which we can, and do, assess sensitivity of results to a particular assumption, but this is restricted in terms of what assumptions can reasonably be assessed and even then I don’t see it used often. It seems to be deemed more admirable to declare what model (i.e., a set of assumptions) will be used before hand and stick to it, regardless of how sensitive the results might be to the choice.
My point
Assumptions and choices are not free. They add information to inferences — like data that were not collected.
We have to purchase valuable information by collecting data, but then we’re allowed to shoplift additional information in the form of assumptions and other choices without any penalty. I believe we should take the hidden costs of assumptions more seriously. We should see them as added information beyond that contained in the “observed data.” We should hold each other accountable for justifying that inferences are not overly influenced by assumptions. What counts as “overly influenced?” That’s hard — but again not a reason to avoid the issue all together.
At the very least, we know we are inserting more information into an inference than we admit to and inferences should be interpreted in light of this. Data-driven is actually data-PLUS-assumption driven — we just prefer not to dwell on the assumption part. We need to consider the possibility that in some cases the information slyly inserted through assumptions may even “swamp” that in the data.
Lessons from data vs. prior tensions?
I hesitate to go here, but then can’t find a good enough reason not to. I purposely chose the word “swamp” above because of its use in comparing the amount of information in a posterior distribution coming from the prior distribution relative to that from “the data.” In practice, it is common to hear people justify a prior by saying we don’t need to worry about it because “the data swamp the prior.” This is one setting where people seem to worry about the information coming in through assumptions for the prior part of the model (the prior), but not worry enough about the information added through assumptions coming from “the data” part of the model (often forgetting the role of the likelihood!). Note for the record — it is possible for a prior distribution to be based on a lot of prior data!
I bring this up, not because I want to debate about priors, but because I find it very ironic that people can get so incredibly worked up about the sensitivity of results to prior distributions while forgetting (or simply ignoring) the information inserted by all the other design, modeling, and summarization assumptions and choices.
Apparently, there is great worry for some about how much information comes from assumptions vs. data, but it only matters in the context of a Bayesian prior. If one can get worked up about a prior distribution, then by all means get worked up about all the other sources of information silently inserted into the process with no crisis of conscience. The fact that this prior-vs-data tension can lead some to proclaim an analysis without a prior is “objective” and an analysis with a prior is “subjective” has always seemed absurd to me — we need to consider this relative to the entire process we actually go through in practice (from idea to inference).
As usual, I don’t think the disregard is blatant. I think it stems from lack of practice, opportunity, or reward to look deeply and critically at our methods and processes. My hope is that starting to recognize assumptions and choices as added information may be a step (albeit tiny) forward. I entered this data-vs-prior realm only because it seems to me to be low hanging fruit pointing to how we can think more broadly about the degree of impact assumptions have on our inferences.
I consider the prior-vs-data conversation low hanging fruit because it’s not at all hidden and stays within our comfort zone. I’m not arguing against its importance, but I think we need to attempt to view it within a larger frame of reference. The prior-vs-data tension feels like safe territory. The prior can be explicitly stated, sensitivity of the posterior to the prior for given data (and likelihood function!) can easily be examined, there are quantities such as “effective number of parameters” we can think about, etc. It’s an example of explicitly inserting information beyond that in the observed data that appears to resonate with researchers across disciplines.
On the other hand, it is an example that also demonstrates the extent of our blind spots. The belief that inferences are based only on data – unless we are using a prior distribution in Bayesian inference is incredibly naive. It shows our blindness and lack of willingness to consider all the design and modeling assumptions and choices. Can we use it as a tangible starting point to extend thoughts and discussions — to considering information added through things we are blind to and don’t talk about? What about all the other information we inadvertently or silently insert with little or no discussion and justification? It may be too pre-loaded of a topic to be able to start with, but maybe worth a try.
Non-parametric vs. parametric
For a quick non-Bayesian reference point, I think it’s useful to consider non-parametric vs parametric methods. It’s not uncommon to hear that parametric statistical tests are “more powerful” than non-parametric tests. Even if you’re not going to use either, it’s worth looking at the reason for this claim. Where does the increased power come from? We increase power through increasing information — typically assumed to come from decreasing variance or increasing sample size. But, given the same experimental design to be carried out, the power changes depending on what assumptions we’re willing make.
Parametric tests involve making (relying on) a particular probability distribution (model). Non-parametric tests still have assumptions, but they don’t add in as much information to the analysis in the form of an assumed probability model. My point is not about power or testing — but about the fact that inserting the assumption associated with a particular probability distribution is adding information (just as increasing sample size increases precision and power through the added information). But, we seem to take for granted the fact that we added non-data information into the analysis, and thus inferences. Sure, you can “check” the assumption — but it is never “met” and therefore always has some price that should not be ignored.
Acknowledging “information” from mathematical statistics
If you took a course in Probability & Mathematical Statistics, then you are likely familiar with the terms including the word “information” — probably information matrix, Fisher information, observed information. Information in this context is presented as quantifiable (under many assumptions that are often glossed over), and its simple mathematical nature feels clean, unobjectionable, and comfortable. The message is that the amount of “information” contained in an estimate of a model parameter comes only from the information in the data. That is, we quantify precision in an estimate mathematically seemingly based only on characteristics of observed data (e.g., standard error of an estimated mean is equal to the standard deviation estimated from the data divided by the square root of the sample size). The math is conditional on model choices (e.g., the likelihood function) and other assumptions, but this gets much less attention.
The other place “information” often comes up is in the many information criteria (e.g. AIC, BIC, DIC, etc.). These can be generally thought of as measures of predictive accuracy. They include penalties for model complexity to avoid overfitting the observed data given a goal of predicting new data. There is convenient mathematical theory to back up calculation of numbers given data and model choices.
Neither of these references to information get at the point I am trying to make — and I bring them up because I want to identify them as distracting from the goal of recognizing assumptions and choices as information, even if it doesn’t come wrapped in a tidy mathematical box.
Crucial questions and limits to what we can quantify
We can’t rely solely on mathematics and things we can quantify to have meaningful conversations among scientists about the trustworthiness of our inferences. There are so many important questions to be asking to spur needed conversation — and I like to believe restore more trust in conclusions and recommendations (even if they are stated with more uncertainty).
- What are the sources of information going into the analysis and thus conclusions?
- What are the relative amounts of information from the different sources (even if impossible to quantify)? How do we gauge (qualitatively) the relative amount of information inserted through assumptions compared to collected data?
- How sensitive are inferences to assumptions and choices?
Observed data are not the only source of information going into conclusions. We need to stop pretending that they are.
The second bullet above lends itself to the natural direction of attempting to quantify the relative amounts of information. Quantifying will always have its limits because much of what is included in assumptions and choices is simply not quantifiable. And if we restrict ourselves to what we can quantify (as we have largely done thus far), then we don’t really change the game. Therefore, as attractive as it is, I find myself against the idea of trying to quantify relative amounts of information and instead spending the effort on challenging ourselves to think through the problems more qualitatively and creatively. This forces a different type of scientific communication — needed, though very uncomfortable. But comfort for scientists isn’t the goal.
The alternative is to keep doing what we tend to do — collect some data, choose a model, add some data, superficially “check” a few assumptions, make some conclusions — and then pretend as if the only information going into the conclusions is that from the collected data. A starting point is beginning to think about assumptions and decisions as added information, in the same way as we think about data as information.
5 Comments
Andrew Gelman
Megan:
I agree with the above post, and I think you put it well. I just want to add one thing, which is that the so-called “likelihood” or data model in statistics is full of assumptions. The division of a model into likelihood and prior is a real thing—it corresponds to a certain model of replication or hierarchical structure of sampling or experimentation (as in the standard example where the likelihood corresponds to sampling balls from an urn, and the prior corresponds to choosing an urn at random)—but it would be a mistake to think of the likelihood as “the data” and the prior as “assumptions.” The likelihood is full of assumptions too.
I’ve often written that it is better to talk about “prior assumptions” or “prior information,” not “prior beliefs.” Reading your post made me realize that we should do the same thing with the data model. We have the data model (which implies a likelihood) and a parameter model (the prior). These models represent assumptions and they carry information. One tricky point from the technical direction is that the amount of information they carry is itself random, in that, except in some very simple cases, the amount of info in these assumps will depend not just on design features such as blocking/stratification, clustering, sample size, etc., but also on the particular data that are observed.
MD Higgs
Andrew,
I completely agree. I should have made the “the data” vs. “the data model” distinction clearer. I see the Bayesian setting as providing an “in your face” example of inconsistent emphasis on acknowledging and justifying different assumptions: a tendency to blindly accept the data model and all its assumptions (often implying through our wording that it’s “just the data”), coupled with huge worries about justifying prior assumptions.
I agree with the “prior assumptions” language and that’s a great reminder. Also, I agree that the amount of information carried is the hard technical problem.
Thieving from the Assumption Market – Critical Inference
[…] The heart of what I’m after is more awareness and discussion about how we justify our inferences — and any models (in a broad sense) propping them up. This post relates to thoughts I recently shared on the cost of assumptions here. […]
Assumptions in science, like assumptions in science – Critical Inference
[…] and comfort. I’ve already dedicated at least a few other posts to assumptions (here, here, here, and here) – I just can’t help coming back to […]
Make an ASS out of U and ME – Critical Inference
[…] head above water. Sure – assumptions related to Statistics (as I’ve hit on before here, here, here, and here and will continue to hit on in the future), but the pool extends far beyond those. […]