Category: General

Home / Category: General

My first real gig as a collaborative statistician was landed as part of a required course in my Statistics master’s program — “Statistical Consulting Seminar.” Most programs have one (if they don’t, they should) — it’s the residency or intern part of becoming a practicing statistician. I went into it with some nerves and some excitement. I wanted it to be a confidence-building success, but at the same time wanted to experience some tough aspects of the job while still surrounded by the support of peers and faculty advisors. I guess I got that, but I had no idea at the time that the dilemma I faced in that first experience would grow to feel like a theme underlying my work as a collaborative statistician. I also didn’t fully grasp the magnitude or the source of the underlying issues leading to the dilemma. Twenty years later, things are no better, and might even be worse — as pressure to publish and obtain grant funding continues to drive decisions in research. We need to recognize it and start talking about it — among statisticians and other scientists.

The start to a theme

The researcher (a.k.a. client) was an equine veterinary PhD student who had carried out a few experiments to investigate the effectiveness of a new treatment relative to the standard treatment (enough time has passed that many of the details are now fuzzy and it’s not my intent to use this post to dig back into the details). After the first meeting, I remember feeling very excited about what I could offer in terms of assistance with the project. I took the job seriously and spent many hours beyond those expected — making plots of the raw data, bringing in an “equivalence testing” approach where they had planned to use a typical null hypothesis testing approach, modeling dependence from repeat measures on the same horse, helping with interpretation, justification of models, etc. I’m sure further improvements could have been made, but I’m confident the inferences based on the approach I recommended were better justified than those from the approach they planned to use before reaching out for statistical assistance.

When it became clear I could offer a lot to the project, it was agreed that I would be a co-author on the resulting manuscripts. I used this as justification to myself to continue work on the project beyond the seminar and beyond my graduation — for no charge. I thought it would be a great start to my career as a statistician to have valuable pubs with my name on them. I definitely contributed enough work intellectually and in sweat to deserve co-authorship.

I don’t remember now exactly when dilemma raised its head, but it had to be pretty close to manuscript submission time. There had been hints along the way that the student’s advisor was worried about trying to publish results from methods that weren’t the “typical” way of doing things in the discipline — or at least in the journal they wanted to be published in. I don’t recall ever meeting the advisor (though it’s possible I did) and I doubt direct interaction would have changed the outcome — realistically it would have been the opinion of a 20-something woman statistician-in-training against the opinion of a successful research veterinarian with probably as many years of experience in research as I had in life.

What I do remember is learning that the manuscripts would ultimately not include some (or most) of the major recommendations and justifications I had contributed. I may have learned of it first via email, but can still picture where I was sitting when we discussed it over the phone. My recollection is that advisor decided it was safer (in terms of chances of getting it published) to go with a more common approach for that field, even if it was not as well justified from a statistical perspective. The student was left to communicate this to me — putting him in a very difficult spot. I had no weight relative to the advisor who had paid for and sponsored the research. I was shocked. I thought I had done my job well. I had provided a more defensible approach to improve inferences (even approved by my Statistics professors at the time) — and they were going to completely ignore it in favor of using an approach for the reason that it had been used before? I don’t remember the decision being made based on the results (i.e., their approach ending with a better story for the treatment), as that would have set off a different level of alarm for me, but I also can’t guarantee that didn’t contribute to the decision.

Their resistance to me removing my name

My shock at their decision was then followed by an after shock at the reaction to my immediate decision to remove myself as a co-author from the work. I assumed (wrongly) that removing my name was the next logical step and would be judged so by everyone involved. I was still new in the statistician role, but I had spent over two years doing research in another discipline before graduate school in Statistics. I felt pretty clear (maybe naively so) on what the prerequisites were for being included as a co-author and that being a co-author implied taking responsibility for the work reported. It seemed like a straightforward situation to me. I didn’t agree with the approach presented and therefore, even though I put a lot of work into the project, my name should not be on the paper.

In all honesty, the degree of resistance at me removing my name did make me temporarily second guess my decision, and I think this is a common and understandable reaction among early career statisticians experiencing this for the first time. Was I being unreasonable? Naive? Was I going to burn bridges and hurt my career? Was I just not aware of norms and expectations for co-authorship relative to statistical contributions?

I don’t remember finding a list of authorship criteria then, but my understanding was consistent with these four current criteria from the International Committee of Medical Journal Editors (ICMJE) [boldface emphasis is mine]:

  1. Substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data for the work; AND
  2. Drafting the work or revising it critically for important intellectual content; AND
  3. Final approval of the version to be published; AND
  4. Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

It didn’t take long to stop the second guessing. Despite the resistance, I knew I did not want my name on the papers. It felt wrong for me to “sign off” on work I did not agree with. Communicating this directly and respectfully was harder than I expected. To justify my decision, I focused on the argument that not removing my name could hurt my career. This was a legitimate worry. I was starting a career as a statistician and did not want my name associated with a paper using a statistical approach I did not agree with and that may be deemed inappropriate by other statisticians. I imagined that every researcher could put themselves in an analogous situation specific to their own position or discipline. Would a veterinary researcher be okay with having their name associated with a paper describing methods that would be judged harshly by other veterinarians? I doubt it — even if it would mean giving up a publication. This career-centric argument was the one I felt got traction — probably because it didn’t make the researcher feel so guilty. In some way, it was focused more on me than the general quality of their work. Though it was and is more complicated than that.

The gray, sticky, and ugly layers

Before going on, I think we need to pause for a reality check. I had hoped to write this first post to lay out the dilemma and raise awareness of this real and serious issue, without delving too deep into the ethical aspects surrounding it. I’m still trying for this, but have found it impossible to do completely. Writing about it has actually helped me process the ethical aspects and implications — and has helped me see (or admit) the problem’s deeper and uglier layers. I still want to put this post out there without adequately addressing those layers, but I at least want to acknowledge them.

I worry that readers will be quick to make their own judgements from the outside — as if the problem is neatly black or white. I do not believe it is productive to harshly judge the individuals on either side and I ask you to trust me that it doesn’t feel so black and white when you are living it. The problem is rooted in scientific culture — in its current incentive systems and norms. Unfortunately, a lot of the incentives and norms are integrally tied to statistical methods and their results. As crazy as that is, it is a reality that creates tough situations for researchers and collaborating statisticians on a daily basis. It creates tension between choice of statistical approach and taking risks for one’s career — and this percolates negatively through science. If there were such thing as the correct approach in most cases, then this would be simpler, but choice of statistical approach is largely a judgement call. While there are some approaches most statisticians would agree are wrong for a given setting, there are many reasonable approaches (under different assumptions and justifications).

With that said, you may have read the beginning and jumped to the conclusion that I had an ethical responsibility to do more than just remove my name from the paper. But, in reality, the line is rarely that clear (unless there is clear and intentional research misconduct). In that first situation, and many situations after, I was not sure where the line was and I wasn’t about to make sweeping accusations. I believed my approach was better justified, but I didn’t feel so confident calling into question the research of established scientists. How could I effectively argue it was unethical to use an approach that the top journals in the field were expecting, and essentially requiring? What did I have to stand on? Who or what would back me up? The ethical dilemma to me at the time was very centered on my decision to be a co-author, not about a lack of integrity or research ethics of my collaborators in general. I believe that they believed their approach was also justified — Why would the journals recommend it and publish work based on it if it wasn’t? Should they change their approach because of recommendations from one stats master’s student? I hope you can see how sticky the situation actually is.

Complicated and serious messages

I naively assumed the researcher would be relieved to remove my name and get on with their submission of the manuscript. I didn’t expect resistance because I hadn’t thought through the discomfort and inconvenience it would brought up for them. The message my decision to remove my name sent was not simple. I now imagine it was heard as something like: “I don’t agree with your research” and “I did a lot of work to try to collaborate with you and now I’m not going to get any credit for it.” Both aspects of this message probably led to feelings of guilt and questioning. So of course it would have made him feel better if I had just agreed to keep my name on the paper.

Such resistance to a statistician removing their name is typical in these situations, for the reasons described. It often becomes a dance at trying to come up with reasons and ways to communicate those reasons that do not overly offend — particularly if the collaboration is integral to one’s future work. I am not condoning this dance, but just acknowledging its existence and the fact that many of us have participated in it. I think it’s worth trying to understand the messages sent by declining co-authorship to help us understand the psychological and social aspects of the problem. Here’s a summary of my thoughts about the situation I described:

  • Removing my name called into question the integrity of the research in a way that voicing my disagreement, but keeping my name on the paper, would not have. The plan to remove my name forced the researchers to deal with some discomfort and at least question things a little. However, they ultimately did not see the choice of approach as a problem in the way I did — because it was accepted in their field.
  • I put in a lot of time on the project for no pay. I was to be “paid” through co-authorship, which would presumably have helped my career and “paid off” in the end for me. This is very common situation for statisticians to find themselves in. When it works, it can benefit both the researcher, the statistician, and maybe science, but the more I think about this arrangement, the more I’m convinced that it can lead to serious ethical problems. If a collaboration ends in a situation similar to what I have described, then this agreement adds an ugly layer. The statistician feels robbed and understandably has to think harder about taking their name off and may settle to keep their name on even when they feel uncomfortable with it. Making the problem worse, the researcher feels guilt at not “paying” the statistician and applies more pressure to have them stay on the paper. If the statistician is paid for their work and then the researcher chooses not to use it, then at least the situation is less sticky (though admittedly still difficult).
  • It is always unfortunate when a graduate student is caught between a statistician’s recommendations and an advisor’s decision. In my experience, this is not rare. The graduate student may completely agree with the statistician, but ultimately the advisor is paying for the work and calling the shots for publication. I think the feelings of guilt and discomfort placed on the student are clear here — and of course they will feel better about the situation if the statistician doesn’t remove their name from the paper.

It’s not about honest disagreements in approach

It’s important to note that in this case, and the many others that would follow in my career, there was no underlying disagreement with my professional recommendations. In fact, many times there was explicit acknowledgement that the approach I was recommending did appear more appropriate than the one the researcher wanted to go with. The tension came from the incredibly strong desire to use the same methods as they had used before and that had been associated with publications in the journal — a huge push to keep doing things the way they’d been doing them because it seemed less risky to stay on the paved path (even with its many potholes). The fact that research using a method was published seems to rubber stamp the method, even when a statistician shines a spotlight on its limitations.

There are many layers to this than can be pulled back and examined and different ways this can all play out. A common theme, however, is that the choice of methods, or the decision to oversell results, is based on maximizing the perceived probability of getting published, even if it means ignoring professional advice from statisticians relative to methods and inferences. Depending on context, this can mean doing what’s usually done (as in my example), using overly sophisticated and cool sounding methods (following fads), choosing methods that tell a more attractive story, presenting results with misleading language meant to sell the research, etc. I know this is a strong statement, but I think most statisticians have witnessed it, at least once in their careers. The extent to which researchers realize they are doing it varies — and I think it’s best to give individuals the benefit of the doubt and continue to raise issues with the strong current that pushes them in that direction. However, when a statistician says they do not feel comfortable having their name on the paper despite major contributions to the project, it should motivate more serious conversation.

So much focus on careers – at the expense of the science

Protecting one’s career is a powerful motivation — not just for pride, but for financial security and survival. It’s not surprising that it carries an enormous amount of weight when decisions are made in research (whether we want to admit it or not). Incentive systems matter and they affect the quality of research being done by infiltrating many seemingly small decisions along the way (again, whether we want to admit it or not). As I already alluded to above, it leads the the ugly and sticky sides of the declining co-authorship dilemma.

There are potential negative and positive effects for both the statistician and other researchers when faced with the decision of whether the statistician should decline co-authorship. Individuals weigh risks to their careers in the process of making decisions.

Over the last 20 years, I have seen little evidence that leaving my name on papers I deemed questionable would have actually harmed my career (though I still believe it should have). I know of a few PhD statisticians who seem to happily accept co-authorship on any paper they are invited to, or at least have a fairly low bar for how much they need to contribute or agree with. I have seen no evidence that it hurts them — only that it helps them by growing their list of publications. Career stability and success is so dependent on “objective metrics” based on counts of publications and length of CV. Plus, researchers love collaborating with statisticians who don’t raise a fuss, so those who aren’t likely to decline co-authorship likely get more opportunities for authorship in the future.

Don’t other statisticians read the papers and raise red flags? I haven’t really seen it happening — and again there is the complication of many reasonable approaches to a problem and many differences in opinion, unless something is blatantly wrong. It’s impossible to know what went on behind the scenes and there just isn’t enough time for an external statistician to critically evaluate minor non-Statistics papers when statisticians go up for promotions or tenure. Statisticians feel they deserve credit toward their careers for work done (even if not well represented in the ultimate publication) and often publications are the only currency to make that happen. Other researchers generally appreciate working with someone who is playing the same career-incentive-system game. It feels like a win-win, and it is a win-win if the primary goal is to support careers.

I don’t see it as a win for science though. People try to survive by succeeding at their jobs and to succeed, they have to play according to the incentive systems in place (or take huge risks by refusing to play the game or even trying to change it). Unfortunately, incentive systems for scientists often do not align with (or at least promote) incentives for doing the best science we are capable of. I think there is broad, if not universal, agreement that the goal of doing science is not to promote and protect the careers of scientists. Yet, actions tell a different story. And the “statistician declining co-authorship” scenario is a concrete example that exposes a lot of layers if we are willing to try to see them.

Interactions between researchers and statisticians can demonstrate the power of individual survival (career success) over research and scientific integrity. I included this direct quote from an email to me in an earlier blog post, but will repeat it again here because it is so relevant: “I know you disagree, but I’m going to stick with my bad science power analysis for this proposal — it’s what the NIH program officer I’ve been talking with told me to do.” This is the most explicit example I have to share, but the message is not rare. Most people just dance around the issue and do it in verbal discussions rather than boldly throwing it out there in an email. The point is — researchers make career self-preservation and self-promotion decisions and they are not seeing the potential ethical issues associated with the choices (or I don’t think it would be boldly stated in an email). They are constantly weighing the risks to their careers of stepping outside scientific norms — and those with the most power in the current system are those who generally navigated those risks successfully, so the cycle continues. Unfortunately, statisticians may be the ones unintentionally recommending risky behavior (career-wise) when their professional opinions conflict with discipline-specific expectations in methods perceived as less risky.

To be fair, we’ll never know how the decisions ultimately play out relative to good for science and society. Again, things are not as black and white as we would like to think they are. For example, take the above email scenario. Maybe if they had gone with my recommendations they wouldn’t have received a grant and the research would have never happened — and maybe it will have overall benefits to society. We don’t know. But, I hope we can agree that working within a system that seems to value careers of scientists over the quality of the science is a serious problem — for researchers, for collaborating statisticians, and for science. It is a problem with research integrity and ethics, even if in the moment it feels like a problem of survival.

Here’s another anecdote. I recently heard from a statistician who removed his name from a manuscript after his original work was replaced with misleading displays of results that were more story worthy. He followed what he felt was his professional obligation by removing his name from the manuscript and pointing out the reasons the new presentation of results was misleading. In response, he was reprimanded by a supervisor — because he did not place enough value on protecting and nurturing the career of the researcher who made the decision to go with the misleading displays (presumably to boost an early career with a tastier and more consumable story). This story sends clear messages to all involved that careers of researchers are valued over scientific integrity — and over the professional integrity of statisticians.

While I never experienced a reprimand from a supervisor, I certainly experienced more indirect, and sometimes passive aggressive, comments. The message was clearly conveyed to me that I was not being helpful in the way they wanted me to be helpful. Didn’t I understand they had to operate within the current rules of the game? Didn’t I understand how to succeed as a scientist? They usually justified decisions to ignore my recommendations under the pretense that I wasn’t embedded within their discipline and just didn’t have enough of an understanding of “how research was done” in their discipline. Within-discipline reviewers and editors hold the power over careers.

Guidelines for ethical statistical practice

The American Statistical Association’s (ASA’s) Committee on Professional Ethics created a document (approved by the ASA’s board of directors) describing the ethical obligations of a practicing statistician. The document is titled Ethical Guidelines for Statistical Practice, though I think it should be Guidelines for Ethical Statistical Practice. There are huge challenges with drafting such a document and I think overall it’s a thoughtful and useful collection of guidelines. I have found it very useful to discuss with Statistics students and have also shared it with collaborators who seem to have a hard time grasping my ethical responsibilities as a statistician.

I encourage you to read (or re-read) the whole thing, but I will include the last section here because it is most relevant to the topic of this post. It is a section not for statisticians, but directed toward those working with statisticians. I think it’s fair to take this as evidence of the widespread nature of problems like those that come up around declining co-authorship. Unfortunately, I don’t think the guidelines are widely read, acknowledged, or followed by non-statisticians, and it often doesn’t get enough traction when preached from statisticians themselves. Regardless, here it is — to read and to share [boldface is mine]:

Responsibilities of Employers, Including Organizations, Individuals, Attorneys, or Other Clients Employing Statistical Practitioners Those employing any person to analyze data are implicitly relying on the profession’s reputation for objectivity. However, this creates an obligation on the part of the employer to understand and respect statisticians’ obligation of objectivity.

  1. Recognize that the ethical guidelines exist and were instituted for the protection and support of the statistician and the consumer alike. 
  2. Maintain a working environment free from intimidation, including discrimination based on personal characteristics; bullying; coercion; unwelcome physical (including sexual) contact; and other forms of harassment.
  3. Recognize that valid findings result from competent work in a moral environment. Employers, funders, or those who commission statistical analysis have an obligation to rely on the expertise and judgment of qualified statisticians for any data analysis. This obligation may be especially relevant in analyses known or anticipated to have tangible physical, financial, or psychological effects.
  4. Recognize the results of valid statistical studies cannot be guaranteed to conform to the expectations or desires of those commissioning the study or the statistical practitioner(s)
  5. Recognize it is contrary to these guidelines to report or follow only those results that conform to expectations without explicitly acknowledging competing findings and the basis for choices regarding which results to report, use, and/or cite.
  6. Recognize the inclusion of statistical practitioners as authors or acknowledgement of their contributions to projects or publications requires their explicit permission because it implies endorsement of the work.
  7. Support sound statistical analysis and expose incompetent or corrupt statistical practice. 
  8. Strive to protect the professional freedom and responsibility of statistical practitioners who comply with these guidelines.

There is nothing about a responsibility of statisticians to compromise their professional integrity to help the careers of other scientists. Period. If you are a statistician, stand up for your professional opinions. You have an ethical responsibility to do so. Something is wrong if you are feeling pressured to “sign off” on work you don’t agree with. If you are someone working with statisticians, please respect their professional opinion and their responsibility to according to the guidelines and their own internal ethics-meter (which may differ by individual). Statisticians should be able to do their jobs without being placed in ethical dilemmas daily — or forced to compromise their ethics to maintain productive professional relationships and collaborations. If they choose not to be a co-author on your work, respect that decision — and reflect hard on why they are making that decision.

Individual differences in ethics-meters and gray area

There are some actions we can all agree are professional unethical (like manufacturing data or knowingly presenting results that are misleading). But like it or not, there is a lot of gray area within the practice of using statistical methods to make inferences. What feels inappropriate, under justified, or even unethical to me because of my understandings, experiences, and philosophies, may not feel the same to a colleague of mine. This is something I have spent a lot of time agonizing over and trying to come to terms with.

Things are rarely black and white, as much as we would like them to be so. I am not so sure of the superiority of my gray-area feelings over those of others that I expect others to move in my direction, but I also should not be expected to move in their direction. Fear of appearing overly critical and judgmental kept me more silent than it should have for many years. I fought to turn down the volume on my own ethics/integrity meter because I was constantly sent the message by others that mine was too sensitive.

I think (hope) I have finally given in to accepting the volume of my meter for what it is and have changed my career to adapt to it, rather than trying to adapt to pressure to help the careers of others in ways I don’t agree with. For me, this took leaving academia and an industry position — giving up a lot of financial security. I certainly feel lighter about the present and the future, but I do still have to contend with decisions I made in the past.

Compromises

Have I compromised? Yes. Is my name on papers that I would rather it not be? Yes. During my career, I often felt cornered and pressured to be on publications or do work in a particular way I didn’t necessarily agree with — for my own career and the careers of others. I worked hard to make sure I contributed positively and openly raised any concerns I had. I made sure papers reflected some intellectual work that I was proud of, even if there were still some parts that were hard for me to swallow. And, in my collaborative work, I still declined authorship on more papers than I accepted authorship on. Some of those I declined were because I honestly did not think my level of contribution was sufficient to warrant co-authorship (another layer I didn’t get to in this post). My publication record does not represent the depth or breadth of my work with other researchers. There is probably a stronger message in my anti-publication record. I am okay with this, but for those who really want to succeed as a collaborative statistician within academia or be a successful consultant (with happy clients who return often) it can present a serious problem.

While working as a traditional collaborative/consulting statistician, I constantly struggled to balance the doing of my job as was expected within the current paradigm (conditional on assumptions and methods that I disagreed with) and staying aligned with my ethical responsibilities as a statistician. I was at times being paid to assist researchers with getting grants and publishing their work. Pushing researchers to adopt approaches beyond those expected within their discipline, was pushing them to take risks with their careers. And, in retrospect, I don’t think I always landed on the right side of the line. There are a lot of decisions I’m not proud of — a lot of compromises I made in the spirit of collaboration and team work.

Summing up – a few take home messages

Especially for those just skimming, here are a few bulleted take-homes that made up the bulk of an earlier and very short draft.

  • Statisticians have a responsibility to not include their name on work that they do not agree with, even if they have put a lot of time into the project.
  • Statisticians, particularly those in academia whose careers depend on publications (at least for the time being), need to have a way to get credit for work with researchers who in the end choose to ignore the advice of the statistician. It is bad for science to interpret lack of co-authorship after collaboration as evidence the statistician is not an adequate collaborator or applied statistician, or that they didn’t do enough work to get credit. Beware of unknowingly valuing number of publications (and careers in general) over research integrity.
  • Researchers collaborating with a statistician have a responsibility to respect the statistician’s expertise and the statistician’s commitment to professional ethics and research integrity. It is incredibly disrespectful to simultaneously choose to ignore advice and contributions while expecting the statistician to keep co-authorship. Put yourself in the statistician’s position by imagining some analogous situation you could end up in.
  • Just because a statistician’s suggestions do not align with “the way things are done to get published” in your discipline, does not mean the ideas aren’t worth careful consideration. We need to move forward in improving our use of statistical methods by thinking more deeply about how and why we are doing it. Justifying choice of an approach simply because “that’s how we do it” is not good enough — particularly when there is someone with more expertise than you recommending something different. Just because something has been done a million times does not mean it’s a good thing to do or should be continued.

Time to stop ignoring the problem

The problem is real and if you are dealing with it, you are not alone. Just as I was finishing this post, I received a very timely email from an early career statistician asking for advice about a co-authorship dilemma. It is time to start talking openly about the issue with other scientists and support those who find themselves in difficult situations relative to co-authorship in general. As hard as it is to admit, it is an issue of research integrity and ethics.

This post began as a comment on Andrew Gelman’s blog in response to Conditioning on a statistical method as a “meta” version of conditioning on a statistical model (March 1, 2020). Not surprisingly, my plan for a brief and concise comment ended up not so brief. It feels more like a post of its own, so putting it here as well.

Andrew’s entire post:

When I do applied statistics, I follow Bayesian workflow: Construct a model, ride it hard, assess its implications, add more information, and so on. I have lots of doubt in my models, but when I’m fitting any particular model, I condition on it. The idea is we take our models seriously as that’s the best way to learn from them.

When I talk about statistical methods, though, I’m much more tentative or pluralistic: I use Bayesian inference but I’m wary of its pitfalls (for example, herehere, and here) and I’m always looking over my shoulder.

I was thinking about this because I recently heard a talk by a Bayesian fundamentalist—one of those people (in this case, a physicist) who was selling the entire Bayesian approach, all the way down to the use of Bayes factors for comparing models. OK, I don’t like Bayes factors, but the larger point is that I was a little bit put off by what seemed to be evangelism, the proffered idea that Bayes is dominant.

But then, awhile afterward, I reflected that this presenter has an attitude about statistical methods that I have about statistical models. His attitude is to take the method—Bayes, all the way thru Bayes factors—as given, and push it as far as possible. Which is what I do with models. The only difference is that my thinking is at the scale of months—learning from fitted models—and he’s thinking at the scale of decades—his entire career. I guess both perspectives are legitimate.

Andrew Gelman, March 1 2020 post on Statistical Modeling, Causal Inference, and Social Science Blog https://statmodeling.stat.columbia.edu/2020/03/01/conditioning-on-a-statistical-method-as-a-meta-version-of-conditioning-on-a-statistical-model/

My thoughts motivated by the post

I find myself thinking a lot about differences (both real and those just implied through language) among “statistical models”, “statistical methods”, and “statistical inference,” but had never explicitly thought about attitudes toward them in the way you describe. I do think it’s a constructive way to frame some fundamental issues related to lack of questioning (or over trusting) of statistical methods, and their models, in practice. The feeling you expressed toward the evangelical Bayesian basically describes how I feel on a daily basis toward anyone using statistical methods with an unexamined, and probably unjustified, degree of faith. I have tried to avoid the religion analogy, but it is all too fitting. It captures my opinion and explains a lot about why I can no longer handle being a typical statistician in practice. In my experiences, researchers in many disciplines form groups of like-minded believers with little desire to carefully examine the beliefs — that are then proselytized through teaching, dissemination, and reviewing. I have trouble reconciling this attitude with being a scientist — but I also understand the very real fear and panic that comes with turning a spotlight on limitations of accepted methods (and therefore previous results), particularly if one’s career is based on using them.

Layers

Your reflection on attitudes toward models vs. attitudes toward methods points to the importance of recognizing and naming the layers of assumptions involved in reaching statistical-based conclusions or predictions. I see particular model assumptions as the most superficial layer (e.g., Gaussian errors, constant variance, linearity, etc.) — they are tangible, easy to describe in words, we present clear strategies for attempting to justify them (although rarely effectively), and we may even explicitly say our results are conditional on them. It’s the layer of assumptions we are the most aware of – through check-list and fact-like textbook presentation. It’s the layer that gets nearly all the attention that goes toward justification of methods — that is, we naively treat justification of model assumptions as justification of a more general method. We teach “checking of assumptions” as generally limited to a model (or maybe a few models) — but it has become largely an exercise carried out on autopilot with ridiculous conclusions such as “the assumptions are met,” demonstrating the lack of thought put into the exercise or even into the understanding of the limitations of a model in general. It’s also very easy to say “results are conditional on the model” without ever trying to think through the broader implications of the conditioning.

There are problems with not taking even the simplest layer seriously enough, and the problem gets much worse when we dig into the deeper layers. I’ll call “methods” the next layer in a way that is consistent with what you communicated. In this layer, we’re not thinking about the specific model chosen, but the more general methods within which the model is used (e.g., Bayesian data analysis). A common, or even default, method often depends on discipline and need not have a strict definition or easy to attach label (like “Bayesian”). Methods are often accepted and trusted (at least within a discipline) with very little explicit acknowledgement of their assumptions or limitations. In terms of conditioning — there is little awareness or discussion of what we are conditioning on (or even that we are conditioning on many things). Methods are described by phrases like “the way we do things” or “the way I am expected to do the analysis.” I cringe at the number of times I have had researchers use such phrases with me — in the context of trying to have conversations about why they are adamant about using a particular method (even despite serious problems from a statistical perspective).

Questioning our methods

I see little questioning of “the way we do things.” Thomas Basboll recently described the distinction between methods (what we do) and methodology (why we do what we do) and I think this distinction is easy to forget and good to keep in mind. In my couple decades as a scientist and statistician, I haven’t sensed much worry or interest in digging into the why we’re doing what we’re doing among practicing researchers. There seems to be an attitude that someone before us put in the hard work to decide how things should be done, and so we just need to follow through with that — and no need to even ask deeper questions like why or how. We act as if we are operating within paradigms that are fully justified, strong, and worthy of blind acceptance. This is dangerous, and wrong. Where are the attitudes of healthy skepticism that lead us to value work that questions the ways we currently do things? Questions results conditional on the current way of doing things doesn’t go far enough, though is definitely easier. Questioning deeper layers isn’t currently part of the “workflow,” but it should be.

They will laugh at us

In my experience, there seems to be an underlying belief among researchers in many disciplines that there are no other reasonable options beyond their view of the accepted methods. This is operating under blind acceptance of a paradigm — and so blind that it’s not even recognized as a paradigm that could be moved away from. Perhaps I shouldn’t bring in the word “paradigm” with its historical and philosophical contexts, but I’m simply using it to describe “the accepted way to do things” at this point in time. It’s so hard to envision the time when we will look back and think how naive we were to be doing the things we were doing (and not questioning them) — because we are trying to do the best we can. Why is it so hard to accept that of course we will look back at some point and realize there were serious problems with how we were doing things — that’s what happens in science and what should happen in science. Even if we don’t know exactly what we should be doing, we can know that what we’re doing is probably wrong, or at least can be dramatically improved. We can’t know what the future will look like, but we can know that we will laugh at ourselves (or other future humans will laugh at us).

If we’re going to ignore, then let’s willfully ignore

Probably the main reason I like Herbert Weisberg’s book Willful Ignorance: The Mismeasure of Uncertainty so much is that it delves into the deeper layers, and in an articulate and accessible way. He hits methods and deeper — to “statistical inference” and the decision to rely on probability in general. In order to proceed with statistical inference, there is a lot we need to willfully ignore. However, the problem in practice is that we are not willfully ignoring — we are unknowingly ignoring. There is a lack of awareness and knowledge about what must be ignored (or what we are conditioning on) to proceed with statistical methods and their models to make statistical inferences. Moving toward actual willful ignorance would be a healthy step.

Look over your shoulder

Finally — You point out that you have a lot of doubts about your models and are always looking over your shoulder relative to methods. I think you are an exception, at least in the world of practicing researchers who rely heavily on statistical methods and associated models. I haven’t seen many people looking over their shoulders, at least with anything more than an obligatory and ineffective glance. We should be looking over our shoulders so much that our necks cramp and we purchase rearview mirrors – but instead we often operate as if we are unaware of any threat, or maybe we just don’t know what a threat looks like. We have to know something about what we’re looking for to be able to recognize it, talk about it, and gain the motivation to keep looking. And looking over our shoulders, or in rear-view mirrors, does not mean we are overly paranoid. It can reflect healthy awareness of our surroundings (what we are conditioning on) — not only of the things straight in front of us, but of the things outside our usual field of vision. There’s a lot there.

This is a topic I have thought a lot about and discussed with students for years, but I have yet tried to convey my thoughts in writing. My goal is to keep this high level and not open too many cans of worms that might derail the main point I’m trying to make. I appeal to the royal “we” throughout — just for simplicity.

Broadening our view of information

For any inference, conclusion, recommendation, etc., it’s hard to disagree with the importance of considering what information is behind it. However, we fall short in terms of the sources of information we tend to consider. If we fall short in terms of evaluating the information used, then we fall short in evaluating the inference itself. While I find this to be the case in many contexts, here I’m focusing on how it manifests through our reliance on statistical methods in science and decision making.

We are trained to think of information (at least that which we should pay attention to) as coming from collected data. Information from collected data is important, but it is not the only source of information informing our inferences and associated conclusions. Statistical inferences focus on quantifying a particular type of uncertainty given (i.e., dependent or conditional on) a large collection of assumptions. Some of these assumptions may be explicitly stated (e.g., assuming data are generated from a normal distribution) and others may be very implicit (e.g., choices in the design, other model choices, choices about what results to report, choices about how to interpret the results, etc.). Assumptions are never met and there are many researcher degrees of freedom — and we tend to pretend they are free. We are not forced to consider and justify their costs relative to inferences ultimately made.

We can easily agree that data are not free — and shouldn’t be. We pay in time and money to gather information as data to support our inferences. We should not make up data (even though made up data are easy and free) because… well, it’s unethical. It is adding information into inferences that is not justified and potentially very misleading. Glad we can agree on that.

What are assumptions? Can they potentially add unjustified and/or misleading information into our inferences? In my view, they certainly can. Should they be free of any costs?

Assumptions are called assumptions, because they are just that. We don’t proceed with an analysis under “truths” or “facts” — we proceed under human-made assumptions and other choices that we might not even label as assumptions. Just as collecting additional data has a cost, this should have a cost (just a different type of cost).

You may think I’m taking it too far to describe assumptions in the same vein as “making up data,” but I do not see them as so far apart (despite how far apart they are treated on the scale of ethics). It could do us some good to at least go through the thought exercise of considering the similarities.

Assumptions insert information into an inference — this has to be the case if an inference depends at all on what assumptions are made (which it typically does). Statistical results and the inferences associated with them are often sensitive to assumptions and other design, modeling, and interpretation choices. Sensitivity implies information (beyond that in the data) is being inserted into the process.

However, we rarely think about assumptions as information inserted into an inference beyond that coming from the data (as we would if it were fake data). I don’t think we do because we don’t have an easy way to quantify the amount of information coming from assumptions and choices vs. the information coming from collected data. There are situations in which we can, and do, assess sensitivity of results to a particular assumption, but this is restricted in terms of what assumptions can reasonably be assessed and even then I don’t see it used often. It seems to be deemed more admirable to declare what model (i.e., a set of assumptions) will be used before hand and stick to it, regardless of how sensitive the results might be to the choice.

My point

Assumptions and choices are not free. They add information to inferences — like data that were not collected.

We have to purchase valuable information by collecting data, but then we’re allowed to shoplift additional information in the form of assumptions and other choices without any penalty. I believe we should take the hidden costs of assumptions more seriously. We should see them as added information beyond that contained in the “observed data.” We should hold each other accountable for justifying that inferences are not overly influenced by assumptions. What counts as “overly influenced?” That’s hard — but again not a reason to avoid the issue all together.

At the very least, we know we are inserting more information into an inference than we admit to and inferences should be interpreted in light of this. Data-driven is actually data-PLUS-assumption driven — we just prefer not to dwell on the assumption part. We need to consider the possibility that in some cases the information slyly inserted through assumptions may even “swamp” that in the data.

Lessons from data vs. prior tensions?

I hesitate to go here, but then can’t find a good enough reason not to. I purposely chose the word “swamp” above because of its use in comparing the amount of information in a posterior distribution coming from the prior distribution relative to that from “the data.” In practice, it is common to hear people justify a prior by saying we don’t need to worry about it because “the data swamp the prior.” This is one setting where people seem to worry about the information coming in through assumptions for the prior part of the model (the prior), but not worry enough about the information added through assumptions coming from “the data” part of the model (often forgetting the role of the likelihood!). Note for the record — it is possible for a prior distribution to be based on a lot of prior data!

I bring this up, not because I want to debate about priors, but because I find it very ironic that people can get so incredibly worked up about the sensitivity of results to prior distributions while forgetting (or simply ignoring) the information inserted by all the other design, modeling, and summarization assumptions and choices.

Apparently, there is great worry for some about how much information comes from assumptions vs. data, but it only matters in the context of a Bayesian prior. If one can get worked up about a prior distribution, then by all means get worked up about all the other sources of information silently inserted into the process with no crisis of conscience. The fact that this prior-vs-data tension can lead some to proclaim an analysis without a prior is “objective” and an analysis with a prior is “subjective” has always seemed absurd to me — we need to consider this relative to the entire process we actually go through in practice (from idea to inference).

As usual, I don’t think the disregard is blatant. I think it stems from lack of practice, opportunity, or reward to look deeply and critically at our methods and processes. My hope is that starting to recognize assumptions and choices as added information may be a step (albeit tiny) forward. I entered this data-vs-prior realm only because it seems to me to be low hanging fruit pointing to how we can think more broadly about the degree of impact assumptions have on our inferences.

I consider the prior-vs-data conversation low hanging fruit because it’s not at all hidden and stays within our comfort zone. I’m not arguing against its importance, but I think we need to attempt to view it within a larger frame of reference. The prior-vs-data tension feels like safe territory. The prior can be explicitly stated, sensitivity of the posterior to the prior for given data (and likelihood function!) can easily be examined, there are quantities such as “effective number of parameters” we can think about, etc. It’s an example of explicitly inserting information beyond that in the observed data that appears to resonate with researchers across disciplines.

On the other hand, it is an example that also demonstrates the extent of our blind spots. The belief that inferences are based only on data – unless we are using a prior distribution in Bayesian inference is incredibly naive. It shows our blindness and lack of willingness to consider all the design and modeling assumptions and choices. Can we use it as a tangible starting point to extend thoughts and discussions — to considering information added through things we are blind to and don’t talk about? What about all the other information we inadvertently or silently insert with little or no discussion and justification? It may be too pre-loaded of a topic to be able to start with, but maybe worth a try.

Non-parametric vs. parametric

For a quick non-Bayesian reference point, I think it’s useful to consider non-parametric vs parametric methods. It’s not uncommon to hear that parametric statistical tests are “more powerful” than non-parametric tests. Even if you’re not going to use either, it’s worth looking at the reason for this claim. Where does the increased power come from? We increase power through increasing information — typically assumed to come from decreasing variance or increasing sample size. But, given the same experimental design to be carried out, the power changes depending on what assumptions we’re willing make.

Parametric tests involve making (relying on) a particular probability distribution (model). Non-parametric tests still have assumptions, but they don’t add in as much information to the analysis in the form of an assumed probability model. My point is not about power or testing — but about the fact that inserting the assumption associated with a particular probability distribution is adding information (just as increasing sample size increases precision and power through the added information). But, we seem to take for granted the fact that we added non-data information into the analysis, and thus inferences. Sure, you can “check” the assumption — but it is never “met” and therefore always has some price that should not be ignored.

Acknowledging “information” from mathematical statistics

If you took a course in Probability & Mathematical Statistics, then you are likely familiar with the terms including the word “information” — probably information matrix, Fisher information, observed information. Information in this context is presented as quantifiable (under many assumptions that are often glossed over), and its simple mathematical nature feels clean, unobjectionable, and comfortable. The message is that the amount of “information” contained in an estimate of a model parameter comes only from the information in the data. That is, we quantify precision in an estimate mathematically seemingly based only on characteristics of observed data (e.g., standard error of an estimated mean is equal to the standard deviation estimated from the data divided by the square root of the sample size). The math is conditional on model choices (e.g., the likelihood function) and other assumptions, but this gets much less attention.

The other place “information” often comes up is in the many information criteria (e.g. AIC, BIC, DIC, etc.). These can be generally thought of as measures of predictive accuracy. They include penalties for model complexity to avoid overfitting the observed data given a goal of predicting new data. There is convenient mathematical theory to back up calculation of numbers given data and model choices.

Neither of these references to information get at the point I am trying to make — and I bring them up because I want to identify them as distracting from the goal of recognizing assumptions and choices as information, even if it doesn’t come wrapped in a tidy mathematical box.

Crucial questions and limits to what we can quantify

We can’t rely solely on mathematics and things we can quantify to have meaningful conversations among scientists about the trustworthiness of our inferences. There are so many important questions to be asking to spur needed conversation — and I like to believe restore more trust in conclusions and recommendations (even if they are stated with more uncertainty).

  • What are the sources of information going into the analysis and thus conclusions?
  • What are the relative amounts of information from the different sources (even if impossible to quantify)? How do we gauge (qualitatively) the relative amount of information inserted through assumptions compared to collected data?
  • How sensitive are inferences to assumptions and choices?

Observed data are not the only source of information going into conclusions. We need to stop pretending that they are.

The second bullet above lends itself to the natural direction of attempting to quantify the relative amounts of information. Quantifying will always have its limits because much of what is included in assumptions and choices is simply not quantifiable. And if we restrict ourselves to what we can quantify (as we have largely done thus far), then we don’t really change the game. Therefore, as attractive as it is, I find myself against the idea of trying to quantify relative amounts of information and instead spending the effort on challenging ourselves to think through the problems more qualitatively and creatively. This forces a different type of scientific communication — needed, though very uncomfortable. But comfort for scientists isn’t the goal.

The alternative is to keep doing what we tend to do — collect some data, choose a model, add some data, superficially “check” a few assumptions, make some conclusions — and then pretend as if the only information going into the conclusions is that from the collected data. A starting point is beginning to think about assumptions and decisions as added information, in the same way as we think about data as information.

I wasn’t planning to write this today, but it’s something that has bothered me for years and after reading “randomization ensures balance” in two separate places this morning, I felt like discussing it.

First, let’s just think about the word “ensure” — a word that means “to guarantee” or “make certain.” We should always be wary of this word in the context of using probability. Randomization of experimental units (e.g., people, plants, groups of people, groups of plants, etc.) to treatments (or vice-versa) involves appealing to the concept of probability through using a random mechanism to assign the units to groups. At a very fundamental level, how can it make sense to use a random mechanism (one time) to ensure something for a single study?

“Balance” is a word that does have different meanings within statistical contexts. When used in the statement “randomization ensures balance,” I interpret it to mean that differences among units (e.g., characteristics of individuals) that might be related to the outcome — but are not controlled for in the design — are equally allocated across the treatment groups. That is, the groups are “balanced” in the sense that they each contain the same number (or close enough to it) of individuals with said characteristic. If groups are balanced relative to the characteristic, then we feel comfortable saying that the treatment was not confounded with that characteristic, which can help justify any causal statements about the effect of the treatment in the end. This conversation could open many other cans of worms — I am only going so far as to try to get us on the same page.

“Randomization ensures balance” is a myth (relative to a single study) that gets spread over and over and over again — and reflects a huge lack of understanding about the role of randomization in statistical (and scientific!) inferences. There are very good reasons to employ randomization in a design, but proselytizing for “ensuring balance” in a single study is very misleading and should be stopped. The statement has strong roots in statistical theory in the context of repeating the randomization process over and over again to create alternative versions of the experiment — a sort of balance on average “in the long run.” But it does not apply to a single study, as it is often appealed to do — or at least interpreted to do by those reading the words. And I think “ensure” is misleading even in the theoretical version.

Sure, randomization helps guard against researcher bias in terms of who ends up in which group, but guarding against researcher bias does not imply balance will be achieved. In fact, I argue that “ensuring balance” (at least relative to variables the researcher has knowledge of) is only possible if NOT using randomization! It is impossible to ensure if using randomization (assuming it’s being used honestly).

So, what does randomization buy us? Why go to all the trouble if we can’t ensure balance? It buys us a simple and easily justified probability model that we can use as the foundation for statistical inferences! Isn’t that exciting and amazing?!! Yeah… I get it…. that doesn’t sound even close to as attractive or useful as “ensuring balance.”

Randomly assigning units to groups (randomization) buys us justified use of the associated randomization distribution. I’m tempted to delve into a tutorial about what a randomization distribution is and why we should care, but I’m going to reel myself in and maintain focus on the “ensuring balance” idea. I’ll start a draft post titled something like “It’s been a long time since STAT 101. What’s a randomization distribution and why should I care for my research?” There are also plenty of resources out there using simulation to help convey what a randomization distribution is (as well as a p-value based on that randomization distribution). Yep — hoping the connection to coveted p-values might increase interest in the more important concept of the randomization distribution. If you don’t have a deep understanding of what a randomization distribution or a sampling distribution represents (beyond regurgitation of a textbook definition), then I question your license to use (or teach!) p-values.

Let’s look more closely at this “ensuring balance” idea by appealing to a small and simple example — though hopefully not too simple.

Suppose we have recruited 8 participants for an experiment where we would like to compare the effectiveness of two treatments — A and B (creatively named!). There is some speculation that the difference between the treatments may depend on age. [An aside: At this point, bringing the very important (but often ignored) experimental design concept of blocking into the discussion would be useful, but we’ll naively go ahead as if the researcher firmly believes that randomization will “ensure balance between the groups” and that age doesn’t need to be explicitly brought into the design.]

Let’s suppose 4 of the 8 participants are in their mid-20’s and 4 are in the mid-50’s — so there is a clear split between young adults and older adults (I said old-ER, not old!). Now, the researcher uses their computer (and its pseudorandom number generator) to randomly assign the 8 participants to treatments (with four in each group). Balance (relative to age) means that each group contains two 20 yr olds and two 50 yr olds. Does use of randomization ensure this outcome?

How many ways can the eight subjects be randomly assigned to two groups of four? “8 choose 4” or 8!/(4!4!) is 70. So, there are 70 possible allocations of individuals to the two groups. If honestly employing unrestricted randomization, then all of these are possible (and have equal probabilities of occurring). How many of the 70 possible randomizations are balanced in terms of age?

Here’s how it breaks out (you can do the math or use statistical software like R to play around yourself — see Appendix if interested). There are three scenarios capturing the relevant allocations of age classes to the treatment groups:

  1. 2 of the 70 randomizations result in all 4 of the 20 year olds in one group and all 4 of the 50 year olds in the other. I think we can all agree this doesn’t count as balanced. This is as bad as it can get. Age is completely confounded with treatment and we learn absolutely nothing about the difference between the treatments within an age class. Ugh. But we used randomization!!? The probability of this horrible design is a 2/70 — almost .03! Small, but not that small when you consider the consequences.
  2. 32 of the 70 possible randomizations result in each treatment group having 3 from one age class and 1 from the other. This isn’t as bad as the previous situation, but it certainly isn’t great! Age class and intervention are nearly confounded and any information about the difference between treatments within an age classes relies on a single person in the one group. I don’t think many would label this arrangement as “balanced.”
  3. Finally, and just for completeness, the remaining 36 out of 70 possible randomizations do result in balance relative to age classes, with 2 participants from each age class in each treatment group.

Here’s the quick summary. Using randomization in the design does not guarantee that you end up in Scenario 3! In fact, only a little over 1/2 of the randomizations (36/70) land you there. Is it at all appropriate to use the word “ensure” to go along with a probability of 0.51? Really, “ensure” only goes with a probability of 1.0 — which doesn’t typically belong in the same room with something called “randomization.” Looking at the other side, nearly half of all possible randomizations result in a design that is clearly not balanced relative to age class.

If the researcher wants to “ensure balance” relative to age, then they should forgo unrestricted randomization. They can ensure balance for this variable by forgoing randomization all together (and giving up its benefits) — OR much better by employing restricted randomization through blocking by randomly assigning two individuals to each treatment within each age class. Note that use of restricted randomization in the design should then be accounted for in the analysis — because it does change the randomization distribution by changing the collection of possible random assignments! The potential benefits of blocking will be saved for another post as well.

The example used a well defined variable suspected to be related to effectiveness of the treatment and measured on all the participants. What about the characteristics that we don’t yet know might be related to the effectiveness of the treatment or that we haven’t measured, or don’t know how to? Will those be automatically balanced through randomization? We can only control for so much in the design and then have to get to the point where we are willing to ignore everything else. But we end up at the same place — randomization will not magically ensure balance for the unknown, or willfully ignored, variables in a study.

At the risk of being repetitive, I feel the need to restate my point again. Randomization does not ensure balance in a single study, and we should stop saying or implying it does. To be very clear – I am not arguing against using randomization. I am arguing against selling false benefits of it.

Appendix – some R code

Why I don’t like hypotheses

February 7, 2020 | General | 2 Comments

In the beginning

There was a time when I believed that stating a hypothesis was a crucial part of doing science — it was engrained in my education — starting in elementary school. I think every science fair project I did sported a neat and tidy hypothesis. And then, I dutifully set out in an attempt to “prove” that hypothesis, with a very vested interest in obtaining results toward that end. I certainly didn’t want any failure to harm my chances of getting a ribbon.

Over thirty years later, my own kids were clearly expected to display such a hypothesis for their early elementary school projects too. I did get involved and try to change this later, but clearly remember trying to convince my young kids that they really didn’t need to include a superficial hypothesis on their science fair project board! I tried to make the case that being attached to a particular outcome could lead to them accidentally affecting the results, and the exercise emphasized getting a particular “answer”, rather than investigating and learning. I won some years, but not all. The social pressure is real, even in elementary school.

I have seen many kids convey their conclusion as “my experiment failed because I didn’t prove my hypothesis.” I am not going to venture into the realm of issues with “confirmation” or “proof” here, as the point of this post is meant to be much simpler. We are in the habit of teaching and using superficial affirmative statements as hypotheses and it is not helping science. It might make us feel like we’re adhering to a systematic scientific method (and maybe convince some that we are), but it is not really supporting rigor in inductive inferences. Appealing to a theme of my blog posts — superficial hypotheses are yet another example of how we try to squeeze the process of doing science into frameworks that feel deductive — it’s comfortable to believe and act as if an answer is within reach.

This isn’t just happening in the beginning of our lives as scientists.

But not just in the beginning

I would like to say “Don’t worry, things improve after elementary school.” Unfortunately, I can’t honestly say that. It continues where stakes are much higher than a ribbon in the school science fair (though it doesn’t take much imagination to replace “ribbon” with “publication”, “award”, or “grant”).

What I have witnessed over the years from researchers with PhDs and plenty of grant funding is not much different from what happens in elementary school science fairs. This is not meant to be a commentary about the individual scientists, but the larger system in which they operate that feeds on its bad habits and superficial expectations. The use of superficial hypotheses is a particularly glaring example of one of those habits.

Over the years as a statistician, I have often brought up my negative view of hypotheses with researchers, usually in the context of helping with or reviewing grant proposals — and this is largely met with a look I don’t think I can adequately describe. The best I can do is “uh…-okay lady” or “Seriously?!!” Or, if not in person, it’s met with responses that it is a needed (i.e., expected) component of the paper or grant proposal. I think I have gotten through to a few (mainly those early in their careers and still open to views that might challenge norms) — but I think it’s rare for it to outlast the tidal wave of superficial-hypothesis-demanding culture they are diving into.

Template wording

What do I mean by superficial hypotheses? I’ll give a quick example. I am removing the specific context because the form is more important than the specific context — and this is a wording formula I saw repeatedly, particularly in the context of grants related to human health research. My suggestions to remove or change such wording were generally ignored in favor of staying within the template to increase chances of funding or publication. I have no idea whether it actually did increase the chances — but staying within the template was considered less risky, and I have no real evidence to argue against that.

So, here’s the example (in the context of an NIH grant proposal):

Aim 1: We will determine whether providing children with intervention A affects their academic achievement. Hypothesis 1: We hypothesize that children receiving intervention A will have higher school achievement scores compared to those who do not participate in the intervention.

I could spend more time discussing potential problems with the wording of the Aim and the Hypothesis, but I will force myself to stay on topic for now! Stated in this way, the hypothesis appears to exist to justify the researchers’ vested interest in a particular outcome. And, let’s be clear, they do have a vested interest in that outcome — their career and future grant funding probably depend on it (but that is a deeper part of the problem). Why dress up a very simple prediction and call it a Hypothesis? Does it trick us into thinking that rigorous science is being done through adherence to The Scientific Method? What does it really provide over information that could be included in the Aim or stated in a question?

I lied — I have to go off topic for a moment to mention just one thing about the wording of the Aim. I strongly believe we need to be more aware of when we use the word “determine” inappropriately. And, to me, this counts as an inappropriate context because I do not consider “determine” and “investigate” to be synonyms. This may seem like a subtle difference, but in my mind it is huge — particularly when non-scientists read and internalize that wording. I have had researchers counter my requests to remove the word with “But, it’s just what we use and everyone knows we mean by it.” I disagree — it’s misleading to those who don’t know what you mean by it.

What about statistical hypotheses?

The connection (if any) between a scientific hypothesis and statistical hypothesis (e.g., your “null hypothesis” and your “alternative hypothesis” from intro stats) was originally going to be part of this post. It’s clear now that it needs to be a separate one and will go on the draft list. But, the simple response is — they are not the same thing! It’s possible that statistical hypotheses have contributed to the over-simplified way of stating scientific hypotheses and I guess that’s something worth thinking more about. Regardless, the fact that some researchers think statistical hypotheses are their scientific hypotheses makes me dislike hypotheses even more. Has our use of Statistics helped ruined the potentially positive aspects of forming creative scientific hypotheses? Hmmm.

Strong Inference forgotten?

I was lucky that my first semester of graduate school landed me in a research methods class focused on J.R. Platt’s 1964 article in Science titled “Strong Inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others.” It is worth reading yourself, but the gist is an argument for following systematic steps in the process of carrying out inductive inferences, referencing ideas of Francis Bacon on conditional inductive trees, T.C. Chamberlain on multiple competing hypotheses, and Karl Popper on the importance of falsification.

According to Platt, Chamberlain recognized in 1897 the “we become attached to it” trouble with a single hypothesis. The ideas in Platt’s paper immediately resonated with me as a new graduate student looking forward to a career as a scientist. They fed into my then still naive and romantic notion of what graduate school and science would be like. I suspect they gave me a good push in the direction I’m still headed today.

J.R. Platt wrote the paper because he was worried that “many of us have almost forgotten” important foundations underlying the method of science (in 1964). Sadly, I think it has only gotten worse in last half century, at least in some disciplines. There’s an interesting 2014 commentary in The Journal of Experimental Biology by Douglas Fudge called “Fifty years of J.R. Platt’s strong inference”.

Here’s one of my favorite quotes from the paper that I can’t resist sharing here:

How many of us write down our alternatives and crucial experiments every day, focusing on the exclusion of a hypothesis? We may write our scientific papers so that it looks as if we had steps 1, 2, and 3 in mind all along. But in between, we do busywork. We become “method-oriented” rather than “problem-oriented.”

Pg 348 Platt (1964)

Even the creation of hypotheses is rarely no more than busywork in many fields – to the point I believe we are generally better off without the hypotheses at all (see earlier example). I think it is worse to pretend we’re following a deeper systematic, and creative, process by using wording originally associated with it, than to honestly follow some different process.

Not alone in my dislike

I was genuinely excited, and even relieved, to find a similar opinion conveyed by Stuart Firestein in his book Ignorance: How it drives Science. I had a draft of this post started and its first title was “Why I hate hypotheses” — it has a nicer ring to it, but I decided it might be too strong. Firestein goes ahead and says it.

You may have noticed that I haven’t made much use of the word hypothesis in this discussion. This might strike you as curious, especially if you know a little about science, because the hypothesis is supposed to be the starting point for all experiments.

The hypothesis is a statement of what one doesn’t know and a strategy for how one is going to find it out. I hate hypotheses. Maybe that’s just a prejudice, but I see them as imprisoning, biasing, and discriminatory. Especially in the public sphere of science, they have a way of taking on a life of their own. Scientists get behind one hypothesis or another as if they were sports teams or nationalities — or religions.

Page 77-78 Stuart Firestein (2012). Ignorance: How it Drives Science. Oxford Press

And a little more about the specific dangers:

At the personal level, for the individual scientist, I think the hypothesis can be just as useless. No, worse than useless, it is a real danger. First, there is the obvious worry about bias. Imagine you are a scientist running a laboratory, and you have a hypothesis and naturally you become dedicated to it — it is, after all, your very clever idea about how things will turn out. Like any bet, you prefer to be a winner. Do you now unconsciously favor the data that prove the hypothesis and overlook the data that don’t? Do you, ever so subtly, select one data point over another — there is always an excuse to leave an outlying data point out of the analysis (e.g., “Well, that was a bad day, nothing seemed to work, ” “The instruments probably had to be recalibrated,” “Those observations were made by a new student in the lab.”). In this way, slowly but surely, the supporting data mount while the opposing data fade away. So much for objectivity.”

Page 78 Stuart Firestein (2012). Ignorance: How it Drives Science. Oxford Press

What else is there?

I am not quite naive enough to believe that doing away with hypotheses will really change researchers’ vested interests in a particular outcome. But, to dress it up a preferred outcome as an objective scientific hypothesis feels dishonest and unethical to me (those may be strong words, but they do capture my feelings about it). At the very least, it feels like a bad case of bullshit.

What would we be losing if we just gave up the requirement to include hypotheses? In many cases, not much. If scientists are using sets of competing hypotheses in rich ways, they will still do it. They don’t have to be labeled as “hypotheses” to be a useful part of scientific thinking leading to new research. In my experiences, the problem lies in forcing superficial, affirmative statements just for the sake of going through the motion of stating a hypothesis. I picture (perhaps unfairly) the grant reviewer with their checklist laid out in front of them: “Hypothesis? Check. Power analysis? Check.”

Let’s put more energy into great research questions, their justification, and then great experimental design to follow — and label predictions as what they are — predictions (not hypotheses). While they may often look the same in practice, I do think there is a different psychology around them. A “prediction” doesn’t carry the same guise of objective scientific thinking as the word “hypothesis” does. And then maybe our future scientists will learn the difference. A success is designing an experiment to learn something and inform next steps, not to get data to support a hypothesis in one study.

Why is it so hard to get traction to even just have informal conversations with some scientists about the limitations and potential harms of becoming wedded to specific hypotheses early in a process? Why do some graduate students live in fear of their hypotheses not being supported and thus not being able to publish and get their degree? Or researchers living in similar fear relative to future grant funding and tenure? I am speaking in generalities that apply differently to different disciplines, but I assure you it is there through my first hand experiences.

Here’s a bit from Firestein about the important role of creativity:

The alternative to hypothesis-driven research is what I referred to earlier as curiosity-driven research. Although you might have thought curiosity was a good thing, the term is more commonly used in a derogatory manner, as if curiosity was too childish a thing to drive a serious research project.

Stuart Firestein (2012). Ignorance: How it Drives Science

I’m not saying its easy

I do want to acknowledge that letting go of statements of predictions disguised as hypotheses can be more difficult in some research contexts than others — and particularly when testing efficacy of interventions or treatments. But, letting go of superficial hypotheses and bringing in creativity is not irrelevant or impossible in that setting, just because it might look more difficult initially. We need to take steps at being less vested in one obvious outcome — to not balance a career and future funding on an affirmative result for something that cannot be proven.

For example, if a lot of previous (perhaps more mechanistic or theoretical) research points to potential benefits of an intervention, but little or no evidence is found when the idea is first investigated using real humans (and assuming this conclusion is not simply based on a large-ish p-value) — then there are a lot of fun and creative questions to ask that can lead to more research! It not necessarily a dead end and not a failure. Is the instrument chosen for measurement able to get close enough to what we really wanted to measure? Can we improve it? What are the other sources of variability among individuals? Can we control for some of the sources in the future? Are we putting too much emphasis on group averages when really we care about individuals and don’t expect them all to respond similarly to the treatment? Why might some people respond positively and some negatively? And so on.

Buddy the Dinosaur

And, for those of you with kids (or those who just enjoy cartoons and science), here is a link where you can see a video of where I think my kids first learned the word hypothesis — thanks to PBS: “I have a hypothesis.”

References

Firestein, Stuart (2012). Ignorance: How it Drives Science, Oxford Press.

Fudge, Douglas (2014). Fifty years of J.R. Platt’s strong inference. The Journal of Experimental Biology, 217, pp. 1202-1204. doi:10.1242/jeb.104976

Platt, J.R. (1964). Strong Inference: Certain systematic methods of scientific thinking may produce much more rapid progress than others. Science, 146(3642), pp. 347-353.

Appreciating the anti-library

January 31, 2020 | General | 3 Comments

There is so so much to read and so little time. This week’s post will be short and to that point.

I find myself often falling into feelings of frustration that can accompany realizations of the vastly overwhelming number of articles, blogs, and books that I would love to read. And not only that I would “love” to read, but that I feel like I “should” read. I suspect most, if not all of you, feel the same.

The feeling that “I haven’t read enough yet” used to keep me from putting my own thoughts out there. But of course, the more we read, the more we realize there is to read. There will never be an “enough” and there shouldn’t be. What we do read, and what we don’t read, shape our own ideas and ways of thinking about problems. I love thinking about how the unique collection of what one person has read (and not read) influences their thinking and ideas. How boring and unproductive life would be if we could all read everything we should.

I am re-reading The Black Swan: The Impact of the Highly Improbable by Nassim Nicholas Taleb (2007 by Random House), and wanted to share his antilibrary idea here — to help you feel a little better about all you will never read. (He goes even further — to antiresumes and antischolars — but I’ll leave those alone for today).

Read books are far less valuable that unread ones. The library should contain as much of what you do not know as your financial means, mortgage rates, and the currently tight real-estate market allow you to put there. You will accumulate more knowledge and more books as you grow older, and the growing number of unread books on the shelves will look at you menacingly. Indeed, the more you know, the larger the rows of unread books. Let us call this collection of unread books an antilibrary.

Page 1. Taleb (2007). The Black Swan: The Impact of the Highly Improbable. Random House, New York.

Happy reading and anti-reading.

We need more ignorance

January 14, 2020 | General | No Comments

Books about ignorance and science

I recently read another book with Ignorance in the title: Ignorance: How it Drives Science by Stuart Firestein (2012 by Oxford Press). It says things we all need to hear and think about more often. The book requires minimal effort — it was written to be read in one or two sittings. In the author’s own words: “a couple of hours spent profitably focusing your mind on perhaps a novel way of thinking about science, and by extension other kinds of knowledge as well.” It’s a hard one not to find time for — and perfect for a seminar or discussion-based course. I wish it would have been on my radar before the last class I taught. Follow the above link to Firestein’s webpage to see his other writings on science — including Ignorance‘s sequel, Failure (on my to-read-very-soon list).

Some of you will already know of my appreciation for Herbert Weisberg‘s 2014 book Willful Ignorance: The Mismeasure of Uncertainty. The two books differ in their goals, with Weisberg’s specifically focused on probability and statistics, but they certainly hit a common theme of accepting ignorance as a necessary, and even positive, force in science.

Firestein says this about the purpose of his book: “…to describe how science progresses by the growth of ignorance, to disabuse you of the popular idea that science is entirely an accumulation of facts, to show how you can be part of the greatest adventure in the history of human civilization without slogging through dense texts and long lectures.” The ideas he is communicating are not complicated and I suspect very few disagree with them. So, why do they feel so refreshing and unique? Firestein came to science through an unusual path (theater) and his writings are another reminder of value added from unconventional paths and experiences.

Letting go of negative connotations

We tend to default to the negative connotations associated with the word ignorance. While there are certainly types of ignorance we hope to avoid, defaulting to the negative in science is part of the problems we face today. Acknowledging and recognizing ignorance is not a sign of weakness and does not imply lack of knowledge or lack of expertise. On the contrary — it takes knowledge to understand, or even just glimpse, what we don’t know. Awareness of ignorance is evidence of knowing enough to know what we don’t know — and admitting that there is a lot we don’t know.

Ignorance, all of what we don’t know, and even what we don’t know we don’t know, is the driving force of science. 

Stuart Firestein

Tension between facts and ignorance

As a statistician, I feel the raw tension between the desire for facts (and rules) and honest acknowledgement of ignorance. Looking at the ways statistical methods are often used in practice provides examples of current scientific culture valuing fact finding over understanding sources of ignorance. Enormous value has been placed on using statistical methods as if they are fact detectors — with little acknowledgement of what we willfully and un-willfully ignore to use them like instruments.

Firestein’s take is inspirational for science in general. Because ignorance can take many forms, it’s easy to wander a bit when thinking about the topic. My mind automatically wanders to ways ignorance (or a lack of awareness of ignorance) plays into the current use of Statistics in science and views of Statistics from those outside the discipline. The following thoughts are less inspirational than Firestein’s– but hopefully constructive to put out there.

Ignorance and the field of Statistics

Statistics is viewed (by those who don’t have the knowledge to know better) as a subject that can be taught in one semester (or in some cases -like some medical school programs– just a two-week module). When exposure to the field of Statistics is extremely limited, it is very easy to leave the experience with little awareness of how small the nibble was relative to what’s out there and still being created. And, if most (or all) of a person’s education in Statistics comes from within discipline teachers who never had enough training to see beyond that nibble themselves — then a vicious cycle is fed with a focus on the nibble. Many teachers of Statistics have not themselves been exposed enough to recognize how much they don’t know. Learning “on the job” within another discipline often suffers from this same problem.

I get it. We were all there once, for varying amounts of time, and I still vividly remember the naive feeling of “how could someone get a graduate degree in Statistics — it just doesn’t seem like there’s much to know.” After all, I had spent one semester as a graduate student (not in Statistics) using a calculator to calculate averages, standard deviations, standard errors, t-statistics, and p-values. I could reject and fail to reject with no problem, spout off the stated assumptions, and even recognize a paired situation from an unpaired. I could randomize individuals to groups — even counterbalancing to try to protect against order effects in a within-subject design. It seemed like the right amount of info and I felt quite clear about how it should be used — it was presented as the correct, and only, way to do data analysis. Almost 25 years later, things haven’t changed that much in a lot of disciplines, though calculators have been replaced by statistical software on computers.

Questions and gradual realizations

Luckily, my extreme lack of ignorance was accompanied by a love of thinking about research methods and philosophy of science. As I started on non-Statistics PhD work, discussions about “statistical significance” and use of statistics in general started to nag at me. I didn’t have knowledge to put satisfying words to it — beyond feeling like how we were doing science was too methods-and-stats-results-focused, at the expense of more question-focused investigation. The nagging got strong enough that I left the PhD program and applied to master’s programs in Statistics. At that point, I was aware there was a lot I didn’t know, but I didn’t have enough knowledge to even know how to describe it.

Even when I started the master’s program in Statistics, I was far from constructively ignorant. But, even by the end of the first week, I started to feel the weight of new found ignorance and to be able to put words to it. I would like to say I was enlightened enough to recognize it as a positive sign, but that is not true. I was still fact-focused and every new concept introduced just let in more discomfort about all I did not know. Eventually, I was forced to accept the feeling of ignorance and its positive side, as well as its connection to knowledge. I am now thankful that the memories of that transition are still accessible to me.

Enough to be dangerous?

The phrase “knowing just enough to be dangerous” is often used in the context of Statistics and with good reason. The phrase describes a person thinking they know enough to know what they don’t know, when in fact they do not. They are not aware of their ignorances, and therefore can be dangerous in their over-confident use of Statistics. This isn’t to say it doesn’t happen in other disciplines, but Statistics is in a unique situation because it is relied on across so many scientific disciplines.

It is not productive to place fault on the individuals themselves — as they are just growing up in a system that nurtures it. I experienced it myself. Traditions and seemingly working systems are hard to break. And, ignorance is bliss. Maybe placing more value on recognizing the importance of constructive ignorance in science will help. Maybe it will seep into the view of Statistics held by scientists in other disciplines — or at least get them to question how much they really know and what exists beyond that.

Judging knowledge based on admissions of ignorance

To be honest, I make quick assessments (yes, judgements) of a scientist’s general knowledge of statistical inference based largely on the level of confidence they convey — with extent of knowledge being inversely related to level of confidence conveyed. High levels of confidence are often accompanied by lack of awareness of ignorance. Stated more simply, when meeting with researchers for the first time, I gauge their level of general Statistics knowledge by their attitude around what they state they do and don’t know. Often, those who come in touting their ability to “run their own stats” and who say they don’t really need a statistician, end up have the least depth to their knowledge. It is not uncommon for researchers to say or imply “I’m a statistician too,” as if my years focused on studying Statistics didn’t really add anything beyond what I would have gained getting a PhD in another discipline. On my more gracious days I don’t take it as a sign of disrespect to the discipline, but instead as a sign they haven’t had the opportunity to gain enough knowledge to have awareness of what I could have possibly studied in those years. I understand by appealing to the memories of when I was there myself. But, by focusing on how much they know, they are communicating loud and clear about how much they don’t.

Technical skills don’t imply deep knowledge

Being adept at “running” a particular analysis or fitting a particular type of model using a computer is the technician part of being a statistician, not the scientist part of being a statistician. I often say — I wish I was labeled as an -ologist instead of an -ician. It’s impossible to say how much our “statistician” label affects opinions of what the discipline is all about, but the -ician certainly doesn’t help. Even the statistic- part of the name is an issue. But, that’s another post sitting in the list of drafts.

The point I think deserves attention is this: Carrying out the technician-like tasks does not imply you have yet come up against the hard questions and challenges of inference or that you have thought deeply about the underlying theory and foundations of the tasks you are carrying out.

It can be helpful to make an analogy with other disciplines. One can be a field technician for ecological research, or a lab technician for psychological research, without a deep understanding of the history, theory, and sources of ignorance plaguing, and driving, the field. I may be naive about those in other disciplines feel, but I don’t think people with PhDs in other disciplines come face-to-face with this issue in the way that statisticians do (or at least not as often). I have done a lot of work with ecologists and psychologists, but I would never call myself an ecologist or a psychologist — and particularly not to a colleague with a PhD in the discipline!

It’s a curious phenomenon that I think mainly stems from a view that statisticians are mainly just technicians with skills in calculation and computation — rather than scientists within a discipline of their own. At times I have taken this as simply an annoyance or frustration, but given the huge reliance on statistical inference in and for science, I am convinced this is a huge part of the problems we’re facing in science. And, I don’t think the “data science” craze is helping the situation.

Productive crisis in graduate school

If we talked more about positive, constructive ignorance, it might help ease the pain of a fundamental crisis graduate students often go through when they really bump up against it for the first time. This is the “Oh no! I’m going to have my master’s or PhD soon and now I feel like I know nothing compared to all there is to know. The more I learn, the less comfortable I am with my degree of knowledge!” Understandably, this leads to feelings of serious frustration and even failure. If we carry around the vision of collecting facts from a bucket with a bottom, it can be rough to realize the bucket is actually bottomless — and we will never arrive at the calm place of “knowing enough” we had hoped to achieve in graduate school.

To me, this crisis (in its many forms) is a sign students are ready to move on and a source of pride. One of my favorite parts of teaching and advising was trying to help students turn the fear and frustration into understanding — which means accepting the associated discomfort. It can be quite discombobulating and downright depressing to realize how little all the people we have trusted actually know. We’re all in this together. I wish I would have done more. A seminar like the one Firestein developed that motivated Ignorance could go a long way — especially if it included students from different disciplines.

Social dilemmas of ignorance

The social dilemmas created around ignorance are far from simple — and I want to touch on that just a little more here. As described above, conveying lack of ignorance to a person who has more knowledge than you on a topic can instead be taken as a sign of lack of knowledge. If you really want to demonstrate to someone that you know a lot about a subject, it may be best to start by acknowledging that you recognize there is so much you don’t know. The problem is … this will only gain you respect with those who have enough knowledge themselves to recognize and appreciate it. Otherwise, your admission of ignorance may backfire, particularly if you are supposed to be the one with more expertise about a topic and you are being trusted for the knowledge you are supposed to have. Imagine your doctor walking in and professing ignorance about your condition before giving you a diagnosis and recommendations? The social dance between knowledge and ignorance is complex. And this has made it hard to embrace and acknowledge the importance of ignorance.

Discomfort and curiosity

The realization that with knowledge comes uncomfortable (though exciting!) feelings of ignorance should be a prerequisite for obtaining a graduate degree. That should be the point. Instead, at least in Statistics, we continue to try to cram more and more “facts” and skills into each student in a very short amount of time — as if we have achieved some final state of knowledge already (a bottom on the bucket). Associating the discomfort of ignorance with knowledge and encouraging curiosity about how ignorance and knowledge are related could go a long way toward improving science, and the use of Statistics in science.

Appendix

Here are eight more quotes from the Ignorance I wanted to type up for future reference — and figured I might as well share to hopefully further pique your interest.

We may look at these quaint ideas smugly now, but is there any reason, really, to think that our modern science may not suffer from similar blunders? In fact, the more successful the fact, the more worrisome it may be. Really successful facts have a tendency to become impregnable to revision.

Page 23-24 Stuart Firestein (2012). Ignorance: How it Drives Science. Oxford Press

So it’s not so much that there are limits to our knowledge, more critically there may be limits to our ignorance. Can we investigate these limits? Can ignorance itself become a subject for investigation? Can we construct an epistemology of ignorance like we have one for knowledge? Robert Proctor, a historian of science at Stanford University, and perhaps best known as an implacable foe of the tobacco industry’s misinformation campaigns, has coined the word agnotology as the study of ignorance. We can investigate ignorance with the same rigor as philosophers and historians have been investigating knowledge.

Page 30 Stuart Firestein (2012). Ignorance: How it Drives Science. Oxford Press

So how should our scientific goals be set? By thinking about ignorance and how to make it grow, not shrink — in other words, by moving the horizon.

Page 50 Stuart Firestein (2012). Ignorance: How it Drives Science. Oxford Press

There is also a certain conclusive, but wrong, notion that comes from an explicit number. In a peculiar way it is an ending, not a beginning. A recipe to finish, not to continue.

Page 54 Stuart Firestein (2012). Ignorance: How it Drives Science. Oxford Press

Big discoveries are covered in the press, show up on the University’s home page, garner awards, help get grants, and make the case for promotions and tenure. But it’s wrong. Great scientists, the pioneers that we admire, are not concerned with results but with next questions.

Page 57 Stuart Firestein (2012). Ignorance: How it Drives Science. Oxford Press

We often use the word ignorance to denote a primitive or foolish set of beliefs. In fact, I would say that “explanation” is often primitive or foolish, and the recognition of ignorance is the beginning of scientific discourse.

Page 167 Stuart Firestein (2012). Ignorance: How it Drives Science. Oxford Press

Getting comfortable with ignorance is how a student becomes a scientist. How unfortunate that this transition is not available to the public at large, who are then left with the textbook view of science. While scientists use ignorance, consciously or unconsciously, in their daily activity, thinking about science from the perspective of ignorance can have an impact beyond the laboratory as well.

Page 167 Stuart Firestein (2012). Ignorance: How it Drives Science. Oxford Press

Today, however, we find ourselves in a situation where science is as inaccessible to the public as if it were written in classical Latin. The citizenry is largely cut off from the primary activity of science and at best gets secondhand translations from an interposed media. Remarkable new findings are trumpeted in the press, but how they came about, what they may mean beyond a cure or new recreational technology, is rarely part of the story. The result is that the public rightly sees science as a huge fact book, an insurmountable mountain of information recorded in a virtually secret language.

Page 171 Stuart Firestein (2012). Ignorance: How it Drives Science. Oxford Press

Fact detector? It is not.

January 2, 2020 | General | 5 Comments

I continue to search for more effective, and simpler, ways to convey my views on misunderstandings and mis-uses of Statistics to others — including scientists. At the heart of my discomfort are not the limitations of statistical inference (I still find it fascinating and useful), but that we use it as if it provides something it was never intended to provide, and simply can’t provide. I have said in the past that it is not “answer finder” and should not be used as one — but I don’t have a lot of evidence from facial expressions and subsequent work of students and researchers that the idea hits home. I want to try again here, with a little different wording spurred from reading Ignorance: How it Drives Science by Stuart Firestein (2012 Oxford Press) – which is a quick read and a great way for anyone interested in science to start off 2020! My opinions come mainly from my experiences in science over the last 25 or so years, and I want to fully recognize that I understand they do not apply broadly across all scientists or scientific disciplines.

Let’s assume that most people see science as the process of collecting more and more facts (where facts are taken as evidence of knowledge). I find this a realistic assumption because of how science is typically presented, taught, and discussed. I wholeheartedly believe it is more about understanding ignorance than collecting facts, but even then have found myself accidentally reinforcing this view with my own kids at times. Firestein’s excellent discussion of the role of ignorance in science seems, at least to me, to often imply that scientists themselves understand the complexities and creative nature of the scientific process, but just haven’t done a good job conveying that to others. Based on my experiences, I think this is true to some degree, but I also see too much emphasis from scientists on fact-finding and fact-reporting and adherence to expectations for this in dissemination of work. This is where my views on use of Statistics come in. I see statistical methods often used in a way that reinforces, and even further contributes to, a fact-centered way of operating in science. The common (and wrong) explanation for what some statistical methods provide is that of a litmus test for whether observed effects “are real or not.” What does “real” mean? I can’t help but interpreting it as a reflection of the view of Statistics as a convenient fact-finding machine.

Take this quote from Firestein’s book:

How do scientists even know for sure when they know something? When is something known to their satisfaction? When is the fact final?

Page 21, Stuart Firestein, Ignorance: How it Drives Science

Firestein writes to help non-scientists or future scientists understand the reality of the scientific process, but it is worth thinking more about how current scientists can benefit from reflecting on these ideas. And, of course, I want to think about the role of common uses of statistical methods in the process. (Note – I purposefully choose between the phrases statistical methods and statistical inference, because I believe statistical methods can be (and are) used with a disregard for the inference part of the process).

I think the three questions posed by Firestein are uncomfortable to scientists, and they should be because they are actually impossible to answer. But, we’ve created a scientific culture and incentive system that expects scientists to pretend as if they have contributed a new fact and that it is close enough to “real” or “final” to be published in a scientific journal. In this system, who is to judge this? How much space in a presentation or paper should be dedicated to this onerous task? This is why and where I believe we have come to rely on statistical methods to provide a cheap shortcut — a shortcut they aren’t designed for.

Statistical methods are not fact detectors.

Counter to foundations based on concepts like variation, uncertainty and probability — statistical methods have been dressed up as meters to help scientists pretend to answer the impossible-to-answer questions. Instead of leaving them to struggle with the questions or admit they are unanswerable (or the wrong questions to begin with), much effort goes into creation of cheap tests and criteria as a shortcut to the wrong destination.

Statistical inferences are, and should be, complex and uncomfortable, not simple and comforting. Statistical inference is about inferring based on combining data and probability models, not about judging whether an experimental or observational result should be taken as a fact. There is no “determining” and no “answers” and no distinguishing “real” from “not real” — even though this language is common in scientific reporting. It is not helping science to keep pretending as if Statistics can detect facts.

You may think I am exaggerating the problem, but I encourage you to read reports of research with this in mind and judge for yourself. How common is it for scientists and other reporters of scientific work to talk and write about statistical methods as if they are fact detectors? The problem gets more complicated if we start to consider whether the authors actually believe statistical methods are capable of detecting facts, or if they are just following conventions and expectations for their own survival. This complication is beyond this post, because either way, awareness that it is a problem has to be the starting point.

What can we do? Well, here’s an easy to ask question when reviewing your work or the work of others: Are statistical methods being used or presented as fact detectors? If the answer at all leans toward ‘yes,’ then it’s time to back up and think about the shortcut being taken, as well as the presumed destination. What can be added to more honestly acknowledge uncertainty and assumptions in any reported inferences? Let us try hard to avoid using statistical inference as if it presents a shortcut to facts.

Defining replication

December 20, 2019 | General | 1 Comment

Replication is word that has long been important in conveying concepts fundamental to experimental design and statistical inference in general. It is one of the first ideas students of Statistics must wrap their heads around in terms of designing, analyzing, and interpreting results from an experiment. First exposure to the word in Statistics is usually as a within-study idea, rather than between-study. Within the larger scientific community, the between-study version of the concept has been receiving a growing amount of attention — even in the form of dramatic terms such as the “replication crisis” or “the crisis of replicability.” The word “reproducible” is also used and the similarities and differences in meaning are deserving of more attention, but I restrict this discussion to the concept of replication and wording around it. Warning: this post contains a lot of scare quotes, but I find them necessary to adequately get my point across.

We can gain some clarity, or at least perspective, by revisiting the within-study context first. Replication is a word representing a concept that seems simple at first glance and leads to clean mathematics, but it is not as simple as it seems — surprise, surprise. The Statistical Sleuth (Ramsey & Schafer, 2013) broadly defines it on page 704 with the sentence: “Replication means conducting copies of a basic study pattern.” I like this very general description, for what it says and what it does not try to say. It immediately points to the problem that comes up in reality — the reliance on “copies.” We proceed by assuming we have (or can conduct) “copies” when we know this is not actually doable. We end up with things that are not exactly copies, but close enough that we are willing to ignore their differences. That is, any explainable differences are willfully ignored because they are not deemed to present a big enough problem that we should discard the “copies” version of the model we hope to use for inference. There is always a continuum — it might be very easy to willfully ignore differences between widgets or petri-dishes, but very hard to willfully ignore differences among humans (or at least it should be). The copies assumption then justifies attributing differences in the measurements taken on the “replicates” to “pure error.” This idea of “pure error” opens another large can of worms — but for the sake of this discussion it’s important to see that replication is the strategy for obtaining units (used generally) that are considered copies in the sense that we can happily ignore other explanations of differences in measurements from those units.

Now, let’s move closer to the context we’re used to hearing about today. The between-study context frequently referred to in discussions today is still adequately described by the same quote from The Statistical Sleuth: “Replication means conducting copies of the basic study pattern.” In this case “copies” refers to whole studies or experiments carried out in the same (or similar) way to investigate the same question, rather than smaller units within a single experiment. The degree to which one study can be considered a copy of the other also exists on a continuum. In some cases, it may be the same researcher, the same lab, the same protocol, etc. and in some cases it may be an experiment carried out by different people with differences in the design, but with the goal of investigating the same question or estimating the magnitude of the same effect. The idea that any experiment is an exact copy of another is clearly false — so again, we must proceed as if any differences are irrelevant enough that we can ignore them for the sake of making inferences. So, the idea of replication is defined both within an experiment and across experiments. But, was actually is the word “replication” referring to and is it consistent with how it’s being used across science today?

Replication is the act of trying to create a copy of (or repeating) the “basic study pattern.” This says absolutely nothing about the results of the two studies and how similar they might be. A study design is replicated (at least to some degree) if it is copied to the extent that others agree any differences can be ignored (or accounted for in another way). The idea behind replication in statistical inference is to use it to quantify variability in some measured quantity that we can’t (or don’t want to) explain away. If, in a single experiment, measurements taken on multiple experimental units are farther apart than expected, we do not necessarily take that as a failure in the experimental design. We instead, assuming no errors are identified or other explanations arise, take that information and quantify the variation to represent some level of “background variability” (used as a basis for relative comparisons among units that are not copies, but instead differ by characteristics that are of interest). It is worth thinking for a few minutes about how this differs from many discussions around “replication” today.

What is happening in the evolution of the word “replication” in science, and why? It is used as if it represents a dichotomy (and yes, it is another false one) — a study is either replicated or not. Well, according to the information in the first part of this post, if the follow-up study can be considered an adequate copy of the first then it was [successfully] replicated (under the limitations of “copies”). People may disagree on the quality of the “copy” and there may be argument about whether a study should be used as if it is represents a “replication” — and that is all good. This, as a lot of statistical inference, is all carried out in a world of as-if’s — even if it seems hard or uncomfortable to constantly remind ourselves of this. Note that nothing I have said here has anything to do with the results of the two (or more) studies!

Today, in many conversations I see “replication” being used as if it is a property of the results of the studies, and not the designs and methods f conducting the studies. Phrases include “the study replicated” or “the study was (or was not) successfully replicated.” This binary results-focused view and its associated language are causing oversimplification, confusion, and other misunderstandings. In the new language, “replicated” is typically used to mean that results from a study carried out as a “copy” of an original study “match” those from the original study. But, this definition brings in a whole other layer of problems because it not only requires assessing whether the studies should be treated as copies, but also requires assessing whether results “match” according to some criterion or set of criteria. This is not an easy problem and clearly depends on the set of criteria used to categorize results as a match (or not). Going into the nuances of this problem is beyond the scope of this single post, but I hope it is clear how complicated the situation becomes. Instead, I hope we can acknowledge the challenges in this definition implied by the language we use. Let’s back up and pay more attention to implications and be clear about what is being assumed and what needs to be justified. We should separate the two ideas in our language.

(1) Assess the degree to which the studies are copies. Assessing whether one study replicated a previous study (or was a successful replication of a previous study) should focus on assessing the claim that it can/should be considered a “copy” of the first. From this point, then one can figure out what to do with the results from two copies (assumed) of the same experiment. There is already a lot to think about here.

(2) Assess the degree to which the results of two (or more) studies are consistent (consistency of results). Perhaps I should come up with a catchier phrase, but this is what I have for now. It may make sense to do this even when one can argue that the differences between two studies are such that one should not be considered a replication of the other. And, consistency is not a yes or no answer. It is way oversimplified and naive to think that the results from two studies investigating the same question are either consistent with each other or not consistent. Let’s try to not fall into this false dichotomy and do the hard work of explaining, in a continuous way, how the results might agree and differ (at the same time) — and being transparent about what criteria and methods we might use for doing so.

Just for the record, I completely disagree with using an assessment of whether a p-value is < 0.05 in two studies as the criterion for labeling a a study as successfully replicated or not. I disagree on multiple levels — with the implied definition of “successfully replicated,” with the lack of emphasis on comparing the designs and analyses, with using a single criterion to asses match, and with using the p-value for a criterion. There are likely other layers that I also disagree with that I’m just not thinking of at the moment. It is time to stop oversimplifying and falling prey to the same things that we tend to blame this “replication crisis” on — overuse of false dichotomies, unwillingness to acknowledge uncertainty and work on continuums, and in general, taking short-cuts.

I end with a quote from RA Fisher’s 1944 Statistical Methods for Research Workers — to remind us that act of replicating (within and between studies) is and was fundamental to statistical inference and science using it. Let’s remember this historical context and build upon it, rather than ignoring it or replacing it with dramatic-sounding, but oversimplified, new language.

The idea of a population is to be applied not only to living, or even to material, individuals. If an observation, such as a simple measurement, be repeated indefinitely, the aggregate of the results is a population of measurements. Such populations are the particular field of study of the Theory of Errors, one of the oldest and most fruitful lines of statistical investigation. Just as a single observation may be regarded as an individual, and its repetition as generating a population, so the entire results of an extensive experiment may be regarded as but one of a population of such experiments. The salutary habit of repeating important experiments, or of carrying out the original observations in replicate, shows a tacit appreciation of the fact that the object of our study is not the individual result, but the population of possibilities of which we do our best to make our experiments representative. The calculation of means and standard errors shows a deliberate attempt to learn something about that population.

RA Fisher (1944). Statistical Methods for Research Workers, 9th edition reprinted by University of Michigan Libraries Collection, pages 2-3. BOLDFACE added by me for emphasis.

Trust is complicated

December 17, 2019 | General | 6 Comments

I’ve been thinking a lot about trust again lately — in science, and in life more generally. As I gather more experiences, I find that trust becomes more complicated. It’s complex in personal relationships with other humans, and complex within the ways we do science (which inherently involve other humans).

I often find myself returning to the concept of trust when I try to dig into why we have come to rely on certain ways of using statistical methods and results — especially those that seem to have obvious flaws that have been publicly identified (by me and many others). My thoughts generally settle in the feeling that people are too willing to trust things they don’t understand. This is counter to a phrase we often hear — that people don’t trust things they don’t understand. This last version makes more logical sense to me, but it isn’t what I have experienced as a statistician in the context of doing science.

I have also come to realize that pointing a finger at issues with trust, or even lack-of-trust, is far from straightforward. Our ability, and desire, to trust is both a good and bad thing for science — and trying to analyze it seems to lead to contradictions and further challenges. With that in mind, I use this post just to share my first layer of thoughts on the subject. Like so many of the topics I have been writing about, degree of trust exists on a continuum from no trust to complete trust — and in the post I recognize I walk dangerously close to falling into another one of our false dichotomies (trust or no trust). I will try to be careful.

I strongly believe we should all try to be more in tune about when and why we exhibit a large (or small) degree of trust without questioning, or even when we definitely know better. And, when and why do we make what seem to be reasonable judgements of trust? My current opinion is that we tend to trust too much, particularly in the process of doing science, despite scientists holding on to a reputation of healthy skepticism. For most (not all) of the scientists I have worked with over the years, I don’t believe they deserve the healthy skeptic label, at least not across all aspects of their work. I don’t mean this in a disrespectful way — I am just being honest and acknowledging the impact of our humanness in the process of doing science.

The act of deciding to trust something (or simply not thinking about how much it should be trusted before trusting it to a large degree) is complicated. It depends on so many ancillary aspects of life that may have little, if nothing, to do with a logical and rational approach to assessing degree of trust warranted for a situation. For example, it depends on our mood, on who’s telling us to trust (or not to trust), and on the implications of the decision in terms of making our lives easier or harder. In short, I believe our decisions about how much to trust something are largely self-serving — will it serve us better to place a lot of trust in it, or to place little trust in it? The answer to that question seems to depend heavily on our current situation, and less on a rigorous practice of digging into details and critically evaluating evidence. The more rigorous alternative is simply too onerous of a task for an individual to undertake as often as would be needed. Because of the this, judgements about trust are often not individualized tasks. Systems are established over time as largely social constructs that seem to give individuals a free pass on rigorously assessing trust — allowing decisions to made quickly and with little pain. The large degree of trust I observe scientists giving to statistical methods and their results seems to come from such a system.

Heavily relying on statistical inferences for research and decision making requires an substantial amount of trust. Over-relying on simple statistical summary measures without a deep understanding of the information they contain (e.g. p-values, confidence intervals, posterior intervals, effect sizes, etc.) requires an enormous amount of trust. There are so many layers of trust that have to be in place to even get to the obvious layer of assessing trust in the presented results (hopefully described as inferences and not facts). When we start to peel back the layers, it gets overwhelming quickly and I admit this quickly lands me feeling uncomfortable and facing my low tolerance for bullshit.

For this post, I focus on the layer of researchers trusting that they should use particular methods and state their results in a particular way because that is how it has been done by those before them — a very social construct as I previously mentioned. In this layer, my observation is that healthy questioning and some level of mistrust have largely disappeared in many disciplines. Ways of carrying out a study using statistical methods are presented and treated as if facts about the way science should be done. And I don’t think most who use them even know how those methods came to be. There seems to be a trust that some great committee of knowledgeable people got together and made the hard decisions by deciding the “best” way to collect, analyze, and report on data. The degree of trust I see is consistent with this view — but this view is false. This is trusting that we should just trust — and I believe this mentality underlies the use of statistical methods in practice and contributes to (or is at least related to) many of the problems we see and hear about related to their use.

Trusting is far more comfortable and easy. Struggling with lack of trust is hard and difficult work that rarely lets up (if it is honest lack of trust– and not simply superficial trusting of those who tell you not to trust something). I believe this is the reason we develop social systems around trust, but they can backfire in major ways. This backfiring is what I see in the use of Statistics. Instead of encouraging health skepticism, people seem to love to trust methods they don’t really understand, or even to trust that they understand methods when in fact they do not (knowledge needed to carry out a method is not the same of knowledge needed to understand its limitations). This approach keeps things from becoming too overwhelming.

Take machine learning algorithms — which are all the rage. Many (most?) people using these algorithms do not have a deep understanding of how they are turning information from data into inferences or decisions — yet they can easily apply them to data in their research. Many (most?) of the assumptions are hidden from view. Their black-boxiness may be uncomfortable to some, but mainly it appears as an invitation to not have to figure out what’s in the box. It’s an invitation to completely trust the algorithms; to trust those using the algorithms; and to trust those who developed and programmed the algorithms (who may be detached from how they are actually being used). As you may have guessed by now, I strongly believe there is too much trust in these algorithms and far too little critically evaluation and discussion of assumptions. Some are trying to raise awareness and start conversations around the attached ethical issues (e.g., Cathy O’Neil’s book Weapons of Math Destruction), but continuing to trust is so much easier.

Thinking about trust in machine learning algorithms is low hanging fruit in this context. What about the magnitude of trust given to p-values, confidence intervals, effect sizes, linear models, etc. ? In my experience as a scientist, there is widespread trust in the idea that much of science should be based on statistical methods, though it is rare that we study why and how the belief in this paradigm came to be. It is far easier to continue to trust in the trust that is already established than to start to distrust and have to figure out what new path to take.

At the risk of repeating myself (again), trust in statistical methods and automatic ways of making conclusions and decisions from the results — without adequate justification for assumptions or even a deep understanding of the limitations of the method is easy. It’s easy for researchers, it’s easy for those reading the research, and it’s easy for those making downstream decisions.

The efforts I spent as a collaborative statistician and teacher trying to help others form a more realistic view of the limitations of Statistics were usually wasted. People generally seemed to agree with what I was saying, but in the end clearly wanted to just trust what had been done by others before them and follow suit. This led to many hours of reflection on my part as to why it seemed so difficult to get researchers to be more skeptical and to have less trust in approaches that hyperextend the natural limits of statistical inference. All the reflection did not lead me to a satisfactory, settled position. It did lead me to acknowledging the deep connection between trust and ease of living or feelings of comfort. Healthy distrust through questioning and critical thinking is uncomfortable and really hard work. And, it is not really possible without a deep understanding of that which you are questioning.

Because I have dedicated my professional life to statistical inference, I of course feel strongly about this topic and have the knowledge and experience to immediately push back against trust without doing a lot of extra work. I had hoped to be able to pass this information on in a productive way to keep others from having to do that work for themselves. But, ironically, that means they must trust my view and the information I’m providing them over the information provided by their previous teachers, their research mentors, grant reviewers, peer reviewers, funding agencies, etc. It is not a question of trusting me or not trusting me — it is a question of weighing whether they should trust me or the rest of the system within which they live and must count on for survival. Framed in this way, I don’t take it so personally — but it isn’t any less frustrating.

I catch myself trusting when I shouldn’t every day. Not because I’m not a skeptical person and not because I don’t have the skills to be able to question and critically evaluate, but because I have limited time and energy each day and must proceed with life and work. For example, I tend to trust doctors’ advice more than I know I should — because it’s generally comforting and easier than not doing so. However, I have made a point at trying to be aware of when I am doing this and I think that is important. It has helped me triage my trust scenarios and put effort into second opinions and reading research when stakes are higher. I also continue to make my fair share of mistakes in trusting other humans.

In summary, I strongly believe science and decision making can be improved if we start the research process from a place of healthy lack of trust — and then build up trust in our methods as we go. Instead of starting by accepting all the assumptions we need to make and then going back and half-heartedly “checking” them, let’s start from a place of questioning our trust in the assumptions and having to convince ourselves that it’s reasonable before proceeding and before trusting anything that comes from using the methods. Starting from a mindful lack of trust should be an integral part of statistical inference and science — despite the discomfort and difficulty it can add to our lives.