Wary of a “Consensus based transparency checklist”

Home / Wary of a “Consensus based transparency checklist”

A large collection of authors describes a “consensus-based transparency checklist” in the Dec 2, 2019 Comment in Nature Human Behavior. I suspect a generally positive reception to the article and checklist, so I think there is probably room to share some skeptical thoughts here. I have mixed emotions about it — the positive aspects are easy to see, but I also have a wary feeling that is harder to put words to. So, it seems worth giving that a try here. I do suspect this checklist will help with transparency at a fairly superficial level (which is good!), but could it potentially harm progress on deeper issues?

As usual, I’m worried effects on deeper issues related to the ways statistical methods are routinely (and often inappropriately) used in practice. It seems clear that increased transparency in describing methods and decisions used is an important first step for identifying where things can be improved. However, will the satisfactory feeling of successfully completing the checklist lull researchers into complacency and keep them from spending effort on the deeper layers? Will it make them feel they don’t need to worry about the deeper stuff because they’ve already successfully made it through the required checklist? I suspect these could happen to some degree, and mostly unconsciously. Perhaps one way to phrase my sense of wariness is this: I worry the checklist is going to inadvertently be taken as a false check of quality, rather than simply transparency (regardless of quality). If this happens, I expect it will hinder deeper thought and critique, rather than encourage it. I hope I’m wrong here, but we’re human and we tend to make such mistakes.

My opinions come from my experiences as a statistician and my knowledge and ideas regarding statistical inference. I come at this already believing that many common uses of statistical methods and results contributes to a lack of transparency in science (at many levels) — mainly through implicit assumptions and lack of meaningful justification of choices and decisions. We have a long, hard road ahead of us to change norms across many disciplines and a simple 36-item (or 12 for those who can’t find the time for 36 yes/no/NA items) is a tiny step — and one that might have its own issues. We should always consider the lurking dangers of offering easy solutions and simple checklists that make humans feel that they’ve done all that is needed, thus encouraging them to do no more. I don’t mean this in a disrespectful way — it is just a reality of humanism as we attempt to prioritize our time for professional survival.

My main worries are described above and expanded on in the Shortcuts and easy fixes section below. The other sections are thoughts that nagged me while I was reading, and I end the post with very brief and informal commentary about some of the items on the short checklist. I realize I am actually nervous about doing this — it will count as one of my “things that make me uncomfortable, but I really should do” for the week. Open conversation and criticism should be directly in-line with the philosophy and motivation underlying the work, so that makes me feel better.

Consensus among whom?

Let’s start with the title. My first reaction was “Consensus among whom?” Before reading further, I quickly skimmed the article to answer this question — because I needed the answer to reasonably evaluate the work described. Was it consensus among all those included in the author list? I knew the backgrounds of a few of them, but not enough to gauge the overall group. Given the goal of transparency, I thought this would be easy to assess, given that in my opinion, the worth of a consensus depends entirely on who’s involved in the process of coming to that consensus! Does the title imply a stronger inference than it should? Does it imply wide scientific consensus, or is it understood to just refer to the method used to decide on the items? It does feel stronger than it should in my opinion (but I have a very low threshold for such things). On the second page, I found some good info: “We developed the checklist contents using a pre-registered ‘reactive-Delphi’ expert consensus process, with the goal of ensuring that the contents cover most of the elements relevant to transparency and accountability in behavioral research. The initial set of items was evaluated by 45 behavioral and social science journal editors-in-chief and associate editors, as well as 18 open-science advocates.” In the next paragraph: “(for the selection of the participants and the details of the consensus procedure see Supplemental Information). As a result, the checklist represents a consensus among these experts.” I did find the lists of people and their affiliations in supplemental information, as well as more information about the process of choosing the journal editors. But, the “among these experts” led to my next thoughts.

Who qualifies as experts on this topic?

My opinion about who deserves the “expert” label for this topic should not be expected to agree completely with the researchers (See previous blog post about labeling people as experts). Rather than just stating they are experts, it would be nice to acknowledge in the actual article that we should expect disagreement on this — rather than presenting it in a way that supposes everyone should trust and agree with their criteria. In the article itself, there is no room for justifying the criteria used and I think this is crucial to evaluating the work — the reader is expected to trust or to check the Supplementary materials (I’m curious how many people actually did or will check it). In the supplementary info, I was happy to see their main assumption for justifying the choice of editors clearly stated: “We aimed to work with the editors of the high ranked journals, assuming that these journals hold high methodological standards, which makes their editors competent to address questions related to transparency.” This seems to imply that “competent” is good enough for the “expert” label. Fine with me, as long as that definition is transparent. Once it is transparent, then people can actually have a real conversation about it and we can better assess the methods and claims in the paper. With limited space, maybe it is just now fact of life that such things end up in the supplementary info?

Back to the assumption — It would take a little more convincing to get me to the point of believing they should be the experts relied on for the consensus, but maybe they bring expertise of logistics and understanding what journals and authors will realistically do. Also — a discussion of self-selection bias would be great (45 of the 207 people identified by the inclusion criteria actually participated, with 34 completing the second round). Maybe the more competent people self-selected because they cared more, but maybe not, because maybe the more competent were too busy? Maybe I am over-reacting, but it’s our job as scientists to be skeptical and I don’t know how much to trust these 45 scientists to provide a consensus to be adopted discipline-wide — they too are integrated in the culture and norms that need changing. In the little I looked at, it doesn’t appear this was a difficult concensus to come to, so maybe I am spending way too much time on this not-so-important part. But, the title led me toward thinking it was important from the first glance. At the very least, we should be asking how qualified everyone is to assess all components, and particularly the statistical ones I’m most worried about (most of the items do rely on statistical inference concepts — design and/or analysis)? [Statisticians were included on the panel of experts — it would be nice to see disciplines listed in the spreadsheet.]

Short cuts and easy fixes

I strongly believe that one of the fundamental problems underlying potential crises in science (including transparency and replication issues) is the ready availability of short-cuts and easy fixes (e.g., dichotomizing results based on p-values, encouraging use of overly simple and unjustified power calculations, etc.). Easy hoops to jump through (for researchers, reviewers, funders, etc.) contribute to the continued use of less than rigorous methods and criteria and low expectations for justifications of their use. As they become part of culture and expectations, it gets harder and harder to push back against them. In the context of Statistics, many researchers are not even aware of the problems underlying common approaches in their discipline. Is this checklist another quick fix that people will be quite happy with because it is very easy and leads to a feeling of comfort that they have done all they need to do to “ensure” transparency of their research? I think the danger is real, I’m just not sure how big it is. And, what if it takes attention away from moving attention to deeper levels of transparency — hiding them behind expected methods (like power analysis). Maybe it’s better than nothing, but maybe it’s not. It’s at least worth thinking about.

Short cuts and easy fixes. The need for a 12-item instead of 36 item checklist seems to help make my point and increases my wariness. “By reducing the demands on researchers’ time to a minimum, the shortened list may facilitate broader adoption, especially among journals that intend to promote transparency but are reluctant to ask authors to complete a 36-item list.” Well, I certainly agree that a “shortened list may facilitate broader adoption.” It’s generally easier to get people to do things that require less time and effort. But, how does it not send the following message: “We think this is important, but not so important that we would expect you to answer an additional 24 yes/no/NA questions? Your time is more valuable than worrying about justifying transparency in your research.” So, even if I wholeheartedly agreed with the checklist, I don’t agree with the implicit message sent by giving the option for the 12 item version.

A very low bar

From my statistician perspective, this check-list might give the green light to scientists to continue with questionable practices that are default norms — and to feel the comfort of getting a gold star for it. Could the existence of this checklist actually decrease motivation for change in the realms I, and other statisticians, care about? Again, maybe I am being too cynical, but at the very least, we need to contend with this possibility and discuss it. This checklist represents a very low bar for scientists to jump over (even at the 36 items, not to mention the 12). We don’t need more low bars — we need to encourage critical thinking and deep justification of decisions and ideas.

Getting picky about words

There are multiple phrases used in the paper that I believe should require justification. I personally do not buy the statements with the given information. You may see these as picky, but I think it is important to realize that even our papers about doing better science fall pray to common issues plaguing the dissemination of science. I see this as more the norm than the exception – we all do it to some degree! I provide a few phrases here, each following by a little commentary.

  • This checklist presents a consensus-based solution to a difficult task: identifying the most important steps needed for achieving transparent research in the social and behavioral sciences.” Wow– that’s strong. I encourage questioning of the use of the phrases “solution” and “most important” — Are they justified in this context? What are the implications of using such strong language that, in my mind, clearly cannot be true? I really think we should try to be very aware of wording and when we can use it to reflect our work in a way that is more honest and humble — even if it means it won’t sell as well.
  • “In recent years many social and behavioral scientists have expressed a lack of confidence in some past findings partly due to unsuccessful replications. Among the causes for this low replication rate are underspecified methods, analyses, and reporting practices.” First, it’s definitely not just social and behavioral scientists expressing lack of confidence. Second, we need to carefully consider this idea of “unsuccessful replication” and the false dichotomy (successful or unsuccessful) and hidden criteria it represents (often based on questionable statistical practices). I have another blog post started on this topic, so will save more discussion on this point for that post.
  • These research practices can be difficult to detect and can easily produce unjustifiably optimistic research reports. Such lack of transparency need not be intentional or deliberatively deceptive.” I whole-heartedly agree with this statement. I also suggest that my first bullet point, along with thoughts on the title, represent “unjustifiably optimistic” wording relative to this research report — and I doubt it was intentional or deliberative.

On a positive note

On a more positive note, I am happy to see it is a living checklist, subject to continual improvement. It will be interesting to see how it evolves and how it might become embedded in norms and culture. The authors explicitly acknowledge that it doesn’t cover everything: “While there may certainly remain important topics the current version fails to cover, nonetheless we trust that this version provides a useful to facilitate starting point for transparency reporting.” I want to be cautiously optimistic, but at the same time I couldn’t naively ignore the wary feeling trying to get my attention.

A look at some of the items

Quick thoughts about some of the items included in the 12-item checklist included as Figure 1 of the article.

  1. Prior to analyzing the complete data set, a time-stamped preregistration was posted in an independent, third-party registry for the data analysis plan.” So – it is fine to analyze the incomplete data set? This could technically mean all but one observation. This presents a very easy way to technically adhere, but really not. I wish the focus was on “prior to collecting data” rather than “prior to analyzing” data. Who knows what changed over the course of collecting data and analyzing the incomplete data set?
  2. The preregistration fully describes… the intended statistical analysis for each research question (this may require, for example, information about the sidedness of the test, inference criteria, corrections for multiple testing, model selection criteria, prior distributions, etc.).” I’ll just make a point about the general language here. I guarantee my view of “fully describe” will not match that of most authors. Who is qualified to assess that? Self-assessment of this seems to me riddled with potential problems — and again does not have to have anything to do with quality of the work, only that whatever is intended is described. Describing completely inappropriate and unreasonable methods does still mean a ‘Yes’ is a perfectly legitimate answer to the checklist. This is where the deeper layers come in. Is it better than nothing, allowing a place to check for potential problems, or does it leave the author with a feeling like things have been checked for quality?
  3. The manuscript fully describes… The rationale for the sample size used (e.g. an a priori power analysis).” Again, hard for me to address this. See my recent post here on power analysis and I have seen plenty of manuscripts and grant proposals that “fully describe” the ingredients used without adequate justification of those ingredients. This is an easy box to check that will continue to promote poorly justified and misunderstood power analyses, while continuing to help people feel good about it. I realize it’s included just as an e.g., but…
  4. The manuscript fully describes…the study design, procedures, and materials to allow independent replication.” I assume by manuscript, they are including Supplementary Materials? Otherwise, this isn’t realistic. Again, this should not be taken as indicating any quality or rigor in the study design and procedures, only that they are described — which is an important first step!
  5. The manuscript fully describes… the measures of interest (e.g. friendliness) and their operationalizations (e.g. a questionnaire measuring friendliness).” While I could quibble with the wording here, I think this one could nudge people into doing a better job in this area — as long as the distinction is used throughout the paper, and not simply in one disclaimer sentence.
  6. The manuscript … distinguishes explicitly between “confirmatory” (i.e. prescribed) and “exploratory” (i.e. not prescribed) analyses.” I agree this one could do good — in my experience many researchers do not really understand the difference and simply don’t report much of the exploratory analysis and results. I don’t think this checklist will protect against that, but might nudge thinking in the right direction.

About Author

about author

MD Higgs

Megan Dailey Higgs is a statistician who loves to think and write about the use of statistical inference, reasoning, and methods in scientific research - among other things. She believes we should spend more time critically thinking about the human practice of "doing science" -- and specifically the past, present, and future roles of Statistics. She has a PhD in Statistics and has worked as a tenured professor, an environmental statistician, director of an academic statistical consulting program, and now works independently on a variety of different types of projects since founding Critical Inference LLC.

2 Comments
  1. MD Higgs

    A email from Andrew Gelman made me reflect on whether I would have signed onto this paper if I had been asked. This is really a question of whether I believe the potential good might outweigh the potential bad. If I’m completely honest with myself, I think I would have signed on — but I also would have shared my concerns and reservations publicly. Just posting this for the record.

  2. She’s wary of the consensus based transparency checklist, and here’s a paragraph we should’ve added to that zillion-authored paper « Statistical Modeling, Causal Inference, and Social Science

    […] Megan Higgs writes: […]

Leave a Reply