Failed experiments do not always fail toward the null

There is a common argument among psychologists that null results are uninformative. Part of this is the logic of NHST – failure to reject the null is not the same as confirmation of the null. Which is an internally valid statement, but ignores the fact that studies with good power also have good precision to estimate effects.

However there is a second line of argument which is more procedural. The argument is that a null result can happen when an experimenter makes a mistake in either the design or execution of a study. I have heard this many times; this argument is central to an essay that Jason Mitchell recently posted arguing that null replications have no evidentiary value. (The essay said other things too, and has generated some discussion online; see e.g., Chris Said’s response.)

The problem with this argument is that experimental errors (in both design and execution) can produce all kinds of results, not just the null. Confounds, artifacts, failures of blinding procedures, demand characteristics, outliers and other violations of statistical assumptions, etc. can all produce non-null effects in data. When it comes to experimenter error, there is nothing special about the null.

Moreover, we commit a serious oversight when we use substantive results as the sole evidence of procedures. Say that the scientific hypothesis is that X causes Y. So we design an experiment with an operationalization of X, O_X, and an operationalization of Y, O_Y. A “positive” result tells us O_X -> O_Y. But unless we can say something about the relationships between O_X and X and between O_Y and Y, the result tells us nothing about X and Y.

We have a well established framework for doing that with measurements: construct validation. We expect that measures can and should be validated independent of results to document that Y -> O_Y (convergent validity) and P, Q, R, etc. !-> O_Y (discriminant validity). We have papers showing that measurement procedures are generally valid (in fact these are some of our most-cited papers!). And we typically expect papers that apply previously-established measurement procedures to show that the procedure worked in a particular sample, e.g. by reporting reliability, factor structure, correlations with other measures, etc.

Although we do not seem to publish as many validation papers on experimental manipulations as on measurements, the logic of validation applies just as well. We can obtain evidence that O_X -> X, for example by showing that experimental O_X affects already-established measurements O_X2, O_X3, etc. And in a sufficiently powered design we can show that O_X does not meaningfully influence other variables that are known to affect Y or O_Y. Just as with measurements, we can accumulate this evidence in systematic investigations to show that procedures are generally effective, and then when labs use the procedures to test substantive hypotheses they can run manipulation checks to show that they are executing a procedure correctly.

Programmatic validation is not always necessary — some experimental procedures are so face-valid that we are willing to accept that O_X -> X without a validation study. Likewise for some measurements. That is totally fine, as long as there is no double standard. But in situations where we would be willing to question whether a null result is informative, we should also be willing to question whether a non-null is. We need to evaluate methods in ways that do not depend on whether those methods give us results we like — for experimental manipulations and measurements alike.

6 thoughts on “Failed experiments do not always fail toward the null

  1. “Does not always fail toward the null” is a pretty safe statement but I’d be curious to know what the probability on either side really is.

    I have made an uncountable number of mistakes over two decades that have both propelled my thinking forward and made me realize how easy it is to fail to replicate. There is a reason this point keeps getting made — it is every scientist’s experience. And the scientists I most respect have a high internal bar for failed replications because of it.

    In my experience, failed replications usually lead to either a face palm as I realize the error and/or an interesting qualification of the original effect. I cannot think of a time I ruled out the original effect wholesale. It’s not that it can’t happen, but I would love to know the practical probabilities. On the contrary, I have replicated dozens of findings –which I’m now starting to publish due to the current climate. Progress of a sort.

    Here’s an example of a recent failure to replicate from my lab. We tested hundreds of villagers in a remote culture on a famous perceptual experiment. Because we were careful about the procedure and had tons of power, we were confident that we had established an important negative result. Then, one of the villagers made an offhand comment that made us realize a critical issue with our auditory stimuli. We fixed the issue and the original finding replicated, clear as day. Twice that trip we failed to replicate a previous finding and, in both cases, those failures reversed dramatically after realizing a key (yet non-obvious) procedural or stimulus issue. Both failures to replicate were interesting, but not because they showed the original effect to be false or fragile. Rather, they showed how even robust effects can be obscured in interesting ways. I feel incredibly fortunate that I was able to understand what happened before publishing either as a straight failure to replicate.

    These mistakes are a welcome part of the process because they help propel understanding forward. They help us learn about the real, underlying phenomena which are usually even more complex and fascinating than we thought.

    As I see it, the push-back against the replication movement is largely from scientists like myself who have adopted asymmetric weights for negative relative to positive findings based on our professional experience. I’d be fascinated to know whether or not that’s good science. I’m betting it is.

    1. There are all kinds of mistakes scientists can make. Different kinds will have different consequences: some will systematically bias toward the null, some away, some randomly. It’s enough for me that all kinds happen often enough to matter, and can be made by original researchers and replicators alike.

      I have caught myself making all kinds of mistakes. Beyond that, I am somewhat hesitant to speculate about which is most common (in my own work or in general) for a few reasons. Some kinds of mistakes may be intrinsically harder to catch. More worryingly, some may be more likely to be caught because of biased procedures or motivated reasoning. If you are more likely to go looking for errors when you get null results, you will probably find more errors that lead to null results. I think that is a pretty natural tendency — it is certainly one I have caught myself falling into. And I have to assume that if I’ve sometimes caught myself doing that, there are other times I haven’t.

      That’s why I am arguing that as much as possible we need to validate our methods and procedures, both in design and execution, separately from looking at our results. When we do that, we can have more confidence that our “positive” results aren’t due to biases in which errors we let slip through, and our “null” results are diagnostic of small effects.

    2. P.S. Re this: “In my experience, failed replications usually lead to either a face palm as I realize the error and/or an interesting qualification of the original effect. I cannot think of a time I ruled out the original effect wholesale. It’s not that it can’t happen, but I would love to know the practical probabilities.”

      I agree completely. A discrepancy between an original study and a replication (I don’t like the term “failed replication”) has multiple potential interpretations. Sometimes we have data or information at hand to figure our which ones are more or less plausible; sometimes we need more data. I don’t see that as a problem with doing replications; to the contrary, it is actually a very good argument in favor of them: https://hardsci.wordpress.com/2014/07/01/some-thoughts-on-replication-and-falsifiability-is-this-a-chance-to-do-better/

  2. >Different kinds will have different consequences: some will systematically bias toward the null, some away, some randomly.

    Saying it this way implies an even distribution –“sometimes this, sometimes that, sometimes this or that” –worse, that the distribution kind of doesn’t matter. My point was that this assumption may be wrong and that this is important.

    >It’s enough for me that all kinds happen often enough to matter, and can be made by original researchers and replicators alike.

    It’s easy to make the case that bad things matter. But in terms of policy, there is an opportunity cost to mattering that makes the probabilities important. Many more people died after 911 than they would have otherwise because they got into cars instead of airplanes. Terrorist attacks and car accidents both matter. It’s the probabilities that made driving a worse decision. When Anthony Fauci spoke at the NIH about going all in against Smallpox he was criticized for taking that money from Cancer and AIDS. No one was arguing that Smallpox didn’t matter. It was about opportunity costs given probabilities.

    Let me be clear, I don’t know what the probabilities are for negative vs. positive findings. It may be that mistakes that cause spurious positive findings happen “often enough to matter despite the opportunity costs associated with policies aimed at their reduction.” But that’s what we need to consider –the full picture –what are the probabilities and what are the costs?

    >That’s why I am arguing that as much as possible we need to validate our methods and procedures, both in design and execution, separately from looking at our results.

    Agreed.

    And thanks for the PS and the general reasonable tone throughout. It gives me some optimism that Science will come out stronger for these kinds of discussions.

    1. To clarify, I am not saying the distribution is uniform – I am saying it is unknown (apart from all possibilities being nontrivial). You might say that is easy, but I am challenging an assumption of Jason’s essay — and the line of thinking that says that null results are uninterpretable — that says that one kind of errors dominate and others either don’t exist or are ignorable.

      I agree with you that we need to think about costs and benefits. But that depends on our estimates both of probabilities and of various outcomes. How do we count up the hours and resources researchers spend trying to build on and extend findings that are unreliable, or develop applications based on them? How do we calculate the value of the discoveries they did not make while they were doing that? How do we calculate the costs of policy recommendations and interventions based on incorrect findings? And the lost credibility when they do not work? And of course the worry on the other side is that if we spend too much time and effort on replication we will discover fewer new things, develop fewer policy improvements and interventions from them, etc. We need to think about these things, but it is not like tallying automobile fatalities.

      Also, a cost-benefit analysis focuses on tradeoffs. But as I argued in my last post, replication is not in opposition to discovery, it is a vital part of the discovery process. In fact many of the reforms currently being advocated reduce errors of many kinds at relatively low cost. Running adequately powered studies, reducing researcher degrees of freedom, reducing publication bias, validating methods, posting details of stimuli and procedures, posting datasets, and even routine replication all help us reduce errors of all kinds and make valid new discoveries.

  3. >To clarify, I am not saying the distribution is uniform – I am saying it is unknown (apart from all possibilities being nontrivial). You might say that is easy…

    Actually, I said “do not always fail toward the null” was easy. I argued that the distribution is more important and also unknown. We agree.

    >but I am challenging an assumption of Jason’s essay — and the line of thinking that says that null results are uninterpretable — that says that one kind of errors dominate and others either don’t exist or are ignorable.

    I think most of us can agree that extreme statements are unhelpful, on both sides. Words like qualify, distribution, opportunity costs, and probability are more helpful than false, failed, always, never.

    >I agree with you that we need to think about costs and benefits. But that depends on our estimates both of probabilities and of various outcomes…. We need to think about these things but it is not like tallying automobile fatalities.

    Agreed that estimating the cost-benefits will be important and challenging. I assume you realize that my car example was making a probability point not an ease of measurement point.

    > In fact many of the reforms currently being advocated reduce errors of many kinds at relatively low cost.

    No one is going to be against “low cost reforms” per se. The problem is deciding what “low” is and to whom. Some of the reforms you mentioned are clear, or should be, to everyone –e.g., “don’t run inadequately powered studies.” Others get more complicated –e.g., “post datasets” –I agree in principle (and do so when journals allow me to post my raw behavioral data), but MEG and fMRI datasets can take a grad student a long time to usefully annotate, have storage and HPC transfer issues at the university level, and have been historically severely underutilized — see JoCN data center). I wouldn’t call this “relatively low cost” but rather the start of a bigger conversation about how to make it so.

Comments are closed.