I just spent a couple of hours at the meeting of the West Coast Experiments group, a set of political scientists interested in using experiments.
One the speakers was talking about the need for “credible” or “honest” p-values. “This will be good”, I thought to myself…
What the speaker was alluding to are the current moves afoot in political science to help stamp out “fishing” for statistically significant results, including pre-registering research plans. The problem is that after you’ve looked at a data set multiple times, the p-values aren’t telling what you think they are. The problem – as some Bayesians would point out – is that a p-value isn’t ever what you’d like it to be, even when you’re looking at the data the 1st time…
From the Bayesian perspective, all this stuff is kind of ridiculously overblown, a consequence of an unthinking acceptance of \(p < .05\) as a model for scientific decision-making, point null hypothesis testing, the whole box and dice. That is worth a separate post one day.
For now, I'll remark that pre-registration of research plans is a bit like eliciting very crude priors: i.e., enumerating things to be looked at in the analysis, because the effects aren't thought to be zero; enumerating things that won't be looked at, because prior beliefs over the effects are concentrated close to zero.
The best moment of Bayesian irony was when the speaker emphasized that the need for honest p-values is especially pressing in situations where the experiment is expensive or intrusive and therefore unlikely to be run very often. This was just awesome, when you think what about a p-value is supposed to measure.
More generally, its been very interesting to bring a Bayesian perspective to my teaching about experimental design and analysis, or to a meeting like the one I was at today.
To begin with, try this on: the role of randomization in Bayesian inference. At least as a formal matter, randomization plays no formal role in the Bayesian analysis of a data from an experiment, or any other data for that matter. This sounds so odd to non-Bayesians at first, particularly people who are doing a lot of experiments. But recall that repeated sampling properties like unbiasedness just aren’t the 1st or 2nd or even 3rd thing you consider in the Bayesian approach.
So just is the value of randomization to a Bayesian? Surely not zero, right? Don Rubin has written a little on this; Rubin’s point – that randomization limits the sensitivity of a Bayesian analysis to modeling assumptions – is a stronger conclusion than it first sounds, and one more of the more helpful things I’ve come across on the topic. I also found this note by J. Ghosh, a very concise and accessible summary of the issues too, summarizing some of the Bayesian thinking on the matter (Savage, Kadane, Berry etc). But my sense is that there’s not a lot out there on this. There is actually more writing on this in the literature putting model-based inference up against design-based inference in the sampling literature, which is essentially a parallel debate.
So, vast chunks of the (overwhelmingly classical/frequentist) literature on the analysis of experiments can seem very odd to a Bayesian. Randomization inference, or permutation tests. Re-randomization of assignment status if one detects imbalance. Virtually all Bayesians take the Likelihood Principle seriously, but so much of the work on experiments seems to violate it. It is also pretty obvious that experimenters are also carrying around prior information and using it: balance checks would seem to be guided by prior expectations as to likely confounders, no? Just in the same way that post-stratification weighting for non-response in a survey setting seems to be guided by an (implicit, and rather simplistic) model of response/non-response.
There is a lot to work through. Above all, it is important to keep in mind what is relevant for the applied scientist, what is more esoteric, and where Bayesian ideas can be of real practical use (e.g., Andy Gelman et al on hierarchical models for multiple comparison problems, or in the analysis of blocked or clustered designs, etc).
For now, I’m blessed to have colleagues like Persi Diaconis, Guido Imbens and Doug Rivers, who indulge (or encourage) my thinking out loud on these matters.