The average study testing a campaign message goes something like this: a sample of (mostly) human1 online respondents is randomly selected to be shown a message or video and then asked questions about whatever the campaign is interested in finding out about, often how they will vote. The analyst then compares the answers of those who saw the message with those who didn’t to work out the effect of the message.
Hewitt et al. (2024) study an archive of such message tests conducted by the company Swayable during the 2018 and 2020 US elections. Their conclusion is simple: pre-testing online messages creates huge electoral advantages by allowing campaigns to identify the most persuasive ads and focus their budget on these.
But there’s a slight problem. From academic research at least, we don’t know that any online ads really persuade voters. Our best guess is that, when deployed in the real world, online ads increase vote share by about 0.7 of a percentage point (Coppock et al., 2022). Moreover, we can’t take it for granted that other, more intensive forms of voter contact — such as door-to-door canvassing — persuade voters to change their vote either (Kalla & Broockman, 2017).
A more direct comparison comes from a brilliant paper by two researchers interested in how to persuade the public to support policies addressing the climate crisis. Carnes & Henderson (2025) essentially replicate a pre-testing-to-deployment pipeline, but carry on testing the interventions as they are sent out into the world. They found that messages which looked promising in pre-testing didn’t work when mailed out to people.
So what goes wrong? In a note, Ben Tappin has outlined a framework for understanding, from the perspective of practitioners, what surveys are useful for. Surveys can be misleading when the population who take them differs from the population who receive the message in the real world, or when the message simply doesn’t reach its audience in the real world. For an extensive description of all these features, do make sure to read the note.
Carnes & Henderson (2025) focus on attention. In a follow-up survey experiment they vary the extent to which respondents are forced to pay attention to the treatment, and find that messages only work when respondents are made to pay attention to them.
Attention requires people to actually receive a message. In experimental design, we call this “compliance”. Broadly, analysts have two options when dealing with non-compliance: ignore it or model it. Ignoring it, we can look at the effect of trying to deliver the intervention — the intent-to-treat (ITT) effect. Modelling it, we can first look at the effect of trying to give people the intervention on whether they actually get it, and use this to scale the ITT. This gives us the complier-average-causal-effect (CACE): the effect of the intervention among the type of people who would receive it if you attempted to contact them.
Often, however, there are multiple ways to define compliance in each case, so the CACE can mean very different things between studies. For example, when can we say someone has actually ‘received’ a political leaflet — when it’s gone through their letterbox, or only when they’ve seen it? What if they’ve seen it but not read it, or read it but not processed it? The point is that we can only measure one of these, so the CACE in this case would tell us something about the effect of the leaflet among people who can be successfully reached by mail.
It’s not just about attention, though. Sometimes treatments are simply easier to execute in a survey setting. In a project I’m working on, we randomly assigned campaign supporters to messages tailored to the issue they had previously stated they cared most about. The problem here is that the treatment (tailored messaging) requires the sender to know something about the person receiving it — in this case, the issue they cared about most. In a survey, this is trivial: ask them. But in the real world this type of information is hard to obtain, and quickly gets outdated. So whilst this type of issue tailoring has been shown to work in surveys, we found it didn’t work in the field (unless the message explicitly mentioned that the recipient had previously raised the issue).
What can practitioners do? Firstly, they should pay close attention to what a message test can tell them. In Tappin’s (2026) framework, they should reflect on whether they want to know whether to deploy a particular message, or which message to deploy. Message tests from surveys are rarely useful for understanding whether a message will be effective overall, because so many factors differ between a survey and the real world that the effect size estimated in a survey is often much larger than the effect in the field.2 In political campaigns, this is generally not the question practitioners are asking (although perhaps it should be, given how little we know about the effects of persuasion interventions). Instead, they are often interested in finding out which message to use. Surveys can be more helpful here, as it’s less likely that something about the survey setting changes the relative effectiveness of messages. Secondly, practitioners are generally well placed to test things in the field, as they are delivered. This can be costly, and is certainly more costly than message testing in surveys. However, doing a few well-executed field tests can buy valuable knowledge about how future surveys may, or may not, tell us what might work.
2 It’s worth noting this suggests that if we see only a very small effect in surveys, it is very unlikely to work in the field.