Old Man Warner snorted. “Pack of crazy fools,” he said. “Listening to the young folks, nothing’s good enough for them. Next thing you know, they’ll be wanting to go back to living in caves, nobody work any more, live that way for a while. Used to be a saying about ‘Lottery in June, corn be heavy soon.’ First thing you know, we’d all be eating stewed chickweed and acorns. There’s always been a lottery,” he added petulantly.
Shirley Jackson, “The Lottery”
A few weeks back, there was much social media drama about this a paper titled: “Social Media and Job Market Success: A Field Experiment on Twitter” (2024) by Jingyi Qiu, Yan Chen, Alain Cohn, and Alvin Roth (recipient of the 2012 Nobel Prize in Economics). The study posted job market papers by economics PhDs, and then assigned prominent economists (who had volunteered) to randomly promote half of them on their profiles(more detail on this paper in a bit).
The “drama” in question was generally: “it is immoral to throw dice around on the most important aspect of a young economist’s career”, versus “no it’s not”. This, of course, awakened interest in a broader subject: Randomized Controlled Trials, or RCTs.
R.C.T. T.O. G.O.
Let’s go back to the 1600s - bloodletting was a common way to cure diseases. Did it work? Well, doctor Joan Baptista van Helmont had an idea: randomly divvy up a few hundred invalids into two groups, one of which got bloodletting applied, and another one that didn’t.
While it’s not clear this experiment ever happened, it sets up the basic principle of the randomized control trial: the idea here is that, to study the effects of a treatment, (in a medical context, a medicine; in an economics context, a policy), a sample group is divided between two: the control group, which does not receive any treatment, and the treatment group, which does. The modern randomized controlled (or control) trial has three “legs”: it’s randomized because who’s in each group gets chosen at random, it’s controlled because there’s a group that doesn’t get the treatment to serve as a counterfactual, and it’s a trial because you’re not developing “at scale” just yet.
Why could it be important to randomly select people for economic studies? Well, you want the only difference, on average, between the two groups to be whether or not they get the treatment. Consider military service: it’s regularly trotted out that drafting kids would reduce crime rates. Is this true? Well, the average person who is exempted from the draft could be, systematically, different than the average person who isn’t - for example, people who volunteer could be from wealthier families who are more patriotic, or poorer families who need certain benefits; or they could have physical disabilities that impede their labor market participation, or wealthier university students who get a deferral. But because many countries use lotteries to allocate draftees versus non draftees, you can get a group of people who are randomly assigned to the draft, and who on average should be similar enough to each other. One study in particular, about Argentina’s mandatory military service in pretty much all of the 20th century, finds that being conscripted raises the crime rate relative to people who didn’t get drafted through the lottery. This doesn’t mean that soldiers have higher crime rates than non soldiers, because of selection issues - but it does provide pretty good evidence that getting drafted is not good for your non-criminal prospects.
Ni yanquis ni marxistas, randomistas
While Randomized Control Trials (or RCTs for short) were originally developed for the natural sciences, namely medicine, they do have something of a history in economics: starting in the 2000s, they began being applied to economic interventions, especially in the field of development economics. In fact, the 2019 Nobel Prize in Economics went to three researchers (technical summary here) who pioneered this approach: Esther Duflo, Abhijit Banerjee, and Michael Kremer.
The field of development economics (or dev econ) was in the midst of a pretty brutal discussion (realheads know all about this) about whether or not international aid worked at all in the 2000s and early 2010s. For example, there was an extremely acrimonious debate about whether deworming children provided any economic benefits (it was a very heavily litigated issue, reaching top global experts, and mostly about really minute details of the study’s implementation), or whether farmers should be provided anti mosquito bednets for free or not.
Traditionally, people answered these types of questions by either coming up with a model and gaming it out in the realm of theory, or by observing some similar enough policy implementation and measuring outcomes. However, the “randomista” movement proposed a third option: by doing an experiment on what worked. For instance, microcredits were considered generally a very good idea, receiving the approval of the Nobel Committee itself too with Mohammed Yunus’s 2010 Peace Prize. But RCT evidence found that there was actually very limited proof of their efficacy, and then more comprehensive Bayesian metastudies found that the evidence in favor of them was basically worthless (less technical writeup here, and abridged version of the writeup at #3 here)1
What’s the argument for RCTs? Well, their proponents claim they are the gold standard of economics research: instead of getting bogged down by big, over-ambitious questions, researchers could chip away at smaller, more tractable ones. So, for instance, instead of asking “what causes people to be poor” (an overly large, extremely complex question), economists could ask something like “does giving people money make them work less” (the answer is no). The idea is that a well-designed RCT can provide nearly inefable information about what solutions work best for what problems - so if you stay away from questions such as “why are there problems”, you can take a piecemeal approach to basically any large-scale issue.
Of course, this approach has inherent limitations. Firstly, not all issues are tractable to an RCT - because it would be materially impossible (think, interest rates), it would be unethical (more on this later), or it would be hard to come up with a counterfactual to test on (if you want to doan RCT on the gender wage gap, what’s the control group?). While two of these won’t stop one Jesse Singal for demanding RCTs, most economists would agree that there are clear limits on the scope and practice of RCTs, thus.
There’s also caveats even with RCTs that can be performed: they can capture the internal validity of the treatment’s effects, i.e. the impact on outcomes, but not the external validity, which would be the applicability of the results to other areas. In what is perhaps the best known example, Duflo, Banerjee, and their coauthors tested whether giving lentils to people incentivized them to get vaccinated versus giving them other items, or nothing at all. They found that the effect was properly identified and positive: people like “eating” versus “not eating”. This means it’s internally valid, if the samples and subsamples are balanced, and the data is all properly collected. However, the external validity is in question: while the approach works in India, perhaps other cultures would consider it unseemly to receive compensation for immunization, and thus it would reduce participation.
The main value added by the RCT movement, it seems, has not been in finding things that “work”, of which there are too few, but in steering resources away from programs that make intuitive sense but don’t actually improve lives. The major area of application is alleviating global poverty, a source of endless questions and no easy answers - perhaps, unless one starts running random experiments in Africa. One may be reminded of the Effective Altruism movement, which, in spite of the controversy, is largely about getting rich people to give less money to the opera and more money to unfashionable causes, such as traffic safety initiatives in Subsaharan Africa2. But much like Effective Altruism, the randomista movement is rather controversial, with two major sources of questioning: first, its efficacy at producing knowledge. Secondly, how ethical it is to turn human beings into guinea pigs.
RCT of the trains running on time
“Do RCTs producte useful knowledge?” seems like a rather stupid question, because they won a Nobel Prize. But let’s look at one example: does building schools make young girls go to school? Yes. Groundbreaking stuff, someone call Stockholm to give them the news of this monumental finding. However, the reason why that is an important an RCT worthy question is quite obvious: the study was conducted in Taliban Afghanistan, which is not uh very friendly towards women’s education (source: Yousafzai, Malala). A similar comment that the RCT study of Afghan girls’ schools is stupid comes from Lant Pritchett, one of the most distinguished development economists out there. RCTs have been, in fact, a centerpiece of a massive economics debate featuring multiple top notch academics.
There’s broader picture issues to consider, firstly. The first, obvious issue is that external validity is weak: there’s no real way to verify whether a study generalizes besides its application. This comes from a series of issues, primarily that transporting the results is mostly done based on “faith”. Also, the internal validity of the study is usually in question too: basically, the results for the population are too heterogenous to be both precise (i.e. capturing properly the value of the effect) without being unbiased (affected by “noise”, so to say). And even when the average effect of a treatment is correctly identified, it is never guaranteed that the average effect is the most relevant statistic. A lot of issues come from the fact that RCTs are not always perfectly designed or perfectly measured, which can potentially make their outcomes suspect - but which is difficult to ascertain in general.
Additionally, some argue that RCTs take attention away from trying to “solve development”, which is quite hard, in favor of pursuing very narrow targets that only tackle symptoms of a much broader problem. However, even critics reckon that structural change isn’t always possible, and thus that improving public services is a good enough stopgap while we figure out the big issues. Other issues come from public discussion and dissemination: bad reporting about studies and their results by the media, and a failure of authors to correct it, treating estimates as facts in a work based on research (I would never stoop to such lows), no audience scrutiny into poor design and poor reporting of details, too narrow solutions with too narrow a scope of the problem, poor choice of outcome variables, and overall bad study design.
The biggest, most “ambitious” critique of RCTs goes after their “gold standard” designation: many consider that claiming that a controlled trial is the pinnacle of possible economics research means that no other research is good enough. While a simple browse of the bibliography of the randomista leaders disproves this notion, the RCT movement has grown in strength and influence recently, leading (as all big movements) to outrageous claims and overconfidence. RCTs are very expensive, and (as mentioned above) the evidence they provide isn’t always the most useful or the most generalizable, even if they are indeed the most credible research method in general. The scientific character of the RCTs themselves never appears to be questioned much, raising many eyebrows. There are few studies on whether participation into the RCT itself distorts the results, which would make external validity even weaker, and usually few considerations into what an appropriate counterfactual is. There are also concerns about whether the RCTs will cause over-reliance on specific metrics, or will force participants to “teach to the test”.
Voting for the randomistas
Finally, there’s a fairly crucial issue: are RCTs useful for policymakers? Randomista institutions, such as Duflo and Banerjee’s J-PAL, have influencing real world policy as a key goal of their research agenda. This question can be broken into three parts: “do RCTs ask policy-relevant questions?”, “are those answers actually useful?”, and “do policymakers actually listen to the randomistas?”.
The first question, do RCTs ask policy-relevant questions, is quite hard. By definition, they only ask policy-relevant questions that can be answered with an RCT, leaving quite a narrow field to choose from - in fact, just under 5% of development related questions can be answered this way. Beyond the extremely limited domain of potential policies (which is a major issue), there’s also the problem that an RCT is incapable of answering whether a program has worthwhile goals, and can only answer if these goals are effectively achieved by RCTs.
The second issue is, once again, the question of internal and external validity: are the results useful? By and large, it depends: policymakers would want to know more data than the average treatment effect (self explanatory concept, TBF) and whether it passes cost-benefit analysis - questions such as “who benefits the most”, “how much do group X and group Y benefit”, “over what timeframe are benefits fully accrued”, etc need really careful and mindful design, which is not necessarily the case. In addition, whether the results scale at all and how to implement the RCT-based policy at a larger scale is not always fully clear and is not always, by itself, something the “RCTers” can do. A final, major issue is attrition - whether or not all people who are in the “treated” group actually do receive treatment, which is not always discussed. Fundamentally, the problem for RCTs is that context seems to matter a lot, even when the heterogeneity of impacts (pertaining to internal validity) can be accounted for. The results from “gold standard” RCTs might even be worse than good ole’ regressions!
Lastly, the third question: do RCTs provide answers that governments will listen to? Not necessarily. Conducting a large scale experiment requires approval from government officials, who may have corrupt motivations to allow such experimentation. Officials in Kenya, for instance, consider the experiments as more or less another source of services that residents can get for free. Additionally, the researchers themselves may be biased towards action: there are clear incentives to under invest in creating reliable impact evaluation knowledge because having credible estimates of policy impacts could discourage adopting certain pre-preferred policy actions. Another problem is a lack of a positive (i.e. actionable) model of policy and policy change - the RCT recommendations may not be relevant because they do not understand how policy comes to be (in the context of application).
It’s about ethics in randomista journals
Now onto the paper that motivated this (honestly far too long) post. On May 21st, economist Jingyi Qiu posted this paper written with Yan Chen, Alain Cohn, and Al Roth on Twitter (blog post by Roth here).
The methodology is simple: they conducted a field experiment on Twitter to verify whether getting endorsed by a top economics professor benefitted your job market chances. To do this, they took a group of 519 job market papers (the papers PhD economists write at the end of their program, to get jobs) and randomly assigned half of them to get favorable quote tweets from prominent economists, and half to not get anything3. Importantly, the authors agreed to participate in some parts of the study, for example by submitting data on their JMPs and academic/personal background. Based on their belonging to underrepresented groups (women, people of color, LGBT+), the rank of their academic institution of origin, and nationality, the authors were sorted according to the field their PhD advisors were in, and matched to another economist in the “booster” set. Economists from the underrepresented groups were given a higher chance of treatment, so that the effects on how to increase diversity can be studied more closely. Then, after getting assigned a booster based on field of study, the booster economists quote tweeted three studies over a month - and a combination of Twitter engagement stats and job market outcomes surveys were collated for both groups.
The results were rather unsurprising: the promoted posts got substantially more engagement than the non-promoted ones (a 442% increase in views, and a 303% increase in likes), an effect that was non-significantly bigger for authors from underrepresented backgrounds. Regarding job market outcomes, the treatment group received 1.2 more interviews than the control group (17.8 versu 16.6), a 7% increase - which is not statistically significant, largely due to timing. Looking at “flyouts”, a key step where interviewees are flown to give a talk at the university campus, the authors of boosted papers got one more flyout offer than the non-boosted authors (6.4 versus 5.4), a 19% statistically significant increase. Finally, regarding job offers, the treatment group received 0.4 more offers than the control group average of 3, a 13% not very significant increase. There was no significant effect on pay or satisfaction, or on differences between tenure-track and non tenure-track outcomes. Also, strangely enough, getting assigned a more popular booster does not result in more engagement or a larger effect, because their bigger audience probably has more non-academics (source: I’m twitter friends with a Succession meme acocunt).
For members of underrepresented groups, there were non-significant positive differences in base rates (i.e., superficially better outcomes) but it is possible this is caused by DEI policies in place in 2022/23, and by selectivity resulting in more (unobservable) ability. In terms of treatment effects, there was a smaller and less significant impact on flyouts, but a larger and not significant one on offers. For women specifically, there actually is an effect, on job offers: women PhDs, with a 33% hike (3 versus 3.9 offers), with a reasonable degree of significance.
Okay, so the finding is that more social media visibility in academia Twitter benefits academics. How did people react? Not very well - many questioned the interpretation of the results, the usefulness of the study, but mostly the ethics: the paper may not have received a sign off from the institution, and the explicit and random prioritization of one set of papers over another ticked people off the wrong way. The participants, lastly, did not actually fully consent to having their papers experimented with. Of course, the Job Market Helper account was quite large by Twitter standards, so it prioritized the entire dataset over all other potential JMPs.
The backlash gets to the ethical concerns many have with RCTs in general - for example, take this piece by The Economist 1843 magazine4, which also generated tons of social media buzz: recipients of the program felt it was unfair to be benefitted by the aid, and in some cases shared it (violating a core condition for RCTs, called SUTVA - that nobody is in both groups, control and treated). In fact, the vast majority of people consider RCTs immoral, thinking it inappropriate to play dice with outcomes. This criticism considers it immoral that the best possible treatments are being potentially withheld to learn lessons, most of which are somwhat obvious - and where, for instance, the benefits to RCT participants could in fact come at the expense of people who are not participating in the RCT.
Additionally, there are many issues with informed consent (people need to know they are part of a trial), and with a concept known as equipoise (genuine questions about whether the treatment works). The best known example of an ethically suspect dev econ RCT was the sad incident where economists shut down water services to half of a subset people who didn’t pay their water bills, to figure out whether they would pay again or not. This move was widely criticized for inflicting intentional harms (on a very impoverished country with remarkably bad public services, no less) on the participants to reach rather obvious conclusions - that people paid their bills. There are also other issues, such as corruption on project management, projects kicking out participants to boost their stats, field workers subverting the experiments, falsified data, and (the worst one) rampant sexual abuse at schools run by international organizations.
Conclusion
RCTs: the only good way to do economics, or Satanic Maoism pretending to be science? Well, neither. They obviously answer some questions quite credibly, and clearly cannot answer others, and the answers aren’t always perfectly useful. There is a lot of debate within economics about their application, but the only reasonable stance is “it depends on the study”. Additionally, serious ethical concerns do exist, but they can be addressed, and in many occasions have: a variety of changes to the RCT process itself, alongside transparency and improved results sharing, could really change how they are perceived both in the field and by their recipients.5 Ultimately, why projects work is, fundamentally, a better question than whether they work.
Shoutout to longtime Twitter mutual Rachel Meager and their very good, usually not econ-focused, Substack. They’re really cool!
Actually ZUSHA! is no longer a top rated charity it seems
To avoid issues with the effect being from any social media promotion, and not from the professors themselves, they had a dedicated account collate all 519 papers, and the economists share all of those. The account, Econ Job Market Helper, has over 2000 followers.
I don’t know if this disclaimer is really necessary but the author of the piece, Linda Kinstler, spoke to me a few weeks later for another article for a different medium.
Another proposal to deal with the fundamental “unfairness” of randomized assignment (besides just giving everyone the treatment at the end of the study) is to more or less create a pseudo market for being assigned treatment.
The problem with questioning the morals of RCTs in a blanket way is the assumption that the thing being questioned is really great and life-saving to begin with. Thanks for using blood-letting as an example. In the control group for that experiment, are the people not getting their blood drained really having medical treatment withheld? Well, I guess they were for the standards of the time, but if it weren't for testing these things, we wouldn't find out that they don't work to begin with.
We tend to think that we have everything figured out, but within recent memory people / doctors were 100% certain that stomach ulcers were caused by stress and smoking. Boy were they surprised to find out that they were caused by bacteria or too much Advil - oops. And that's from the 1990s.
Great post – very thorough and well-researched. Thank you for writing this.
One thing I don't really agree with, though, is your conclusion that "why projects work is, fundamentally, a better question than whether they work."
Once we understand that something does work, sure, let's figure out why. However, many things—most things, in fact—actually don't work, including initiatives we may expect to work. You mention the example of microcredit: Isn't it very useful to know that in fact the impact of microcredit is very limited? Without RCTs, we'd probably have never known this.
I discuss this in some more depth in a recent post of mine: https://inexactscience.substack.com/p/most-ideas-fail-and-thats-fine