Academic studies in the social sciences often find very different results. Even in disciplines like medicine, where one might imagine there to be a direct, physical relationship between the intervention being tested and its consequences, results can vary — but many think the situation is worse in the social sciences. This is because the relationship between an intervention and its effects may depend on multiple factors, and differences in context or implementation can have a large impact on the studies’ results.
There are other reasons that studies might report different effects. For one, chance errors could affect a study’s results. Researchers may also consciously or subconsciously bias their results. All these sources of variability have led to fears of a “replication crisis” in psychology and other social sciences relevant to business. Given this variability, how should we consume evidence?
The immediate answer is to not rely too much on any one study. Whenever possible, look for meta-analyses or systematic reviews that synthesize results from many studies, as they can provide more-credible evidence and sometimes suggest reasons why results differ.
When considering how much weight to give a study and its results, pay attention to its sample size. Studies are particularly likely to fail to replicate if they were based on a small sample. The most positive and negative results are often those with the smallest samples or widest confidence intervals. Smaller studies are more likely to fail to replicate in part due to chance, but effects may also be smaller as sample size increases, for several reasons. If the study was testing an intervention, there may be capacity constraints that prevent high-quality implementation at scale. For example, if you were testing out a training program, you might not need to hire any full-time personnel to run it — but if you were to expand the program, you would need to hire new staff, and they may not run it quite as well.
Smaller studies also often target the exact sample that would yield the biggest effects. There’s a logic to this: If you have a costly intervention that you can allocate to only a few people, you might perform triage and allocate it to those who could benefit from it the most. But that means the effect would likely be smaller if you implemented the intervention in a larger group. More generally, it can be helpful to think about what things might be different if the intervention were scaled up. For example, small interventions are unlikely to affect the broader market, but if scaled up, competitors or regulators might change their behavior in response.
Similarly, consider peculiarities of the sample, context, and implementation. How did the researchers come to study the people, firms, or products they did? Would you expect this sample to have performed better or worse than the sample you are interested in? The setting could have affected the results too. Was there anything special about the setting that could have made the results larger?
If the study was evaluating an intervention, how that intervention was implemented is very important. For example, suppose you hear that reminder messages can improve attendance at appointments. If you were considering implementing a reminder system, you would probably want to know the frequency of the messages the researchers sent and their content in order to gauge whether you might have different results.
You may also have more confidence in the results of a study if there is some clear, causal mechanism that explains the findings and is constant across settings. Some results in behavioral economics, for instance, suggest that certain rules of human behavior are hardwired. Unfortunately, these mechanisms can be very hard to uncover, and many experiments in behavioral economics that initially seemed to reflect a hardwired rule have failed to replicate, such as a finding that happiness increases patience. Nonetheless, if there is a convincing reason that we might expect to see the results that a study has found, or if there is a strong theoretical reason that we might expect a particular result to generalize, that should lead us to trust the results more.
Finally, if it sounds too good to be true, it probably is. This might sound like a cliché, but it’s based on a principle from Bayesian statistics: Stranger claims should require stronger evidence in order to change one’s beliefs, or “priors.” If we take our priors seriously — and there is reason to believe that, on average, humans are pretty good at making many kinds of predictions — then results that seem improbable actually are less likely to be true. In other words, holding constant the significance of a result or the power of a study, the probability of a “false positive” or “false negative” report varies with how likely we thought it to be true before hearing the new evidence.
This article emphasizes the importance of drawing on many studies, rather than relying too much on any one study. What if there haven’t been many studies? If that’s the case, you may wish to consider other sources of evidence, such as advice or predictions from others. Just as with social science research, you may get conflicting advice, but aggregated forecasts can be quite accurate. However, make sure your sources are not relying on the same information; research has found that people are subject to “correlation neglect,” so that when multiple experts or news outlets base their reports on the same study, people incorrectly treat those sources as independent and end up over-weighting that study’s results.
Overall, trust a mix of your experience and the evidence, but be careful not to be overconfident in your assessments. Most people could benefit by weighing the evidence more, even when results vary.