"Replication crisis" in science

Comment from r/medicine.

I hate to be the one to say this.

But is it any wonder that the general public has lost so much faith in science? And other institutions?

We all know that even in the face of something like this, science is still better than the alternatives. But for everyday people going about their lives, how are they supposed to know that? How are they supposed to trust an institution that preaches "follow the data" then can't replicate even half of the data they want everyone to follow? And the replication crisis is even worse in some other academic fields.

We are better than the alternative. But we are also under greater scrutiny because we claim to have science on our side. So it falls to us to make sure that what we are practicing is science and not fabrication.

"Publish or perish" has got to go. We've taken the capitalist version of academia far past its ideal level and well into the stage where competitive pressure is crushing scientific integrity as a motivation. Our academic systems don't just allow fabricators to exist (as they always have), we are selecting for them by rewarding quantity over quality.

I don't know how to get rid of it, and I wish I did
 
Just stumbled on this. How true is this? They claim there is no evidence of a replication crisis.

We show that OSC's article contains three major statistical errors and, when corrected, provides no evidence of a replication crisis. Indeed, the evidence is also consistent with the opposite conclusion -- that the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%. The moral of the story is that meta-science must follow the rules of science.

https://projects.iq.harvard.edu/psychology-replications
 
Just stumbled on this. How true is this? They claim there is no evidence of a replication crisis.



https://projects.iq.harvard.edu/psychology-replications

There was a project (OSC) that ended around 2015, that tried to reproduce 100 published studies and they concluded that replication rates are problematic:
Aarts et al. describe the replication of 100 experiments reported in papers published in 2008 in three high-ranking psychology journals. Assessing whether the replication and the original experiment yielded the same result according to several criteria, they find that about one-third to one-half of the original findings were also observed in the replication study.

The paper you referenced (Gilbert et al.) criticized the methodology of this project. They argue, that the authors of the OSC project didn't account for three reasons, that can explain the differing results. They argue, that if these three aspects are taken into account, the results wouldn't support the conclusion that reproducibility is an issue.

There are also responses to Gilbert et al., that argue that these authors themselves ignore various relevant aspects and so on. The original study has countless citations. Its hard to judge the validity unless you want to go down a rabbit-hole of technical details.

To give one example:
Gilbert et al argue, that various of the replication studies differed from the original studies and they listed various examples. Following a passage from their comment:
In fact, many of OSC’s replication studies differed from the original studies in other ways as well. For example, many of OSC’s replication studies drew their samples from different populations than the original studies did. An original study that measured American’s attitudes toward African Americans (3) was replicated with Italians, who do not share the same stereotypes; an original study that asked college students to imagine being called on by a professor (4) was replicated with participants who had never been to college; and an original study that asked students who commute to school to choose between apartments that were short and long drives from campus (5) was replicated with students who do not commute to school. What’s more, many of OSC’s replication studies used procedures that differed from the original study’s procedures in substantial ways: An original study that asked Israelis to imagine the consequences of military service (6) was replicated by asking Americans to imagine the consequences of a honeymoon; an original study that gave younger children the difficult task of locating targets on a large screen (7) was replicated by giving older children the easier task of locating targets on a small screen; an original study that showed how a change in the wording of a charitable appeal sent by mail to Koreans could boost response rates (8) was replicated by sending 771,408 e-mail messages to people all over the world (which produced a response rate of essentially zero in all conditions). All of these infidelities are potential sources of random error that the OSC’s benchmark did not take into account.

At first this might sound quite bad and fishy. Why would the people who replicate studies change the original protocols? The answer is, that reproduction isn't always as easy as it seems. Gilbert et al. omit that various of these changes were endorsed as improvements by the original authors of the studies, that were getting replicated. So changes in protocols aren't always a problem. Sometimes this would be a legitimate problem and sometimes changes in the protocols are warranted. One would have to look at every single study to evaluate if changes are a problem. Nobody has time for that. Yet, if I pick it as point of contention, that part should probably included and discussed fairly.

Gilbert et al raise the legitimate point, that reproduction studies can be equally challenging and flawed. They are never the "objective last word" and have to be scrutinized. The big picture is, that there are so many red flags about reproducibility, that their conclusion isn't convincing. A generous interpretation is, that they highlight some weaknesses of the OSC-study, that should be taken into account in future research.
 
There was a project (OSC) that ended around 2015, that tried to reproduce 100 published studies and they concluded that replication rates are problematic:


The paper you referenced (Gilbert et al.) criticized the methodology of this project. They argue, that the authors of the OSC project didn't account for three reasons, that can explain the differing results. They argue, that if these three aspects are taken into account, the results wouldn't support the conclusion that reproducibility is an issue.

There are also responses to Gilbert et al., that argue that these authors themselves ignore various relevant aspects and so on. The original study has countless citations. Its hard to judge the validity unless you want to go down a rabbit-hole of technical details.

To give one example:
Gilbert et al argue, that various of the replication studies differed from the original studies and they listed various examples. Following a passage from their comment:

At first this might sound quite bad and fishy. Why would the people who replicate studies change the original protocols? The answer is, that reproduction isn't always as easy as it seems. Gilbert et al. omit that various of these changes were endorsed as improvements by the original authors of the studies, that were getting replicated. So changes in protocols aren't always a problem. Sometimes this would be a legitimate problem and sometimes changes in the protocols are warranted. One would have to look at every single study to evaluate if changes are a problem. Nobody has time for that. Yet, if I pick it as point of contention, that part should probably included and discussed fairly.

Gilbert et al raise the legitimate point, that reproduction studies can be equally challenging and flawed. They are never the "objective last word" and have to be scrutinized. The big picture is, that there are so many red flags about reproducibility, that their conclusion isn't convincing. A generous interpretation is, that they highlight some weaknesses of the OSC-study, that should be taken into account in future research.
But how valid is this claim?
Indeed, the evidence is also consistent with the opposite conclusion -- that the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%
 
Last edited:
But how valid is this claim?

If one accepts their methodological corrections, which I wouldn't, this claim would be true, if reproducibility is understood in a narrow way ("is there an effect"). They don't address, that the OSC study also found, that effect sizes are substantially smaller in the reproduced studies.
 
Replicate this :lol:



Psychology is bad, but there's also a lot of nonsense in biology that is allowed to go through because of vast sample sizes rather than any actual science. (In this paper, for clarity, the authors are claiming that that distribution tilts towards one end, and those blue/pink are significantly different from a horizontal line...)
 
Replicate this :lol:



Psychology is bad, but there's also a lot of nonsense in biology that is allowed to go through because of vast sample sizes rather than any actual science. (In this paper, for clarity, the authors are claiming that that distribution tilts towards one end, and those blue/pink are significantly different from a horizontal line...)


skimming through the paper, this is a buzz-word paradise for someone without background knowledge. I am surprised that concepts like fluid vs crystallized intelligence are reputable concepts.
 
The deeper I go into this, the more I get convinced that outside of math and physics, most of research papers are junk. The further you go from fundamental sciences, the more junk the science becomes.

Of course, science is better than any other alternative, no doubt there. But a large portion of science is fake as feck. Of course, in long term, the fake science papers influence the field less than the good papers, but still.
 
The deeper I go into this, the more I get convinced that outside of math and physics, most of research papers are junk. The further you go from fundamental sciences, the more junk the science becomes.

Of course, science is better than any other alternative, no doubt there. But a large portion of science is fake as feck. Of course, in long term, the fake science papers influence the field less than the good papers, but still.
I have a feeling that it's been corrupted for a while given the financial incentives for certain research/studies to be undertaken and the careers that can be had etc. Once we made it like this, humans will always bend stuff to their favour.
 
I have a feeling that it's been corrupted for a while given the financial incentives for certain research/studies to be undertaken and the careers that can be had etc. Once we made it like this, humans will always bend stuff to their favour.
Yep. The issue is that 'fame' is based on number of citations and h-index. The more famous someone is, the higher chance they have on getting grants. The more grants, the more postdocs and PhD students. Which in turn means more papers and thus more citations and a higher h-index. PhD students get pushed to death to get those two-three papers to give them the PhD, and no one really does long-term studying. It is always push to get the paper out, even if deep down they know that the paper has no impact at all.

Long are the days since people spent a decade on doing something. Nowadays it is mostly on working in a project for the next 3-6 months.

NB: My experience is mostly on the AI field where it is like this, but talking with other people, it is hardly much different in other fields.
 
Definitely true in the field I work (Machine Learning / Computer Vision).

A lot of papers don't have code at all, which makes them non-reproducible by default. Many others have code, but good luck getting the results they claim in them. And then from the remaining, many are reproducible, but the results are not because of the main idea of the paper, but because of a lot of engineering to push the results past state-of-the-art. I know that I have done this for a couple of my papers, cause if not state-of-the-art means reject, and everybody is doing so, which makes it the only way to actually get your idea published. I think this is a syndrome of a broken (to some degree system), people are incentivised to publish as much as they can, not necessarily to really push the science forward. Postdoc positions and tenure-track positions are primarily based on the number of top-venue papers, as are the offers from the big tech companies. Getting meaningful research is a distant second.

Nevertheless, I think that the field has progressed massively because when there are so many top-venue papers each year, some of them are meant to be good. After all, the standards to publish in top venues are very high, and even if 99% of them in grand scheme of things don't do anything, the 1% is gonna push the science forward. There has also been a push to at least publish the code, which makes reproducibility better, and cheating in results more difficult (hard to cheat in results where your code is online). Nowadays, around 80% of top-venue papers have code online, just a few years ago, most of them didn't.

Saying that despite that I think this is a big problem in my field, it is far better than in the others. In medicine or social studies, some of the experiments are completely not reproducible. The venues not being double-blinded also means that at least some of the decision is based on the author's reputation, not on the science itself. With the results not being properly reproducible, it also gets easy to cheat. And at least in neuroscience (I also work a bit in the intersection between ML and neuroscience) the quality of some top-venue papers I read is completely appalling, Nature papers that I would have not accepted as Master thesis.

It is much better in theoretical fields though. For example, in rare occasions I check theoretical physics papers, they are top-notch (as far as I can understand read them). But then, I guess by definition, they cannot have reproducibility issues.

-------

TLDR: Most of the research there, despite being peer-reviewed and top-tier venue (in general, every field has just a few journals/conferences who are top-tier, the rest is somewhere between junk and not worth it to read), most of it is probably non-reproducible and generally useless.
Very interesting.
 
what is your take on behavioral economics as an area of study?
Not asking me, but economics is shaman-like smoke blowing. Beyond the basic structural stuff, which is very useful, you're dealing with probabalist models and modelists who do not have a clue what they are doing even as they are supposed top-tier economists (they don't know this). It's an art, not a science, and people often forget that. The only scientific part of it is the structural, logical, stuff, which precludes essentially every single economic model ever devised.

Behavioral economics is literally a joke. Economics itself, however, is an art concealed as a science in 90% of all cases I've ever witnessed. Fortune tellers with statistics. Weather-men with better probability charts. Now, depends whether you're into economic, sociological, analysis at structural levels, which, logically, is about as scientific as it gets or into simply stock-market, quant, type stuff which is wind-pissing based on whichever shaman has the best run of winning horses.
 
what is your take on behavioral economics as an area of study?

I think it's at times interesting and potentially useful, but that its use is to tweak standard models rather than as a standalone thing. It's not the marginal revolution mk.2, it's not going to change everything, and the initial hype and claims were too much. The pop sci stuff that has reached the mainstream, both in book sales and policy wise, is largely bullshit. Obviously fraud like this Ariely thing is very damaging for credibility, and then you have Nudge which is crap science. I'd consider Ariely more psychology than economics, but his stuff was important for the early development of BE, and of course Thaler (Nudge) won the Nobel Memorial Prize, so it's not a good look.
 
So this whole “crisis in scientific research” thing seems to be getting worse and worse.

Link

Watchdog groups – such as Retraction Watch – have tracked the problem and have noted retractions by journals that were forced to act on occasions when fabrications were uncovered. One study, by Nature, revealed that in 2013 there were just over 1,000 retractions. In 2022, the figure topped 4,000 before jumping to more than 10,000 last year

The startling rise in the publication of sham science papers has its roots in China, where young doctors and scientists seeking promotion were required to have published scientific papers. Shadow organisations – known as “paper mills” – began to supply fabricated work for publication in journals there.

Can only assume that LLM technology is going to supercharge these paper mills, making the problem even worse.
 
Just what we need as inch into our post-truth era.

you don’t need scientists. just subscribe to my youtube channel ‘rimaldo’s red-hot reckons.’
 
We may have improved our balance and we are quicker to get up when we faceplant, but at the end of the day we are still just stumbling forwards like drunken fools.
 
Publish or perish’ culture blamed for reproducibility crisis
Nearly three-quarters of biomedical researchers think there is a reproducibility crisis in science, according to a survey published in November. The leading cause cited for that crisis was “pressure to publish”.
Sixty-two per cent of respondents said that pressure to publish “always” or “very often” contributes to irreproducibility, the survey found.
https://www.nature.com/articles/d41586-024-04253-w
 
Is the TLDR that people in academia typically have unrealistic standards for research publications that are linked to promotion and financial incentive pathways, so they end up making up a bunch of fraff to pump numbers so it looks better?
 
Publish or perish’ culture blamed for reproducibility crisis


https://www.nature.com/articles/d41586-024-04253-w

It may just be the wording of what you posted (I can't access Nature on my non-work computer so going by the blurbs above), but it reeks of excuse making. Sure, the pressure to publish is massive (it was that way a almost 2 decades ago when I was in academia), but the decision to publish fake or misleading data is a choice that has no blame attached other than to the person committing the fraud. I would add that the partner i crime to the replication issue is the "first over the line" issue. The acceptance of the precedent of the first published work as the "gold standard" is dangerous as it makes it that much hard to address the replication issue.

I saw it when I at the bench and it was infuriating, especially when that work was counter to mine and had already been published (a month before my manuscript was to be submitted). That meant I was the one who had the onus to completely disprove the already published work as it was the "gold standard" simply by having been published first. Was that work fraud? Probably not, but it was shoddy work that our lab could not reproduce. Did it cause harm? Probably not, but it did confuse our little pocket in the infectious disease world for a bit. What it did do was lead to me spending a full year performing experiments to show that the work that led to their conclusion was faulty and not reproducible (made infinitely more difficult by a poorly written methods section....but the widespread presence of that practice is a different rant).

Pressure to perform exists in all industries and careers, but few have the potential for catastrophic down stream outcomes than fraud in the biomedical community. Whether it's wasted resources and time pursuing unvalidated drug and treatment targets or harm caused by the actual release of those, there are real consequences.
 
Is the TLDR that people in academia typically have unrealistic standards for research publications that are linked to promotion and financial incentive pathways, so they end up making up a bunch of fraff to pump numbers so it looks better?
Yes
 
It may just be the wording of what you posted (I can't access Nature on my non-work computer so going by the blurbs above), but it reeks of excuse making. Sure, the pressure to publish is massive (it was that way a almost 2 decades ago when I was in academia), but the decision to publish fake or misleading data is a choice that has no blame attached other than to the person committing the fraud. I would add that the partner i crime to the replication issue is the "first over the line" issue. The acceptance of the precedent of the first published work as the "gold standard" is dangerous as it makes it that much hard to address the replication issue.

I saw it when I at the bench and it was infuriating, especially when that work was counter to mine and had already been published (a month before my manuscript was to be submitted). That meant I was the one who had the onus to completely disprove the already published work as it was the "gold standard" simply by having been published first. Was that work fraud? Probably not, but it was shoddy work that our lab could not reproduce. Did it cause harm? Probably not, but it did confuse our little pocket in the infectious disease world for a bit. What it did do was lead to me spending a full year performing experiments to show that the work that led to their conclusion was faulty and not reproducible (made infinitely more difficult by a poorly written methods section....but the widespread presence of that practice is a different rant).

Pressure to perform exists in all industries and careers, but few have the potential for catastrophic down stream outcomes than fraud in the biomedical community. Whether it's wasted resources and time pursuing unvalidated drug and treatment targets or harm caused by the actual release of those, there are real consequences.
I agree that, ultimately, the scientist is to blame for deciding to commit fraud; but it's also true that the pressures and incentives are perverse, especially in China (from what I understand). But yeah, better classes on research ethics and the possible societal implications of fraud would be welcome.

That doesn't fully the addresses with psychological sciences though; it seems they rather also need proper classes in methodology (I mean, if you think p hacking is acceptable science...).
 
Is the TLDR that people in academia typically have unrealistic standards for research publications that are linked to promotion and financial incentive pathways, so they end up making up a bunch of fraff to pump numbers so it looks better?

The standards are not unrealistic per se, but the number, and impact factor of the journal they are published in, are indeed the metric by which promotion/tenure are measured. I think there are 2 main issues at play here:

1. Favoritism: Peer reviewed journals are supposed to exist to ensure that the most impactful work is elevated. The issue is that the review groups selected for journals are usually made up of investigators who are experts in that field of study, and usually those people exist in one of two buckets: rivals and friends. You can usually request that your rivals are not on the review committee, which usually means your reviewers are going to be made up, in part, by people who are going to be less likely to go at you hard. This also means that less connected researchers will have a harder time breaking into higher impact journals unless they are submitting a work of extraordinary claims. I think you can see where that leads.

2. Ego: Never, ever, ever underestimate how much ego drives academics. The most egotistical people I have ever worked with are not biotech CEO's or executives, nope it's PI's who run a lab of 2 people, or a lab of 200.
 
The standards are not unrealistic per se, but the number, and impact factor of the journal they are published in, are indeed the metric by which promotion/tenure are measured. I think there are 2 main issues at play here:

1. Favoritism: Peer reviewed journals are supposed to exist to ensure that the most impactful work is elevated. The issue is that the review groups selected for journals are usually made up of investigators who are experts in that field of study, and usually those people exist in one of two buckets: rivals and friends. You can usually request that your rivals are not on the review committee, which usually means your reviewers are going to be made up, in part, by people who are going to be less likely to go at you hard. This also means that less connected researchers will have a harder time breaking into higher impact journals unless they are submitting a work of extraordinary claims. I think you can see where that leads.

2. Ego: Never, ever, ever underestimate how much ego drives academics. The most egotistical people I have ever worked with are not biotech CEO's or executives, nope it's PI's who run a lab of 2 people, or a lab of 200.
Yeah to clarify i had meant unrealistic in terms of quantity of submissions, or timeframe.
 
The standards are not unrealistic per se, but the number, and impact factor of the journal they are published in, are indeed the metric by which promotion/tenure are measured. I think there are 2 main issues at play here:

1. Favoritism: Peer reviewed journals are supposed to exist to ensure that the most impactful work is elevated. The issue is that the review groups selected for journals are usually made up of investigators who are experts in that field of study, and usually those people exist in one of two buckets: rivals and friends. You can usually request that your rivals are not on the review committee, which usually means your reviewers are going to be made up, in part, by people who are going to be less likely to go at you hard. This also means that less connected researchers will have a harder time breaking into higher impact journals unless they are submitting a work of extraordinary claims. I think you can see where that leads.

2. Ego: Never, ever, ever underestimate how much ego drives academics. The most egotistical people I have ever worked with are not biotech CEO's or executives, nope it's PI's who run a lab of 2 people, or a lab of 200.
1) This is so true in many fields, and the solution is beyond simple. Already happened in AI fields like 15 years ago, but for no reason, no other field is doing that.

Make the peer-review system double blinded. The authors do not know whom is reviewing their paper. The reviewers do not know whose paper they are reviewing. The area chair / action editor might know the reviewers but does not know the authors. The authors must put domain conflicts which mean that the reviewers and action editor cannot be from the same institution or from institutions they work closely with.

Far from perfect, but it minimizes the amount of scientific cheating going on in the review process.

2) Agree. The amount of fight going on who gets the last authorship in papers with collaborations is insane. Even in cases where the egomaniac PI has done the grand total of feck all.

———

Depending on the field, standards are beyond unrealistic. Again talking about AI, cause that I am familiar with, but to get accepted for a PhD in a top university nowadays, you must have 2+ first-author top-tier papers. Which is insane, cause five years ago that was enough to get the PhD degree. China is extremely competitive, there are students in their bachelor degree who managed to get 3-5 first-author papers during their undergrad studies. I mean, not very long ago, that was enough for a top-tier postdoc position, and is still enough for a research scientist in FAANG.

Btw, I have an intern of such level (now a second year PhD student). Who has more top-tier papers, a higher h-index, and more citations than me. Complete insanity.
 
Last edited: