• N&PD Moderators: Skorpio | thegreenhand

Why Most Published Research Findings Are False


Bluelight Crew
Oct 7, 2001
Why Most Published Research Findings Are False

John P. A. Ioannidis

There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

John P. A. Ioannidis is in the Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece, and Institute for Clinical Research and Health Policy Studies, Department of Medicine, Tufts-New England Medical Center, Tufts University School of Medicine, Boston, Massachusetts, United States of America. E-mail: [email protected]

Competing Interests: The author has declared that no competing interests exist.

Published: August 30, 2005

DOI: 10.1371/journal.pmed.0020124

Copyright: © 2005 John P. A. Ioannidis. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abbreviation: PPV, positive predictive value

Citation: Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124

Published research findings are sometimes refuted by subsequent evidence, with ensuing confusion and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1–3] to the most modern molecular research [4,5]. There is increasing concern that in modern research, false findings may be the majority or even the vast majority of published research claims [6–8]. However, this should not be surprising. It can be proven that most claimed research findings are false. Here I will examine the key factors that influence this problem and some corollaries thereof.

Modeling the Framework for False Positive Findings

Several methodologists have pointed out [9–11] that the high rate of nonreplication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values. Research findings are defined here as any relationship reaching formal statistical significance, e.g., effective interventions, informative predictors, risk factors, or associations. “Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null findings.

It can be proven that most claimed research findings are false.

As has been shown previously, the probability that a research finding is indeed true depends on the prior probability of it being true (before doing the study), the statistical power of the study, and the level of statistical significance [10,11]. Consider a 2 × 2 table in which research findings are compared against the gold standard of true relationships in a scientific field. In a research field both true and false hypotheses can be made about the presence of relationships. Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the field. R is characteristic of the field and can vary a lot depending on whether the field targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fields where either there is only one true relationship (among many that can be hypothesized) or the power is similar to find any of the several existing true relationships. The pre-study probability of a relationship being true is R/(R + 1). The probability of a study finding a true relationship reflects the power 1 − β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists reflects the Type I error rate, α. Assuming that c relationships are being probed in the field, the expected values of the 2 × 2 table are given in Table 1. After a research finding has been claimed based on achieving formal statistical significance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 − β)R/(R − βR + α). A research finding is thus more likely true than false if (1 − β)R > α. Since usually the vast majority of investigators depend on α = 0.05, this means that a research finding is more likely true than false if (1 − β)R > 0.05.

Table 1. Research Findings and True Relationships

What is less well appreciated is that bias and the extent of repeated independent testing by different teams of investigators around the globe may further distort this picture and may lead to even smaller probabilities of the research findings being indeed true. We will try to model these two factors in the context of similar 2 × 2 tables.


First, let us define bias as the combination of various design, data, analysis, and presentation factors that tend to produce research findings when they should not be produced. Let u be the proportion of probed analyses that would not have been “research findings,” but nevertheless end up presented and reported as such, because of bias. Bias should not be confused with chance variability that causes some findings to be false by chance even though the study design, data, analysis, and presentation are perfect. Bias can entail manipulation in the analysis or reporting of findings. Selective or distorted reporting is a typical form of such bias. We may assume that u does not depend on whether a true relationship exists or not. This is not an unreasonable assumption, since typically it is impossible to know which relationships are indeed true. In the presence of bias (Table 2), one gets PPV = ([1 − β]R + uβR)/(R + α − βR + u − uα + uβR), and PPV decreases with increasing u, unless 1 − β ≤ α, i.e., 1 − β ≤ 0.05 for most situations. Thus, with increasing bias, the chances that a research finding is true diminish considerably. This is shown for different levels of power and for different pre-study odds in Figure 1.

Figure 1. PPV (Probability That a Research Finding Is True) as a Function of the
Pre-Study Odds for Various Levels of Bias, u

Panels correspond to power of 0.20, 0.50, and 0.80.

Table 2. Research Findings and True Relationships in the Presence of Bias

Conversely, true research findings may occasionally be annulled because of reverse bias. For example, with large measurement errors relationships are lost in noise [12], or investigators use data inefficiently or fail to notice statistically significant relationships, or there may be conflicts of interest that tend to “bury” significant findings [13]. There is no good large-scale empirical evidence on how frequently such reverse bias may occur across diverse research fields. However, it is probably fair to say that reverse bias is not as common. Moreover measurement errors and inefficient use of data are probably becoming less frequent problems, since measurement error has decreased with technological advances in the molecular era and investigators are becoming increasingly sophisticated about their data. Regardless, reverse bias may be modeled in the same way as bias above. Also reverse bias should not be confused with chance variability that may lead to missing a true relationship because of chance.
Testing by Several Independent Teams

Several independent teams may be addressing the same sets of research questions. As research efforts are globalized, it is practically the rule that several research teams, often dozens of them, may probe the same or similar questions. Unfortunately, in some areas, the prevailing mentality until now has been to focus on isolated discoveries by single teams and interpret research experiments in isolation. An increasing number of questions have at least one study claiming a research finding, and this receives unilateral attention. The probability that at least one study, among several done on the same question, claims a statistically significant research finding is easy to estimate. For n independent studies of equal power, the 2 × 2 table is shown in Table 3: PPV = R(1 − βn)/(R + 1 − [1 − α]n − Rβn) (not considering bias). With increasing number of independent studies, PPV tends to decrease, unless 1 − β < α, i.e., typically 1 − β < 0.05. This is shown for different levels of power and for different pre-study odds in Figure 2. For n studies of different power, the term βn is replaced by the product of the terms βi for i = 1 to n, but inferences are similar.

Figure 2. PPV (Probability That a Research Finding Is True) as a Function of the Pre-Study Odds for Various Numbers of Conducted Studies, n
Panels correspond to power of 0.20, 0.50, and 0.80.

Table 3. Research Findings and True Relationships in the Presence of Multiple Studies


A practical example is shown in Box 1. Based on the above considerations, one may deduce several interesting corollaries about the probability that a research finding is indeed true.

Corollary 1: The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.
Small sample size means smaller power and, for all functions above, the PPV for a true research finding decreases as power decreases towards 1 − β = 0.05. Thus, other factors being equal, research findings are more likely true in scientific fields that undertake large studies, such as randomized controlled trials in cardiology (several thousand subjects randomized) [14] than in scientific fields with small studies, such as most research of molecular predictors (sample sizes 100-fold smaller) [15].

Corollary 2: The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.
Power is also related to the effect size. Thus research findings are more likely true in scientific fields with large effects, such as the impact of smoking on cancer or cardiovascular disease (relative risks 3–20), than in scientific fields where postulated effects are small, such as genetic risk factors for multigenetic diseases (relative risks 1.1–1.5) [7]. Modern epidemiology is increasingly obliged to target smaller effect sizes [16]. Consequently, the proportion of true research findings is expected to de[crease. In the same line of thinking, if the true effect sizes are very small in a scientific field, this field is likely to be plagued by almost ubiquitous false positive claims. For example, if the majority of true genetic or nutritional determinants of complex diseases confer relative risks less than 1.05, genetic or nutritional epidemiology would be largely utopian endeavors.

Corollary 3: The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true.
As shown above, the post-study probability that a finding is true (PPV) depends a lot on the pre-study odds (R). Thus, research findings are more likely true in confirmatory designs, such as large phase III randomized controlled trials, or meta-analyses thereof, than in hypothesis-generating experiments. Fields considered highly informative and creative given the wealth of the assembled and tested information, such as microarrays and other high-throughput discovery-oriented research [4,8,17], should have extremely low PPV.

Corollary 4: The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.
Flexibility increases the potential for transforming what would be “negative” results into “positive” results, i.e., bias, u. For several research designs, e.g., randomized controlled trials [18–20] or meta-analyses [21,22], there have been efforts to standardize their conduct and reporting. Adherence to common standards is likely to increase the proportion of true findings. The same applies to outcomes. True findings may be more common when outcomes are unequivocal and universally agreed (e.g., death) rather than when multifarious outcomes are devised (e.g., scales for schizophrenia outcomes) [23]. Similarly, fields that use commonly agreed, stereotyped analytical methods (e.g., Kaplan-Meier plots and the log-rank test) [24] may yield a larger proportion of true findings than fields where analytical methods are still under experimentation (e.g., artificial intelligence methods) and only “best” results are reported. Regardless, even in the most stringent research designs, bias seems to be a major problem. For example, there is strong evidence that selective outcome reporting, with manipulation of the outcomes and analyses reported, is a common problem even for randomized trails [25]. Simply abolishing selective publication would not make this problem go away.

Corollary 5: The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true.
Conflicts of interest and prejudice may increase bias, u. Conflicts of interest are very common in biomedical research [26], and typically they are inadequately and sparsely reported [26,27]. Prejudice may not necessarily have financial roots. Scientists in a given field may be prejudiced purely because of their belief in a scientific theory or commitment to their own findings. Many otherwise seemingly independent, university-based studies may be conducted for no other reason than to give physicians and researchers qualifications for promotion or tenure. Such nonfinancial conflicts may also lead to distorted reported results and interpretations. Prestigious investigators may suppress via the peer review process the appearance and dissemination of findings that refute their findings, thus condemning their field to perpetuate false dogma. Empirical evidence on expert opinion shows that it is extremely unreliable [28].

Corollary 6: The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.
This seemingly paradoxical corollary follows because, as stated above, the PPV of isolated findings decreases when many teams of investigators are involved in the same field. This may explain why we occasionally see major excitement followed rapidly by severe disappointments in fields that draw wide attention. With many teams working on the same field and with massive experimental data being produced, timing is of the essence in beating competition. Thus, each team may prioritize on pursuing and disseminating its most impressive “positive” results. “Negative” results may become attractive for dissemination only if some other team has found a “positive” association on the same question. In that case, it may be attractive to refute a claim made in some prestigious journal. The term Proteus phenomenon has been coined to describe this phenomenon of rapidly alternating extreme research claims and extremely opposite refutations [29]. Empirical evidence suggests that this sequence of extreme opposites is very common in molecular genetics [29].

These corollaries consider each factor separately, but these factors often influence each other. For example, investigators working in fields where true effect sizes are perceived to be small may be more likely to perform large studies than investigators working in fields where true effect sizes are perceived to be large. Or prejudice may prevail in a hot scientific field, further undermining the predictive value of its research findings. Highly prejudiced stakeholders may even create a barrier that aborts efforts at obtaining and disseminating opposing results. Conversely, the fact that a field is hot or has strong invested interests may sometimes promote larger studies and improved standards of research, enhancing the predictive value of its research findings. Or massive discovery-oriented testing may result in such a large yield of significant relationships that investigators have enough to report and search further and thus refrain from data dredging and manipulation.

Most Research Findings Are False for Most Research Designs and for Most Fields

In the described framework, a PPV exceeding 50% is quite difficult to get. Table 4 provides the results of simulations using the formulas developed for the influence of power, ratio of true to non-true relationships, and bias, for various types of situations that may be characteristic of specific study designs and settings. A finding from a well-conducted, adequately powered randomized controlled trial starting with a 50% pre-study chance that the intervention is effective is eventually true about 85% of the time. A fairly similar performance is expected of a confirmatory meta-analysis of good-quality randomized trials: potential bias probably increases, but power and pre-test chances are higher compared to a single randomized trial. Conversely, a meta-analytic finding from inconclusive studies where pooling is used to “correct” the low power of single studies, is probably false if R ≤ 1:3. Research findings from underpowered, early-phase clinical trials would be true about one in four times, or even less frequently if bias is present. Epidemiological studies of an exploratory nature perform even worse, especially when underpowered, but even well-powered epidemiological studies may have only a one in five chance being true, if R = 1:10. Finally, in discovery-oriented research with massive testing, where tested relationships exceed true ones 1,000-fold (e.g., 30,000 genes tested, of which 30 may be the true culprits) [30,31], PPV for each claimed relationship is extremely low, even with considerable standardization of laboratory and statistical methods, outcomes, and reporting thereof to minimize bias.

Table 4. PPV of Research Findings for Various Combinations of Power (1 − β), Ratio of True to Not-True Relationships (R), and Bias (u)

Claimed Research Findings May Often Be Simply Accurate Measures of the Prevailing Bias

As shown, the majority of modern biomedical research is operating in areas with very low pre- and post-study probability for true findings. Let us suppose that in a research field there are no true findings at all to be discovered. History of science teaches us that scientific endeavor has often in the past wasted effort in fields with absolutely no yield of true scientific information, at least based on our current understanding. In such a “null field,” one would ideally expect all observed effect sizes to vary by chance around the null in the absence of bias. The extent that observed findings deviate from what is expected by chance alone would be simply a pure measure of the prevailing bias.

For example, let us suppose that no nutrients or dietary patterns are actually important determinants for the risk of developing a specific tumor. Let us also suppose that the scientific literature has examined 60 nutrients and claims all of them to be related to the risk of developing this tumor with relative risks in the range of 1.2 to 1.4 for the comparison of the upper to lower intake tertiles. Then the claimed effect sizes are simply measuring nothing else but the net bias that has been involved in the generation of this scientific literature. Claimed effect sizes are in fact the most accurate estimates of the net bias. It even follows that between “null fields,” the fields that claim stronger effects (often with accompanying claims of medical or public health importance) are simply those that have sustained the worst biases.

For fields with very low PPV, the few true relationships would not distort this overall picture much. Even if a few relationships are true, the shape of the distribution of the observed effects would still yield a clear measure of the biases involved in the field. This concept totally reverses the way we view scientific results. Traditionally, investigators have viewed large and highly significant effects with excitement, as signs of important discoveries. Too large and too highly significant effects may actually be more likely to be signs of large bias in most fields of modern research. They should lead investigators to careful critical thinking about what might have gone wrong with their data, analyses, and results.

Of course, investigators working in any field are likely to resist accepting that the whole field in which they have spent their careers is a “null field.” However, other lines of evidence, or advances in technology and experimentation, may lead eventually to the dismantling of a scientific field. Obtaining measures of the net bias in one field may also be useful for obtaining insight into what might be the range of bias operating in other fields where similar analytical methods, technologies, and conflicts may be operating.

How Can We Improve the Situation?

Is it unavoidable that most research findings are false, or can we improve the situation? A major problem is that it is impossible to know with 100% certainty what the truth is in any research question. In this regard, the pure “gold” standard is unattainable. However, there are several approaches to improve the post-study probability.

Better powered evidence, e.g., large studies or low-bias meta-analyses, may help, as it comes closer to the unknown “gold” standard. However, large studies may still have biases and these should be acknowledged and avoided. Moreover, large-scale evidence is impossible to obtain for all of the millions and trillions of research questions posed in current research. Large-scale evidence should be targeted for research questions where the pre-study probability is already considerably high, so that a significant research finding will lead to a post-test probability that would be considered quite definitive. Large-scale evidence is also particularly indicated when it can test major concepts rather than narrow, specific questions. A negative finding can then refute not only a specific proposed claim, but a whole field or considerable portion thereof. Selecting the performance of large-scale studies based on narrow-minded criteria, such as the marketing promotion of a specific drug, is largely wasted research. Moreover, one should be cautious that extremely large studies may be more likely to find a formally statistical significant difference for a trivial effect that is not really meaningfully different from the null [32–34].

Second, most research questions are addressed by many teams, and it is misleading to emphasize the statistically significant findings of any single team. What matters is the totality of the evidence. Diminishing bias through enhanced research standards and curtailing of prejudices may also help. However, this may require a change in scientific mentality that might be difficult to achieve. In some research designs, efforts may also be more successful with upfront registration of studies, e.g., randomized trials [35]. Registration would pose a challenge for hypothesis-generating research. Some kind of registration or networking of data collections or investigators within fields may be more feasible than registration of each and every hypothesis-generating experiment. Regardless, even if we do not see a great deal of progress with registration of studies in other fields, the principles of developing and adhering to a protocol could be more widely borrowed from randomized controlled trials.

Finally, instead of chasing statistical significance, we should improve our understanding of the range of R values—the pre-study odds—where research efforts operate [10]. Before running an experiment, investigators should consider what they believe the chances are that they are testing a true rather than a non-true relationship. Speculated high R values may sometimes then be ascertained. As described above, whenever ethically acceptable, large studies with minimal bias should be performed on research findings that are considered relatively established, to see how often they are indeed confirmed. I suspect several established “classics” will fail the test [36].

Nevertheless, most new discoveries will continue to stem from hypothesis-generating research with low or very low pre-study odds. We should then acknowledge that statistical significance testing in the report of a single study gives only a partial picture, without knowing how much testing has been done outside the report and in the relevant field at large. Despite a large statistical literature for multiple testing corrections [37], usually it is impossible to decipher how much data dredging by the reporting authors or other research teams has preceded a reported research finding. Even if determining this were feasible, this would not inform us about the pre-study odds. Thus, it is unavoidable that one should make approximate assumptions on how many relationships are expected to be true among those probed across the relevant research fields and research designs. The wider field may yield some guidance for estimating this probability for the isolated research project. Experiences from biases detected in other neighboring fields would also be useful to draw upon. Even though these assumptions would be considerably subjective, they would still be very useful in interpreting research claims and putting them in context.

Box 1. An Example: Science at Low Pre-Study Odds

Let us assume that a team of investigators performs a whole genome association study to test whether any of 100,000 gene polymorphisms are associated with susceptibility to schizophrenia. Based on what we know about the extent of heritability of the disease, it is reasonable to expect that probably around ten gene polymorphisms among those tested would be truly associated with schizophrenia, with relatively similar odds ratios around 1.3 for the ten or so polymorphisms and with a fairly similar power to identify any of them. Then R = 10/100,000 = 10−4, and the pre-study probability for any polymorphism to be associated with schizophrenia is also R/(R + 1) = 10−4. Let us also suppose that the study has 60% power to find an association with an odds ratio of 1.3 at α = 0.05. Then it can be estimated that if a statistically significant association is found with the p-value barely crossing the 0.05 threshold, the post-study probability that this is true increases about 12-fold compared with the pre-study probability, but it is still only 12 × 10−4.

Now let us suppose that the investigators manipulate their design, analyses, and reporting so as to make more relationships cross the p = 0.05 threshold even though this would not have been crossed with a perfectly adhered to design and analysis and with perfect comprehensive reporting of the results, strictly according to the original study plan. Such manipulation could be done, for example, with serendipitous inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, investigation of genetic contrasts that were not originally specified, changes in the disease or control definitions, and various combinations of selective or distorted reporting of the results. Commercially available “data mining” packages actually are proud of their ability to yield statistically significant results through data dredging. In the presence of bias with u = 0.10, the post-study probability that a research finding is true is only 4.4 × 10−4. Furthermore, even in the absence of any bias, when ten independent research teams perform similar experiments around the world, if one of them finds a formally statistically significant association, the probability that the research finding is true is only 1.5 × 10−4, hardly any higher than the probability we had before any of this extensive research was undertaken!


1. Ioannidis JP, Haidich AB, Lau J (2001) Any casualties in the clash of randomised and observational evidence? BMJ 322: 879–880. Find this article online
2. Lawlor DA, Davey Smith G, Kundu D, Bruckdorfer KR, Ebrahim S (2004) Those confounded vitamins: What can we learn from the differences between observational versus randomised trial evidence? Lancet 363: 1724–1727. Find this article online
3. Vandenbroucke JP (2004) When are observational studies as credible as randomised trials? Lancet 363: 1728–1731. Find this article online
4. Michiels S, Koscielny S, Hill C (2005) Prediction of cancer outcome with microarrays: A multiple random validation strategy. Lancet 365: 488–492. Find this article online
5. Ioannidis JPA, Ntzani EE, Trikalinos TA, Contopoulos-Ioannidis DG (2001) Replication validity of genetic association studies. Nat Genet 29: 306–309. Find this article online
6. Colhoun HM, McKeigue PM, Davey Smith G (2003) Problems of reporting genetic associations with complex outcomes. Lancet 361: 865–872. Find this article online
7. Ioannidis JP (2003) Genetic associations: False or true? Trends Mol Med 9: 135–138. Find this article online
8. Ioannidis JPA (2005) Microarrays and molecular research: Noise discovery? Lancet 365: 454–455. Find this article online
9. Sterne JA, Davey Smith G (2001) Sifting the evidence—What's wrong with significance tests. BMJ 322: 226–231. Find this article online
10. Wacholder S, Chanock S, Garcia-Closas M, El ghormli L, Rothman N (2004) Assessing the probability that a positive report is false: An approach for molecular epidemiology studies. J Natl Cancer Inst 96: 434–442. Find this article online
11. Risch NJ (2000) Searching for genetic determinants in the new millennium. Nature 405: 847–856. Find this article online
12. Kelsey JL, Whittemore AS, Evans AS, Thompson WD (1996) Methods in observational epidemiology, 2nd ed. New York: Oxford U Press. 432 p.
13. Topol EJ (2004) Failing the public health—Rofecoxib, Merck, and the FDA. N Engl J Med 351: 1707–1709. Find this article online
14. Yusuf S, Collins R, Peto R (1984) Why do we need some large, simple randomized trials? Stat Med 3: 409–422. Find this article online
15. Altman DG, Royston P (2000) What do we mean by validating a prognostic model? Stat Med 19: 453–473. Find this article online
16. Taubes G (1995) Epidemiology faces its limits. Science 269: 164–169. Find this article online
17. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, et al. (1999) Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286: 531–537. Find this article online
18. Moher D, Schulz KF, Altman DG (2001) The CONSORT statement: Revised recommendations for improving the quality of reports of parallel-group randomised trials. Lancet 357: 1191–1194. Find this article online
19. Ioannidis JP, Evans SJ, Gotzsche PC, O'Neill RT, Altman DG, et al. (2004) Better reporting of harms in randomized trials: An extension of the CONSORT statement. Ann Intern Med 141: 781–788. Find this article online
20. International Conference on Harmonisation E9 Expert Working Group (1999) ICH Harmonised Tripartite Guideline. Statistical principles for clinical trials. Stat Med 18: 1905–1942. Find this article online
21. Moher D, Cook DJ, Eastwood S, Olkin I, Rennie D, et al. (1999) Improving the quality of reports of meta-analyses of randomised controlled trials: The QUOROM statement. Quality of Reporting of Meta-analyses. Lancet 354: 1896–1900. Find this article online
22. Stroup DF, Berlin JA, Morton SC, Olkin I, Williamson GD, et al. (2000) Meta-analysis of observational studies in epidemiology: A proposal for reporting. Meta-analysis of Observational Studies in Epidemiology (MOOSE) group. JAMA 283: 2008–2012. Find this article online
23. Marshall M, Lockwood A, Bradley C, Adams C, Joy C, et al. (2000) Unpublished rating scales: A major source of bias in randomised controlled trials of treatments for schizophrenia. Br J Psychiatry 176: 249–252. Find this article online
24. Altman DG, Goodman SN (1994) Transfer of technology from statistical journals to the biomedical literature. Past trends and future predictions. JAMA 272: 129–132. Find this article online
25. Chan AW, Hrobjartsson A, Haahr MT, Gotzsche PC, Altman DG (2004) Empirical evidence for selective reporting of outcomes in randomized trials: Comparison of protocols to published articles. JAMA 291: 2457–2465. Find this article online
26. Krimsky S, Rothenberg LS, Stott P, Kyle G (1998) Scientific journals and their authors' financial interests: A pilot study. Psychother Psychosom 67: 194–201. Find this article online
27. Papanikolaou GN, Baltogianni MS, Contopoulos-Ioannidis DG, Haidich AB, Giannakakis IA, et al. (2001) Reporting of conflicts of interest in guidelines of preventive and therapeutic interventions. BMC Med Res Methodol 1: 3. Find this article online
28. Antman EM, Lau J, Kupelnick B, Mosteller F, Chalmers TC (1992) A comparison of results of meta-analyses of randomized control trials and recommendations of clinical experts. Treatments for myocardial infarction. JAMA 268: 240–248. Find this article online
29. Ioannidis JP, Trikalinos TA (2005) Early extreme contradictory estimates may appear in published research: The Proteus phenomenon in molecular genetics research and randomized trials. J Clin Epidemiol 58: 543–549. Find this article online
30. Ntzani EE, Ioannidis JP (2003) Predictive ability of DNA microarrays for cancer outcomes and correlates: An empirical assessment. Lancet 362: 1439–1444. Find this article online
31. Ransohoff DF (2004) Rules of evidence for cancer molecular-marker discovery and validation. Nat Rev Cancer 4: 309–314. Find this article online
32. Lindley DV (1957) A statistical paradox. Biometrika 44: 187–192. Find this article online
33. Bartlett MS (1957) A comment on D.V. Lindley's statistical paradox. Biometrika 44: 533–534. Find this article online
34. Senn SJ (2001) Two cheers for P-values. J Epidemiol Biostat 6: 193–204. Find this article online
35. De Angelis C, Drazen JM, Frizelle FA, Haug C, Hoey J, et al. (2004) Clinical trial registration: A statement from the International Committee of Medical Journal Editors. N Engl J Med 351: 1250–1251. Find this article online
36. Ioannidis JPA (2005) Contradicted and initially stronger effects in highly cited clinical research. JAMA 294: 218–228. Find this article online
37. Hsueh HM, Chen JJ, Kodell RL (2003) Comparison of methods for estimating the number of true null hypotheses in multiplicity testing. J Biopharm Stat 13: 675–689. Find this article online
Last edited:
Interesting article, and I'd definately give the author the award for the biggest balls award...

But I don't know how applicable that is to most physiological/pharmacological research... Don't get me wrong, obviously a lot of research is bullshit but this guy is coming from a hard-core biostatisticins clinical trial/epidemiology point of view. I mean, paired t-tests on obvious measures like EPSP amplitude; or behaviour frequencies, aren't open for much manipulations, like complex measures from some clinical trials.
Microparadigms: Chains of collective reasoning in publications about molecular interactions

Andrey Rzhetsky*,{dagger},{ddagger},§, Ivan Iossifov*,{dagger}, Ji Meng Loh¶, and Kevin P. White||

*Department of Biomedical Informatics, {dagger}Columbia Genome Center, and {ddagger}Center for Computational Biology and Bioinformatics, Columbia University, New York, NY 10032; ¶Department of Statistics, Columbia University, New York, NY 10027; and ||Department of Genetics, Yale University, New Haven, CT 06520

Communicated by Sherman M. Weissman, Yale University School of Medicine, New Haven, CT, January 23, 2006 (received for review August 15, 2005)


We analyzed a very large set of molecular interactions that had been derived automatically from biological texts. We found that published statements, regardless of their verity, tend to interfere with interpretation of the subsequent experiments and, therefore, can act as scientific "microparadigms," similar to dominant scientific theories [Kuhn, T. S. (1996) The Structure of Scientific Revolutions (Univ. Chicago Press, Chicago)]. Using statistical tools, we measured the strength of the influence of a single published statement on subsequent interpretations. We call these measured values the momentums of the published statements and treat separately the majority and minority of conflicting statements about the same molecular event. Our results indicate that, when building biological models based on published experimental data, we may have to treat the data as highly dependent-ordered sequences of statements (i.e., chains of collective reasoning) rather than unordered and independent experimental observations. Furthermore, our computations indicate that our data set can be interpreted in two very different ways (two "alternative universes"): one is an "optimists’ universe" with a very low incidence of false results (<5%), and another is a "pessimists’ universe" with an extraordinarily high rate of false results (>90%). Our computations deem highly unlikely any milder intermediate explanation between these two extremes.


More than 5 million biomedical research and review articles have been published in the last 10 years. Automated analysis and synthesis of the knowledge locked in this literature has emerged as a major challenge in computational biology. Recent advances in automated text analysis have provided an opportunity for collecting and scrutinizing huge collections of published statements, offering a unique and previously inaccessible "bird’s eye" view of a large field. Among others, the GeneWays text-mining project (1–3) recently made available for analysis millions of biological statements extracted from 78 contemporary research journals. By developing computational tools that allowed detailed statistical analysis of millions of statements extracted from scientific texts, we used these unique data to probe the large-scale properties of the scientific knowledge-production process. We explicitly modeled both the generation of experimental results and the experimenters’ interpretation of their results and found that previously published statements, regardless of whether they are subsequently shown to be true or false, can have a profound effect on interpretations of further experiments and the probability that a scientific community would converge to a correct conclusion.

In this study, we focused on chronologically ordered chains of statements about published molecular interactions, such as "protein A activates gene B" or "small molecule C binds protein D." Each chain comprises chronologically ordered positive and/or negative statements about the same pair of molecules; for brevity, we encode each such chain with a series of 0’s for the negative statements, and 1’s for the positive statements. For example, an imaginary chain of length 3 could include "protein A activates protein B" (1), "protein A does not activate protein B" (0) and "protein A activates protein B" (1) (see also Fig. 1). Discrepancies across published statements may arise because of variations in experimental conditions, errors in the conduct of the experiment, misinterpretation of results, or a combinations of these factors.

Figure 1

Fig. 1. A hypothetical chain of collective reasoning. The chain is started by a scientist who performs an experiment hidden from the outside world. The results of the experiment involve some fuzziness, and the chain originator publishes the most likely interpretation given the absence of prior publications. The second, third, and all other scientists who join the chain later, think in the context of the published opinions and can be led to interpret their experimental results differently than would be done in the absence of prior data. The fourth and fifth persons in the chain publish interpretations of their data that would be opposite in the absence of prior publication.

There is a well established term in economics, "information cascade" (4), which represents a special form of a collective-reasoning chain that degenerates into repetition of the same statement (4). Here we suggest a model that can generate a rich spectrum of patterns of published statements, including information cascades. We then explore patterns that occur in real scientific publications and compare them to this model.

Results and Discussion

Modeling Experiments and Publication Process. There are numerous possible ways to evaluate dependencies across published statements. The simplest approach is to evaluate a correlation between the chronologically consecutive statements within the same chain of reasoning. We started with the simple correlation analysis and observed an overwhelmingly strong dependence between statements within a chain (correlation coefficient is 0.9857, with 842,720 pairs of neighboring statements studied; the corresponding P value is 0 for all practical purposes). However, this simple analysis is very hard to interpret, because the strong correlation across statements in a chain can be due to a number of different factors: because all statements within a chain have the same "true" value, because biological experiments have a very low error rate, because most of the published statements are not experimentally supported but are restatements of earlier statements, because published statements have large "momentums" that affect interpretation of later experiments, or because the general likelihood of randomly formulating a correct statement is high (if most of the molecules were capable of interacting in some way, most of the random statements of the form "A interacts with B" would be correct). Therefore, we designed a probabilistic model that allowed us to discriminate contributions of these factors while being simple and computationally tractable (see Model Box and Fig. 2 for explanation of the model assumptions and its application; see also Supporting Text, which is published assupporting information on the PNAS web site, for further in-depth mathematical details). One of the benefits of formulating our model in probabilistic terms is the ability to quantify our confidence in the results of the analysis, given the model and its assumptions.

Figure 2

Fig. 2. The probabilistic model that generates chains of collective reasoning in our study. Plot A gives an overview of the major stochastic components of the model. Plot B specifies a pathway that leads to generation of the first link in a chain of reasoning. Plot C explains the stochastic processes that lead to chain extension. Note that, to save space, we show only the extension of a chain with a positive statement; the probability for extending the chain with a negative statement is 1 minus the probability of publishing a positive statement. Definitions of the nine model parameters are provided in the Model Box section and Table 1, and the probability equations required for implementing the model are provided in Supporting Text.

Table 1. Major parameters and variables used in the modeling

Patterns. Our model can represent a rich spectrum of possible patterns in chains of published statements (see Fig. 3): all patterns that we propose here occur in the real chains, albeit some of them are rarer than others. The plots on the left side of Fig. 3 represent sequences of simulated positive and negative statements, viewed as white and gray cells, respectively; plots on the right side of the figure show the probability of reaching the correct answer (which is, of course, known only in a simulation) at each step of a reasoning chain for the chains shown to their left. All five patterns are characterized by distinct distributions of single-digit run lengths; for example, the runs of single digits are on average longest for the third pattern (Fig. 3 E and F) and shortest for the fourth pattern (Fig. 3 G and H).

Figure 3

Fig. 3. Hypothetical patterns of conflicting statements that can be observed in real publications. Each row in the left group of plots (A, C, E, G, and I) corresponds to an independent chain (from left to right) of reasoning, where white cells indicate positive statements and gray cells indicate negative statements. Different plots correspond to different parameter values in the underlying model. Each row in the right group of plots (B, D, F, H, and J) represents the probability that the correct result will be reached at the given step of the corresponding chain shown in the same row in the left group.

The first pattern, shown in Fig. 3 A and B, corresponds to a complete statistical independence of published statements ("trust nobody" but yourself): scientists in this imaginary world do not read one another’s papers (momentums of all published statements are zero), and prior publications produce no bias in interpretation of experiments by a scientist. The probability of publishing a correct statement in this case is the same for all links in a reasoning chain (Fig. 3B).

The second, third, and fourth patterns (Fig. 3 C and D, E and F, and G and H, respectively) illustrate three possible modes of dependence within a single reasoning chain. The third and fourth patterns correspond to extreme conformism (the superconformism pattern, indicating high concordance with the majority of statements), and anticonformism (the superanticonformism pattern, indicating a tendency to disagree with the majority of statements), respectively. Both patterns can result from published statements having large momentums: If the majority statements are heavier that the minority statements, the model produces the extreme conformism pattern, whereas if the minority statements are heavier, the resulting pattern is anticonformism. The superconformism pattern (Fig. 3 E and F) is a perfect example of an information cascade. Another pattern (anticonformism with an inferiority complex; Fig. 3 C and D) is a curious hybrid between the conformism and anticonformism patterns: The scientists in this hypothetical universe tend to follow the majority of published statements as long as there are no conflicts, but once the first conflicting statement is published, the same scientists tend to follow an anticonformist model, by joining the minority opinion and generating a stutter-like publication signature.

We call the fifth pattern, shown in Fig. 3 I and J, mild skepticism: The published statements here are dependent, because they have small, but positive, momentums, and the majority statements are heavier than the minority statements. This dependence is manifested by runs of zeros and ones longer than those observed in the independent model (Fig. 3 A and B), but the dependence is relatively weak. In this hypothetical world, scientists do read their peers’ articles and try to compare their own results to the published ones but tend to trust their own data more than the data published by their peers.

Patterns that resemble the mild skepticism were prevalent in our real-world data set (described below), but analysis revealed the presence of all five hypothetical patterns.

Data Analysis. To estimate the momentums of published statements, we applied our computational tools to data stored in the GENEWAYS 6.0 database (3). To detect possible variations in behavior of statements about different types of molecular interactions, we divided interactions in the large data set into logical interactions (such as activate, regulate, and inhibit) and physical interactions (such as bind, phosphorylate, and methylate). This subdivision resulted in three distinct data sets: (i) the whole data set (all), and (ii) logical and (iii) physical interaction subsets.

Our first observation, based on computation, was that, because of the huge data set, we can clearly demonstrate that momentums of published statements are notably positive, but <0.1 (see Fig. 4A and B). This result means that scientists are often strongly affected by prior publications in interpreting their own experimental data, while weighting their own private results (which have weight 1 under our model) at least 10-fold as high as a single result published by somebody else.

Figure 4

Fig. 4. The estimated posterior distributions of the parameters for three data sets (all, logical, and physical). A, B, C, D, and E correspond to parameters {alpha}, {iota}, {nu}, µ, and {rho}, respectively.

The second observation was that, for all three data sets, the dominant statements were considerably "heavier" than the nondominant statements, revealing a tendency toward conformism (see Fig. 4; see also Appendix 1 and Data Sets 1 and 2, which are published as supporting information on the PNAS web site).

Our third and most striking finding emerged from the need to explain the observation that the published statements in our data set are predominantly positive (<5% of them are negative) and are highly correlated within chains. The estimated momentums of published statements are too small to wholly account for the high correlation, and a mechanical republication of the old statements (without experimental reevaluation) appears to be insufficient to explain the trend either (see Model Box and Fig. 4). Our stochastic analysis of the real data produced not a single, most likely explanation, but rather two sets of nearly equally probable "alternative universes" (see Fig. 4 C–E). These statistically derived "universes" reflect a conclusion that perhaps can be reached through a common-sense logical reasoning (such a derivation, however, would lack quantification of the confidence). One explanation is that the high agreement among published statements is due to a very low rate of experimental errors (optimists’ universe, where both false-positive and false-negative error rates are <0.05) and an overwhelming predominance of positive statements over negative ones among true statements. The alternative explanation posits a pessimists’ universe that is characterized by exceptionally error-prone experiments; both false-positive and false-negative error rates are significantly >0.9, and a randomly chosen positive statement is far more likely to be false than true. The statistical tools allow us to conclude that intermediate milder universes are very unlikely under our assumptions (see estimated marginal posterior distributions of parameter values in Fig. 4) and that both universes enjoy considerable support by data (see Fig. 5). This ambiguity is not due to the model’s weakness, but due to the lack of information about the actual proportion of positive versus negative statements that exist. In fact, our data-shuffling experiments (described in detail in Supporting Text) showed that the two-universe effect disappears for many types of reshuffled data. Furthermore, our model parameter estimates are sensitive to data randomization and to elimination of constant (single-digit) chains or variable (double-digit) chains (see Supporting Text).

Figure 5

Fig. 5. The estimated posterior probabilities of the universe classes for three data sets (all, logical, and physical). The universes are defined in terms of the values of parameters {rho}, µ, and {nu}: The optimists’ universe with significant posterior probability has low values of both µ and {nu} (<0.5) and a large value of {rho},and the pessimists’ universe with significant posterior probability has high values of both error-related parameters (>0.5) and a small value of {rho}. There is one more optimists’ universe with a small value of {rho} and one more pessimists’ universe with a large value of {rho} that have negligible posterior probabilities and are included into group other. The remaining four universes (also included into group other in the plot) have one low and one high error-parameter value. Only one of the pessimists’ and one of the optimists’ universes have nonnegligible posterior probabilities.

Optimum Parameter Values. Our probabilistic model allows us to find optimum parameter values that maximize the probability that a given chain of scientific reasoning will converge to correct result. An evaluation of the optimum parameters under our model (see Model Box) indicated that the momentums of published statements estimated from real data are too high to maximize the probability of reaching the correct result at the end of a chain. This finding suggests that the scientific process may not maximize the overall probability that the result published at the end of a chain of reasoning will be correct.

A detailed analysis of a measure leading to improved probability of publishing correct results is outside of the focus of this study, but experience in the fields of physics (5) and structural biology (6) offers concrete steps (such as random and independent benchmarking of published results) that provide scientists with feedback about the true distribution of experimental errors. Another major question also remains open: In which of the two alternative universes discovered in our analysis are we living? Our results indicate that the optimistic and pessimistic realities are almost equally likely given currently available data.

Evaluating the quality of the published facts is more than a matter of pure academic curiosity: If the problem of convergence to a false "accepted" scientific result is indeed frequent, it might be important to focus on alleviating it through restructuring the publication process or introducing a means of independent benchmarking of published results.

Model Box Our model is built on eight simple and intuitive assumptions. First, we assume that for every pair of substances, there is a general truth or rule: These substances either usually do or usually do not interact. The odds of encountering a negative rule ("A usually does not interact with B") are not necessarily the same as the odds of encountering a positive rule ("C usually does interact with D"); we denote the corresponding probabilities by 1 – {rho} and {rho}, respectively. Second, each general rule may have an exception, with probability {varphi} (e.g., proteins A and B interact in most cases, but do not interact when in tissue X). Third, we allow experiments to produce erroneous results: They produce false-negative results with probability {nu} and false-positive results with probability µ. Fourth, we assume an asymmetry in terms of ease of publication between negative and positive experimental results. Many experimentalists believe that it is more difficult to publish a negative result ("we were unable to demonstrate that A and B interact") than to publish a positive result ("we demonstrated that A and B interact"), so the model allows negative results to be discarded, without publication, with probability 1 – {eta}. Fifth, we assume that a published statement can be based on original experiments (with probability 1 – betai) or can be a restatement of an earlier published statement (with probability betai). We tested two formulations of the model: The simpler version assumes that betai is constant, whereas, in the more complicated version of the model, betai is increasing as the chain grows longer: betai = 1 – i–{psi}, {psi} > 0. The more complicated formulation asserts that the chances that a scientist would experimentally reverify an old statement drop with the growth of the available evidence. We assume that the first statement in every chain is always supported by an experiment. Sixth, we allow an experimenter’s interpretation of her own data (and hence of her published result) to differ from the "unbiased" interpretation of the same data that an expert would have in the absence of prior publications. This model feature reflects our observation that, when reading about published experiments similar to their own, scientists build in their minds an equivalent of statistical prior distributions of experimental outcomes that they are using for interpreting their own experimental data. We assume that each published statement has a weight that is different for statements in reasoning chains where they are in the majority ({alpha}), are the minority ({iota}), and are of equal number ({tau}). For example, for the chain of reasoning 1, 0, 0, 1, 1, 1, every published positive statement would have weight {alpha} (because it is in the majority), whereas each negative statement would have weight {iota}. For the hypothetical chain 0, 1, 0, 1, the weight of each the statement would be equal ({tau}), because there are an equal number of zeros and ones. The weight of each published statement is nonnegative and reflects the importance of published statements in influencing both a researcher’s choice of experiments (and thus ultimately observed results) and her interpretation of the results. We set the subjective weight of the researcher’s own experiment to 1. Seventh, we assume that relationship among statements related to the same molecular interaction is adequately represented with a linear structure (a chain). Eighth, we assume that different chains are statistically independent.

In our model (Fig. 2A), each chain of reasoning results from a combination of two processes: one determines the length of the chain, whereas the other specifies the arrangement of zeros and ones within the chain given that length. The first process is described in detail in Supporting Text. The second process (see Fig. 2B and C) is responsible for generating a specific sequence of zeros and ones within a chain of a given length. In this study, we emphasize analysis of the second process.

Note that in our model, experimental data cannot be observed directly by the research community. They have to be inferred from publications. For that reason, we call the experimental results "hidden" (see Fig. 2 B and C) by analogy with the hidden states in the hidden Markov models.

To estimate the marginal posterior distributions for the parameters, we used the Metropolis-coupled Markov chain Monte Carlo technique (refs. 7 and 8; our implementation of the algorithm closely followed that of Altekar and colleagues, ref. 9), run on a cluster of 40 Intel processors. The value of parameter {tau} for these computations was assumed to be equal to the average of {alpha} and {iota} and the values of {varphi} and {eta} to 0 and 1, respectively. This assumption did not affect our findings regarding values of the other parameters, yet it greatly reduced computational complexity. Using the Metropolis-coupled Markov chain Monte Carlo technique, we estimated a full posterior distribution of parameters given data, P(parameters | data). We then divided the whole space of the permissible parameter values into "bad" and "good" neighborhoods [the error rate is very high in the bad neighborhood (>0.5) and low in the good neighborhood (<0.5)] and computed the posterior probabilities that the parameter values belong to each neighborhood. Our estimates of the marginal posterior distributions for major parameters are shown in Fig. 4. As long as in our computation we assumed noninformative prior parameter distributions, the mode of each estimated marginal density corresponds to the maximum-likelihood estimate of the parameter value; a narrow peak indicates a high degree of certainty in the estimate, whereas a wide peak indicates that the variance of the estimate is large.

The data set that we used for analysis included 2.5 million reasoning chains containing 3.3 million individual statements extracted from the GENEWAYS 6.0 database (1–3). We did our data analysis in two ways. In one version of the analysis, we used only one (most frequent) statement of each kind per article, whereas in the other version of analysis, we used all statements available in the database. The results of the two analyses are qualitatively indistinguishable, and we show here only results of the analysis of the former type.

Analysis of all three data sets under the constant-beta model produced consistent estimates of beta with the posterior mean close to 0.2 (the 95% credible interval for the largest dataset, all, was bound by 0.166 and 0.235). To determine whether this simple way of estimating restatements was reasonable, we manually analyzed 200 statements about molecular interactions in Drosophila melanogaster (these statements were randomly sampled from the fly-specific portion of the GENEWAYS 6.0 database). Among these 200 statements, 107 were based on original experiments, which gave us an estimate of beta equal to 0.465 with a 95% confidence interval (0.394 and 0.637; see Supporting Text). Therefore, the simple constant-beta model was falsified with the data. We were then able to estimate the value of the decay parameter ({psi}) by using the manually collected data ({psi} was equal to 0.445, 0.479, and 0.502 for datasets all, logical, and physical, respectively). Notably, however, when we compared estimation results under the simpler constant-beta model with those under the more complicated decaying-beta model, all of the major results reported here held for both computations, demonstrating the robustness of our model to estimates of mechanical restatements.

The probabilistic approach to analysis of data naturally allowed us to compare directly the relative plausibility of each alternative universe (by estimating the latter’s posterior probability, see Fig. 5). To define the bounds of universes, we divided the parameter space into eight equal-sized subspaces, and we estimated the proportion of the posterior density associated with each subspace. [The subspaces were separated by three mutually orthogonal planes cutting axes ({rho}, µ, {nu}) at {rho} = 0.5, µ = 0.5, and {nu} = 0.5, respectively.] Because the number of informative (cold) Metropolis-coupled Markov chain Monte Carlo technique iterations in our analysis was enormous (3 x 106), the differences between posterior probabilities of the two universes are statistically significant.

We can draw several conclusions from this analysis of the relative plausibility of the alternative universes. For the largest combined data set (all), the most likely universe was the pessimists’ (posterior probability 0.73), followed by the optimists’ (posterior probability 0.27). A very similar picture is observed for the smaller data sets (Fig. 5), but for all practical purposes, both universes successfully explained reality.

Optimum Parameter Values. Assuming that both experimental error rates (false-negative and false-positive) do not exceed 0.5, the optimum value for parameter {iota} is zero, whereas the optimum value of {alpha} depends on the length of the chain. The optimum values of {alpha} are close to 0.39, 0.29, and 0.21 for chains of length 3, 6, and 9, respectively (data not shown). As the chain grows longer, the optimum value of {alpha} becomes progressively smaller.

Supporting Information. For more information, see Figs. 6–23, which are published as supporting information on the PNAS web site.


We thank Lynn Caporale, Murat Cokol, Lyn Dupré, Michael Krauthammer, Ani Nenkova, Paul Pavlidis, Valerie Reinke, James J. Russo, Rita Rzhetsky, and Tian Zheng for comments on the earlier version of the manuscript; Dennis Vitkup for suggesting the term "momentum of a statement"; and Ahmet Sinav for the artwork. This study was supported by grants from the National Institutes of Health (to A.R. and K.P.W.), the National Science Foundation, the Department of Energy, the Cure Autism Now Foundation, and the Defense Advanced Research Projects Agency (to A.R.), and the W. M. Keck Foundation (to K.P.W.).


§To whom correspondence should be addressed. E-mail: [email protected]

Freely available online through the PNAS open access option.

Author contributions: A.R. and K.P.W. designed research; A.R. and I.I. performed research; A.R., I.I., and J.M.L. analyzed data; and A.R. and K.P.W. wrote the paper.

Conflict of interest statement: No conflicts declared.

© 2006 by The National Academy of Sciences of the USA


1. Friedman, C., Kra, P., Yu, H., Krauthammer, M. & Rzhetsky, A. (2001) Bioinformatics 17, Suppl. 1, S74–S82.[Abstract/Free Full Text]
2. Krauthammer, M., Kra, P., Iossifov, I., Gomez, S. M., Hripcsak, G., Hatzivassiloglou, V., Friedman, C. & Rzhetsky, A. (2002) Bioinformatics 18, Suppl. 1, S249–S257.[Abstract]
3. Rzhetsky, A., Iossifov, I., Koike, T., Krauthammer, M., Kra, P., Morris, M., Yu, H., Duboue, P. A., Weng, W. & Wilbur, W. J., et al. (2004) J. Biomed. Inform 37, 43–53.[CrossRef][ISI][Medline]
4. Anderson, L. R. & Holt, C. A. (1997) Am. Econ. Rev 87, 847–862.[ISI]
5. Henrion, M. & Fischhoff, B. (1986) Am. J. Physics 54, 791–798.[CrossRef][ISI]
6. Venclovas, C., Zemla, A., Fidelis, K. & Moult, J. (2003) Proteins 53, Suppl. 6, 585–595.[CrossRef][ISI][Medline]
7. Gilks, W. R., Richardson, S.& Spiegelhalter, D. J. eds.Markov Chain Monte Carlo in Practice (1996) (Chapman & Hall/CRC, London).
8. Geyer, C. J. (1991) in Computer Science and Statistics: Proceedings of the 23rd Symposium on the Interface ed. Keramidas, E. M. (Interface Found., Fairfax Station, VA), pp. 156–163.
9. Altekar, G., Dwarkadas, S., Huelsenbeck, J. P. & Ronquist, F. (2004) Bioinformatics 20, 407–415.[Abstract/Free Full Text]
Last edited:
All very interesting, and it is certainly unusual that such papers get published in "scientific" journals, but really this has been said before within philosophy of science.

Particulary all those collaries in the first paper; Harry Collins' grounbreaking book Changing Order laid bare the effects that supra-scientific concerns have on results. It's not a great leap forward to expand on those ideas to note that smaller sample sizes etc reduce the probability of a "true" result.

But honestly - what ever happend to falsification? The most "realist" of philosophers of science (Popper et al) were very clear on that - hypotheses can never be proven to be "true", only "false".

Collins: http://www.amazon.com/gp/product/02...ef=sr_1_10/103-8665657-2258241?_encoding=UTF8