1. Read Only Methods and Results; Bypass the Discussion
The Discussion—and to some extent the Introduction and the Conclusion sections of the published research articles—often offer inferences that differ from those that a dispassionate reader would draw from the Methods and Results sections of such articles.10
Consider, for example, 2 systematic reviews with meta-analyses published in 2001 that summarized randomized trials that assessed the effect of albumin use for fluid resuscitation. One review, funded by the Plasma Proteins Therapeutic Association, pooled 42 short-term trials reporting mortality and found no significant difference in mortality with albumin vs crystalloid solutions across all groups of patients (relative risk [RR], 1.11; 95% confidence interval [CI], 0.95-1.28) and in patients with burns (RR, 1.76; 95% CI, 0.97-3.17).11 The other review, funded by the UK National Health Service, pooled 31 short-term trials that reported mortality and found a significantly higher mortality with albumin in all patient groups (RR, 1.52; 95% CI, 1.17-1.99) and in patients with burns (RR, 2.40; 95% CI, 1.11-5.19).12
Although these 2 systematic reviews included a slightly different set of trials (eg, the former included an additional trial in patients with burns), both yield point estimates that suggest that albumin may increase mortality and CIs that include the possibility of a considerable increase in mortality. The trials were small, many had a high risk of bias, and the results were heterogeneous. The authors of the first review concluded, in their discussion, that their results “should serve to allay concerns regarding the safety of albumin.” In contrast, the discussion section of the second review recommended banning the use of albumin outside the context of a rigorously conducted RCT.
Authors of an editorial accompanying the first systematic review13 suggested that the funding source may have been, at least in part, responsible for the different interpretations. At that time, the Plasma Proteins Therapeutic Association was promoting access to and reimbursement for the use of albumin, an expensive intervention; on the other hand, the National Health Service paid for albumin use in the United Kingdom.
Examples of potential conflicts of interest that apparently drive conclusions abound. Systematic examinations of the association between funding and conclusions have found that trial investigators have greater enthusiasm for the experimental treatment when funded by for-profit than nonprofit interests.14-17 Even after adjusting for magnitude of treatment effect and adverse events, for-profit funding has been reported to be associated with a 5-fold increase in the odds of recommending an experimental drug as treatment of choice (odds ratio [OR], 5.3; 95% CI, 2.0-14.4) compared with nonprofit funding.14
These issues also extend to systematic reviews of RCTs. Industry-supported systematic reviews that address drug treatments, although reporting similar treatment effect, provide more favorable conclusions compared with Cochrane reviews that address the same question.18 Industry influence also extends to cost-effectiveness analyses and clinical practice guidelines.19
To apply this first guide and thereby bypass the Discussion section of research reports, clinicians must be able to make sense of the methods and results.
2. Read the Summary Structured Abstract Published in Evidence-based Secondary Publications (Preappraised Resources)
Secondary journals, such as ACP Journal Club, Evidence-Based Medicine, and Evidence-Based Mental Health, publish structured abstracts and commentary that summarize research articles published elsewhere. These materials are produced by a team of clinicians and methodologists and often in collaboration with the authors of the original articles. These abstracts often include critical information about research conduct (eg, allocation concealment; blinding of patients, clinicians, data collectors, data analysts, and outcome adjudicators; and complete follow-up) that may have been omitted from the original reports.20 They also may diminish some of the “spin” that distorts the abstracts of the original publications.21 The structured abstracts do not include the Introduction or the Discussion sections of the original report or the conclusions of the original study. The title and the conclusions of this secondary abstract are typically the product of critical appraisal by individuals for whom competing financial or personal interests will be minimal or absent.
Compare, for example, the ACP Journal Club abstract and commentary with that of the full publication of an important trial22 that addressed the prevention of stroke.23 The title of the original publication describes the study as testing “a perindopril-based blood pressure lowering regimen,” and the article reports that the perindopril-containing regimen resulted in a 28% relative risk reduction (RRR) in the risk of recurrent stroke (95% CI, 17%-38%).23
The ACP Journal Club abstract and its accompanying commentary identified the publication as describing 2 parallel but separate randomized placebo-controlled trials, including approximately 6100 patients with a history of stroke or transient ischemic attack. In the first trial, patients were randomized to receive perindopril or placebo; active treatment had no appreciable effect on stroke (RRR, 5%; 95% CI, –19% to 23%). In the second trial, patients were allocated to receive perindopril plus indapamide or double placebo. Combined treatment resulted in a 43% RRR (95% CI, 30%-54%) in recurrent stroke. The ACP Journal Club commentary notes that the authors, in communication with the editors, refused to accept the interpretation of the publication as reporting 2 separate RCTs (which explains why it is difficult for even the knowledgeable reader to get a clear picture of the design from the original publication).
The objectivity and methodologic sophistication of those preparing the independent structured abstracts may provide additional value for clinicians. We suggest reviewing the structured abstract of any article that appears in high-quality preappraised secondary publications. We do not claim perfection of this methodologic review: residual hidden bias or misleading presentation may elude the methodologists. Nevertheless, the resource is certain, on occasion, to help.
3. Beware Large Treatment Effects in Trials With Only a Few Events
Clinicians should be skeptical of large treatment effects from trials that are stopped early with few events (see Chapter 11.3, Randomized Trials Stopped Early for Benefit). In addition, clinicians should be cautious about an unusually large effect (eg, an RRR >50%) from a study with few events (eg, <100). One reason to be cautious is that investigators may not have had a formal stopping rule applied to stopped early trials but may have been taking repeated looks at their data and chose to stop early when they saw a large effect. If this is the case, neither the nominal P value nor the CI is valid.
Very large effects are implausible because multiple mechanisms underlie most diseases, and therapies typically address only 1 or 2 of those mechanisms.24 The complementary success of angiotensin-converting enzyme (ACE) inhibitors, antiplatelet agents, lipid-lowering agents, and β-blockers in reducing cardiac events in patients with myocardial infarction (MI) illustrates this multiplicity of disease mechanisms. Predictably, each agent offers only a modest magnitude of risk reduction (from 20% to 33%).
An empirical evaluation of more than 85 000 meta-analytic forest plots from 3082 systematic reviews indicates that in almost 10% of the analyses, the first trial had statistically significant results and a very large effect, but the conduct of subsequent studies almost always revealed a much smaller treatment effect.25
For example, a study conducted in 1997 aimed to determine the efficacy and safety of an angiotensin II receptor blocker (ARB) compared with an ACE inhibitor in patients with heart failure.26 This trial randomized 772 patients and found a 46% RRR in death when receiving ARB treatment (P = .03). However, only 49 events were observed. Subsequently, a large RCT that recruited 3152 participants found no benefit on mortality for the same comparison.27 A larger trial of 5477 patients with congestive heart failure was unable to find statistically significant higher mortality with ARB treatment (RR, 1.13; 95% CI, 0.99-1.28; P = .07).28 Finally, a Cochrane systematic review that included 22 studies and more than 17 000 patients found that ARBs, compared with ACEs, have a similar effect on mortality (RR, 1.05; 95% CI, 0.91-1.22; P = .48).29 Thus, evidence of promising large treatment effects from small—or even not so small—RCTs should be used cautiously. The possibility that further and larger trials or meta-analyses can contradict early results cannot be discarded.24
Consider another RCT that assessed the effects of β-blockers in 112 participants undergoing surgery for peripheral vascular diseases that was stopped early.30 Of 59 patients who received the intervention, 2 had a major event (perioperative mortality or nonfatal MI) compared with 18 of 53 patients who received standard care (RR, 0.10; 95% CI, 0.02-0.41). With only 20 events in total, the study results suggested a large treatment effect—a 90% RRR.
Although the CI is precise, conclusions derived from this very large treatment effect estimated from a trial that was stopped early with a small sample size and only 20 events warrant extreme caution (see Chapter 11.3, Randomized Trials Stopped Early for Benefit). Another reason for questioning the trial results emerged subsequently because the trial was identified as a possible case of research misconduct.30 Box 13.3-2 presents 6 reasons to be cautious about adopting new treatments on the basis of initial promising results, including the possibility of subsequent discovery of research misconduct.
Reasons for Being Cautious in Adopting New Interventions
|Favorite Table|Download (.pdf) BOX 13.3-2
Reasons for Being Cautious in Adopting New Interventions
Initial studies may be biased by inadequacies in concealment, blinding, loss to follow-up, or stopping early.
Initial studies are particularly susceptible to reporting bias.
Initial studies are particularly susceptible to dissemination bias; markedly positive studies are likely to receive disproportionate attention.
Initial studies may overestimate effects by chance (particularly if effects are large and the number of events is small).
There is a substantial probability (20%) that serious adverse effects will emerge subsequently (cyclooxygenase 2 inhibitors provide a notable example).
On rare occasions, research results will prove to have been fraudulent.
It is not only individual trials that sometimes provide potentially misleading large estimates of effect on the basis of relatively small numbers of events—this is also true of systematic reviews and meta-analyses. Consider a systematic review of RCTs that evaluated antibiotic prophylaxis in neutropenic patients and concluded that prophylaxis with fluoroquinolones reduces the risk of infection-related mortality by an impressive 62% (RR, 0.38; 95% CI, 0.21-0.69; P = .001).31 In total, 1022 patients were included in this meta-analysis with only 47 events. If a trialist were planning to conduct an RCT to answer the same clinical question, a minimum sample size of 6400 participants would be required to detect a 25% RRR in infection-related mortality (assuming α = 0.05, β = 0.20, and a control event rate of 7%). We call the sample size required for a single trial anticipating a modest treatment effect the optimal information size (OIS). The fact that the total sample size in this meta-analysis (n = 1022) is substantially smaller than the OIS (n = 6400), the remarkable 62% RRR in mortality and the relatively small number of events (n = 47) all support skepticism regarding the results.
One final consideration is the concept of fragility, which refers to how the inferences from a clinical trial might differ if one changed just a few events to nonevents or vice versa. One can apply the fragility concept to the second Leicester Intravenous Magnesium Intervention Trial (LIMIT-2), which assessed the effect of intravenous magnesium in 2316 participants with suspected acute MI.32 Of 1159 patients receiving the intervention, 90 died, compared with 118 of 1157 in the placebo group (RRR, 24%; 95% CI, 1%-43%). Although in this trial the treatment effect is relatively modest with quite a few events (ie, >100 events), the results still can be misleading, as a subsequent trial has demonstrated (Fourth International Study of Infarct Survival).33
If one considers how the results of LIMIT-2 might change if only a few events were missed in the intervention group (eg, due to losses to follow-up, assessor bias, or chance), the CI would quickly move toward the null. In LIMIT-2, if only 2 events were missed in the intervention group, the results would lose their statistical significance. Therefore, when the number of events required to move the P value past the conventional threshold for statistical significance is small, one should be cautious of believing that a treatment effect truly exists.
The implication is clear: beware of large effects with a small number of events because the results are likely to be misleading. Be careful even with larger events and a modest sample size because study results can still be fragile. Statistical simulations suggest that—in the face of substantial adverse effects, burden, or cost—changes in practice should generally wait until at least 1 replication has been reported with at least 300 events across the available studies (see also Guide 7 in this chapter).34
4. Beware Faulty Comparators
Industry-funded studies typically yield larger treatment effects than studies funded by nonprofit organizations.3,16,17,35,36 One major explanation is choice of comparators.37 The use of a placebo and no treatment as comparators is common, even when RCTs have established the effectiveness of active treatments.38
This frequent use of placebo/no-treatment comparators when effective treatments are available results in very limited availability of head-to-head comparisons of what are otherwise considered first-choice treatments.39 The biased choice of comparators extends to meta-analyses of randomized trials, where the focus may be on trying to make a case to promote specific agents.40 Box 13.3-3 lists the types of faulty comparators in studies to which clinicians should be alert.
|Favorite Table|Download (.pdf) BOX 13.3-3
Comparison with placebo when effective agents are available
Comparison with less effective agents when more effective comparators are available
Comparison with more toxic agents when less toxic comparators are available
Comparison with too low a dose (or inadequate dose titration) of an otherwise effective comparator, leading to misleading claims of effectiveness
Comparison with a too high (and thus toxic) dose (or inadequate dose titration) of an otherwise safe comparator, leading to misleading claims of lower toxicity
A study of 136 trials new treatments for multiple myeloma provides an illustration of likely industry bias in the choice of comparators. Of the trials funded by for-profit entities, 60% compared their new interventions against placebo or no treatment; this was true of only 21% of trials funded by nonprofit organizations.35
In another example, 3 important trials of ARBs for patients with diabetic nephropathy used placebo—rather than ACE inhibitors, which have demonstrated effectiveness—as the control management strategy.41-43 The accompanying editorial suggested that the economic interests of the sponsor dictated that choice of comparator. The sponsors may have avoided an ACE inhibitor control group because “…sales of angiotensin-receptor blockers would be lower if the 2 classes of drugs proved equally effective.”44
Choice of dose and administration regimen also can result in misleading comparisons,45 such as would result if less effective or more toxic agents rather than the best ones available were included in a study or if a trial included the best available agent but in excessively small or excessively large doses.
For example, Safer45 identified 8 trials sponsored by 3 drug companies that compared newer second-generation neuroleptic agents with a fixed high dose (20 mg/d; optimal dosing, <12 mg/d46) of haloperidol. Not surprisingly, these trials found that patients who used the new agents had fewer extrapyramidal adverse effects. Safer45 offers another example in which a study compared paroxetine against amitriptyline, a sedating tricyclic antidepressant. The trial administered amitriptyline twice daily, possibly leading to excessive daytime somnolence.47 In a separate example, Johansen and Gotzsche48 noted the use of an ineffective comparator (nystatin) and the use of an inadequate and unusual administration route (oral amphotericin B, poorly absorbed in the gastrointestinal tract) as comparators in RCTs of the efficacy of antifungals in patients with cancer and neutropenia.
When reading reports of RCTs with active comparators, clinicians should ask whether the comparator should have been another active agent rather than placebo. If the comparator was an active agent, the question is whether the dose, formulation, and administration regimen was optimal.
5. Beware Small Treatment Effects and Extrapolation to Very Low-Risk Patients
Pharmaceutical companies are conducting very large RCTs to be able to exclude chance as an explanation for small treatment effects. Results are consistent with small treatment effects when either the point estimate is very close to no effect (an RRR or absolute risk reduction [ARR] close to 0; an RR or OR close to 1) or the CI includes values close to no effect.
For example, in a very large trial of antihypertensive regimens, investigators randomly allocated more than 6000 individuals to receive ACE inhibitor therapy vs diuretic agents and concluded “initiation of antihypertensive treatment involving ACE inhibitors in older subjects … appears to lead to better outcomes than treatment with diuretic agents….”49 In absolute terms, however, the difference between the regimens was small: there were 4.2 events per 100 patient-years and 4.6 events per 100 patient-years in the ACE inhibitor and diuretic groups, respectively. The corresponding RRR of 11% had an associated 95% CI of –1% to 21%.
In this case, we have 2 reasons to doubt the importance of the apparent difference between treatment groups. First, the point estimate suggests a small absolute difference (0.4 events per 100 patient-years), and second, the CI suggests it may have been even smaller. Indeed, there may have been no true difference at all.
There are a variety of strategies that investigators and sponsors use to create a spurious impression of a large treatment effect (Box 13.3-4). When the absolute risk of adverse events in untreated patients—the baseline risk—is low, you are likely to see a presentation that focuses on RRR and deemphasizes or ignores ARR. The focus on RRR conveys a spurious sense of the importance of the result.
Strategies for Making a Treatment Effect Appear Larger Than It Is
|Favorite Table|Download (.pdf) BOX 13.3-4
Strategies for Making a Treatment Effect Appear Larger Than It Is
Use relative rather than absolute risk; a 50% relative risk reduction may mean a decrease in risk from 1% to 0.5%.
Express risk during a long period; the reduction in risk from 1% to 0.5% may occur during 10 years.
For visual presentations, make sure the x-axis intersects the y-axis well above 0; if the x-axis intersects the y-axis at 60%, you can make an improvement from 70% to 75% appear as a 33% increase in survival.
Include a few high-risk patients in a trial of predominantly low-risk patients; even though most events occur in high-risk individuals, claim important benefits for a large number of low-risk patients in the general population.
Ignore the lower boundary of the confidence interval (CI); when the lower boundary of the CI around the relative risk reduction approaches 0, declare significance and henceforth focus exclusively on the point estimate.
Focus on statistical significance; when a result achieves statistical significance but both relative and absolute effects are small, highlight the statistical significance and downplay or ignore the magnitude.
For instance, the European Trial on the Reduction of Cardiac Events with Perindopril in Stable Coronary Artery Disease (EUROPA) found a reduction in MI with perindopril in patients who survived a previous MI and was hailed as a breakthrough. The RRR in MI of 22% (95% CI, 10%-33%) translates into an ARR of 1.4% during 4 years. Thus, clinicians must treat approximately 70 patients for 4 years to prevent a single MI. In particular, when one considers that most of these patients may already be ingesting aspirin or warfarin, a statin, and a β-blocker to reduce their MI risk, one may question the characterization of the incremental benefit as a breakthrough.
Other techniques complement the use of RRRs in making treatment effects appear large. For visual presentations, beware of survival curves in which the x-axis intersects the y-axis much above the 0 level, giving the visual impression of a large effect.50 Another technique relates to choice of time span for presenting treatment effect: long periods for effects that investigators or sponsors wish to make appear large and short ones for those they wish to make appear small.
For instance, McCormack and Greenhalgh51 pointed out that report 33 of the UK Prospective Diabetes Study trial52 expressed the risk of severe hypoglycemia as the percentage of participants per year (eg, 2.3% per year for patients receiving insulin). This contrasts with the expression of the benefits as the percentage of participants during 10 years (eg, 3.2% absolute reduction in the risk of any diabetes-related end points). By choosing to express harms during a short period (per year) and benefits during a long period (a decade), the presentation obscures the fact that the absolute increase in frequency of hypoglycemia with intensive glycemic control is approximately 7 times the absolute reduction in diabetes complications.
A shift of the target study population to include very low-risk patients means a potentially major expansion in market size for the agent and a consequently larger effect on health care costs associated with small and possibly marginal gains in health. In the past few years, several professional societies have decreased the threshold for diagnosis and treatment of hypertension, diabetes, and hyperlipidemia, which has increased the proportion of people eligible for treatment.53,54 Even if RCTs reveal benefits in populations that include such very low-risk patients, the number of events in very low-risk patients is typically few, and the results of such trials are driven entirely by a few higher-risk patients.55
Whenever relative or absolute benefits are small or the lower boundary of the CI approaches no effect, the treatment benefits and the potential harm, inconveniences, and costs are likely to be, at best, finely balanced. Judicious rather than routine administration of new drugs under these circumstances is likely to best serve patient needs and represent prudent allocation of health care resources.
6. Beware Uneven Emphasis on Benefits and Harms
Clinical decision making requires a balanced interpretation of both benefits and harms associated with any intervention. Unfortunately, many clinical trials neglect even the minimal reporting of harm.56,57 In an analysis of trials from 7 areas, investigators found that the space allocated to harms was slightly less than the space allocated to the names of authors and their affiliations.56 Even when investigators report some information regarding harms, failure to present event rates in treatment and control groups, omission of severity of the events, or inappropriate combining of disparate events can compromise sensible interpretation. Despite some improvement in reporting of harms over time in some areas, most fields continue to devote suboptimal attention to intervention harms.58
For example, a trial of intravenous immunoglobulin in advanced human immunodeficiency virus infection that was stopped early because of efficacy failed to mention any adverse events.59 In this trial, omission of harm data compounds problems associated with early discontinuation (see Chapter 11.3, Randomized Trials Stopped Early for Benefit). In another example, a placebo-controlled trial of nabumetone for rheumatoid arthritis stated that “the adverse experience profiles were similar for both treatment groups,” with no further information concerning the nature of the adverse effects.60
7. Wait for the Overall Results to Emerge; Do Not Rush
Many clinical specialties move at a high speed in terms of introducing new treatments, diagnostics, and other interventions. Although this is exciting and often may improve patient outcomes, problems will arise if clinicians adopt interventions prematurely. The most common problem is that early claims of efficacy or efficiency are exaggerated. As clinical studies accumulate, it is more common for effects to shrink than to increase.25,61-63
An initial study may reveal a very large effect, and when the next study reveals a negligible or even negative effect, the result is controversy. This scenario is most commonly observed in molecular medicine studies, in which turnaround of information can be fast and proposed hypotheses can be rejected rapidly. Subsequent studies of the same question may reveal intermediate results between these 2 extremes.64,65
For example, an article in 1994 reported that a variant of the vitamin D receptor gene explains most of the population risk for having low bone-mineral density (ie, weak bones prone to fracture).66 The finding made the cover page of Nature, which heralded the “osteoporosis gene.” Other subsequent studies revealed an opposite effect with the same variant predisposing to stronger bones. A subsequent large-scale analysis of 100-fold more participants than the original Nature study revealed that there is no effect at all.67
Another reason to wait for more evidence is that RCTs do not enroll sufficient patients or have sufficiently long follow-up to permit detection of relatively uncommon, serious adverse events, particularly if those adverse events occur not uncommonly in the absence of the intervention (such as MIs that occur without exposure to cyclooxygenase 2 inhibitors).68 For example, approximately 20% of drugs that the US Food and Drug Administration (FDA) licenses are either withdrawn from the market or have major safety warnings added to the drug labels within 25 years of initial licensing.69
In 2006, the DREAM (Diabetes Reduction Assessment with Ramipril and Rosiglitazone Medi-cation) study reported that the use of rosiglitazone at 8 mg/d for 3 years, compared with placebo, “reduces incident type 2 diabetes (HR, 0.38; 95% CI, 0.33 to 0.44) and increases the likelihood of regression to normoglycemia in adults with impaired fasting glucose or impaired glucose tolerance, or both.” The study also reported, however, a higher although nonsignificant increased risk of MI (hazard ratio [HR], 1.66; 95% CI, 0.73-3.80).70 Two subsequent systematic reviews70,71 that included more than 35 000 patients provided additional evidence that rosiglitazone increases the risk of MI events (OR, 1.43; 95% CI, 1.03-1.98; P = .0372; and OR, 1.28; 95% CI, 1.02-1.63; P = .0473).
A final reason to wait is that evidence of serious misrepresentation of results may emerge. For instance, the original published report of a trial that investigated the toxicity of anti-inflammatory drugs contained 6-month data and indicated that celecoxib caused fewer symptomatic ulcers and ulcer complications than diclofenac or ibuprofen.74 However, when the FDA reviewed the 12-month data from both trials, the result was inconclusive: the RR for ulcer complications in patients receiving celecoxib and in patients receiving ibuprofen or diclofenac was 0.83 (95% CI, 0.46-1.50).75 The authors explained their omission on the basis of large differential loss to follow-up, particularly of high-risk patients in the diclofenac arm, after 6 months.76 Fortunately, such egregious instances of misleading presentations of evidence are rare.
Box 13.3-2 provides a number of reasons for caution in adopting new interventions. In every case in which new promising interventions are available, the clinician should balance the risk of offering potentially suboptimal management by using the established intervention vs prematurely offering the new intervention that may be less effective than advertised or may be associated with as yet undisclosed or unknown toxicity. The decision is not easy, particularly because clinicians face both marketing pressures and peer pressure to be up-to-date according to what is reported in scientific meetings and medical journals. Indeed, many may perceive themselves as practicing evidence-based medicine when they adopt the newest therapy tested in a recently published RCT.