# Chapter 12.1: Hypothesis Testing

For every treatment, there is a true, underlying effect that any individual experiment can only estimate (see Chapter 6, Why Study Results Mislead: Bias and Random Error). Investigators use statistical methods to advance their understanding of this true effect. This chapter explores the logic underlying one approach to statistical inquiry: hypothesis testing. Readers interested in how to teach the *concepts* reviewed in this chapter to clinical learners may be interested in an interactive script we have developed for this purpose.^{1}

The hypothesis-testing approach to statistical exploration is to begin with what is called a *null hypothesis* and try to disprove that hypothesis. Typically, the null hypothesis states that there is no difference between the interventions being compared. To start our discussion, we will focus on *dichotomous* (yes/no) *outcomes*, such as dead or alive or hospitalized or not hospitalized.

For instance, in a comparison of vasodilator treatment in 804 men with heart failure, investigators compared the proportion of enalapril-treated patients who died with the proportion of patients who received a combination of hydralazine and nitrates who died.^{2} We start with the assumption that the treatments are equally effective, and we adhere to this position unless the results make it untenable. We could state the null hypothesis in the vasodilator trial more formally as follows: the true difference in the proportion of patients surviving between those treated with enalapril and those treated with hydralazine and nitrates is 0.

In this hypothesis-testing framework, the statistical analysis addresses the question of whether the observed data are consistent with the null hypothesis. Even if the treatment truly has no positive or negative effect on the outcome (ie, the *effect size* is 0), the results observed will rarely agree exactly with the null hypothesis. For instance, even if a treatment has no true effect on mortality, seldom will we see exactly the same proportion of deaths in treatment and *control groups*. As the results diverge farther and farther from the finding of “no difference,” however, the null hypothesis that there is no true difference between the treatments becomes progressively less credible. If the difference between results of the treatment and control groups becomes large enough, we abandon belief in the null hypothesis. We further develop the underlying logic by describing the role of chance in clinical research.

In Chapter 6, Why Study Results Mislead: Bias and Random Error, we considered a balanced coin with which the true *probability* of obtaining either heads or tails in any individual coin toss is 0.5. We noted that if we tossed such a coin 10 times, we would not be surprised if we did not see exactly 5 heads and 5 tails. Occasionally, we would get results quite divergent from the 5:5 split, such as 8:2 or even 9:1. Furthermore, very infrequently, the 10 coin tosses would result in 10 heads or tails.

Chance is responsible for this variability in results. Certain recreational games illustrate the way chance operates. On occasion, the roll of 2 unbiased dice (dice with an equal probability of rolling any number between 1 and 6) will yield 2 ones or 2 sixes. On occasion (much to the delight of the recipient), the dealer at a poker game will deal a hand that consists of 5 cards of a single suit. Even less frequently, the 5 cards will not only belong to a single suit but also have consecutive face values.

Chance is not restricted to the world of coin tosses, dice, and card games. If we take a sample of patients from a community, chance may result in unusual and potentially misleading distributions of chronic disease, such as hypertension or diabetes. Chance also may be responsible for substantial imbalance in *event rates* in 2 groups of patients given different treatments that are, in fact, equally effective. Much statistical inquiry is geared toward determining the extent to which unbalanced distributions could be attributed to chance and the extent to which we should invoke other explanations (differential *treatment effects*, for instance). As we discuss in this chapter, the size of the study (determining, in turn, the number of events) to a large extent determines the conclusions of its statistical inquiry.

One error that an investigator can make is to conclude that an outcome differs between a treatment group and a control group when, in fact, no such difference exists. In statistical terms, making the mistake of concluding that treatment and control differ when, in truth, they do not is called a *type I error,* and the probability of making such an error is referred to as the *α* *level*.

Imagine a situation in which we are uncertain whether a coin is biased. We could construct a null hypothesis that the true proportions of heads and tails are equal (ie, the coin is unbiased). With this scenario, the probability of any given toss landing heads is 50%, as is the probability of any given toss landing tails. We could test this hypothesis by an experiment in which we conduct a series of coin tosses. Statistical analysis of the results of the experiment would address the question of whether the results observed were consistent with chance.

Let us conduct a hypothetical experiment in which the suspect coin is tossed 10 times, and on all 10 occasions, the result is heads. How likely is this to have occurred if the coin were indeed unbiased? Most people would conclude that it is highly unlikely that chance could explain this extremely unlikely result. We would therefore be ready to reject the hypothesis that the coin is unbiased (the null hypothesis) and conclude that the coin is biased toward a toss of heads.

Statistical methods allow us to be more precise by ascertaining just how likely such an unusual result is if the null hypothesis is true. The *law of multiplicative probabilities* for independent events (in which one event in no way influences the other) tells us that the probability of 10 consecutive heads can be found by multiplying the probability of a single head 10 times over; that is, (1/2) × (1/2) × (1/2), and so on, which yields a value of slightly less than 1 in a 1000. A series of 10 consecutive tails would be equally unusual and would also cause us to doubt that the coin was unbiased. The probability of getting 10 heads or 10 tails is just under 2/1000.

In a journal article, we would likely see this probability expressed as a *P* value, such as *P* = .002 (if the value was rounded to the third decimal). What is the precise meaning of this *P* value? If the coin were unbiased (ie, if the null hypothesis were true) and we were to repeat the experiment of the 10 coin tosses many times, by chance alone, we would get either 10 heads or 10 tails in approximately 2 per 1000 of these repetitions.

The framework of hypothesis testing involves a yes/no decision. Are we willing to reject the null hypothesis? This choice involves a decision about how much risk or chance of making a type I error we are willing to accept. The reasoning implies a threshold value that demarcates a boundary. On one side of this boundary, we are unwilling to reject the null hypothesis; on the other side, we are ready to conclude that chance is no longer a plausible explanation for the results. The threshold chosen is the α level mentioned above.

To return to the example of 10 consecutive heads or tails, most people observing this distribution would be ready to reject the null hypothesis, which—it turns out—would be expected to occur by chance alone less than twice per 1000 experiments. What if we repeat the thought experiment, and this time we obtain 9 tails and 1 head? Once again, it is unlikely that the result is because of the play of chance alone. As shown in Figure 12.1-1 (which you will recognize from Chapter 6; the theoretical distribution of the distribution of results on an infinite number of repetitions of the 10-coin-flip experiment when the coin is unbiased), the *P* value is .02, or 2 in 100. That is, if the coin were unbiased and the null hypothesis were true, we would expect results as extreme as—or more extreme than—those observed (ie, 10 heads or 10 tails, 9 heads and 1 tail, or 9 tails and 1 head) to occur by chance alone 2 times per 100 repetitions of the experiment.

Where we set this threshold or boundary is a matter of judgment. Statistical convention suggests a threshold that demarcates the plausible from the implausible at 5 times per 100, which is represented by an α value of .05. Once we have chosen our threshold (of α = .05, for example), we call a result that falls beyond this boundary (ie, the result gives *P* ≤ .05) *statistically significant*. The meaning of statistically significant, therefore, is “sufficiently unlikely to be due to chance alone that we are ready to reject the null hypothesis.”

Statistically significant findings occasionally happen by chance, and it is only convention that makes the .05 threshold sacrosanct. Suppose we set α = .01, so we reject the null hypothesis if *P* ≤ .01. A finding with a *P* < .01 will happen, simply by chance, 1% of the time if the null hypothesis is true; this means we would reject a true null hypothesis 1% of the time. If we wish to be more conservative (more sure when we reject the null hypothesis that chance cannot explain the difference observed), we might well choose a 1% threshold.

Let us repeat our experiment twice more, both times with a new coin. On the first repetition, we obtain 8 heads and 2 tails. Calculation of the *P* value associated with an 8/2 split tells us that, if the coin were unbiased, results as extreme as or more extreme than 8/2 (or 2/8) would occur solely as a result of the play of chance 11 times per 100 (*P* = .11) (Figure 12.1-1). We have crossed to the other side of the conventional boundary between what is plausible and what is implausible. If we accept the convention, the results are not statistically significant and we will not reject the null hypothesis.

On our final repetition of the experiment, we obtain 7 tails and 3 heads. Experience tells us that such a result, although not the most common, would not be unusual even if the coin were unbiased. The *P* value confirms our intuition: Results as extreme as, or more extreme than, this 7/3 split would occur under the null hypothesis 34 times per 100 (*P* = .34) (Figure 12.1-1). Again, we will not reject the null hypothesis.

When investigators compare 2 treatments, the question they ask is, how likely is it that the observed difference, or a larger one, could be a result of chance alone? If we accept the conventional boundary or threshold (*P* ≤ .05), we will reject the null hypothesis and conclude that the treatment has some effect when the answer to this question is that repetitions of the experiment would yield differences as extreme as or more extreme than those we have observed less than 5% of the time. The 5% refers to both the observed difference and an equally large difference in the opposite direction because both results will be equally implausible (ie, this is a 2-sided significance test). Investigators sometimes conduct 1-sided significance tests where they consider differences in only 1 direction.

Let us return to the example of the randomized trial in which investigators compared enalapril and the combination of hydralazine and nitrates in 804 men with heart failure. The results of this study illustrate hypothesis testing using a dichotomous (yes/no) outcome, in this case, mortality.^{2} During the *follow-up* period, which ranged from 6 months to 5.7 years, 132 of 403 patients (33%) assigned to receive enalapril died, as did 153 of 401 (38%) of those assigned to receive hydralazine and nitrates. Application of a statistical test that compares proportions (the *χ*^{2} *test*) reveals that if there were actually no underlying difference in mortality between the 2 groups, differences as large as or larger than those actually observed would be expected 11 times per 100 (*P* = .11). Using the hypothesis-testing framework and the conventional threshold of *P* < .05, we would conclude that we cannot reject the null hypothesis and that the difference observed is compatible with chance.

Consider a woman who suspects she is pregnant and is undertaking a pregnancy test. The test has possible errors associated with its result. Figure 12.1-2 represents the 4 possible results: the woman is either pregnant or not pregnant, and the test result is either positive or negative. If the woman is pregnant, the test may be positive (*true positive*, cell a) or negative (*false positive*, cell b). If the woman is not pregnant, the test may be positive (*false negative*, cell c) or negative (*true negative*, cell d).

We can apply the same logic to the result of an experiment testing the effect of a treatment. The treatment either has an effect or it does not; the experiment is either positive (*P* ≤ .05) or negative (*P* > .05) (Figure 12.1-3). Here, a true-positive result occurs when there is a real treatment effect and the study results yield a *P* ≤ .05 (cell a), and a true-negative result occurs when treatment has no effect and the study yields a *P* > .05. We refer to a false-positive result (no true treatment effect, *P* ≤ .05, cell b) as a type I error. When we set our threshold α at .05, we fix the probability of a type I error at 5%: 1 in 20 times we will be misled and true null hypotheses will be rejected.

Another type of error that an investigator can make is to conclude that an effective treatment is useless. We refer to such a false-negative result (treatment truly effective, *P* > .05, cell c) as a *type II error*. A type II error occurs when we erroneously dismiss an actual treatment effect—and a potentially useful treatment. The likelihood of making a type II error is called the β level. We expand on this logic in the following discussion.

A clinician might comment on the results of the comparison of treatment with enalapril with that of a combination of hydralazine and nitrates as follows: “Although I accept the 5% threshold and therefore agree that we cannot reject the null hypothesis, I am nevertheless still suspicious that enalapril results in a lower mortality than does the combination of hydralazine and nitrates. The experiment still leaves me in a state of uncertainty.” In making these statements, the clinician is recognizing the possibility of a second type of error, the type II error, in hypothesis testing.

In the comparison of enalapril with hydralazine and nitrates in which we have failed to reject the null hypothesis (*P* > .05), the question is whether this is a true-negative result (cell d) or a false-negative result, a type II error (cell c). The investigators found that 5% fewer patients receiving enalapril died than those receiving the alternative vasodilator regimen. If the true difference in mortality really were 5%, we would readily conclude that patients will receive an important benefit if we prescribe enalapril. Despite this, we were unable to reject the null hypothesis. Why is it that the investigators observed an important difference between the mortality rates and yet were unable to conclude that enalapril is superior to hydralazine and nitrates?

Whenever we observe a large difference between treatment and control groups and yet cannot reject the null hypothesis, we should consider the possibility that the problem is failure to enroll enough patients. The likelihood of missing an important difference (and, therefore, of making a type II error) decreases as the sample size and thus the number of events get larger. We may think of our likelihood of avoiding a type II error; we refer to this likelihood as *power*. When a study is at high risk of making a type II error, we say it has inadequate power to detect an important difference. The larger the sample size, the lower the risk of type II error and the greater the power of the study.

Although the 804 patients recruited by the investigators conducting the vasodilator trial may sound like a substantial number, for dichotomous outcomes, such as mortality, even larger sample sizes are often required to detect small treatment effects. For example, researchers conducting the trials that established the optimal treatment of acute myocardial infarction with thrombolytic agents both anticipated and found *absolute differences* between treatment and control mortalities of less than 5%. Because of these small absolute differences between treatment and control, they required—and recruited—thousands of patients to ensure adequate power.

Whenever a trial has failed to reject the null hypothesis (ie, when *P* > .05), a possible interpretation is that the investigators may have missed a true treatment effect. In these negative studies, the larger the difference in effects in favor of the experimental treatment, the more likely it is that the investigators missed a true treatment effect.^{3} Another chapter in this book describes how to decide whether a study is large enough to provide a secure basis for clinical decisions (see Chapter 10, Confidence Intervals: Was the Single Study or Meta-analysis Large Enough?).

Thus, it is important to bear in mind that when a trial fails to reject the null hypothesis, this only means that there is no evidence of a difference between the interventions under comparison. This is different from concluding that the effects of the 2 interventions are the same.^{4}

Some studies are not designed to determine whether a new treatment is better than the current one, but rather whether a treatment that is less expensive, easier to administer, or less toxic is more or less as good, or at most only a little worse, than standard therapy. Such studies are often referred to as *equivalence trials* or *noninferiority trials* (see Chapter 8, How to Use a Noninferiority Trial).^{5}

In hypothesis testing, we are aiming to disprove the null hypothesis. In equivalence and noninferiority trials, the null hypotheses are different from those in superiority trials. The null hypothesis of an equivalence trial states that there is a true difference between the 2 treatments, whereas the null hypothesis of a noninferiority trial states that one treatment is better than the other. As a consequence, the interpretation of type I error and type II error change.

Consider a noninferiority study of a new treatment that is in fact not worse than the standard. Consider further that the sample size (and thus the power) of the study is inadequate. If this is the case, the investigator runs the risk of a type II error: not rejecting the null hypothesis and thus failing to show that the new treatment is no worse than the previous standard. In these circumstances, patients who continue to receive standard therapy may miss important benefits of a noninferior and easier to administer, less expensive, or less toxic alternative.

To this point, our examples have used outcomes such as yes/no, heads or tails, and dying or not dying, all of which we can summarize as a proportion. Often, investigators compare the effects of 2 or more treatments using a variable, such as days in hospital or a score on a quality-of-life questionnaire. We call such variables, in which results can take a large number of values with small differences among those values, *continuous variables*. When we compare differences among groups using continuous outcomes, we typically ask whether we can exclude chance as the explanation of a difference in means.

The study of enalapril vs hydralazine and nitrates in patients with heart failure described previously^{2} provides an example of the use of a continuous variable as an outcome in a hypothesis test. The investigators compared the effect of the 2 regimens on exercise capacity. In contrast to the effect on mortality, which favored enalapril, exercise capacity improved with hydralazine and nitrates but not with enalapril. Using a test appropriate for continuous variables (eg, the *t* test), the investigators compared the changes in exercise capacity from baseline to 6 months in the patients receiving hydralazine and nitrates with those changes in the enalapril group during the same period. Exercise capacity in the hydralazine group improved more, and the differences between the 2 groups are unlikely to have occurred by chance (*P* = .02).

Suppose we have assembled a set of all 5 Canadian coins (nickel, dime, quarter, dollar, 2-dollar) and want to test the hypothesis that these 5 coins are unbiased. As in the earlier example, we toss each coin 10 times and count the number of heads for each coin to be 4, 7, 5, 9, and 4. Using the results from 10 consecutive tosses of a single coin, we note that the 9 heads for the dollar coin are extremely unlikely if the coin is unbiased, so we conclude that the dollar coin is biased (*P* = .02, as before). If we had specified that we were going to focus on only the dollar coin and ignored the results for the others, this experiment would have been identical to the single coin-tossing experiment.

However, we tossed 5 coins and if any coin showed 9 or more heads or 9 or more tails, we would have considered this to be equally extreme unlikely. To calculate the *P* value for the observation that a single coin came up heads 9 times requires us to work out how unlikely it is, if all 5 coins are unbiased, to get at least 1 coin with 9 or more heads or 9 or more tails. Intuition tells us that 9 heads are more likely to happen when tossing 5 coins than when tossing only 1. Probability theory can tell us exactly how likely it is.

The chance of getting 9 or more identical results on 1 unbiased coin is 0.021 or 2.1%. This means that the chance of getting fewer than 9 identical results is 1 − 0.021 = 0.979. The chance that all 5 coins will have fewer than 9 identical results is 0.979 × 0.979 × 0.979 × 0.979 × 0.979 = 0.90. So there is a 1 − 0.90 = 0.10 or 10% chance that at least 1 coin will have 9 or more identical results if all 5 coins are unbiased.

The example above illustrates that an outcome may be extremely unlikely in a single experiment but the same outcome would not be regarded as so unlikely in the context of repeated experiments. Consider a study that examined the effect of a treatment on 6 outcomes. To make the calculations easier, we will assume they are independent, meaning that one outcome on a patient does not depend in any way on the other outcomes.

Suppose we decided to test each outcome at the α = .05 level. For any single outcome, if the treatment is completely ineffective, there is indeed only a 5% chance that we will cross the significance threshold and reject the null hypothesis; there is a 95% chance that we will not reject it. What happens when we examine 6 outcomes? The chance of not crossing the threshold for the first 2 outcomes is 0.95 multiplied by 0.95; for all 6 outcomes, the probability that not a single outcome would cross the 5% threshold is 0.95 to the sixth power, or 0.74. The probability that at least 1 outcome has a result that crosses the significance threshold is therefore 1.0 – 0.74 = 26%, or approximately 1 in 4, rather than 1 in 20. If we wished to maintain our overall type I error rate of 0.05, we could divide the threshold α by 6, so that each of the 6 tests would use a boundary value of approximately 0.05 / 6 = 0.0083.^{6}

Identifying the correct α level for a test that a hypothesis is testing is made appreciably more complicated if we are simultaneously considering more than 1 hypothesis. For instance, in the coin-tossing example above, we chose to use a single coin having 9 heads as our measure of how extremely unlikely the results were. Faced with the same set of outcomes, someone else might have chosen the coins with 7 and 9 heads and asked how extremely unlikely that was. And someone else might wonder how extreme the entire set of 4, 7, 5, 9, and 4 heads was. We also need to decide on the relevant hypothesis or hypotheses to test. Are we interested in testing hypotheses about the unbiased nature of each coin and calculating a *P* value for each coin? Or are we interested in the single global null hypothesis that all of the coins are unbiased? If that is our null hypothesis and we reject it, we simply conclude that at least 1 coin is biased, without saying which coin it is.

We find an example of the dangers of using multiple outcomes in a randomized trial of the effect of rehabilitation on quality of life after myocardial infarction, in which investigators randomly assigned patients to receive standard care, an exercise program, or a counseling program. They obtained patient reports of 10 outcomes: work, leisure, quality of work and leisure, sexual activity, adherence with advice, cardiac symptoms, psychiatric symptoms, general health, and satisfaction with outcome.^{7} For almost all of these variables, there was no difference among the 3 groups. However, after 18 months of follow-up, patients were more satisfied with the exercise regimen than with the other 2 regimens, families in the counseling group were less protective than in the other groups, and patients participating in the counseling group worked more hours and had sexual intercourse more frequently.

Does this mean that both exercise and rehabilitation programs should be implemented because of the small number of outcomes that changed in their favor or that they should be rejected because most of the outcomes showed no difference? The authors themselves concluded that their results did not support the effectiveness of rehabilitation in improving quality of life. However, a program's advocate might argue that if even some of the ratings favored treatment, the intervention is worthwhile. The use of multiple instruments opens the door to such potential controversy.

We should be aware of multiple hypothesis testing that may yield misleading results. A number of statistical strategies exist for dealing with the issue of multiple hypothesis testing on the same data set. We have illustrated a useful strategy for clinicians in a previous example: dividing the *P* value by the number of tests. One also can specify, before the study is undertaken, a single primary outcome on which the major conclusions of the study will hinge. Another approach when conducting a study is to derive a single global test statistic that effectively combines the multiple outcomes into a single measure.

Finally, we might argue that in some situations, we can conduct several hypothesis tests without adjusting for multiple comparisons. When the hypotheses being tested represent distinct scientific questions, each of interest in its own right, it may be that interpretation of each hypothesis should not be influenced by the number of other hypotheses being tested.^{6}

A full discussion of strategies for dealing with multiple outcomes is beyond the scope of the *Users' Guides to the Medical Literature*, but the interested reader can find a cogent discussion elsewhere.^{8}

At this point, you may be entertaining a number of questions that leave you uneasy. Why use a single cut point for rejecting the null hypothesis when the choice of a cut point is somewhat arbitrary? Why dichotomize the question of whether a treatment is effective into a yes/no issue when it may be viewed more appropriately as a continuum (for instance, from “very unlikely to be effective” to “almost certainly effective”)? See Chapter 10, Confidence Intervals: Was the Single Study or Meta-analysis Large Enough?, for an explanation of why we consider an alternative to hypothesis testing a superior approach.

*CMAJ.*2004;171:online-1 to online-12. http://www.cmaj.ca/cgi/data/171/6/611/DC1/1. Accessed February 10, 2014.

*N Engl J Med*. 1991;325(5):303-–310. [PubMed: 2057035]

*Arch Intern Med*[Archives of Internal Medicine Full Text]. 1985;145(4):709-–712. [PubMed: 3985731]

*J Clin Epidemiol*. 1991;44(8):839-–849. [PubMed: 1941037]

*Encyclopedia of Biostatistics.*New York, NY: Wiley; 1999:2736-–2746.

*Lancet*. 1981;2(8260-61):1399-–1402. [PubMed: 6118768]

*Biometrics*. 1987;43(3):487-–498. [PubMed: 3663814]