One error that an investigator can make is to conclude that an outcome differs between a treatment group and a control group when, in fact, no such difference exists. In statistical terms, making the mistake of concluding that treatment and control differ when, in truth, they do not is called a type I error, and the probability of making such an error is referred to as the α level.
Imagine a situation in which we are uncertain whether a coin is biased. We could construct a null hypothesis that the true proportions of heads and tails are equal (ie, the coin is unbiased). With this scenario, the probability of any given toss landing heads is 50%, as is the probability of any given toss landing tails. We could test this hypothesis by an experiment in which we conduct a series of coin tosses. Statistical analysis of the results of the experiment would address the question of whether the results observed were consistent with chance.
Let us conduct a hypothetical experiment in which the suspect coin is tossed 10 times, and on all 10 occasions, the result is heads. How likely is this to have occurred if the coin were indeed unbiased? Most people would conclude that it is highly unlikely that chance could explain this extremely unlikely result. We would therefore be ready to reject the hypothesis that the coin is unbiased (the null hypothesis) and conclude that the coin is biased toward a toss of heads.
Statistical methods allow us to be more precise by ascertaining just how likely such an unusual result is if the null hypothesis is true. The law of multiplicative probabilities for independent events (in which one event in no way influences the other) tells us that the probability of 10 consecutive heads can be found by multiplying the probability of a single head 10 times over; that is, (1/2) × (1/2) × (1/2), and so on, which yields a value of slightly less than 1 in a 1000. A series of 10 consecutive tails would be equally unusual and would also cause us to doubt that the coin was unbiased. The probability of getting 10 heads or 10 tails is just under 2/1000.
In a journal article, we would likely see this probability expressed as a P value, such as P = .002 (if the value was rounded to the third decimal). What is the precise meaning of this P value? If the coin were unbiased (ie, if the null hypothesis were true) and we were to repeat the experiment of the 10 coin tosses many times, by chance alone, we would get either 10 heads or 10 tails in approximately 2 per 1000 of these repetitions.
The framework of hypothesis testing involves a yes/no decision. Are we willing to reject the null hypothesis? This choice involves a decision about how much risk or chance of making a type I error we are willing to accept. The reasoning implies a threshold value that demarcates a boundary. On one side of this boundary, we are unwilling to reject the null hypothesis; on the other side, we are ready to conclude that chance is no longer a plausible explanation for the results. The threshold chosen is the α level mentioned above.
To return to the example of 10 consecutive heads or tails, most people observing this distribution would be ready to reject the null hypothesis, which—it turns out—would be expected to occur by chance alone less than twice per 1000 experiments. What if we repeat the thought experiment, and this time we obtain 9 tails and 1 head? Once again, it is unlikely that the result is because of the play of chance alone. As shown in Figure 12.1-1 (which you will recognize from Chapter 6; the theoretical distribution of the distribution of results on an infinite number of repetitions of the 10-coin-flip experiment when the coin is unbiased), the P value is .02, or 2 in 100. That is, if the coin were unbiased and the null hypothesis were true, we would expect results as extreme as—or more extreme than—those observed (ie, 10 heads or 10 tails, 9 heads and 1 tail, or 9 tails and 1 head) to occur by chance alone 2 times per 100 repetitions of the experiment.
Theoretical Distribution of Results of an Infinite Number of Repetitions of 10 Tosses of an Unbiased Coin
Where we set this threshold or boundary is a matter of judgment. Statistical convention suggests a threshold that demarcates the plausible from the implausible at 5 times per 100, which is represented by an α value of .05. Once we have chosen our threshold (of α = .05, for example), we call a result that falls beyond this boundary (ie, the result gives P ≤ .05) statistically significant. The meaning of statistically significant, therefore, is “sufficiently unlikely to be due to chance alone that we are ready to reject the null hypothesis.”
Statistically significant findings occasionally happen by chance, and it is only convention that makes the .05 threshold sacrosanct. Suppose we set α = .01, so we reject the null hypothesis if P ≤ .01. A finding with a P < .01 will happen, simply by chance, 1% of the time if the null hypothesis is true; this means we would reject a true null hypothesis 1% of the time. If we wish to be more conservative (more sure when we reject the null hypothesis that chance cannot explain the difference observed), we might well choose a 1% threshold.
Let us repeat our experiment twice more, both times with a new coin. On the first repetition, we obtain 8 heads and 2 tails. Calculation of the P value associated with an 8/2 split tells us that, if the coin were unbiased, results as extreme as or more extreme than 8/2 (or 2/8) would occur solely as a result of the play of chance 11 times per 100 (P = .11) (Figure 12.1-1). We have crossed to the other side of the conventional boundary between what is plausible and what is implausible. If we accept the convention, the results are not statistically significant and we will not reject the null hypothesis.
On our final repetition of the experiment, we obtain 7 tails and 3 heads. Experience tells us that such a result, although not the most common, would not be unusual even if the coin were unbiased. The P value confirms our intuition: Results as extreme as, or more extreme than, this 7/3 split would occur under the null hypothesis 34 times per 100 (P = .34) (Figure 12.1-1). Again, we will not reject the null hypothesis.
When investigators compare 2 treatments, the question they ask is, how likely is it that the observed difference, or a larger one, could be a result of chance alone? If we accept the conventional boundary or threshold (P ≤ .05), we will reject the null hypothesis and conclude that the treatment has some effect when the answer to this question is that repetitions of the experiment would yield differences as extreme as or more extreme than those we have observed less than 5% of the time. The 5% refers to both the observed difference and an equally large difference in the opposite direction because both results will be equally implausible (ie, this is a 2-sided significance test). Investigators sometimes conduct 1-sided significance tests where they consider differences in only 1 direction.
Let us return to the example of the randomized trial in which investigators compared enalapril and the combination of hydralazine and nitrates in 804 men with heart failure. The results of this study illustrate hypothesis testing using a dichotomous (yes/no) outcome, in this case, mortality.2 During the follow-up period, which ranged from 6 months to 5.7 years, 132 of 403 patients (33%) assigned to receive enalapril died, as did 153 of 401 (38%) of those assigned to receive hydralazine and nitrates. Application of a statistical test that compares proportions (the χ2 test) reveals that if there were actually no underlying difference in mortality between the 2 groups, differences as large as or larger than those actually observed would be expected 11 times per 100 (P = .11). Using the hypothesis-testing framework and the conventional threshold of P < .05, we would conclude that we cannot reject the null hypothesis and that the difference observed is compatible with chance.