This JAMA Guide to Statistics and Methods explains when adjustment for multiple comparisons is appropriate and outlines the limitations, interpretations, and cautions to be aware of when using these adjustments.
Problems can arise when researchers try to assess the statistical significance of more than 1 test in a study. In a single test, statistical significance is often determined based on an observed effect or finding that is unlikely (<5%) to occur due to chance alone. When more than 1 comparison are made, the chance of falsely detecting a nonexistent effect increases. This is known as the problem of multiple comparisons (MCs), and adjustments can be made in statistical testing to account for this.1
Saitz et al2 reported results of a randomized trial evaluating the efficacy of 2 brief counseling interventions (ie, a brief negotiated interview and an adaptation of a motivational interview, referred to as MOTIV) in reducing drug use in primary care patients when compared with not having an intervention. Because MCs were made, the authors adjusted how they determined statistical significance. In this chapter, we explain why adjustment for MCs is appropriate in this study and point out the limitations, interpretations, and cautions when using these adjustments.
Why Are Multiple Comparison Procedures Used?
When a single statistical test is performed at the 5% significance level, there is a 5% chance of falsely concluding that a supposed effect exists when in fact there is none. This is known as making a false discovery or having a false-positive inference. The significance level represents the risk of making a false discovery in an individual test, denoted as the individual error rate (IER). If 20 such tests are conducted, there is a 5% chance of making a false-positive inference with each test so that, on average, there will be 1 false discovery in the 20 tests.
Another way to view this is in terms of probabilities. If the probability of making a false conclusion (ie, false discovery) is 5% for a single test in which the effect does not exist, then 95% of the time, the test will arrive at the correct conclusion (ie, insignificant effect). With 2 such tests, the probability of finding an insignificant effect with the first test is 95%, as it is for the second. However, the probability of finding insignificant effects in the first and the second test is 0.95 × 0.95, or 90%. With 20 such tests, the probability that all of the 20 tests correctly show insignificance is (0.95)20 or 36%. So there is a 100% − 36%, or 64%, chance of at least 1 false-positive test occurring among the 20 tests. Because this probability quantifies the risk of making any false-positive inference by a group, or family, of tests, it is referred to as the family-wise error rate (FWER). The FWER generally increases as the number of tests performed increases. For example, assuming IER = 5% and denoting the number of multiple tests performed as K, then for K = 2 independent tests, FWER = 1 − (0.95)2 = 10%; for K...