Will the Reproducibility of the Test Result and Its Interpretation Be Satisfactory in My Clinical Setting?
The value of any test depends on its ability to yield the same result when reapplied to stable patients. Poor reproducibility can result from problems with the test itself (eg, variations in reagents in radioimmunoassay kits for determining hormone levels) or from its interpretation (eg, the extent of ST-segment elevation on an electrocardiogram). You can easily confirm this when you recall the clinical disagreements that arise when you and one or more colleagues examine the same electrocardiogram, ultrasonogram, or CT (even when all of you are experts).
Ideally, an article about a diagnostic test will address the reproducibility of the test results using a measure that corrects for agreement by chance (see Chapter 19.3, Measuring Agreement Beyond Chance), especially for issues that involve interpretation or judgment.
If the reported reproducibility of a test in the study setting is mediocre and disagreement between observers is common, and yet the test still discriminates well between those with and without the target condition, the test is likely to be very useful. Under these circumstances, there is a good chance that the test can be readily applied to your clinical setting.
If reproducibility of a diagnostic test is very high, either the test is simple and unambiguous or those interpreting the results are highly skilled. If the latter applies, less skilled interpreters in your own clinical setting may not do as well. You will either need to obtain appropriate training (or ensure that those interpreting the test in your setting have that training) or look for an easier and more robust test.
Are the Study Results Applicable to the Patients in My Practice?
Test properties may change with a different mix of disease severity or with a different distribution of competing conditions. When patients with the target disorder all have severe disease, LRs will move away from a value of 1.0 (ie, sensitivity increases). If patients are all mildly affected, LRs move toward a value of 1.0 (ie, sensitivity decreases). If patients without the target disorder have competing conditions that mimic the test results seen in patients who have the target disorder, the LRs will move closer to 1.0, and the test will appear less useful (ie, specificity decreases). In a different clinical setting in which fewer of the disease-free patients have these competing conditions, the LRs will move away from 1.0, and the test will appear more useful (ie, specificity increases). Differing prevalence in your setting may alert you to the possibility that the spectrum of target-positive and target-negative patients could differ in your practice.15
Investigators have reported the phenomenon of differing test properties in different subpopulations for exercise electrocardiography in the diagnosis of coronary artery disease. The more severe the coronary artery disease, the larger the LRs of abnormal exercise electrocardiograph results for angiographic narrowing of the coronary arteries.16 Another example comes from the diagnosis of venous thromboembolism, where compression ultrasonography for proximal-vein thrombosis has proved more accurate in symptomatic outpatients than in asymptomatic postoperative patients.17
Sometimes, a test fails in just the patients one hopes it will best serve. The LR of a negative dipstick test result for the rapid diagnosis of urinary tract infection is approximately 0.2 in patients with clear symptoms and thus a high probability of urinary tract infection but is higher than 0.5 in those with low probability,18 rendering it of little help in ruling out infection in the latter situation.
If you practice in a setting similar to that of the study and if the patient under consideration meets all of the study eligibility criteria, you can be confident that the results are applicable. If not, you must make a judgment. As with therapeutic interventions, you should ask whether here are compelling reasons why the results should not be applied to the patients in your practice, either because of the severity of disease in those patients or because the mix of competing conditions is so different that generalization is unwarranted. You may resolve the issue of generalizability if you can find a systematic review that summarizes the results of a number of studies.19
Will the Test Results Change My Management Strategy?
It is useful, when making and communicating management decisions, to link them explicitly to the probability of the target disorder. For any target disorder there are probabilities below which a clinician would dismiss a diagnosis and order no further tests: the test threshold. Similarly, there are probabilities above which a clinician would consider the diagnosis confirmed and would stop testing and initiate treatment (ie, the treatment threshold). When the probability of the target disorder lies between the test and treatment thresholds, further testing is mandated (see Chapter 16, The Process of Diagnosis).
If most patients have test results with LRs near 1.0, test results will seldom move us across the test or treatment threshold. Thus, the usefulness of a diagnostic test is strongly influenced by the proportion of patients suspected of having the target disorder whose test results have very high or very low LRs. Among the patients suspected of having dementia, a review of Table 18-1 allows us to determine the proportion of patients with extreme results (LR >10 or <0.1). The proportion can be calculated as (105 + 2 + 64 + 2 + 11 + 163)/(345 + 306) or 347/651 = 53%. The SIS is likely to move the posttest probability in a decisive manner in half of the patients suspected of having dementia and examined—a very impressive proportion and better than for most of our diagnostic tests.
A final comment has to do with the use of sequential tests. The LR approach fits in particularly well in thinking about the diagnostic pathway. Each item of history—or each finding on physical examination—represents a diagnostic test in itself. We can use one test to get a certain posttest probability that can be further increased or decreased by using another, subsequent test. In general, we can also use laboratory tests or imaging procedures in the same way. If 2 tests are very closely related, however, application of the second test may provide little or no additional information, and the sequential application of LRs will yield misleading results. For example, once one has the results of the most powerful laboratory test for iron deficiency, serum ferritin, additional tests, such as serum iron or transferrin saturation, add no further useful information.20 Once one has conducted an SIS, additional information from the MMSE is likely to be minimal.
Clinical prediction rules deal with the lack of independence of a series of tests and provide the clinician with a way of combining their results (see Chapter 19.4, Clinical Prediction Rules). For instance, in patients with suspected pulmonary embolism, one could use a rule that incorporates leg symptoms, heart rate, hemoptysis, and other aspects of the history and physical examination to accurately classify patients with suspected pulmonary embolism as being characterized by high, medium, and low probability.21
Will Patients Be Better Off as a Result of the Test?
The ultimate criterion for the usefulness of a diagnostic test is whether the benefits that accrue to patients are greater than the associated risks.22 How can we establish the benefits and risks of applying a diagnostic test? The answer lies in thinking of a diagnostic test as a therapeutic maneuver (see Chapter 7, Therapy [Randomized Trials]). Establishing whether a test does more good than harm will involve (1) randomizing patients to a diagnostic strategy that includes the test under investigation and a management schedule linked to it, or to one in which the test is not available, and (2) following up patients in both groups forward in time to determine the frequency of patient-important outcomes.
When is demonstrating accuracy sufficient to mandate the use of a test and when does one require a randomized clinical trial? The value of an accurate test will be undisputed when the target disorder is dangerous if left undiagnosed, if the test has acceptable risks, and if effective treatment exists. This is the case for the CT-angiogram for suspected pulmonary embolism. A high probability or normal or near-normal results of the CT-angiogram may well eliminate the need for further investigation and may result in anticoagulant agents being appropriately given or appropriately withheld (with either course of action having a substantial positive influence on patient outcome).
Sometimes, a test may be completely benign, represent a low resource investment, be evidently accurate, and clearly lead to useful changes in management. Such is the case for use of the SIS in patients with suspected dementia, when test results may dictate reassurance or extensive investigation and ultimately planning for a tragic deteriorating course.
In other clinical situations, tests may be accurate and management may even change as a result of their application, but their effect on patient outcome may be far less certain. Consider one of the issues we raised in our discussion of framing clinical questions (see Chapter 4, What Is the Question?). There, we considered a patient with apparently resectable non–small cell carcinoma of the lung and wondered whether the clinician should order a positron emission tomogram (PET)–CT and base further management on the results or use alternative diagnostic strategies. For this question, knowledge of the accuracy of CT is insufficient. A randomized trial of PET-CT–directed management or an alternative strategy for all patients is warranted. Other examples include catheterization of the right side of the heart for critically ill patients with uncertain hemodynamic status and bronchoalveolar lavage for critically ill patients with possible pulmonary infection. For these tests, randomized trials have helped elucidate optimal management strategies.
CLINICAL SCENARIO RESOLUTION
Although the study itself does not report reproducibility, its scoring is simple and straightforward because you need only count the number of errors made to 6 questions. The SIS does not require any props or visual cues and is therefore unobtrusive, easy to administer, and takes only 1 to 2 minutes to complete (compared with 5 to 10 minutes for the MMSE). Although you note that trained research staff administered the SIS, the appendix of the article gives a detailed, word-by-word instruction on how to administer the SIS. You believe that you too can administer this scale reliably.
The patient in the clinical scenario is an older woman who was able to come to your clinic by herself but appeared no longer as lucid as she used to be. The Alzheimer Disease Center cohort in the study we had been examining in this chapter consists of people suspected of having dementia by their caregivers and brought to a tertiary care center directly. Their test characteristics were reported to be similar to those observed in the general population cohort, that is, in a sample with less severe presentations. You decide that there is no compelling reason that the study results would not apply to your patient.
You invite your patient back to the office for a follow-up visit and administer the SIS. The result is a score of 4, which, given your pretest probability of 20%, increases the probability to more than 60%. After hearing that you are concerned about her memory and possibly about her function, she agrees to a referral to a geriatrician for more extensive investigation.