+++
Updated Summary on Precision and Accuracy of the Clinical Examination
+++
What Is There to Update?
++
Each of the updates in The Rational Clinical Examination systematically evaluates the newly published literature on the topic, except this one. Updating the Primer requires a different approach to fulfill the original promise that the series would address methodologic concerns beyond precision and accuracy. What we will do is take a very utilitarian approach, driven by the topic updates themselves. The updates and our own lectures on the rational clinical examination unearthed topics that we need to address. Rather than conducting a systematic review of quality measures, sensitivity, specificity, likelihood ratios (LRs), and a plethora of related topics, we instead provide background information and answers to questions that our own authors required when preparing their reviews and updates.
++
Of course, the basic premise for diagnosis has not changed since the Primer (or since Thomas Bayes figured it out more than 3 centuries ago):
++
++
For the clinical examination, this means we (1) use information about the probability of a target disorder (frequently taken as the prevalence, which is then converted to the prior odds) and then (2) apply the results of symptoms or signs (in the form of an LR). After applying the LR associated with various symptoms and signs, we get the posterior odds of disease. The probability of disease increases when a clinical finding is more likely in a patient with the target disorder (reflected by an LR > 1). The probability of disease decreases when a clinical finding is more likely to occur in a patient without the target disorder (reflected by an LR < 1). The resultant probability becomes the “posterior” probability because the prior probability is established first and then modified with information from the medical history and physical examination quantitatively expressed in the form of the LR.∗ Keeping the simple equation in mind focuses the goal of The Rational Clinical Examination series articles on providing all the data needed to solve the posterior odds equation.
+
++
++
In the Primer, we emphasized the role of the univariate LR for clinicians. The term univariate means the results for 1 finding, without regard to the findings of other historical or clinical features. We chose this route for a variety of reasons, most important being its fundamental property that allows clinicians to apply the values to individual patients in a consistent pattern. LRs always convey the same information—they quantify the change in odds of disease for a particular test result. By tradition for dichotomous test results, we call the LR associated with a positive test the LR+ (positive LR), whereas the LR associated with a negative test is the LR– (negative LR). In either case, the actual LR value is related to the change in likelihood that the patient has the disease of interest. Thus, there can be no confusion, as is sometimes the case when physicians become overwhelmed with how to translate positive predictive value, true-positive rate, false-positive rate, negative predictive value, true-negative rate, or false-negative rate into a change in the likelihood of disease for an individual patient.
++
Many clinicians feel more comfortable with the terms and. However, these values in and of themselves have little application to the clinical setting. Sensitivity and specificity are values that apply to a screening test result before we know whether the patient has the target disorder. So which result do we use at the bedside? Sensitivity applies only to patients with disease, whereas specificity applies only to patients without disease. Because we use screening tests precisely because we do not know about the presence or absence of disease, how do we decide whether the value of sensitivity or the value of specificity applies to our patient? The simple answer is that we do not know. If we do know which result applies to our patient, then, by definition, we know the disease status, and the results of screening tests lose relevance. The true value of an LR comes from its mathematical definition that combines the values of sensitivity and specificity, making it applicable to each patient before we know whether disease is present or absent.
++
When evaluated in combination, the sensitivity and specificity are the building blocks of the LR for tests that are dichotomous (eg, “positive” or “negative,” “present” or “absent”). The LR for a positive result is sensitivity/(1 – specificity), whereas the LR for a negative result is (1 – sensitivity)/specificity. But what happens when a screening test has more than 2 outcomes (Table 1-3)?
++
++
Traditional laboratory tests are measured on continuous scales, where the result intervals have a mathematical meaning, but the clinician could not possibly know the LR for every outcome. A clinical laboratory reports the raw result, along with a designator for whether the result is “high,” “normal,” or “low.” The report takes the raw value and transforms it to an ordinal scale, making it easier for clinicians to review a large amount of data. When there are more than 2 outcomes of a screening test, sensitivity and specificity cannot be directly calculated, so the clinician must rely on LRs that are usually given for ordinal results.
++
A simple quantitative explanation helps explain why the sensitivity and specificity lose meaning when there are more than 2 screening test results. The presence of a third heart sound (S3) suggests left ventricular (LV) systolic dysfunction. Sometimes, the clinician is uncertain whether the sound is present. To illustrate this point, we can make up some data that might apply to the clinician's interpretation of the S3 compared with a reference standard echocardiogram that quantified the LV function (Table 1-4).
++
++
We can describe the sensitivity of the S3 as 30/(30 + 5 + 10) = 0.68 and the specificity as 50/(5 + 10 + 50) = 0.77. Although this may seem straightforward, closer inspection reveals some problems with that interpretation. First, the treatment of the “uncertain” results lacks consistency. For calculating the sensitivity, we “count” an uncertain S3 as if it were actually absent. But the clinical reality was that the physician could not state with certainty whether it was present or absent. When we calculate the specificity, we do the exact opposite and count the “uncertain” outcomes as if they were “positive.” How can one “uncertain” finding be considered “positive” for sensitivity but “negative” as specificity? This dual treatment creates problems that become even more pronounced as the number of results increases beyond 3 outcomes.
++
Second, even if we believed that the sensitivity and specificity captured the meaning of an S3 that is either present or absent, how do we describe the results for “uncertain?” Sensitivity provides an inadequate definition because sensitivity is the value that describes the percentage of patients with an abnormal result among all those with disease and “uncertain” is neither abnormal nor normal. A similar argument applies to the specificity, so that neither sensitivity nor specificity offers a reasonable description of the value of an uncertain result. The constructs just do not apply to a test result that is neither completely normal nor completely abnormal. The LR provides a way to describe not only the positive and negative results but also those that are uncertain.
++
At a fundamental level, the LR takes a given screening test result and for that outcome tells us the ratio of those with disease to those without disease. So once we know which row of the table a patient belongs in according to their test result (S3 present, S3 uncertain, or S3 absent), the LR tells us the likelihood that the patient will come from the first column vs the second column. We can calculate an LR for every row of an r × 2 table (where r represents the number of rows) (Table 1-5).
++
++
Thus, when we hear an S3 in the patient, we apply the value 8.7, which makes LV systolic dysfunction much more likely. When we feel confident that an S3 is absent, the likelihood of LV systolic dysfunction decreases. However, when we are “uncertain,” the LR we apply is 0.72, a value that approaches 1 and suggests that the “uncertain” result should not have a large effect on our estimate of the likelihood of disease. Oftentimes, it is useful to know that “uncertain” really means “not much information” with an LR approaching 1.
+++
Isn’t All the Information in the Patient's Medical History?
++
We now need to address a common belief that the physical examination is not particularly helpful and, at best, only confirms the historical findings and symptoms. Oftentimes, a clinician takes a patient's medical history and makes a diagnosis before performing a physical examination. This process, although sometimes successful, leads to the inference that the physical examination was unnecessary. For a simple reason, the inference is not true: the physical examination begins from the moment the clinician meets a patient and before the patient utters a word! We observe body language, the patient's gait, vital signs (eg, tachypnea), and physical deformities, and we judge the acuity of illness. These findings derived from visual observations may be hard to quantify (eg, a sense that the quiet, sullen patient might be depressed), although most clinicians recognize the huge amount of information they collect in the first few moments of a patient interaction. Because describing and measuring the influence of our overall observations is difficult, researchers often overlook the clinical gestalt.
++
One way of isolating the clinical gestalt is to evaluate whether we can make a diagnosis in the absence of directly observing a patient. A symptom checklist (but not the patient's medical history) can be obtained through a completed patient self-administered questionnaire. Sometimes, we can infer a diagnosis from such questionnaires with our impression uncontaminated by physical findings, but the diagnosis typically requires confirmation obtained through a patient interview or physical examination. The ability to disentangle the history from the physical examination findings is often an illusion, leading to the inference that the patient's medical history (symptoms) dominates the clinical diagnostic process over the physical examination (signs).
++
The most important part of the clinical examination and the resulting diagnosis is typically not the symptoms or signs—it is the pretest probability, transformed to the prior odds, that dominates the equation. Simply put, if a condition is highly unlikely (or vice versa), then the presence or absence of any addition findings will typically not change things. As a corollary, when the probability of a target condition is not so certain, the effect of the signs and symptoms on the prior probability creates a potentially bigger effect.
++
So, where does the pretest probability come from? We establish the pretest probability in the course of our clinical examination, and that creates a bit of a problem (for both researchers and clinicians). In other words, as we learn more about the patient's medical history, symptoms, and signs, we orient our approach to a narrower spectrum of disease possibilities. This approach requires that we “waste” a few findings to establish the pretest probability. For example, most patients we examine do not have sinusitis, and we do not ask questions about symptoms related to sinusitis, nor do we transilluminate the sinuses during the course of a clinical examination unless we have a suspicion of the disease. We might constrain our evaluation for sinusitis to patients who claim nasal stuffiness, nasal discharge, or maxillary facial discomfort or who come right out and state, “I think I have a sinus infection.” Each of these findings would prompt an appropriate evaluation for sinusitis and in a research study create the “entrance criteria.” Thus, when we refer to the pretest probability of sinusitis, we most likely are referring to the prevalence of sinusitis among patients with any of those findings rather than to the prevalence of sinusitis among all patients in general. This pretest probability becomes the value we use in the equation and the anchor for applying other symptoms and signs we uncover during our clinical examination.
++
The establishment of the pretest probability is the problem most learners fear, representing their main “excuse” for not using the concepts in The Rational Clinical Examination. Frequently, learners claim “lack of experience.” When existing studies adequately describe their study population, the pretest probability is not difficult to understand. Experience becomes more valuable when the literature is less clear, and perhaps this is part of the “art” of the clinical examination. Trainees may be quite good at estimating the pretest probability of common conditions. However, both trainees and experienced clinicians tend to overestimate the prior probabilities of less common diseases. Trainees express discomfort when estimating the prior probability because (1) they do not practice quantifying and then validating their clinical impression and (2) they may recall their own cases in which they pursued an unlikely diagnosis for a seemingly “classic” presentation, only to find that the disease was not present. Although the second reason emanates from overlooking the importance of prior probability, it requires a reassessment of the role of symptoms and signs.
++
The presence of a “good” symptom or sign creates a large effect on the probability, convincing the clinician that the target condition is much more likely to be present than the prior probability suggests. The suggestion that some prespecified LR threshold defines a good clinical finding for all disease is a myth so persistent that it represents a medical urban legend. Some researchers and clinicians define a “good” test result as that associated with an LR greater than 10 or an LR less than 0.1, but these results do not have intrinsic properties that are the sine qua non of high value. For example, a pretest probability of 10% and positive test with an LR = 10 generates a posttest probability of 53%; this is a big increase in the probability of disease but hardly an increase that clinches the diagnosis. Furthermore, this is a similar posttest probability that follows from a disease with a pretest probability of 20% and a positive test with an LR = 5. Thus, although positive test results are increasingly powerful as the LR increases and negative results are increasingly valuable as the LR decreases, the efficiency of the finding in making a diagnosis depends on the pretest probability.
++
When considering that multiple symptoms and signs are interpreted together, individual findings with much less impressive LRs alone (eg, LR+, 2-5; or LR–, 0.25-0.50) could prove useful when used in combination. If no LR threshold automatically qualifies a result as good, is there a way to compare the efficiency of different clinical findings?
++
A positive clinical finding with the highest LR+ or a negative finding with the lowest LR– will always have the greatest effect on posttest probability. Unfortunately, clinicians discover that a list of symptoms and signs for an individual patient sometimes simultaneously yields outcomes both suggesting (positive results) and pointing away from (negative findings) a target disorder. There is a way, though, to make sense of this. Rank ordering the LR+ associated with each result, along with the reciprocal of the LR– (1/LR–), reveals the single “best” clinical finding for a target condition. The value with the highest LR+ or 1/LR– is the single best symptom or sign result. A single symptom or sign may be useful when present (high LR+) or absent (small LR–). Unfortunately, most symptoms and signs will not produce both the best findings when positive and also the best when it is negative. For example, a clinical sign may have a low LR– when negative, whereas a positive result may have an LR+ that approaches 1. Creating a mental list of LR and 1/LR– for a variety of symptoms and signs is not easy. Some clinicians want to identify the single finding that overall is the most likely to give them the right answer (ie, positive when the patient has disease and negative when the patient is not affected).
++
The diagnostic odds ratio (DOR) creates a single measure of accuracy that tells us which symptom or sign is most likely to correctly classify a patient as having the target disorder or not.1 The DOR is not difficult to calculate, as the DOR = LR+/LR–. The more accurate the symptom or sign, the higher the DOR. So when faced with a table of data on many clinical findings in which none distinguishes itself as the overwhelming favorite, the clinician should choose the finding with the highest DOR. Unfortunately, the DOR cannot be used like the LR for estimating the probability of a diagnosis, but it can help us choose the symptoms and signs of higher utility so that we can ignore those of lesser value. At this point, the skeptical reader might accept that there is a method for identifying better symptoms and signs in terms of their overall measurement properties (through the DOR) and better results applicable to individual patients (through the LR). However, a remaining question might be, How confident can I be that the symptoms and signs I think are the best really are the best?
+++
The Confidence Interval
++
When The Rational Clinical Examination series began, we presented likelihood results as single point values as if they completely described a clinical finding—they do not. Like all statistical parameters, an LR has an associated confidence interval (CI) that helps us decide whether the data are sufficient for us to infer usefulness. These CIs are important because they provide transparency. An optimistic LR suggests a promising clinical finding, but a broad CI dampens the enthusiasm by implying that a small sample size accounts for some certainty. We are particularly cautious when the 95% CI includes 1 because LR values of 1 add no information to the pretest probability. Broad CIs around LR–, even when they do not include 1, are a particular problem. Because the LR– values are constrained between 0 and 1, a broad CI seems less of a problem than the broad CI around a high LR+. To compare the relative findings, the clinical reader can use the technique we described above (ie, taking the value 1/LR–) for comparing the breadth of the CIs of negative to positive LRs.
++
Some readers will be surprised that there are different methods that yield slight (but clinically unimportant) differences in CIs. We prefer the easiest computational method that also works well in spreadsheets.2 One situation presents problems for researchers and clinical readers alike: what do we do when one cell of the 2 × 2 table is 0? When any single cell has a 0 value (typically, the cells for false positive or false negatives), adding 0.5 to each cell of the 2 × 2 table allows calculation of useful CIs.3 A sensitivity of 100% yields an LR– of 0, with the LR upper 95% CI obtained after adding 0.5 to each cell. A specificity of 100% yields an LR+ that is not calculable (∞), so we report both the LR+ and CI obtained after adding 0.5 to each cell. Although high-quality studies report both the sensitivity and specificity of clinical findings, not all of them calculate the LRs for us. When researchers provide the actual numbers of affected and unaffected patients, together with the sensitivity and specificity, we can generate the LRs and 95% CIs. Although it is sometimes easy to calculate CIs from individual research reports, meta-analysis offers us an even better way of describing the LRs of findings evaluated across several studies.
++
Meta-analysis of symptoms and signs combines the results described across several studies and summarizes them to get a single estimate and CI. Although some statisticians have a high degree of skepticism about the appropriateness of combining LRs, we take the position that summarizing results provides clarity for clinicians that at the very least allows them to assimilate data and decide whether a symptom or sign is useful, useless, or uncertain.
++
An important part of meta-analysis requires the investigator to make decisions about the appropriateness of combining data. Although statisticians often suggest a purely statistical approach (ie, studies that have statistically heterogeneous results should not be combined), we take a more pragmatic approach similar to that espoused by other clinical diagnosticians.4 First, we evaluate whether the universe of published studies represents the universe of patients for whom the target condition might be considered. When the studies reflect the population of patients for whom the symptoms and signs apply, we prefer to try combining the LRs. On the other hand, when studies use various definitions of disease or different thresholds for the symptoms and signs, we cannot combine the results in a meaningful way. When we cannot combine the results, we present ranges for the LRs. Second, we consider our target audience to be clinical readers. For a condition that might have a very different LR among different populations of patients (eg, findings for appendicitis among children vs geriatrics patients), we avoid combining results or we at least show how they vary. Part of this approach requires common sense, and part of this is statistical, in which we examine the outlier results to deduce whether there is anything recognizable that accounts for the variant LR findings. Third, we examine the actual results with their CIs after we combine the data. We always use random-effects measures for generating the LR and CIs, rather than the fixed-effects approach. Random-effects measures generate broader CIs than the fixed effects, providing at least some assurance that we are not overstating the importance and confidence in our findings. If a study is a statistical LR outlier, we still include it in the combined data if it does not make a large clinical difference in the LRs. We suggest that the clinician use clinical judgment when deciding whether 2 LRs yield clinically important differences in the posttest probability. For example, for a pretest probability of 30%, an LR of 5.4 produces a posttest probability of 70%, whereas an LR of 3.5 produces a posttest probability of 60%. These LRs “look” different, but a clinician might take a similar action for a posttest probability of 70% vs 60%. Thus, the 2 LRs could be statistically different but provide clinically similar results. We always provide the results from each study, and astute readers can decide from the point estimates and CIs whether they believe a finding is useful or useless.
++
More statistically experienced readers may recognize that meta-analysis of LRs differs from what they expect. Statisticians, when they accept meta-analysis of diagnostic tests at all, prefer summarizing the DOR as a global measure of test performance. We take a different approach because summarizing the DOR gives clinicians a value that they cannot use for individual patients. Although we do sometimes provide summary measures of the DOR, the summary measures of the prevalence of disease (pretest probability) and the LR are the values needed for solving the equation for posttest probability. Sometimes, we encounter studies that only provide sensitivity data. What do we do with studies that are case series of patients with disease and that do not have specificity values?
+++
“Sensitivity-Only” Studies
++
When conditions are less common, investigators recognize that enrolling consecutive patients at risk for the target disorder creates a study population overwhelmed by those without disease. This approach is costly and takes time, and the small number of patients with disease leads to broad CIs around the sensitivity and LR–. The alternate approach of studying only patients with disease so that sensitivity can be defined is pragmatic, and it may be the best the investigator can do. These studies typically come from a narrow spectrum of diseased patients, and often, the clinical finding is recorded among patients when the clinician knows that disease is present. In addition to understanding the potential biases in the data, we must understand the inferences made from the sensitivity of symptoms and signs without specificity values. The goal of sensitivity studies is to identify a group of symptoms and signs that would unlikely all be negative in a patient with the target condition.
++
Symptoms and signs with high sensitivity are less likely to be negative in patients with disease. When presented with sensitivity data by itself, clinicians will count the number of absent findings in their patients and deduce that those with normal findings on multiple high-sensitivity symptoms and signs will be unlikely to have disease. For example, suppose we identify 2 symptoms and 1 sign, each of which has a sensitivity of 85% for the target condition. That means that each finding would be absent in 15% of patients with disease; all 3 would be absent in fewer than 1% of patients (0.15 × 0.15 × 0.15).
+++
How Do We Use All the Symptoms and Signs?
++
Among several reasons for preferring LRs as our common statistical parameter, rather than the individual sensitivity and specificity values, the ability to multiply likelihood results from several findings is the most alluring. Unfortunately, a crucial assumption is not often fully addressed—sequentially multiplying LRs requires that the symptoms and signs be independent of one another.
++
Let us explain the independence concept with a simple example. Suppose you conduct a study of chest pain symptoms as a predictor of acute ischemia and you categorize words as having “physical” or “emotional” connotations. Words that describe location and radiation would be physical (eg, “center of the chest,” “in the neck”), whereas words that describe the interpretation of pain would be emotional (eg, “suffocating,” “crushing”). You decide to record whenever a patient refers to an “elephant” in describing their discomfort as emotional as in, “It felt like an elephant stepped on my chest.” We suspect it is obvious that a patient who is “elephant-positive” is experiencing crushing pain, but if they report they are having “crushing pain that feels like an elephant on my chest,” should we report the findings separately for “crushing positive” and “elephant positive?” Multiplying the LRs together for “crushing,” “elephant-like” discomfort probably overstates the importance, producing a posttest odds ratio that is too high because elephant-like pain is not independent of crushing pain. Although common sense might work as an initial judge of independence, common sense should not be the only arbiter of independence. What should you do when presented with an array of findings for many symptoms and sign without any assessment of independence?
++
To make teaching and performing the medical history and physical examination more efficient and accurate, we want parsimony. By “parsimony,” we mean the fewest number of symptoms and signs that yield the most accurate information. Parsimonious examinations force teachers to teach only the most relevant parts of the examination, allowing students to spend more time learning what is important while eliminating wasteful maneuvers. Of course, some of this waste is in eliminating maneuvers that do not work well. For example, a Rinne test is interesting to teach, but it does not add useful diagnostic information to the symptom of “decreased hearing” reported by the patient.5 We eliminate additional wasted effort when we discard nonindependent findings.
++
A parsimonious examination should mathematically make us more accurate because a “complete” medical history and physical examination almost certainly produces nonindependent findings. “Positive” nonindependent findings confuse us and distort our probability estimates, typically making us infer a higher probability of disease than is justified. Most authors of The Rational Clinical Examination articles emphasize no more than 3 to 4 findings, even when additional symptoms and signs have useful LRs. Narrowing down the number of recommended findings requires “face validity,” by which we mean using common sense to recommend the items with the best, seemingly independent LRs. When we take this approach, experienced clinicians then use semiquantitative reasoning and deduce that the more findings present, the more likely the patient has disease (or vice versa).
++
When clinicians want to incorporate the results of diagnostic studies into their decision making, they can take 3 approaches to prevent errors created by lack of independence.6 Performing the clinical examination and then using only one single history or physical examination finding to adjust the prior odds will guarantee there is no problem with independence. (Of course, it also guarantees that the clinician might be ignoring a lot of useful clinical information!) Typically, the clinician will want to use the single finding that has the greatest effect on the prior odds, or the “best” finding that we described earlier. The approach is not difficult since simple math allows you to rank the findings in order from most useful to least useful. Suppose you have 3 findings (A, B, and C) that can each be positive or negative, with the LRs associated with each result shown in Table 1-6. Is the finding that “A” is present more diagnostically useful than “C's” absence? To determine this, you can rank order these by comparing the LR for the positive results to 1/LR for the negative results. Table 1-6 shows the relative value each of the findings. If your patient had “A” absent, “C” present, and “B” present, then you would multiply the prior odds by the LR associated with the outcome for test “B” (LR = 5.0) because it had had the most useful outcome for that individual.
++
++
Although the above result removes any concerns with independence, the clinician must collect many data that ultimately are discarded. At the very least, it is not efficient, and at the worst, important information could be ignored. Not surprisingly, this approach lacks appeal because it ignores the way most clinicians incorporate many bits of information into their decision making.
++
Clinical researchers must analyze their data in a multivariate way to help clinicians. By “multivariate,” we mean that they must analyze combinations of findings so that there is less concern about independence. This can involve one of 2 general approaches. The easiest approach is to take the medical history and physical examination findings and perform logistic regression. Logistic regression takes a number of individual variables and determines their importance in predicting whether disease is present or absent. In the first strategy for assessing independence, logistic regression identifies variables that lack independence and that can be eliminated as redundant. In our example above, if all patients with wheezing were also dyspneic, then the finding on the “variable” dyspnea might be unimportant once we know the wheezing status. The logistic regression approach would identify this as being nonsignificant, and the investigator would suggest we concentrate our efforts at assessing for wheezing. Used as a “data-reduction” step to achieve parsimony, the clinician would use the simple, univariate LRs for any finding identified as being independently useful in the logistic model. This approach has a lot of appeal because it identifies the important and useful variables for the clinician, and it does not require that they understand the logistic model itself, because the univariate LRs are used. However, in using the simple, unadjusted LRs, we ignore the relationship between the various clinical findings in favor of simplicity.
++
The β parameters of a multivariate logistic analysis describe the relative importance of symptoms and signs. From algebra, you might remember the equation for a straight line is y = mx + b. The m in the equation is the slope, and it quantifies how a change in x affects y.∗ A logistic model works similarly, except that now, rather than having 1 x, we have several symptoms and signs that we evaluate all at once. The equivalent of m in the logistic model now represents the β parameter, which is the odds ratio associated with each symptom or sign; the higher the β parameter, the more important the finding. When investigators provide us the actual multivariate models, we can put the results of our own patient's clinical examination into the model, and the outcome is the individual patient's actual probability of disease.
+
++
+++
The Fuss About Precision
++
The Primer states, “for an item of the clinical history or physical examination to be accurate, it first must be precise.” By precision, we imply that 2 or more observers agree on the presence or absence of a finding in a patient who experienced no clinical changes.∗
++
When we measure precision, describing the percentage of time that 2 observers agree on a symptom or sign fails to consider simple luck. Instead of reporting simple agreement, investigators report precision as the agreement beyond that attributable to chance. For dichotomous findings (“yes” vs “no” or “present” vs “absent”) compared between 2 observers, we quantify this agreement beyond chance with the κ statistic.† The κ statistic varies from –1 (perfect disagree-ment) to 0 (chance agreement) to +1 (perfect agreement).
++
Suppose we are interested in whether a third heart sound identifies patients with LV systolic dysfunction. It is easy to imagine that a cardiologist might be better at identifying this correctly than a generalist internist, suggesting that a κ statistic might show lower agreement beyond chance than if we were comparing 2 general physicians. Should we conclude that a third heart sound is not a good test from the precision between a cardiologist and a general internist? The answer, of course, is no because test accuracy depends on the quality of the observation—the cardiologist might be a better observer than a less experienced clinician. These seemingly imprecise symptoms and signs are potentially useful when certain providers get consistently good results because they represent opportunities for improved performance and accuracy.
++
A second type of precision is more important for identifying inaccurate findings. Although a low κ between observers points to opportunities for improving, poor intraobserver agreement precludes high accuracy unless the problem can be eliminated. Intraobserver agreement describes whether a clinician gets the same result when assessing a symptom or sign on a patient who is clinically unchanged. For example, when a clinician inquires about unilateral headaches as a symptom for migraines but the patient changes his or her answer, the finding can never be accurate or precise. Although the natural assumption might be to blame the patient for inconsistency, part of poor intraobserver agreement may be attributable to poor technique that can be improved. This is true even when applied to symptoms as reported by the patient because different answers follow when the information is solicited differently (eg, asking the patient a leading question about unilateral headaches vs an open-ended question). But if clinicians cannot assure reliability on their own findings, they will never use the symptoms and signs accurately. If you cannot agree with yourself, the LR results will be random.
+
++
+++
A Brief Word About Quality
++
Every article in The Rational Clinical Examination series and the updates in this book use a standard process for assessing the quality of data. Although the Primer focuses mostly on the sensitivity, specificity, and LR results, it should be clear that narrow CIs around the results do not assure methodologic rigor of the studies that generated the results. At the inception of The Rational Clinical Examination series, the evidence-based medicine movement was in its infancy. An early article in the series heralded its entry into the mainstream thoughts of clinical educators and investigators.7 Because standardized approaches had not been developed for assessing the quality of the medical history and physical examination, David L. Sackett, MD, and Charles H. Goldsmith, PhD, agreed on certain characteristics that they asked their reviewers to use when judging quality. The criteria were simplified and summarized in an early article of the series.8 Subsequently, several groups have published their criteria for the review of diagnostic accuracy studies, although none address the particular nuances of symptoms and signs.9, 10, and 11 Perhaps it is not surprising that many clinical investigators and epidemiologists have reported on a large number of quality measures that describe what seem like innumerable potential biases in diagnostic test studies. Despite the increasing complexity of rating systems and quality measures, the original criteria for reviewing articles have stood the test of time and pragmatism. If anything, we made the process easier and reduced the number of quality levels a reviewer might assign an article. We reviewed the recommendations for diagnostic test studies9, 10 and adapted them specifically for studies of the clinical examination.12 In the early articles appearing in The Rational Clinical Examination series, we assigned Grades for levels of evidence. However, this blurred the distinction between Levels 3, 4, and 5. Because no study accepts Level 5 evidence in making recommendations, we dropped the Grade designation and now report only the Levels as shown in Table 1-7.13
++
++
Most of the important biases that compromise a study's results follow from the study population not being consecutive, prospective, or independently assessed with an appropriate blindly applied reference standard. By consecutive, we mean that the authors enrolled all patients for whom the target disorder was a reasonable consideration. Independent means that the symptom or sign under study was not used to select patients for the study. means that the symptoms and signs were applied without knowledge of the presence of disease determined by the reference standard, but also that the reference standard was interpreted without knowledge of the study questions. The size of a study (level 1 vs level 2) for quality assessment depends on the disease under consideration. The authors of The Rational Clinical Examination evaluate sample sizes according to their review of the literature because there is no uniform number that determines quality; for example, a large study of thoracic aortic aneurysms might likely not have as many patients as a large study of urinary tract infection in women.
++
One particular bias, verification bias, deserves special consideration because it can be insidious and have a big effect on the LR. Verification bias occurs when all the potentially eligible patients fail to undergo confirmation of their disease status. Often, this is done for pragmatic reasons. An example might be a study of headache patients that seeks to describe whether asymmetric neurologic findings (eg, weakness) indicating serious intracranial abnormalities were discovered through neuroimaging. Because it would be expensive and impractical to have every patient with headaches undergo imaging, an investigator typically chooses to maximize the chance of finding something by including all patients with asymmetric muscle strength but only a sample of those who are normal. We can highlight the effect of verification bias on the sensitivity, specificity, and LRs, through examining tables of example data. Suppose an investigator reports the findings displayed in Table 1-8.
++
++
In the example, the finding looks excellent, with a sensitivity and specificity of 90%. However, because the investigator could not justify the reference standard (eg, neuroimaging on every patient with a headache), the investigative team referred only a sample of those with positive clinical findings (for illustrative purposes, 10%). Had the investigator been evaluating every patient, the findings might have been as shown in Table 1-9.
++
++
The data demonstrate that verification bias tends to overestimate sensitivity while underestimating specificity.∗ When the bias is left unadjusted, the investigator will not recognize that the presence of the finding is actually better than suggested (the adjusted LR+ should be higher), whereas the absence of the finding is not as good as suggested (the adjusted LR– should be closer to 1). Astute investigators will recognize that if they collect complete data on all the potentially eligible patients, the bias is one of the few in diagnostic test research that can be mathematically corrected.
+
++
+++
References for the Update
1. +
Glas
AF, Ligmer
JG, Prins
MH, Bonsel
GJ, Bossuyt
PMM. The diagnostic odds ratio: a single indicator of test performance.
J Clin Epidemiol. 2003;56(11):1129–1135.
[PubMed: 14615004]
2. +
Simel
DL, Samsa
GP, Matchar
DB. Likelihood ratios with confidence: sample size estimation for diagnostic test studies.
J Clin Epidemiol. 1991;44(8):763–770.
[PubMed: 1877423]
4. +
Devillé
WL, Buntix
F, de Vet
R, Ligmer
J, Montori
V. Guidelines for conducting systematic reviews of studies evaluating the accuracy of diagnostic tests. In: Knotterus
JA, ed. The Evidence Base of Clinical Diagnosis. London, England: BMJ Books; 2002.
6. +
Holleman
DR, Simel
DL. Quantitative assessments from the clinical examination: how should clinicians integrate the numerous results?
J Gen Intern Med. 1997;12(3):165–171.
[PubMed: 9100141]
7. +
Guyatt
G, Cairns
J, Churchill
D
et al.. Evidence-Based Medicine Working Group. Evidence-based medicine: a new approach to teaching the practice of medicine.
JAMA[JAMA and JAMA Network Journals Full Text]. 1992;268(17):2420–2425.
9. +
Bossuyt
PMM, Reitsma
JB, Bruns
DE
et al.. for the STARD Group. Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Clin Chem. 2003;49:1–6.
10. +
Bossuyt
PMM, Reitsma
JB, Bruns
DE
et al.. The STARD statement for reporting studies of diagnostic accuracy: explanation and elaboration.
Clin Chem. 2003;49(1):7–18.
[PubMed: 12507954]
11. +
Whiting
P, Rutjes
AWS, Reitsma
JB
et al.. The development of QUADAS: a tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews.
BMC Med Res Tech. 2003;3:25.
[PubMed: 14606960]
12. +
Simel
DL, Rennie
D, Bossuyt
PM. The STARD statement for reporting diagnostic accuracy studies: application to the history and physical examination.
J Gen Intern Med. 2008;23(6):768–774.
[PubMed: 18347878]