Limitations of design of studies that measure patients' experience include issues beyond risk of bias. Thus, in this chapter, we continue to use the term “validity” to address both risk of bias and these additional issues.
Have the Investigators Measured Aspects of Patients' Lives That Patients Consider Important?
We have described how investigators often substitute end points that make intuitive sense to them for those that patients value. Clinicians can recognize these situations by asking themselves the following question: if the end points measured by the investigators were the only thing that changed, would patients be willing to take the treatment? In addition to changes in clinical or physiologic variables, patients would require that they feel better or live longer. For instance, if a treatment of osteoporosis increased bone density without preventing back pain, loss of height, or fractures, patients would not be interested in risking the adverse effects—or incurring the costs and inconvenience—of treatment. The extent to which all relevant concepts or dimensions of health status important to patients are comprehensively sampled by the HRQL instrument reflects its content validity.
How can clinicians be sure that investigators have measured aspects of life that patients value? Investigators may find that the outcomes they have measured are important to patients by asking them directly.
For example, in a study that examined PROs in patients with chronic airflow limitation who were recruited from a secondary care respirology clinic, the investigators used a literature review and interviews with clinicians and patients to identify 123 items that reflected possible ways that patients' illness might affect their quality of life.7 The investigators then asked 100 patients to identify the items that were relevant to them and to indicate how important those items were. They found that the most important problem areas for patients were their dyspnea during day-to-day activities and their chronic fatigue. An additional area of difficulty was emotional function, including feelings of frustration and impatience.
If the authors do not present direct evidence that their outcome measures are important to patients, they may cite previous work. For example, researchers conducting a randomized trial of respiratory rehabilitation in patients with chronic lung disease used a PRO measure based on the responses of patients in the study described just above, and they referred to that study.17 Ideally, the report will include enough information about the questionnaire to obviate the need to review previous reports.
Another alternative is to describe the content of the outcome measures in detail. An adequate description of the content of a questionnaire allows clinicians to use their own experience to decide whether what is being measured is important to patients.
For instance, the authors of an article describing a randomized trial of surgery vs watchful waiting for benign prostatic hyperplasia assessed the degree to which urinary difficulties bothered the patients or interfered with their activities of daily living, sexual function, social activities, and general well-being.18 Few would doubt the importance of these items and the need to include them in the results of the trial.
USING THE GUIDE
The PANSS, used in the study of antipsychotics for chronic schizophrenia, covers a wide range of psychopathologic symptoms that patients with schizophrenia may experience, including the so-called positive symptoms (7 items for delusions, hallucinations, and so on), the so-called negative symptoms (7 items for blunted affect, withdrawal, and so on), and the general psychopathology (16 items for anxiety, depression, and so on).14 These items can capture the overall picture of the patient's symptoms well but may miss more general aspects of HRQL, such as a sense of well-being or satisfaction with life.
Is the Instrument Reliable (When Measuring Severity) or Responsive (When Measuring Change)?
There are 2 ways in which investigators use PROs. They may wish to help clinicians distinguish between people who have a better or worse level of HRQL or to measure whether people are feeling better or worse over time.19
For instance, suppose a trial of a new drug for patients with heart failure finds that it works best in patients with the New York Heart Association (NYHA) functional classification class III and IV symptoms. We could use the NYHA classification for 2 purposes. First, for treatment decisions, we might use it as a tool by which to discriminate between patients who do and do not warrant therapy. We might also want to determine whether the drug was effective in improving an individual patient's functional status and, in so doing, monitor changes in the patient's NYHA functional class. However, for this purpose, the NYHA classification, which has only 4 levels, would likely not perform adequately.
If, when we are trying to discriminate among people with differing levels of disease severity at a single point in time, everyone gets the same score, we will not be able to separate the severely diseased from those with minor disease. The differences in disease severity we are trying to detect—the signal—come from cross-sectional differences in scores among patients. The bigger these differences are, the better the instrument is in discriminating among patients with different levels of disease severity (ie, the better its performance).
At the same time, if scores recorded from the same stable patients on repeated measurements fluctuate wildly—we call this fluctuation the noise—we will not be able to determine, with any sense of certainty, the patients' relative well-being.20 The greater the noise, which comes from variability within patients, the more difficulty we will have detecting the signal.
The technical term usually used to describe the ratio of variability between patients—the signal—to the total variability—the signal plus the noise—is reliability. If patients' scores change little over time (when in fact patients' statuses are not changing) but are very different from patient to patient in accordance with the differences in disease severity of each patient, reliability will be high. If the changes in score within patients are high in relation to differences among patients, reliability will be low.
The mathematical expression of reliability is the variance (or variability) among patients divided by the variance among patients and the variance within patients. One index of reliability measures homogeneity or internal consistency of scores recorded for questionnaire items, constituting a scale expressed by Cronbach α coefficient. Cronbach α ranges from 0 to 1, and values of at least 0.7 are desirable.
A more useful measure, expressed as test-retest reliability, refers to reproducibility of measurements when the same instrument is applied to the same stable patients. Preferred mathematical expressions of this type of reliability are κ, when the scale is dichotomous or categorical (see Chapter 19.3, Measuring Agreement Beyond Chance), and intraclass correlation coefficients (ICCs), when the scale is continuous. Both measures vary between −1 and 1. As a rough rule of thumb, values of κ or ICC should exceed 0.7.
In patients with chronic heart failure, we might want to determine whether a new drug was effective in improving patients' functional status, and to achieve this goal we might monitor changes in patients' NYHA functional class. When we use instruments to evaluate change over time, the instruments must be able to detect any important changes in the way patients are feeling, even if those changes are small. In this case, the signal comes from the difference in scores among patients whose status has improved or deteriorated, and the noise comes from the variability in scores among patients whose status has not changed. The term we use for the ability to detect change in the signal-to-noise ratio over time is responsiveness. It is sometimes also referred to as sensitivity to change.
An unresponsive instrument can result in false-negative results, in which the intervention improves how patients feel, yet the instrument fails to detect the improvement. This problem may be particularly salient for questionnaires that have the advantage of covering all relevant areas of HRQL but the disadvantage of covering each area superficially. With only 4 categories, a crude instrument such as the NYHA functional classification may work well for stratifying patients according to their level of disability but is very unlikely to detect small but important improvements in health status that result from treatment.
There is no universally accepted mathematical expression for responsiveness. Some studies judge a scale to be responsive when it can find a statistically significant change after an intervention of known efficacy. For example, the CRQ was found to be responsive when all of the domain scores improved substantially after initiation or modification of treatment, despite only small improvements in spirometric values.7 Despite this high responsiveness, one of the CRQ subscales was subsequently found to have a modest reliability (internal consistency = 0.53; test-retest reliability = 0.73).21
In studies that find no difference in change in PROs when patients are in a treatment group vs a control group, clinicians should look for evidence that the instruments have been able to detect small but important effects in previous investigations. In the absence of this evidence, instrument unresponsiveness becomes a plausible reason for the failure to detect differences in PROs between the treatment and the control groups.
For example, researchers who conducted a randomized trial of a diabetes education program reported no changes in 2 measures of well-being, attributing the result to, among other factors, lack of integration of the program with standard therapy.22 However, those patients involved in the education program, in comparison with those in a control group who did not receive the education, had an improvement in knowledge and self-care, along with a decrease in feelings of dependence on physicians. Given these changes, another explanation for the negative result—no difference between treatments in well-being—is inadequate responsiveness of the 2 well-being measures the investigators used.
Using the Guide
In the report of the CATIE trial,1 the authors do not address the responsiveness of the PANSS. A prior comparison of the PANSS with an independent global assessment of change, however, persuasively demonstrated its responsiveness.16
Does the Instrument Relate to Other Measurements in the Way It Should?
Validity has to do with whether the instrument is measuring what it is intended to measure. The absence of a reference standard for HRQL creates a challenge for anyone hoping to measure patients' experience. We can be more confident that an instrument is doing its job if the items appear to measure what is intended (the instrument's face validity), although face validity alone is of limited help. Empirical evidence that it measures the domains of interest allows stronger inferences.
To provide such evidence, investigators have borrowed validation strategies from psychologists, who for many years have thought carefully about how to best determine whether questionnaires that assess intelligence and attitudes really measure what is intended.
Establishing validity involves examining the logical associations that should exist among assessment measures. For example, we would expect that patients with a lower treadmill exercise capacity generally will have more dyspnea in daily life than those with a higher exercise capacity, and we would expect to see substantial correlations between a new measure of emotional function and existing emotional function questionnaires.
When we are interested in evaluating change over time, we examine correlations of changes in scores. For example, patients who deteriorate in their treadmill exercise capacity should, in general, experience increases in dyspnea, whereas those whose exercise capacity improves should experience less dyspnea, and a new emotional function measure should reveal improvement in patients who improve on existing measures of emotional function. The technical term for this process is testing an instrument's construct validity.
Clinicians should look for evidence of the validity of PRO measures used in clinical studies. Reports of randomized trials that used PRO measures seldom review evidence of the validity of the instruments they use, but clinicians can gain some reassurance from statements (backed by citations) that the questionnaires have been validated previously. In the absence of evident face validity or empirical evidence of construct validity, clinicians are entitled to skepticism about the study's measurement of HRQL.23
A final concern arises if the measurement instrument is used in a culturally and linguistically different environment than the one in which it was developed—typically, use of a non-English version of an English-language questionnaire. Ideally, these non–English-language versions have undergone a translation process that ensures that the new version of the questionnaire reflects the idiom and the attitudes of the local population, a process called linguistic and cultural validation.24 At the very least, the translation of the instrument should follow a procedure known as back-translation, whereby a first group of researchers translates the original into a new language, a second group blindly back-translates it into English, and a third group ascertains the equivalence of the original and the back-translated versions and resolves any discrepancies. If investigators provide no reassurance of appropriate linguistic validation, the clinician has another reason for caution regarding the results. In a review of 44 different versions of the McGill Pain Questionnaire representing 26 different languages/cultures, regardless of the method of cross-cultural adaptation, clinimetric testing of the adapted questionnaires was generally poorly performed, with only 9 undertaking back-translation. For 18 versions, no testing at all had been undertaken.25
USING THE GUIDE
In the antipsychotics study,1 the investigators provide no citation to support the validity of the PANSS. As noted above, a quick search of PubMed (entering “PANSS” with no restriction) identified 2441 articles, showing that it is a widely used measure in psychiatry. Two reports describe extensive validation of the instrument.14,15
Are There Important Aspects of Health-Related Quality of Life That Have Been Omitted?
Although investigators may have addressed HRQL issues, they may not have done so comprehensively. When measuring patients' discomfort, distress, and disability, one can think of a hierarchy that begins with symptoms, moves on to the functional consequences of the symptoms, and ends with more complex elements, such as emotional function. Exhaustive measurement may be important in some contexts but not others.
If, as a clinician, you believe your patients' sole interest is in whether a treatment relieves the primary symptoms and most important functional limitations, you will be satisfied with a limited range of assessments. Randomized trials in patients with migraine26 and postherpetic neuralgia27 were restricted primarily to the measurement of pain, and studies of patients with rheumatoid arthritis28 and back pain29 measured pain and physical function but not emotional or social function. Depending on the magnitude of effect on pain, the adverse effects of the medication, and the circumstances of the patient (degree of pain, concern about toxicity, degree of impairment of function, or emotional distress), lack of comprehensiveness of outcome measurement may or may not be important.
Thus, as a clinician, you can judge whether these omissions are important to you or, more to the point, to your patients. You should consider that although the omissions are unimportant to some patients, they may be critical to others (see Chapter 27, Decision Making and the Patient). We therefore encourage you to bear in mind the broader effect of disease on patients' lives.
Disease-specific HRQL measures that explore the full range of patients' problems and experience remind us of domains we might otherwise forget. We can trust these measures to be comprehensive if the developers have conducted a detailed survey of patients with the illness or condition.
For example, the American College of Rheumatology developed the 7-item core set of disease activity measures for rheumatoid arthritis, 3 of which represent patients' own reports of pain, global disease activity, and physical function.30 Despite the extensive and intensive development process of the 7 core items, the data set, when presented to patients, failed to include an important aspect of disease activity: fatigue.31
If you are interested in going beyond the specific illness and comparing the effect of treatments on PROs across diseases or conditions, you will look for a more comprehensive assessment. These comparisons require generic HRQL measures, covering all relevant areas of HRQL, that are designed for administration to people with any kind of underlying health problems (or no problem at all).
One type of generic measure, a health profile, yields scores for all domains of HRQL (eg, mobility, self-care, and physical, emotional, and social function). The most popular health profiles are short forms of the instruments used in the Medical Outcomes Study.32,33 Inevitably, such instruments cover each area superficially, which may limit their responsiveness. Indeed, generic instruments are less powerful in detecting treatment effects than specific instruments.34 Ironically, generic instruments also may not be sufficiently comprehensive; in certain cases, they may completely omit patients' primary symptoms. Even when investigators use both disease-specific and generic measures, these may still fail to adequately address adverse effects or toxicity of therapy.
For example, in a study of methotrexate for patients with inflammatory bowel disease,35 patients completed the Inflammatory Bowel Disease Questionnaire, which addresses patients' bowel function, emotional function, systemic symptoms, and social function. Coincidentally, it measures some adverse effects of methotrexate, including nausea and lethargy, because they also afflict patients with inflammatory bowel disease who are not taking methotrexate, but it fails to measure other adverse effects, such as rash or mouth ulcers.
The investigators could have administered a generic instrument to assess aspects of patients' health status not related to inflammatory bowel disease, but once again, such instruments also would fail to directly address issues such as rash or mouth ulcers. The investigators chose a checklist approach to elucidate adverse effects and documented the frequency of occurrence of adverse events that were both severe enough and not severe enough to warrant discontinuation of treatment, but such an approach provides limited information about the influence of adverse effects on patients' lives.
USING THE GUIDE
In the CATIE trial,1 the investigators not only used the PANSS but also monitored adverse events through systematic query, administered 3 rating scales of extrapyramidal signs, and measured changes in weight, electrocardiogram, and laboratory analyses.1 The assessment appears adequately comprehensive.