`
`
`
`
`
`
`
`
`A Consumer’s Guide to Subgroup Analyses
`Andrew D. Oxman. MD, and Gordon H. Guyatt. MD
`
`
`
`
`
`
`
`
`
`
`
`
`I The extent to which a clinician should believe and
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`act on the results of subgroup analyses of data from
`
`
`
`
`
`
`randomized trials or meta-analyses is controversial.
`Guidelines are provided in this paper for making these
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`decisions. The strength of inference regarding a pro-
`
`
`
`
`
`
`posed difference in treatment effect among subgroups
`
`
`
`
`
`
`
`
`is dependent on the magnitude of the difference. the
`
`
`
`
`
`
`statistical significance of the difference, whether the
`
`
`
`
`
`
`hypothesis preceded or followed the analysis, whether
`
`
`
`
`
`
`
`
`
`the subgroup analysis was one of a small number of
`
`
`
`
`
`
`hypotheses tested, whether the difference was sug-
`
`
`
`
`
`
`
`gested by comparisons within or between studies, the
`
`
`
`
`
`
`
`consistency of the difference. and the existence of
`
`
`
`
`
`
`indirect evidence that supports the difference. Applica—
`
`
`
`
`
`
`
`
`tion of these guidelines will assist clinicians in making
`
`
`
`
`
`
`
`decisions regarding whether to base a treatment deci-
`
`
`
`
`
`
`
`
`
`
`sion on overall results or on the results of a subgroup
`
`analysis.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Annals qflnrerrttu' Medicine. 1992;116:78-84.
`
`
`
`
`
`
`
`From McMaster University Health Sciences Centre. Hamilton.
`
`
`
`
`
`
`Ontario. For current author addresses. see end of text.
`
`
`
`
`
`
`
`
`
`
`
`
`Clinicians faced with a treatment decision about a
`
`
`
`
`
`
`
`
`particular patient are interested in the evidence that
`
`
`
`
`
`
`
`
`pertains most directly to that
`individual. Thus.
`is
`it
`
`
`
`
`
`
`
`
`
`frequently of interest to examine a particular category
`
`
`
`
`
`
`
`
`of participants
`in a clinical
`trial:
`for example.
`the
`
`
`
`
`
`
`
`
`
`women.
`those in a certain age group. or those with a
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`specific pattern of disease.
`In observational studies.
`these examinations. or subgroup analyses. are routine.
`
`
`
`
`
`
`
`They are also frequently encountered in reports of clin-
`
`
`
`
`
`
`
`
`ical trials. In a survey of 45 clinical
`trials reported in
`
`
`
`
`
`
`
`
`
`
`three leading medical journals. Pocock and colleagues
`
`
`
`
`
`
`(1) found at least one subgroup analysis that compared
`
`
`
`
`
`
`
`
`the response to treatment in difierent categories of pa-
`
`
`
`
`
`
`
`
`tients in 51% of the reports.
`
`
`
`
`
`
`The results of subgroup analyses have had major
`
`
`
`
`
`
`
`
`effects. sometimes harmful. on treatment recommenda-
`
`
`
`
`
`tions. For example. many patients with suspected myo-
`
`
`
`
`
`
`
`cardial infarction who could have benefited from throm-
`
`
`
`
`
`
`
`bolytic therapy may not have received this treatment as
`
`
`
`
`
`
`
`
`
`a result of subgroup analyses based on the duration of
`
`
`
`
`
`
`
`
`
`
`symptoms before treatment (2) and the conclusion that
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`streptokinase was only effective in patients treated
`within 6 hours alter the onset of pain (3. 4). A later.
`
`
`
`
`
`
`
`
`
`
`
`
`larger trial showed that streplokinase was effective up
`
`
`
`
`
`
`
`
`to 24 hours after the onset of symptoms (5).
`
`
`
`
`
`
`
`
`
`Conclusions based on subgroup analyses can have
`
`
`
`
`
`
`adverse consequences both when a particular category
`
`
`
`
`
`
`of patients is denied effective treatment (a “false-nega-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`78
`
`
`© |992 American College of Physicians
`
`
`
`
`
`
`
`Page 1 0f 8
`
`Biogen Exhibit 2069
`
`Mylan v. Biogen
`IPR 2018-01403
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`live" conclusion). as in the above example. and when
`
`
`
`
`
`
`
`
`ineffective or even harmful treatment is given to a sub—
`
`
`
`
`
`
`
`
`
`group oi' patients (a "false—positive" conclusion). Be—
`
`
`
`
`
`
`cause of these risks and their frequency. the appropri-
`
`
`
`
`
`
`
`
`ateness of drawing conclusions from subgroup analyses
`
`
`
`
`
`
`has been challenged (6. 7). and it has been argued that
`
`
`
`
`
`
`
`
`
`
`treatment recommendations based on subgroup analyses
`
`
`
`
`
`may do more harm than good. This hypothesis is cur-
`
`
`
`
`
`
`
`
`
`rently being tested empirically by comparing treatment
`
`
`
`
`
`
`recommendations generated from early trials of new
`
`
`
`
`
`
`treatments based on subgroup analyses with treatment
`
`
`
`
`
`
`recommendations that would have been made had sub—
`
`
`
`
`
`
`
`group analyses been ignored. assessing “whether they
`
`
`
`
`
`
`lead to more patients receiving treatments that are
`
`
`
`
`
`
`
`worthwhile and fewer patients receiving treatments that
`
`
`
`
`
`
`are not." (Sackett DL. Personal communication.)
`
`
`
`
`
`
`Although we agree that subgroup analyses are poten-
`
`
`
`
`
`
`
`tially misleading and that there is a tendency to over-
`
`
`
`
`
`
`
`
`
`emphasize the results of subgroup analyses.
`in this pa-
`
`
`
`
`
`
`
`
`per we will present an alternative point of view. The
`
`
`
`
`
`
`
`
`
`
`essence of our argument
`is that subgroup analysis is
`
`
`
`
`
`
`
`
`
`
`both informative and potentially misleading. Rather
`
`
`
`
`
`than arguing for or against the merits of subgroup anal-
`
`
`
`
`
`
`
`
`
`ysis. we will present guidelines in this article for decid-
`
`
`
`
`
`
`
`
`
`ing how believable the results of subgroup analyses are
`
`
`
`
`
`
`
`
`
`and. consequently. when to act on recommendations
`
`
`
`
`
`
`
`based on subgroup analyses and when to ignore them.
`
`
`
`
`
`
`
`
`
`Our discussion will focus on randomized trials and
`
`
`
`
`
`
`
`
`meta-analyses of randomized trials (systematic over-
`
`
`
`
`
`views). although the same principles apply to any other
`
`
`
`
`
`
`
`
`
`research design. The assumption from which we start in
`
`
`
`
`
`
`
`
`
`this discussion is that the underlying design of the stud—
`
`
`
`
`
`
`
`
`
`ies being examined is sound. For treatment trials. sound
`
`
`
`
`
`
`
`
`design invoIVes elements of randomization. masking.
`
`
`
`
`
`completeness of follow-up. and other strategies for min-
`
`
`
`
`
`
`
`imizing both random error and bias (8. 9}. If the study
`
`
`
`
`
`
`
`
`
`
`is not sound. the overall conclusion is suspect. let alone
`
`
`
`
`
`
`
`
`
`conclusions based on subgroup analyses.
`
`
`
`
`
`to
`the extent
`Even given a rigorous study design.
`
`
`
`
`
`
`
`
`which subgroup analyses
`should be done—or be-
`
`
`
`
`
`
`lieved—is highly controversial. Although there are
`
`
`
`
`
`those who ignore scientific principles in the subgroup
`
`
`
`
`
`
`
`analyses they undertake and report. go on fishing expe-
`
`
`
`
`
`
`
`
`ditions. and indulge in data-dredging exercises (It). 11).
`
`
`
`
`
`
`
`
`
`there are also those who rnix apples and oranges. drown
`
`
`
`
`
`
`
`
`
`in the data they pool (12). reach meaningless conclu-
`
`
`
`
`
`
`
`
`sions about “average" effects ([3). and fail
`to detect
`
`
`
`
`
`
`
`
`
`Clinically important effects because of the heterogeneity
`
`
`
`
`
`
`
`of their study groups (14). Although the debate between
`
`
`
`
`
`
`
`
`
`these two camps is entertaining and can lead to some
`
`
`
`
`
`
`
`
`
`
`useful
`insights. practical
`advice
`for assessing the
`
`
`
`
`
`
`
`strength of inferences based on subgroup analyses is
`
`
`
`
`
`
`
`
`also important. In providing such advice. we will build
`
`
`
`
`
`
`
`
`
`on criteria that have been suggested by other authors
`
`
`
`
`
`
`
`
`
`(IS-18).
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Biogen Exhibit 2069
`Mylan v. Biogen
`IPR 2018-01403
`
`Page 1 of 8
`
`
`
`usually artifacts due to chance. The same position can
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`be taken with respect to apparent diiferences between
`treatment effects in drugs of a single class: this would
`
`
`
`
`
`
`
`
`
`suggest that the best estimate of the efiect of any one
`
`
`
`
`
`
`
`
`
`
`
`drug is the overall effect of the group of drugs across all
`
`
`
`
`
`
`
`
`
`
`
`
`methodologically adequate. randomized. controlled tri-
`
`
`
`
`als (2]). There is confusion. however. over the funda-
`
`
`
`
`
`
`
`
`mental distinction between a “qualitative interaction"
`
`
`
`
`
`
`and a “quantitative interaction" (22). Although a strict
`
`
`
`
`
`
`
`
`definition of a qualitative interaction would mean that
`
`
`
`
`
`
`
`
`there is a sign reversal (22) (meaning that the treatment
`
`
`
`
`
`
`
`
`
`
`is beneficial in one group and harmful in another). it is
`
`
`
`
`
`
`
`
`
`
`
`also used to refer to a substantial quantitative interac-
`
`
`
`
`
`
`
`
`tion (that is, a difference in the magnitude of effect that
`
`
`
`
`
`
`
`
`
`
`is clinically important). From a clinical point of view. it
`
`
`
`
`
`
`
`
`
`
`is important to recognize that a substantial quantitative
`
`
`
`
`
`
`
`
`interaction can be as important as a qualitative interac—
`
`
`
`
`
`
`
`
`tion. For instance.
`the side efi'ects of a treatment may
`
`
`
`
`
`
`
`
`
`be such that
`is worth administering to patients in
`it
`
`
`
`
`
`
`
`
`
`
`whom the magnitude of the treatment effect is large, but
`
`
`
`
`
`
`
`
`
`
`not to patients in whom the treatment effect is small or
`
`
`
`
`
`
`
`
`
`
`
`moderate.
`
`it
`is still reasonable to distinguish
`Having said this.
`
`
`
`
`
`
`
`
`
`between interactions that are clinically trivial and those
`
`
`
`
`
`
`
`
`that are clinically important. The former can be ignored.
`
`
`
`
`
`
`
`
`
`and that
`is the point at which our guidelines begin.
`
`
`
`
`
`
`
`
`
`
`Once the clinician has decided that an interaction.
`if
`
`
`
`
`
`
`
`
`
`real. would be important.
`the subsequent six criteria
`
`
`
`
`
`
`
`
`can be used to help decide on the credibility of the
`
`
`
`
`
`
`
`
`
`
`
`proposed subgroup difference. Three of the criteria (2 to
`
`
`
`
`
`
`
`
`
`4) are markers of the potential for random error [that is.
`
`
`
`
`
`
`
`
`
`
`
`mistakes due to chance); one (criterion 5) is a marker of
`
`
`
`
`
`
`
`
`
`
`
`the potential for systematic errors; and the last
`two
`
`
`
`
`
`
`
`
`
`address the consistency of the evidence (criterion 6) and
`
`
`
`
`
`
`
`
`
`its biologic plausibility (criterion 7).
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Table I. Guidelines for Deciding whether Apparent Dif-
`
`
`
`
`
`
`
`ferences in Subgroup Response Are Real
`
`
`
`
`
`
`
`
`LMIJ_
`
`
`as
`.
`
`
`is the magnitude of the difi'erence clinically important?
`
`
`
`
`
`
`
`Was the difference statistically significant?
`
`
`
`
`
`Did the hypothesis precede rather than follow the analysis?
`
`
`
`
`
`
`
`
`Was the subgroup analysis one of a small number of
`
`
`
`
`
`
`
`
`
`hypotheses tested'?
`
`
`5. Was the difference suggested by comparisons within rather
`
`
`
`
`
`
`
`
`than between studies?
`
`
`
`Was the difi'ercnce consistent across studies?J
`
`
`
`
`
`
`Is there indirect evidence that supports the hypothesized
`
`
`
`
`
`
`
`difierence'?
`
`
`
`
`
`
`
`
`
`
`Our criteria are summarized in Table l and are de-
`
`
`
`
`
`
`
`
`
`scribed in detail below. An example of a hypothesized
`
`
`
`
`
`
`
`
`difference in subgroup response and the extent to which
`
`
`
`
`
`
`
`
`
`it meets our proposed criteria is given in Table 2. We
`
`
`
`
`
`
`
`
`
`
`
`will use this example in the text to highlight some of the
`
`
`
`
`
`
`
`
`
`
`
`
`relevant issues. It should be noted from the outset that
`
`
`
`
`
`
`
`
`
`
`our criteria. like any guidelines for making an inference.
`
`
`
`
`
`
`
`
`
`do not provide hard and fast rules; they simply repre-
`
`
`
`
`
`
`
`
`
`sent an organized approach to making reasonable judg-
`
`
`
`
`
`
`
`ments.
`
`
`Guidelines for Deciding whether Apparent Difl'erences in
`
`
`
`
`
`
`Subgroup Response Are Real
`
`
`
`
`
`
`
`Conceptual Approach Underlying the Guidelines
`
`
`
`
`
`
`
`
`
`
`
`Subgroup analyses of data from randomized trials or
`
`
`
`
`
`
`
`meta-analyses are undertaken to identify ”effect modi-
`
`
`
`
`
`
`fiers." characteriSLICs of the patients or treatment
`that
`
`
`
`
`
`
`
`modify the effect of the intervention under study. Sta-
`
`
`
`
`
`
`
`
`tistical “interactions" in a set of data are measured to
`
`
`
`
`
`
`
`
`
`
`estimate efi‘ect modification (an epidemiologic concept)
`
`
`
`
`
`
`in the population represented by the study sample (19).
`
`
`
`
`
`
`
`
`
`The term interaction is sometimes (but not in this pa-
`
`
`
`
`
`
`
`
`
`per) also used to refer to the concept of synergism or
`
`
`
`
`
`
`
`
`
`
`
`antagonism. a biologic mechanism of action in which
`
`
`
`
`
`
`
`
`the combined effect of two or more factors differs from
`
`
`
`
`
`
`
`
`
`
`the sum of their solitary effects (20). In the following
`
`
`
`
`
`
`
`
`
`
`discussion, we use the term "interaction“ to refer to
`
`
`
`
`
`
`
`
`
`situations in which the observed effectiveness of an
`
`
`
`
`
`
`
`
`intervention differs across subgroups.
`
`
`
`
`The premise underlying the hypothesis that subgroup
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`analyses do more harm than good is that "unanticipated
`qualitative interactions" are unusual and. when appar-
`
`
`
`
`
`
`ent unanticipated interactions are discovered. they are
`
`
`
`
`
`
`
`
`
`
`
`
`The Guidelines
`
`
`1. Is the Magnitude ofthe Diflérenr‘e Clinically
`
`
`
`
`
`
`
`important?
`
`Given the extent of biologic variability. it would be
`
`
`
`
`
`
`
`
`surprising not
`to find interactions between treatment
`
`
`
`
`
`
`effects and various other factors. Differences in the
`
`
`
`
`
`
`
`elfect of treatment are likely to be associated with dif-
`
`
`
`
`
`
`
`
`
`ferences in patient characteristics. differences in the
`
`
`
`
`
`
`
`administration of the treatment (such as different sur-
`
`
`
`
`
`
`
`geons or difi’erent drug doses). and difi'erences in the
`
`
`
`
`
`
`
`
`primary end point. However.
`is only when these
`it
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Table 2. An Example of a Hypothesined Difference in Subgroup Response: Digoxin is More Effective in Patients with
`More Severe Heart Failure
`
`
`
`
`Criterion
`
`
`
`Result
`
`
`Clinically important difierentiation between responders and nonresponders.
`. Magnitude of the difi'erence
`
`
`
`
`
`
`
`
`
`
`
`MPG-—
`Yes. P values were less than 0.01 in both studies.
`Statistical significance
`
`
`
`
`
`
`
`
`
`
`
`
`
`A prion‘ hypothesis
`Yes. the hypothesis was suggested by results of one study and tested in a
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`second study.
`
`
`:’1 Small number of hypotheses
`lf viewed as severity of bean failure. yes. if viewed as components (for
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`example. heart size. third heart sound. ejection fraction]. no.
`
`
`
`
`
`
`
`
`
`LII . Within-study comparisons
`Yes. in mm crossover trials. comparisons were within studies.
`
`
`
`
`
`
`
`
`
`
`
`
`Yes. in two studies tested. However. it was not tested in other trials. and
`6. Consistency across studies
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`this is necessary for confirmation.
`
`
`
`
`
`7. Indirect evidence
`Yes. biologically plausible that clinically important response is restricted to
`
`
`
`
`
`
`
`
`
`
`
`
`
`those with more severe heart failure.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`1 January I992 - Annals of internal Medicine - Volume lib - Number I
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`79
`
`
`
`Page 2 0f 8
`
`Page 2 of 8
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`differences or interactions are practically important—
`that is, when they are large enough that they would lead
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`to difi‘erent clinical decisions for different subgroups—-
`
`
`
`
`
`
`
`
`
`that there is any point in considering them further.
`As a rule. the larger the ditference between the effect
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`in a particular subgroup (or with a particular drug or
`
`
`
`
`
`
`
`
`
`dosage of drug) and the overall efi'ect. the more plausi-
`ble it is that the difference is real. At the same time. as
`
`
`
`
`
`
`
`
`
`
`
`
`the difi‘erence in efiect size between the anomalous sub-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`group and the remainder of the patients becomes larger.
`the clinical importance of the difference increases.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Unfortunately. if the results of subgroup analysis are
`
`
`
`
`
`
`
`
`only reported for the subgroups within which sizable
`treatment differences are found.
`the estimates of the
`
`
`
`
`
`
`
`
`magnitude of the interaction will be biased because only
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the extreme estimates are reported (23). This is analo-
`
`
`
`
`
`
`
`
`
`gous to regression to the mean (the tendency for ex-
`treme findings. such as unusually high blood pressure
`
`
`
`
`
`
`
`values.
`to revert
`toward less extreme values on re-
`
`
`
`
`
`
`
`
`peated examination) (24). Moreover. when the overall
`
`
`
`
`
`
`
`treatment effect is modest.
`there is a good chance of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`finding a ”qualitative“ interaction even when only two
`
`
`
`
`subgroups are examined (17).
`When they report the results of subgroup analyses.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`authors should make clear to readers how many com-
`
`
`
`
`
`
`
`
`
`
`parisons were made and how it was decided which ones
`
`
`
`
`
`
`
`to report. Given current publication practices. however.
`
`
`
`
`
`
`
`
`
`were the reader simply to conclude that a reported
`
`
`
`
`
`
`
`
`
`
`
`interaction is real just because it
`is large. he or she
`
`
`
`
`
`
`
`
`
`would be wrong more often than right. Thus. having
`
`
`
`
`
`
`
`
`
`determined that an interaction. if real.
`is large enough
`
`
`
`
`
`
`
`
`
`
`to be important. it is essential to consider other criteria.
`
`
`
`
`
`2. Was the Difi'erence Statistically Significant?
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Any large data set has. imbedded within it, a certain
`
`
`
`
`
`
`
`
`number of apparent, but in fact spurious. interactions.
`
`
`
`
`
`
`
`
`
`
`Statistical tests of significance can be used to assess the
`likelihood that a given interaction might have arisen due
`
`
`
`
`
`
`
`
`to chance alone. For example. Yusuf and colleagues
`
`
`
`
`
`
`
`
`(25), in an overview of randomized trials of beta blocker
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`treatment for myocardial
`infarction. compared agents
`with and without
`intrinsic sympathomimetic activity
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`(ISA) and found that the agents without ISA seemed to
`
`
`
`
`
`
`
`
`
`
`produce a larger effect than the ones with it. This dif-
`ference was significant at the 0.01 level. indicating that
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`it was unlikely to have occurred due to chance alone.
`
`
`
`
`
`
`
`
`
`
`Yet,
`two subsequent trials, one of an agent with ISA
`and one of an agent without 15A. showed the opposite
`
`
`
`
`
`
`
`
`
`
`result and. when added to the overview. eliminated the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`statistical significance of the interaction (26). There are
`
`
`
`
`
`
`
`several possible explanations for this. including chance.
`In other words. although events that occur one out of a
`
`
`
`
`
`
`
`
`
`
`hundred times might be considered rare. they do occur.
`
`
`
`
`
`
`
`
`
`Of course. the lower a P value is. the less likely it is
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`that an observed interaction can be explained by chance
`alone.
`
`Conversely. just as it is possible to observe spurious
`
`
`
`
`
`
`
`
`
`interactions. chance is likely to lead to some studies
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`(among a large group) in which even a real interaction is
`
`
`
`
`
`
`
`
`
`
`not apparent. This is particularly true if the studies are
`
`
`
`
`
`
`
`
`
`small and the clinical end points of interest are infre-
`quent. In this case. the power to detect an interaction
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`would be low. Because subgroup analyses always in-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`clude fewer patients than does the overall analysis. they
`
`
`
`
`
`
`
`
`
`carry a greater risk for making a type [I error—falsely
`
`
`
`
`
`
`concluding that there is no difference.
`
`
`
`
`
`Statistical techniques for conducting subgroup analy-
`sis include the Breslow-Day technique and regression
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`approaches (27). With the Breslow-Day technique and
`
`
`
`
`
`
`
`
`
`
`
`it
`similar approaches (28).
`is possible to use a test for
`
`
`
`
`
`
`
`homogeneity to estimate the probability that an ob-
`served interaction might have arisen due to chance
`
`
`
`
`
`
`
`
`
`
`
`
`
`alone. More commonly. authors simply conduct a num—
`
`
`
`
`
`
`
`
`ber of comparisons for different subgroups and apply
`chi-square tests or t-tests without formally testing for
`
`
`
`
`
`
`
`
`interactions.
`
`
`
`
`
`
`
`
`This practice. together with only reporting subgroups
`within which sizable treatment diflerences are found.
`
`
`
`
`
`
`
`can lead to an overestimate of the significance as well
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`as the size of the difference. One way of adjusting for
`this bias is to use Bayes or empiric Bayes methods,
`
`
`
`
`
`
`
`
`
`
`which shrink the extreme estimates toward the overall
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`estimate of treatment effect (23. 29. 30). Both a point
`
`
`
`
`
`
`
`
`
`estimate of the magnitude of the difi'erence and a con-
`
`
`
`
`
`
`
`
`fidence interval can be obtained using these approaches.
`Regression models. such as logistic regression (28).
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`can also be used for analysis of interactions if the in-
`teractions are modeled by product terms. This approach
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`allows for testing the significance of an interaction while
`
`
`
`
`
`
`
`
`
`controlling for other factors. if there are many subgroup
`
`
`
`
`
`
`
`factors. however.
`the number of product terms neces-
`
`
`
`
`
`
`
`
`
`sary for an adequate modeling of the interactions may
`be greater than the number of observations; an analysis
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`of the interactions is
`then impossible. An additional
`
`
`
`
`
`
`
`
`
`problem with this approach is deciding which of many
`
`
`
`
`
`
`
`
`
`possible interaction terms to enter into the model as
`
`
`
`
`
`
`
`
`
`well as the potential for bias in their selection.
`Methods for selecting factors to include have been
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`proposed (3|) in addition to other approaches to sub-
`
`
`
`
`
`
`
`
`
`
`im—
`group analysis (15, 18. 23. 27). Although it
`is not
`portant for clinical readers to understand the details of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`these approaches. it is important to understand the con-
`
`
`
`
`
`
`
`cepts of statistical significance and power in subgroup
`analysis. Statistical analysis is a useful tool for assess-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`ing whether an observed interaction might have been
`due to chance alone. but it is not a substitute for clin—
`
`
`
`
`
`
`
`
`
`
`
`ical judgment.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`3. Did the Hypothesis Prerede Rather than Follow the
`Anaiysis?
`
`
`
`
`
`
`
`
`Surveying patterns of data that suggest possible inter-
`
`
`
`
`
`
`
`
`actions may.
`in fact. prompt
`the analysis that “con-
`
`
`
`
`
`
`
`
`
`firms“ the existence of a possible interaction. As a
`
`
`
`
`
`
`
`
`result.
`the credibility of any apparent
`interaction that
`
`
`
`
`
`
`
`
`
`arises out of post-hoe exploration of a data set is ques-
`tionable.
`
`
`
`
`
`
`
`
`
`
`finding that
`An example of this was the apparent
`aspirin had a beneficial
`reflect
`in preventing stroke in
`
`
`
`
`
`
`
`
`
`men with cerebrovascular disease but not
`in women
`
`
`
`
`
`
`
`
`(32). This interaction. which was “discovered“ in the
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`first large trial of aspirin in patients with transient ische—
`
`
`
`
`
`
`
`
`mic attacks. was subsequently found.
`in other studies
`
`
`
`
`
`
`
`
`and in a meta-analysis summarizing these studies (33).
`
`
`
`
`
`
`
`
`to be spurious. This finding.
`like the streptokinase ex-
`
`
`
`
`
`
`
`
`
`ample.
`is an example of a "false negative" subgroup
`
`
`
`
`
`
`
`analysis.
`In this instance. many physicians withheld
`
`30
`
`
`
`l January I992 - Annals of Internal Medicine - Volume us - Number]
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Page 3 0f 8
`
`Page 3 of 8
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`aspirin for women with cerebrovascular disease for a
`considerable period.
`
`
`
`
`
`
`
`
`
`Whether a hypothesis preceded analysis of a data set
`is not necessarily a black or white issue. At one ex-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`treme. unexpected results might be clearly responsible
`
`
`
`
`
`
`
`
`
`for generating a new hypothesis. At the other extreme.
`
`
`
`
`
`
`
`
`
`
`a subgroup analysis might be clearly planned for in a
`
`
`
`
`
`
`
`
`study protocol to test a hypothesis suggested by previ-
`ous research. Between these two extremes lie a range
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`of possibilities. and the extent
`to which a hypothesis
`arose before. during. or after the data were collected
`
`
`
`
`
`
`
`
`
`and analyzed is frequently not clear. For example.
`if
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`data monitoring detects a seeming interaction in a long-
`
`
`
`
`
`
`
`
`
`term study.
`it may be possible to state the hypothesis
`and then test it
`in future analyses (34). This technique
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`may be most appropriate if additional study patients are
`still to be accrued
`
`
`
`
`
`
`
`
`
`
`sometimes yield
`Although post-hoe analyses will
`
`
`
`
`
`
`
`
`plausible results.
`they should generally be viewed as
`
`
`
`
`
`hypothesis~generating exercises rather than as hypothe—
`
`
`
`
`
`
`
`
`sis testing. Decisions about which analyses to do and
`
`
`
`
`
`
`
`
`
`
`which ones to report are much more likely to be data
`
`
`
`
`
`
`
`driven with post-hoe analyses and thereby more likely
`
`
`
`
`
`
`
`
`
`to be spurious. On the other hand. when a hypothesis
`has been clearly and unequivocally suggested by a dif-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`ferent data set.
`it moves from a hypothesis-generating
`
`
`
`
`
`toward a hypothesis—testing framework.
`In Bayesian
`terms. the higher prior probability increases the poste-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`rior probability (after the subgroup analysis) of an in-
`teraction being real {29. 30).
`
`
`
`
`
`
`
`
`
`
`
`
`
`If a hypothesis about an interaction has arisen from
`
`
`
`
`
`
`
`
`
`exploration of a data set from a study.
`then an argu-
`ment can be made for excluding that study from a
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`meta-analysis in which the hypothesis is tested. Cer—
`tainly. if the hypothesis is confirmed in a metaianalysis
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`that excludes data from the study that originally sug-
`
`
`
`
`
`
`
`
`gested the interaction.
`the inference rests on stronger
`
`
`
`
`
`
`
`
`ground. If the statistical significance of the interaction
`
`
`
`
`
`
`
`
`disappears or is substantially weakened when data from
`the original study are excluded.
`the strength of infer-
`
`
`
`
`
`
`
`
`ence is reduced.
`
`
`
`it should be
`When considering post-hoe analyses.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`kept in mind that they are more susceptible to bias as
`well as to spurious results. The reader should be par—
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`ticularly cautious about analysis of subgroups of pa-
`
`
`
`
`
`
`
`
`tients that are delineated by variables measured after
`
`
`
`
`
`
`
`
`baseline. even if the hypothesis preceded the analysis.
`
`
`
`
`
`
`
`
`If the treatment can influence whether a participant
`
`
`
`
`
`
`
`
`becomes a member of a particular subgroup.
`the con-
`clusions of the analysis are open to bias. For instance.
`
`
`
`
`
`
`
`
`
`
`one might hypothesize that compliers will do better if
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`they are in the treatment group than in the control
`group but that noncompliers will do equally well in both
`
`
`
`
`
`
`
`
`
`
`groups. The reasons for compliance and noncompli-
`
`
`
`
`
`
`ance. however. probably difi‘er
`in the treatment and
`
`
`
`
`
`
`
`
`control groups. As a result.
`in this comparison.
`the
`
`
`
`
`
`
`
`
`
`advantages of randomization (and with it. the validity of
`
`
`
`
`
`
`
`
`
`the analysis) are lost.
`
`
`
`
`
`
`
`
`
`
`
`An example of the evolution of a hypothesis concern-
`ing responsive subgroups comes from the investigation
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`of the eflicacy of digoxin in preventing clinically impor-
`
`
`
`
`
`
`
`tant exacerbations of heart failure in heart-failure pa-
`
`
`
`
`
`
`
`
`tients in sinus rhythm (see Table 2). Lee and colleagues
`
`
`
`
`
`
`
`
`
`(35) conducted a crossover study in which they found
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`the drug to be effective. They did a regression analysis
`
`
`
`
`
`
`
`
`that suggested that only one factor—the presence of a
`
`
`
`
`
`
`
`third heart sound—predicted who would benefit from
`
`
`
`
`
`
`
`
`
`
`the drug. Only patients with a third heart sound were
`
`
`
`
`
`
`
`
`
`better 011' while taking digoxin. The hypothesis that this
`
`
`
`
`
`
`
`
`
`
`might be one of the predictors appears to have preceded
`the study. Nevertheless. on the basis of the foregoing
`
`
`
`
`
`
`
`
`
`discussion. the investigators Were perhaps too ready to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`conclude that digoxin use in heart-failure patients in
`sinus rhythm should be restricted to those with a third
`
`
`
`
`
`
`
`
`
`
`heart sound.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`4. Was the Subgroup Analysis One Ufa Small Number
`of Hypotheses Tested?
`
`
`
`
`
`
`
`
`
`Post-hoe hypotheses based on subgroup analysis of-
`ten arise from exploration of a data set in which many
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`such hypotheses are considered. The greater the num-
`
`
`
`
`
`
`
`
`
`ber of hypotheses tested.
`the greater the number of
`interactions that will be discovered by chance. Even if
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`investigators have clearly specified their hypotheses in
`the strength of inference associated with the
`advance.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`apparent confirmation of any single hypothesis will de-
`
`
`
`
`
`
`
`
`
`
`
`
`crease if it
`is one of a large number that have been
`tested. In their regression analysis. Lee and colleagues
`
`
`
`
`
`
`
`
`(35) included If: variables. This relatively large number
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`increases the level of skepticism with which the pres-
`ence of a third heart sound as an important predictor of
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`response to digoxin should be viewed.
`
`
`
`
`
`
`
`Unfortunately. as noted above.
`the reader may not
`always be sure about
`the number of pessible interac-
`
`
`
`
`
`
`
`
`tions that were tested.
`if the investigators chose to
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`withhold this information. despite admonitions not to do
`so, and reported only those that were “significant,” the
`
`
`
`
`
`
`
`
`
`reader is likely to be misled.
`
`
`
`
`
`
`(BHAT) ran-
`The Beta-Blocker Heart Attack Trial
`
`
`
`
`
`
`
`
`
`
`
`
`domized approximately 4000 patients to propranolol or
`
`
`
`
`
`
`placebo after
`a myocardial
`infarction (36).
`Subse—
`quently. 146 subgroup comparisons were done (37). Al-
`
`
`
`
`
`
`
`though the estimated efi'ects of the treatment clustered
`
`
`
`
`
`
`
`around the overall efl'ect. the efl’ect
`in some small sub-
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`groups appeared to be either much more effective or
`
`
`
`
`
`
`
`ineffective. The overall pattern. which approximated a
`
`
`
`
`
`
`
`
`"normal" distribution, would suggest that most of the
`
`
`
`
`
`
`
`observed difference in efi‘ect among the various sub-
`
`
`
`
`
`
`
`
`
`groups was due to sampling error rather than to true
`interactions.
`
`
`
`
`
`
`
`
`
`
`
`Another way to consider this is in terms of the effect
`
`
`
`
`
`
`
`of multiple comparisons on P values. The more hypoth-
`
`eses that are tested. the more likely it is to make a type
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`I error. that is. to reject one of the null hypotheses even
`
`
`
`
`
`
`
`
`
`if all are actually true. Assuming that no true diifer-
`ences exist. if IOD different comparisons are made. five
`
`
`
`
`
`
`
`
`can be expected to yield a P value of 0.05 or less by
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`chance alone. In this situation. a more appropriate anal-
`
`
`
`
`
`
`
`
`
`ysis would account for the number of subgroups. their
`
`
`
`
`
`
`
`
`
`
`relation to other subgroups. and the size of the elfect
`
`
`
`
`
`within subgroups and overall (23).
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`5. Was the Diflerem'e Suggested by Comparisons
`
`
`
`
`
`
`within Rather than between Studies?
`
`
`
`
`
`Making inferences about difi'erent eifect sizes in dif-
`
`
`
`
`
`
`
`ferent groups on the basis of between-study differences
`
`
`
`
`
`
`
`
`
`
`
`
`I January [992
`
`
`
`
`
`' Annals of‘lmernal' Medicine - Volume llo - Numberl
`
`
`
`
`
`
`
`
`
`
`
`81
`
`
`Page 4 0f 8
`
`Page 4 of 8
`
`
`
`entails a high risk compared with inferences made on
`
`
`
`
`
`
`
`
`
`the basis of within-study differences. For instance. one
`
`
`
`
`
`
`
`
`would be reluctant to conclude that propranolol results
`
`
`
`
`
`
`
`
`in a different magnitude of risk reduction for death after
`
`
`
`
`
`
`
`
`
`
`myocardial infarction than does metoprolol on the basis
`
`
`
`
`
`
`
`
`
`
`
`
`
`