`
`A Consumer's Guide to Subgroup Analyses
`Andrew D . Oxman, MD, and Gordon H . Guyatt. MD
`
`• The extent to which a clinician should believe and
`act on the results of subgroup analyses of data from
`randomized trials or meta-analyses is controversial.
`Guidelines are provided in this paper for making these
`decisions. The strength of inference regarding a pro(cid:173)
`posed difference in treatment effect among subgroups
`is dependent on the magnitude of the difference, the
`statistical significance of the difference, whether the
`hypothesis preceded or followed the analysis, whether
`the subgroup analysis was one of a small number of
`hypotheses tested, whether the difference was sug(cid:173)
`gested by comparisons within or between studies, the
`consistency of the difference, and the existence of
`indirect evidence that supports the difference. Applica(cid:173)
`tion of these guidelines will assist clinicians in making
`decisions regarding whether to base a treatment deci(cid:173)
`sion on overall results or on the results of a subgroup
`analysis.
`
`Anna/.1· of ln/emal Medicine. 1992:116:78-84.
`
`From McMaster University Health Sciences Centre , Hamilton,
`Ontario. For current author addresses. see end of text.
`
`Clinicians faced with a treatment decision about a
`particular patient are interested in t he evidence that
`pertains most directly to that individual. T hus. it is
`frequently of interest to examine a particular category
`of participants
`in a clinical
`trial: f or example.
`the
`women, those in a certain age group. or those with a
`specific pattern of disease.
`In obser vational studies.
`these examinations, or subgroup analyses, are routine.
`They are also frequently encountered in reports of clin(cid:173)
`ical trials. In a survey of 45 clinical trials reported in
`t hree leading medical j ournals, Pocock and colleagues
`( I ) found at l east one subgroup analysis that compared
`the response to treatment in different categories of pa(cid:173)
`tients in 51% of the reports.
`The results of subgroup analyses have had maj or
`effects, sometimes harmful. on treatment recommenda(cid:173)
`tions. For example, many patients with suspected myo(cid:173)
`cardial infarction who could have benefited from throm(cid:173)
`bolytic therapy may not have received this treatment as
`a result of subgroup analyses based on t he duration of
`symptoms before treatment (2) and the conclusion that
`in patients treated
`streptokinase was only effecti ve
`wi thin 6 hours after the onset of pain (3. 4). A later.
`larger trial showed that streptokinase was effecti ve up
`to 24 hours after the onset of symptoms (5).
`Concl usions based on subgroup analyses can have
`adverse consequences both when a particular category
`of patients is denied effective treatment (a "fal se-nega-
`
`78
`
`© 1992 American College of Physicians
`
`ti ve" conclusion). as in the above example, and when
`ineffective or even harmful treatment is given to a sub(cid:173)
`group of patients (a "false-positive" conclusion). Be(cid:173)
`cause of these risks and their frequency, the appropri(cid:173)
`ateness of drawing conclusions from subgroup analyses
`has been challenged (6, 7), and it has been argued that
`treatment recommendations based on subgroup analyses
`may do more harm than good. T his hypothesis is cur(cid:173)
`rently being tested empirically by comparing treatment
`recommendations generated from earl y trials of new
`treatments based on subgroup anal yses with treatment
`recommendations that would have been made had sub(cid:173)
`gr oup anal yses been ignored, assessing "whether they
`lead
`to more patients receiving treatments that are
`worthwhile and fewer patients receiving treatments that
`are not. .. (Sackett DL. Personal communication.)
`Although we agree that subgroup analyses are poten(cid:173)
`tially misleading and that there is a tendency to over(cid:173)
`emphasize the results of subgroup analyses, in this pa(cid:173)
`per we will present an alternative point of view. The
`essence of our argument is that subgroup analysis is
`hotlr informative and potentially misleading. Rather
`than arguing for or against the merits of subgr oup anal (cid:173)
`ysis. we will present guidelines in this article for decid(cid:173)
`ing how believable the results of subgroup analyses are
`and , consequently, when to act on recommendations
`based on subgroup analyses and when to ignore them.
`Our discussion will focus on randomized trials and
`meta-analyses of randomized trials (systematic over(cid:173)
`views). although the same principles apply to any other
`research design. T he assumption from which we start in
`this discussion is that the underl ying design of the stud(cid:173)
`ies being examined is sound. For treatment trials, sound
`design invol ves elements of randomization, masking.
`completeness of follow-up, and other strategies for min(cid:173)
`imizing both random error and bias (8 . 9). If the study
`is not sound, the overall conclusion is suspect , let alone
`conclusions based on subgroup anal yses.
`Even gi ven a rigorous study design, the extent to
`which subgroup anal yses should be done-or be(cid:173)
`lieved-
`is highly controversial. Al though
`there are
`those who ignore scient ific principles in the subgroup
`analyses they undertake and report. go on fishing expe(cid:173)
`ditions, and indulge in data-dredging exercises (1 0. L1 ).
`there are also those who mix apples and oranges. drown
`in the data they pool ( 12). reach meaningless conclu(cid:173)
`sions about "average" effects ( 13), and fail to detect
`clinically important effects because of the heterogeneity
`of their study groups (14). Although the debate between
`these two camps is entertaining and can lead to some
`useful
`insights, practical advice
`for assessing
`the
`strength of inferences based on subgroup anal yses is
`also important. l n providing such advice, we will build
`on cri teria that have been suggested by other authors
`( 15-18).
`
`Page 1 of 8
`
`Biogen Exhibit 2037
`Coalition v. Biogen
`IPR2015-01993
`
`
`
`Table I. Guidelines for Deciding whether Apparent Dif·
`ferences in Subgroup Response Are Real
`
`I. Is the magnitude of the difference clinically important ?
`2. Was the difference statistically significant?
`3. Did the hypothesis precede rather than follow the analysis?
`4. Was the subgro up analysis one of a small number of
`hypothese tested?
`5. Was the difference suggested by comparisons within rather
`than between studies?
`6. Was the difference consistent across studies ?
`7. Is there indirect evidence that s upports the hypothesized
`difference?
`
`Our criteria a re summarized in Table I and are de(cid:173)
`scribed in detail below. An example of a hypothesized
`difference in subgroup response a nd the extent to which
`it meets our proposed criteria is give n in Table 2. We
`will use this example in the text to highlight some of the
`relevant issues. It should be noted from the outset that
`our criteria, like a ny guidelines for making an inference,
`do not prov ide hard and fast rules; they simply repre(cid:173)
`sent an organized a pproach to making reasonable judg(cid:173)
`ments.
`
`Guidelines for Deciding whether Apparent Differences in
`Subgroup Response Are Real
`
`Conceptual Approach Underlying the Guidelines
`
`Subgroup analyses of data from randomized trials or
`meta-analyses are undertaken to identify ··effect modi(cid:173)
`fiers," characteristics of the patients or treatment that
`modify the effect of the intervention under study. Sta(cid:173)
`tistical " interactions" in a set of data a re measured to
`estimate effect modification (an epidemiologic concept)
`in the population represented by the study sample ( 19).
`The term interactio n is sometimes (but not in this pa(cid:173)
`per) also used to refer to the concept of synergism or
`antagonism. a biologic mechanism of action in which
`the combined effect of two or more factors differs from
`the sum of their solitary effects (20). In the following
`discussion , we use the term " interaction" to refer to
`situations in which the observed effectiveness of an
`inte rve ntion differs across subgroups.
`The premise underlying the hypothesis that subgroup
`analyses do more harm than good is that ' 'unanticipated
`qualitative interactions" are unusual and. when appar(cid:173)
`ent unanticipated interactions are discovered , they are
`
`usuall y artifacts due to chance. The same position can
`be taken with respect to apparent differences between
`treatment effects in drugs of a single class; this would
`suggest that the best estimate of the effect of any one
`drug is the overall effect of the group of drugs across all
`methodologically adequate, randomized . controlled tri(cid:173)
`als (21). There is confusion. however, over the funda(cid:173)
`mental distinction between a •·qualitative interaction' '
`and a " qua ntitative interaction " (22). Although a strict
`definition of a qualitative interaction would mean that
`the re is a sign reversal (22) (meanjng that the treatment
`is beneficial in one group and harmful in another). it is
`also used to refe r to a substantial quantitative interac(cid:173)
`tion (that is, a difference in the magnitude of effect that
`is clinically important). From a clinical point of view, it
`is importa nt to recognize that a substantial quantitative
`interaction can be as important as a qualitative interac(cid:173)
`tion. For instance. the side effects of a treatment may
`be such that it is worth administering to patients in
`whom the magnitude of the treatment effect is large, but
`not to patients in whom the treatment effect is small or
`moderate.
`Having said this, it is still reasonable to distinguish
`between interactions that are clinically trivia l and those
`that are clinically important. The former can be ignored.
`and that is the point at which our guidelines begin.
`Once the clinician has decided that a n interaction , if
`real. would be important , the subsequent six criteria
`can be used to help dec ide on the credibility of the
`proposed subgroup difference . Three of the criteria (2 to
`4) are markers of the potential for random error (that is,
`mistakes due to chance); one (criterion 5) is a ma rker of
`the potential for systematic errors; a nd the last two
`address the consistency of the evidence (criterion 6) and
`its biologic plausibility (criterion 7).
`
`The Guidelines
`I. Is the Mag n.itude of the Difference Clinically
`Important?
`Given the extent of biologic variability. it would be
`surprising nor to find interactions between treatment
`effects and various other factors. Differences in the
`effect of treatment are likely to be associated with dif(cid:173)
`ferences in patient characteristics, differences in the
`ad ministration of the treatment (such as different sur(cid:173)
`geons or different drug doses), and differences in the
`prima ry e nd point. However, it is only when these
`
`Table 2. An Example of a Hypothesized Difference in Subgroup Response: Digoxin is More Effective in Patients with
`More Severe Heart Failure
`
`Criterion
`
`I . Magnitude of the difference
`2. Statistical significance
`3. A priori hypothesis
`
`4. Small number of hypotheses
`
`5. Within-study comparisons
`6. Consistency across studies
`
`7. Indirect evidence
`
`Result
`
`Clinicall y important differentiation between responders and nonresponders.
`Yes, P values were less than 0.01 in both studies.
`Yes. the hypothesis was suggested by results of one stud y and tested in a
`second study.
`If viewed as severity of heart failure. yes. lf viewed as components (for
`example, heart size. third heart sound . ejection fraction). no.
`Yes. in two crossover trials, comparisons were within studies.
`Yes, in two studies tested. Ho wever, it was not tested in other trials. and
`this is necessary for confirmation.
`Yes, biologically plausible that clinically important re sponse is restricted to
`those with more severe heart failure.
`
`I January 1992 • Annals of Internal Medicine • Volume 116 • Number I
`
`79
`
`Page 2 of 8
`
`
`
`differences or interactions are practically important(cid:173)
`that is, when they are large enough that they would lead
`to different clinical decisions for different subgroups(cid:173)
`that there is any point in considering them further.
`As a rule, the larger the difference between the effect
`in a particular subgroup (or with a particular drug or
`dosage of drug) and the overall effect, the more plausi(cid:173)
`ble it is that the difference is real. At the same time, as
`the difference in effect size between the anomalous sub(cid:173)
`group and the remainder of the patients becomes larger,
`the clinical importance of the difference increases.
`Unfortunately, if the results of subgroup analysis are
`only reported for the subgroups within which sizable
`treatment differences are found , the estimates of the
`magnitude of the interaction will be biased because only
`the extreme estimates are reported (23). This is analo(cid:173)
`gous to regression to the mean (the tendency for ex(cid:173)
`treme findings, such as unusually high blood pressure
`values , to revert toward less extreme values on re(cid:173)
`peated examination) (24). Moreover, when the overall
`treatment effect is modest, there is a good chance of
`finding a " qualitative" interaction even when only two
`subgroups are examined (17).
`When they report the results of subgroup analyses.
`authors should make clear to readers how many com(cid:173)
`parisons were made and how it was decided which ones
`to report. Given current publication practices, however,
`were the reader simply to conclude that a reported
`interaction is real just because it is large, he or she
`would be wrong more often than right. Thus, having
`determined that an interaction, if real , is large enough
`to be important, it is essential to consider other criteria.
`
`2. Was the Difference Statistically Significant?
`Any large data set has, imbedded within it, a certain
`number of apparent, but in fact spurious, interactions.
`Statistical tests of significance can be used to assess the
`likelihood that a given interaction might have arisen due
`to chance alone. For example, Yusuf and colleagues
`(25), in an overview of randomized trials of beta blocker
`treatment for myocardial infarction, compared agents
`with and without intrinsic sympathomimetic activity
`(ISA) and found that the agents without ISA seemed to
`produce a larger effect than the ones with it. This dif(cid:173)
`ference was significant at the 0.01 level, indicating that
`it was unlikely to have occurred due to chance alone.
`Yet, two subsequent trials, one of an agent with ISA
`and one of an agent without ISA. showed the opposite
`result and, when added to the overview, eliminated the
`statistical significance of the interaction (26). There are
`several possible explanations for this, including chance.
`In other words, although events that occur one out of a
`hundred times might be considered rare, they do occur.
`Of course, the lower a P value is, the less likely it is
`that an observed interaction can be explained by chance
`alone.
`Conversely, just as it is possible to observe spurious
`interactions, chance is likely to lead to some studies
`(among a large group) in which even a real interaction is
`not apparent. This is particularly true if the studies are
`small and the clinical end points of interest are infre(cid:173)
`quent. In this case, the power to detect an interaction
`would be low. Because subgroup analyses always in-
`
`elude fewer patients than does the overall analysis, they
`carry a greater risk for making a type II error-falsely
`concluding that there is no difference.
`Statistical techniques for conducting subgroup analy(cid:173)
`sis include the Breslow-Day technique and regression
`approaches (27). With the Breslow-Day technique and
`similar approaches (28), it is possible to use a test for
`homogeneity to estimate the probability that an ob(cid:173)
`served interaction might have arisen due to chance
`alone. More commonly, authors simply conduct a num(cid:173)
`ber of comparisons for different subgroups and apply
`chi-square tests or /-tests without formally testing for
`interactions.
`This practice, together with only reporting subgroups
`within which sizable treatment differences are found ,
`can lead to an overestimate of the significance as well
`as the size of the difference. One way of adjusting for
`this bias is to use Bayes or empiric Bayes methods,
`which shrink the extreme estimates toward the overall
`estimate of treatment effect (23, 29. 30). Both a point
`estimate of the magnitude of the difference and a con(cid:173)
`fidence interval can be obtained using these approaches.
`Regression models. such as logistic regression (28),
`can also be used for analysis of interactions if the in(cid:173)
`teractions are modeled by product terms . This approach
`allows for testing the significance of an interaction while
`controlling for other factors. If there are many subgroup
`factors, however, the number of product terms neces(cid:173)
`sary for an adequate modeling of the interactions may
`be greater than the number of observations ; an analysis
`of the interactions is then impossible. An additional
`problem with this approach is deciding which of many
`possible interaction terms to enter into the model as
`well as the potential for bias in their selection.
`Methods for selecting factors to include have been
`proposed (3 1) in addition to other approaches to sub(cid:173)
`group analysis (15 , 18, 23 , 27). Although it is not im(cid:173)
`ponant for clinical readers to understand the details of
`these approaches, it is important to understand the con(cid:173)
`cepts of statistical significance and power in subgroup
`analysis. Statistical analysis is a useful tool for assess(cid:173)
`ing whether an observed interaction might have been
`due to chance alone, but it is not a substitute for clin(cid:173)
`ical judgment.
`
`3. Did the Hypothesis Precede Rather than Follow the
`Analysis?
`Surveying patterns of data that suggest possible inter(cid:173)
`actions may, in fact. prompt the analysis that "con(cid:173)
`firms" the existence of a possible interaction. As a
`result , the credibility of any apparent interaction that
`arises out of post-hoc exploration of a data set is ques(cid:173)
`tionable.
`An example of this was the apparent finding that
`aspirin had a beneficial effect in preventing stroke in
`men with cerebrovascular disease but not in women
`(32). This interaction. which was " discovered" in the
`first large trial of aspirin in patients with transient ische(cid:173)
`mic attacks, was subsequently found, in other studies
`and in a meta-analysis summarizing these studies (33),
`to be spurious. This finding, like the streptokinase ex(cid:173)
`ample, is an example of a "false negative" subgroup
`analysis. In this instance , many physicians withheld
`
`80
`
`I January 1992 • Annals of Internal Medicine • Volume 116 • Number I
`
`Page 3 of 8
`
`
`
`aspmn for women with cerebrovascular disease for a
`considerable period.
`Whether a hypothesis preceded anal ysis of a data set
`is not necessarily a black or white issue. At one ex(cid:173)
`treme, unexpected results might be clearly responsible
`for generating a new hypothesi s. At the other extreme,
`a subgroup analysis might be clearly planned for in a
`stud y protocol to test a hypothesis suggested by previ(cid:173)
`ous research. Between these two extremes lie a range
`of possibilities, a nd the extent to which a hypothesis
`arose before, during. or after the data were collected
`and analyzed is frequently not clear. For example. if
`data monitoring detects a seeming interaction in a long(cid:173)
`term study. it may be possible to state the hypothesis
`and then test it in future a nalyse (34). This technique
`may be most appropriate if additional study patients are
`still to be accrued.
`Although post-hoc analyses will sometimes yield
`plausible results, they should generally be viewed as
`hypothesis-generating exercises rather than as hypothe(cid:173)
`sis testing. Decisions about which a nalyses to do and
`which ones to report are much more likely to be data
`driven with post-hoc analyses and thereby more likely
`to be spurious. On the other hand , whe n a hypothesis
`has been clearly and unequivocally suggested by a dif(cid:173)
`ferent data set , it moves from a hypothesis-generating
`toward a hypothesis-testing framework. In Bayesian
`terms, the higher prior probability inc reases the poste(cid:173)
`rior probability (after the subgroup a nalysis) of an in(cid:173)
`teraction being real (29, 30).
`If a hypothesis about an interaction has arisen from
`exploration of a data set from a study, then an argu(cid:173)
`ment can be made for excluding that study from a
`meta-analysis in which the hypothesis is tested. Cer(cid:173)
`tainly , if the hypothesis is confirmed in a meta-analysis
`that excludes data from the study that originally sug(cid:173)
`gested the interaction. the inference rests on stronger
`ground. If the statistical significance of the interaction
`disappears or is substantially weakened when data from
`the original study are excluded , the strength of infer(cid:173)
`ence is reduced.
`When considering post-hoc analyses , it should be
`kept in mind that they are more susceptible to bias as
`well as to spurious results. The reader shoul.d be par(cid:173)
`ticularly cautious about analysis of subgroups of pa(cid:173)
`tients that are delineated by variables measured after
`baseline, even if the hypothesis preceded the analysis.
`If the treatment can influence whether a participant
`becomes a member of a particular subgroup, the con(cid:173)
`clusions of the analysis are open to bias. For instance,
`one might hypothesize that compliers will do better if
`they are in the treatment group tha n in the control
`group but that noncompliers will do equally well in both
`groups. The reasons for compliance and noncompli(cid:173)
`ance, however. probably differ in the treatment a nd
`control groups. As a result, in this comparison, the
`advantages of randomization (and with it. the validity of
`the a nalysis) are lost.
`An example of the evolution of a hypothesis concern(cid:173)
`ing responsive subgroups comes from the investigation
`of the efficacy of digoxin in preventing clinicall y impor(cid:173)
`tant exacerbations of heart failure in heart-failure pa(cid:173)
`tients in sinus rhythm (see Table 2). Lee and colleagues
`
`(35) conducted a crossover study in which they found
`the drug to be effective. T hey did a regression analysis
`that suggested that only one factor-the presence of a
`third heart sound-predicted who would benefit from
`the drug. Only patients with a third heart sound were
`better off while taking digoxin. The hypothesis that this
`might be one of the predictors appears to have preceded
`the study. Nevertheless, on the basis of the foregoing
`discu sion. the investigators were perhaps too ready to
`conclude that digoxin use in heart-failure patients in
`sinus rhythm should be restricted to those with a third
`heart sound.
`
`4 . Was th e Subgroup Analysis One of a Small Number
`of Hypotheses Tested?
`Post-hoc hypotheses based on subgroup analysis of(cid:173)
`ten arise from exploration of a data set in which many
`such hypotheses are considered. The greater the num(cid:173)
`ber of hypotheses tested. the greater the number of
`interactions that will be discovered by chance. Even if
`investigators have clearly specified their hypotheses in
`advance. the strength of inference associated with the
`apparent confirmation of any single hypothesis will de(cid:173)
`crease if it is one of a large number that have been
`tested. In their regression analysis, Lee and colleagues
`(35) included 16 variables. This relatively large number
`increases the level of skepticism with which the pres(cid:173)
`ence of a third heart sound as an important predictor of
`response to digoxin should be viewed.
`Unfortunately, as noted above, the reader may not
`always be sure about the number of possible interac(cid:173)
`tions that were tested. If the investigators chose to
`with hold this information , despite admonitions not to do
`so, and reported only those that were "significant ," the
`reader is likely to be misled.
`The Beta-Blocker Heart Attack Trial (BRAT) ran(cid:173)
`domized approximately 4000 patients to propranolol or
`placebo after a myocardial
`infarction (36) . Subse(cid:173)
`quently. 146 subgroup comparisons were done (37). Al(cid:173)
`though the estimated effects of the treatment clustered
`around the overall effect, the effect in some small sub(cid:173)
`groups a ppeared to be either much more effective or
`ineffective. The overall pattern, which approximated a
`"normal" distribution, would suggest that most of the
`observed difference in effect among the various sub(cid:173)
`groups was due to sampling error rather than to true
`interactions.
`Another way to conside r this is in terms of the effect
`of multiple comparisons on P values. The more hypoth(cid:173)
`eses that are tested , the more likely it is to make a type
`I error, that is, to reject one of the null hypotheses even
`if all are actually true. Assuming that no true differ(cid:173)
`ences exist, if 100 different compa risons are made , fi ve
`can be expected to yield a P value of 0.05 or less by
`chance alone. In this situation. a more appropriate anal(cid:173)
`ysis would account for the number of subgroups , their
`relation to other subgroups, and the size of the effect
`within subgroups a nd overall (23).
`
`5. Was the Difference Suggested by Comparisons
`wiThin Rather Than between Studies?
`Making inferences about different effect sizes in dif(cid:173)
`ferent groups on the basis of between-study differences
`
`I January 1992 • Annals of Internal Medicine • Volume 116 • N umber I
`
`81
`
`Page 4 of 8
`
`
`
`entails a high risk compared with inference~ made on
`the basis of within-study differences. For instance , one
`would be relucrant ro conclude that propranolol results
`in a differenr magnitude of risk reduction for death after
`myocardial infarction than does metoprolol on the basis
`of data from lwo studies. one that compared proprano(cid:173)
`lol with placebo and another that compared metoprolol
`with placebo. This could be thought of as an indirect
`comparison. A direct comparison would involve. in a
`single study. patients being randomized to receive ei(cid:173)
`ther placebo, propranolol, or meroprolol. If, in such a
`direct comparison, clinically important and sratistically
`significant ditl'erences in magnitude or effect between
`the two active treatments were demonsrrated. rhe infer(cid:173)
`ence would be quite srrong.
`An example that illustrares this point comes from an
`overview examining the effectiveness of prophylaxi:. for
`gastrointestinal bleeding in critically ill patients (38l.
`Hisramine2-
`receptor (H1) antagonises and antacids.
`when individually compared with placebo. had compa(cid:173)
`rable effects in reducing overt bleeding (common odds
`ratios of 0.35 in both cases). In contrast. direct com(cid:173)
`parison from studies in which patients were randomized
`to receive H2 antagonists or antacids have shown a
`statistically significantly greater reducrion in bleeding
`with the latter (common odds ratio. 0.56).
`The reason that inference on the basis of bcrween(cid:173)
`study differences is so potentially misleading is that
`there may be a myriad of factors, a:.idc from the mosr
`salient dilrerence, which is the basis of the inference
`being made. that could explain the inreraction. For in(cid:173)
`stance. aside from diHerences in !he specific drugs u~ed.
`different populations (varying in risk for adverse out(cid:173)
`comes. for example). varying degrees of co-interven(cid:173)
`tion . or varying criteria for gastroinrestinal bleeding
`each could explain the results. These differences would
`not be plausible explanations if the inference were
`based on within-study differences in randomized trials
`in which populations studied. control of co-intervemion.
`and outcome crireria were all identical.
`Stated simply, berween-study inference~ arc based on
`comparisons between noncomparable group!>: even
`when all of the individual studies were randomiLed.
`patients were not randomized to one study or anorhcr.
`Clinical decisions based on between-stud y comparisons
`should be made cautiously. if at all. As a rule . infer(cid:173)
`ences based on between-study comparisons should be
`viewed as preliminary and as requiring confirmation
`from direct within-study comparison. This
`i~ rrue
`whether rhe between-study comparison has 10 do with
`dilferenr groups or different interventions.
`
`6. Was 1he Dijj'erenre Consistent tl('ross Swdie.\ '!
`A hypothesis concerning differential response in a
`subgroup of patients may be generated by examinarion
`of data from a single study. The interaction becomes far
`more credible if it is also found in other studies. The
`extent to which a comprehensive scientific overview of
`the relevant lirerature find s an interaction to be consi~
`tenrly present is probably the besr single index as to
`wherher it should be believed.
`In other words. the replication of an interaction in
`independent. unbiased studies provides strong support
`
`for its believability. On rhe other hand. there arc two
`reasons to be cautious in applying this criterion . The
`first goes back to sample size. Because subgroup anal(cid:173)
`yses often include small numbers of patients, the resulls
`tend to be imprecise and the exrent to which results
`from different ~tudies are con!>istent can be uncertain.
`The second caurion rel ate~ to making between-study
`comparisons. For the same rea~on that it is risky ro
`base conclusions on between-study differences. it is
`only reasonable to expect variation in the results of
`trials of the same therapy. due to difference~ in the
`•audy popu lations. the interventions. the outcomes. and
`the study designs. as well as the play of chance. Thus,
`when assessing the consistency of results, it is impor(cid:173)
`tant to consider both the power of rhe comparisons (or
`their statistical certainty) and other diiTerences between
`studies that might influence the results.
`The hypothesis concerning a third heart sound as a
`predictor of response to digoxin in heart-failure patients
`in ~inus rhythm was tested in a second crossover, ran(cid:173)
`domized trial !39). The presence of a rhird heart sound
`proved a weaker predictor than in the initial study.
`although its association with response to digoxin did
`reach conventional levels of ~tati ~tica l significance.
`However. a number of factors that. like a third heart
`~ou nd. reflect greater severity of hear! failure. were
`associated with response to digoxin. Thu~. ~upport for a
`more general hypothesis. that response is related to the
`severity of heart failure. wa'> provided by the second
`study.
`Other sludies have examined rhe efficacy of digoxin
`in heart-failure patients in sinu!. rhythm, and these have
`been summarized in a meta-am.tlysis (40). Unfortu(cid:173)
`nately. none of these <;IUdies has conducted subgroup
`analyses addressing the issue of dift'erential response
`according to different severity of heart fai lure. Had
`rht:se analyses been done in the other studies. the hy(cid:173)
`pothesis wou ld likely have been confirmed or refuted
`with substantially greater confidence. A~ it b. we would
`be inclined to view the conclu~ion as tentative: the
`strength of inference is onl y moderate.
`
`7. Is There lndirrct E1•idence to Support tilt•
`Hypmfwsi:.ed Difference'!
`We are generally more ready to believe a hypothe(cid:173)
`sized interacrion if indirect evidence (such as from an(cid:173)
`imal studies or analogous <;ituations in human biology)
`makes the interaction more plausible . That is, to the
`extent rhat a hypothesis is consistent with our current
`under~tanding or the biologic mechanisms of disease.
`we are more likely to believe it. Such understanding
`come from three types of indirect evidence: from stud(cid:173)
`ies of different populations (including animal studies);
`from observation~ of inreractions for !.imilar interven(cid:173)
`tion:.: and from result~ of studi e~ of other, related out(cid:173)
`comes (parlicularly inrermediary ourcomes).
`The extent to which indirect evidence strengthens an
`inference about a hyporhesizcd interaction varies sub(cid:173)
`stanrially. In general . evidence from intermediary out(cid:173)
`comes is the strongesr type of indirect evidence. Evi(cid:173)
`dence of differences in immune response. for example.
`can provide srrong support for a conclusion that there is
`an imp