`
`Secondary Endpoints Cannot Be Validly
`Analyzed if the Primary Endpoint Does Not
`Demonstrate Clear Statistical Significance
`
`Robert T. O’Neill, PhD
`Office of Epidemiology
`and biostatistics, Center
`Rockville, Maryland
`
`Drug Evaluation and Research/FDA,
`
`treatment
`the interpretation of observed
`ABSTRACT: There is lack of consensus surrounding
`effects for secondary clinical endpoints when the primary endpoint for which the clinical
`trial was initially designed does not meet the objective of a demonstrated
`effect. We
`provide some arguments
`to support caution in making
`inferences
`for secondary end-
`points in this situation. We examine the definitions of primary and secondary endpoints
`within the context of a hypothesis-testing
`framework
`for multiple endpoints, and we
`address the relationship of the correlation structure of these endpoints and the statistical
`adjustments needed to preserve experiment-wise
`type I error for a valid inference. We
`also address the hypothesis-testing
`framework and the estimation framework
`for valid
`inference, focusing on the interpretation of p-values associated with differentially pow-
`ered hypothesis
`tests for each endpoint to detect an important clinical effect. We point
`out the limitations on the strength of evidence
`(and quantification of uncertainty)
`for
`a secondary endpoint effect that can be derived from only one study and introduce the
`likelihood of replication of the finding in another study of identical size and design as
`Controlled Clin Trials 1997;18:550-556
`a useful concept
`to guide this interpretation.
`0 Elsevier Science Inc. 1997
`
`KEY WORDS: Primary endpoints, secondary endpoints,
`hypothesis fests
`
`statistical adjustments, valid inference,
`
`that secondary end-
`in favor of the provocative premise
`This article argues
`points cannot be validly analyzed
`if the primary endpoint does not demonstrate
`clear statistical significance.
`I ask the following questions: What is the definition
`and what is the difference between a primary and a secondary endpoint, and
`what are the other categories of endpoint definitions? What is the impact of
`the correlation
`structure among
`these endpoints? What
`is meant by a valid
`analysis within the context of a hypothesis-testing
`decision rule for evidence?
`What is meant by the concept of clear statistical significance? Also, I will discuss
`the relationship between
`the sample size of a clinical
`trial planned
`from the
`perspective of power against a particular alternative
`to test a hypothesis and
`the corresponding precision of the estimate of treatment effects derived
`from
`
`for Robert T. OWeill, PhD, Ofice of Epidemiology and Biosfatisfics, Center
`Address reprint requests
`for Drug Evaluation and Research/FDA, Room 158-45, HFD-700,560O Fishers Lane, Rockuille, MD 20857.
`Received February 17, 1997; revised April 7, 1997; accepted April 22, 1997.
`
`Controlled Clinical Trials 18:550-556 (1997)
`0 Elsevier Science Inc. 1997
`655 Avenue of the Americas, New York, NY 10010
`
`0197-2456/97/$17.00
`PII SOl97-2456(97)00075-5
`
`Page 1 of 7
`
`YEDA EXHIBIT NO. 2081
`MYLAN PHARM. v YEDA
`IPR2015-00644
`
`for
`p-values,
`
`
`Inference from Primary and Secondary Endpoints
`
`the differing concepts
`that sample size. This discussion allows us to contrast
`of clear statistical significance and precise estimates of treatment effects. Finally,
`as another comment on the notion of clear statistical
`significance
`and how
`much evidence we may derive from a single clinical trial, I will briefly discuss
`the concept of the chance of replication of statistically
`significant
`results
`(i.e.,
`p-values
`less than a prespecified
`level, say 0.05) in a second clinical
`trial as a
`basis of confirmatory
`evidence of a potentially
`serendipitous
`secondary end-
`point finding observed
`in a single initial study.
`First, to clarify the distinction between a primary endpoint and a secondary
`endpoint,
`I define a primary endpoint as a clinical endpoint
`that provides
`evidence sufficient
`to fully characterize
`clinically
`the effect of a treatment
`in a
`manner
`that would support a regulatory
`claim
`for the treatment. Because
`evaluation of the impact of treatment on a primary endpoint
`is the major
`purpose of a clinical trial, the sample size of the trial is based upon the power
`of the trial to detect a specified clinical benefit on the primary endpoint. A
`secondary endpoint
`is a clinical endpoint that provides additional clinical char-
`acterization of treatment effect but that is not sufficient
`to characterize
`fully
`the benefit or to support a claim for a treatment effect. By definition, a secondary
`endpoint could not, by itself, be convincing of clinically significant
`treatment
`effects, even if it were observed
`to be statistically
`significant. Defined
`in this
`way, a secondary endpoint could not become a primary endpoint after the fact.
`This distinction
`in definitions does not, however,
`illustrate why controversy
`exists concerning whether a statistically
`significant secondary endpoint should
`be considered valid. I believe the controversy arises when there is a multiplicity
`of endpoints whose collective use has not been considered
`in advance and
`when none of these endpoints may fully characterize
`a treatment effect. For
`example,
`the validity of a secondary endpoint analysis becomes difficult
`to
`interpret when composite endpoints, composed of both primary and secondary
`endpoints, are themselves considered as primary or secondary endpoints. An-
`other example
`that is difficult to interpret occurs when an endpoint
`is catego-
`rized as secondary not because
`it would not characterize
`the clinical benefit of
`treatment, but because
`the planned size of the clinical
`trial gives c1 priori low
`statistical power to detect treatment-induced
`changes. If we permit a secondary
`endpoint
`to become a primary endpoint
`solely on the basis of its observed
`statistical significance,
`then it is very important
`to formulate,
`in advance,
`the
`statistical structure of the decision rule for judging clear statistical evidence.
`The clinical trial literature supports the principle of parsimony
`in the choice
`and selection of clinical endpoints used to characterize and test the effects of
`treatment on disease
`[l]. Trialists recognize
`that the multiplicity of treatment
`endpoints within a confirmatory hypothesis-testing
`paradigm can impact inter-
`pretation of results. The quantification of statistical uncertainty of any conclu-
`sion of treatment benefit
`(e.g., a valid analysis)
`should weigh
`the possible
`scenarios
`for judging a result of a clinical trial as successful. Thus, clinical trial
`investigators have usually chosen a few primary endpoints on which to base
`the design of the trial and relegated
`to secondary
`status other endpoints
`that
`are clinically
`interesting,
`corroborative,
`or suggestive but not of convincing
`clinical importance. Mortality, when considered as a secondary endpoint, seems
`to be one of the few exceptions
`to this strategy because of the clinical
`impact
`of a statistically
`significant
`finding. One of the usual reasons
`for designating
`
`Page 2 of 7
`
`YEDA EXHIBIT NO. 2081
`MYLAN PHARM. v YEDA
`IPR2015-00644
`
`551
`
`
`Table
`
`Overall and Adjusted Type 1 Errors for Two Decision Criteria
`Designed
`to Evaluate Four Equally Correlated Endpoints*
`Criterion 1
`Criterion 2
`At Least One of the Four End-
`Each of Four Endpoints
`Correlation Among
`points Must Be Significant at 0.05
`the Four Endpoints Must Be Significant at 0.05
`Adjusted Type 1
`for Endpoints
`to Maintain
`Overall 0.05
`
`R.T. O’Neill
`
`Adjusted Type 1
`for Endpoints
`0.0127
`0.013
`0.014
`0.017
`0.022
`0.05
`
`Overall
`Type 1
`
`Overall
`Type 1
`to Maintain
`Overall 0.05
`
`0.0
`0.2
`0.4
`0.6
`0.8
`1.0
`
`<O.OOl
`0.0002
`0.0014
`0.005
`0.013
`0.05
`
`0.473
`0.376
`0.289
`0.209
`0.136
`0.05
`
`0.186
`0.173
`0.155
`0.133
`0.106
`0.05
`
`* Capizzi and Zhang [Z].
`
`is that the trialist believes u priori that there
`mortality as a secondary endpoint
`is little chance a treatment effect will be observed, given the sample sizes and
`the power to detect a clinically
`important effect on mortality.
`that incorporate
`Further, when a clinical trial employs secondary endpoints
`all cause and cause-specific mortality, nonfatal events, composites of fatal and
`nonfatal events, and composites of highly correlated endpoints and competing
`multiple risk endpoints,
`the interpretation
`of the multiplicity of outcomes be-
`comes complex. This is especially
`true if the decision
`rules and multiplicity
`adjustments
`to control false conclusions are not properly planned
`in advance
`as objectives of a clinical
`trial.
`formula-
`We assume, as most clinical trial protocols do, a hypothesis-testing
`tion, and it is within
`this framework
`that statistical
`significance of treatment
`effects associated with the primary and secondary endpoints
`is the criterion
`used in judging
`the uncertainty of the result. We interpret a valid analysis
`to
`mean
`that the observed
`strength of evidence,
`as represented
`by a p-value
`and a confidence
`interval,
`is considered well within acceptable bounds
`for
`controlling overall type I error for the trial, and other hypothesis-testing
`consid-
`erations, such as adjustments
`for multiplicity of endpoints, are satisfied.
`
`MULTIPLE CORRELATED
`IN A CLINICAL TRIAL
`
`ENDPOINTS:
`
`THE DECISION RULE FOR A “WIN”
`
`as a collection of
`endpoints may be considered
`Primary and secondary
`multiple endpoints, each of which,
`if not held to a protocol-defined
`criterion
`for valid interpretation,
`could produce a variety of outcomes whose uncertainty
`is difficult to quantify. Table 1, adapted from Capizzi and Zhang [2], illustrates
`the impact
`that two different decision rules have on the overall
`type I error
`in a hypothesis-testing
`framework when
`there are four correlated multiple
`endpoints
`in a trial.
`The table considers
`the correlation
`ranges
`
`structure in which
`four endpoints with an equicorrelation
`from 0 to 1. We consider
`two decision criteria within
`
`Page 3 of 7
`
`YEDA EXHIBIT NO. 2081
`MYLAN PHARM. v YEDA
`IPR2015-00644
`
`552
`1
`
`
`Inference from Primary and Secondary Endpoints
`
`that at least one of the
`framework. Criterion 2 requires
`the hypothesis-testing
`four clinical endpoints demonstrate a statistically
`significant
`finding at a 0.05
`type 1 level. Depending upon the correlation among the four endpoints,
`the
`overall
`type I error for the decision
`rule can range
`from 0.05 to 0.186. To
`maintain an overall 0.05 error rate, the adjusted type 1 levels for the individual
`endpoint can range between 0.05 and 0.0127. Clearly,
`the validity of the infer-
`ence is sensitive
`to the correlation
`structure
`for this decision rule.
`The other decision
`rule, criterion 1, requires
`that all four of the clinical
`endpoints demonstrate
`a statistically
`significant
`result at the 0.05 level. The
`overall type I error rate for this decision rule ranges from 0.05 when correlation
`is 1 to less than 0.0001 when the endpoints are uncorrelated. To maintain an
`overall 0.05 error rate, the adjusted
`levels for the individual endpoints
`range
`from 0.05 to 0.473. Thus, both decision rules are valid for their intended pur-
`poses. The inferences made from each are valid, but they differ
`in terms of
`both clinical and statistical
`interpretation.
`in Table 1 is that the number of
`The message derived
`from the information
`endpoints,
`the correlation structure among the endpoints, and the decision rule
`for a “win” all matter in judging the validity of the inference. When the criterion
`for that win is that at least one endpoint must be statistically
`significant at the
`0.05 level, then the need for statistical adjustments
`to maintain a constant overall
`0.05
`level decreases
`as the correlation
`among
`the four endpoints
`increases
`toward 1. On the other hand, no statistical adjustments may be needed and,
`in fact, the overall
`type 1 error of 0.05 is conservative when the win criterion
`specifies
`that each of four clinical endpoints must demonstrate
`statistical sig-
`nificance at the 0.05 level.
`
`IS CONDITIONAL ON WHETHER
`INFERENCE ON THE SECONDARY ENDPOINT
`THE PRIMARY ENDPOINT
`IS OR IS NOT STATISTICALLY
`SIGNIFICANT
`
`Consider the implications of making inference on the secondary endpoint condi-
`tional on what has occurred with the primary endpoint. When there is correla-
`tion between
`the primary and secondary endpoints, as indeed there would be
`when some endpoints
`are functions or composites of other endpoints,
`the
`information
`conveyed
`in the secondary
`endpoint differs when
`the primary
`endpoint
`is and is not significant. The conditional nature of this inference raises
`some interesting
`issues. For example,
`the primary endpoint usually forms the
`basis for the design, sample size, and power of a clinical trial. Thus, it should
`be more likely to observe significant p-values for the primary endpoint when
`the trial
`is well powered and when
`the alternative hypothesis
`(a specified
`treatment effect) is true. If the clinical trial is underpowered
`for the secondary
`endpoint, as it might be when mortality
`is categorized as a secondary endpoint,
`we should expect to observe significant p-values associated with the secondary
`endpoint
`to a lesser extent, even if the alternative
`for the secondary endpoint
`is true. Thus, differentially powered
`tests for each of the multiple endpoints
`play some role in the interpretation of observed outcomes.
`the secondary
`An example
`that may be more problematic
`occurs when
`endpoint
`is a composite endpoint such as total mortality, which also includes
`a primary endpoint
`like cardiovascular mortality. The validity of the analysis
`should be influenced by the correlation
`structure
`induced by inclusion of the
`
`Page 4 of 7
`
`YEDA EXHIBIT NO. 2081
`MYLAN PHARM. v YEDA
`IPR2015-00644
`
`553
`
`
`R.T. O’Neill
`
`contri-
`into the secondary endpoint and by the proportional
`primary endpoint
`bution of the primary endpoint
`treatment effect to the secondary
`composite
`endpoint
`treatment effect. If the treatment effect on the primary endpoint
`is
`not observed
`to be statistically
`significant,
`than the interpretation
`of a valid
`inference
`for the secondary
`(composite)
`endpoint should address
`the condi-
`tional nature of the criteria. For example,
`if a primary endpoint
`that comprises
`75% of a composite
`secondary
`endpoint
`is not statistically
`significant,
`even
`when the trial is powered
`for that primary endpoint,
`there is now information
`that a component of the composite
`secondary
`endpoint
`is not likely
`to be
`impacted by treatment. The clinical and statistical
`interpretation
`of a valid
`inference
`for the secondary
`endpoint
`seems
`to me to involve a conditional
`inference,
`the statistical adjustments
`for which can be viewed
`in several ways.
`I am unaware of research
`that directly sheds
`light on the properties of the
`inference
`for these situations, but I question
`the validity of the inference.
`
`SAMPLE SIZE AND THE PRECISION OF THE EXPECTED TREATMENT EFFECT:
`ANOTHER MEASURE OF VALIDITY
`
`that the precision of the estimate of
`trialists recognize
`While most clinical
`treatment effect
`is important
`to the characterization
`of the treatment effect,
`clinical
`trials are usually not designed
`to estimate precisely a treatment effect.
`Rather,
`the precision of the estimate of the treatment effect is a byproduct of
`the clinical
`trial, the size of which
`is based on detecting a posited clinically
`important
`treatment effect in a hypotheis-testing
`framework. A valid inference
`may consider the clinical utility of the precision of the estimate of the treatment
`effect on the secondary endpoint as compared with that on the primary end-
`point or a multiplicity of endpoints of different variabilities.
`for the
`interval
`Figure 1 illustrates
`the expected width of a 95% confidence
`planned
`treatment effect size associated with a sample size designed
`to detect
`a univariate endpoint effect size with 90% power, using a hypothesis
`test as
`the method of inference
`[2]. The sample sizes, calculated
`according
`to the
`traditional hypthesis-testing
`paradigm, are designed
`to detect important
`treat-
`ment effect sizes ranging
`from 0.1 to 1.0, with 90% study power. A treatment
`effect size is defined as the clinically
`important difference
`in average endpoint
`response
`in the test and control groups divided by the standard deviation of
`the endpoint response. Figure 1 may provide
`insight
`into the extent to which
`different endpoints with different
`treatment effect sizes will have, for a fixed-
`sample-size
`trial, different precisions
`for the estimates. A given trial sample
`size may be more or less sufficient
`to provide
`the necessary power
`for any
`primary or secondary endpoint, depending upon the variations among each
`of the endpoints
`in control group response and its variability and upon the
`relative
`treatment
`impact on each of the endpoints.
`
`INTERPRETATION OF THE P-VALUE FOR PRIMARY AND
`SECONDARY ENDPOINTS
`
`for a secondary endpoint
`A clinical trial may be substantially underpowered
`treatment effects. For a
`relative
`to the primary endpoint
`for their respective
`given fixed sample size of a trial, assuming a true treatment effect, D, exists
`
`Page 5 of 7
`
`YEDA EXHIBIT NO. 2081
`MYLAN PHARM. v YEDA
`IPR2015-00644
`
`554
`
`
`Inference from Primary and Secondary Endpoints
`
`555
`
`1.0 -
`l 9 -
`I
`I
`N=26 $j
`.8-
`I
`I
`
`I
`
`I
`
`N=33 iz :;- t-1 N=43 E-c 1-1 N=59 $ s- g N=85
`
`1-1
`.4-
`Nt132 % 4 ::- i---------_-I N =234
`H
`N=528 01 - H Nt2102
`0
`I I I
`0 .1 .2 .3 .4 .5 .8 .7 .8 .9 1.01.1 1.21.3 1.41.51.8 1.7
`Effect Size A / (3
`
`Figure
`
`Expected width of 95% confidence interval for studies powered at 90% for an assumed
`treatment of ES.
`
`the expected distribution of the p-values associated
`for a secondary endpoint,
`with the test of treatment effect on the secondary endpoint may differ from
`the distribution
`of expected p-values
`for the primary endpoints. Statistical
`adjustments used in multiple
`testing situations are designed
`to preserve
`the
`overall experiment-wise
`type I error rate for the variety of comparisons
`envi-
`sioned. Statistical adjustment procedures
`for multiple comparisons assume the
`null hypothesis
`is true. The expected distribution of p-values for primary and
`secondary endpoints should differ as a function of the power to detect a specific
`alternate
`treatment effect size for each endpoint
`[4]. An observed p-value of
`0.05 for an underpowered
`(secondary) endpoint may be more impressive
`than
`an observed p-value of 0.05 for a substantially overpowered
`(primary) endpoint.
`The interpretation
`of the strength of the evidence on behalf of a treatment
`effect on a secondary or primary endpoint should
`involve knowledge of the
`distribution of the p-value when both the null and alternative hypotheses are
`true for each of the endpoints.
`finding may
`A final point: Under certain situations, a secondary endpoint
`be considered an exploratory
`result, especially when criteria
`for assessing
`its
`importance have not been clearly specified
`in the protocol
`in advance.
`If we
`use the observed magnitude of the p-value as a measure of the validity of the
`inference,
`it is worth considering how likely
`it is, given the p-value observed
`in that initial study, that a second study would replicate the statistical evidence
`observed
`in the initial study. This
`is one conceptual approach
`to verifying
`whether a statistically significant
`finding for a secondary endpoint
`is serendipi-
`tous, especially
`if considered
`in light of an exploratory or hypothesis-generating
`finding. Exploratory
`findings, especially
`those derived
`from analyses not pre-
`specified
`in a protocol, should be confirmed
`in a second study designed
`in part
`
`Page 6 of 7
`
`YEDA EXHIBIT NO. 2081
`MYLAN PHARM. v YEDA
`IPR2015-00644
`
`N=22
`-
`I I I I I I I I I I , I I I
`1
`
`
`Table 2 Probability of Observing a Statistically Significant Result (p <
`0.05) upon Repetition of a Clinical Trial when the Effect, ES, Observed
`in the First Trial Is Assumed
`to Be the True Effect
`Probability of a
`Significant Result (Power)
`
`Observed
`p-Value
`
`R.T. O’Neill
`
`0.10
`0.05
`0.03
`0.01
`0.005
`0.001
`
`0.37
`0.57
`0.58
`0.73
`0.80
`0.91
`
`from the first study. Some ideas of
`to replicate or confirm the “valid” analysis
`Goodman
`[5] can be adapted to illustrate
`that if a second trial were conducted
`in a manner
`identical
`to that of the initial trial that produced
`the secondary
`endpoint result, and if one used the same sample size as was used in the initial
`study, and if one assumed
`that the treatment effect size observed
`in the first
`trial was the true effect size against which to calculate power of the second
`study, the chance a having a statistically
`significant
`result in the repeat study
`(i.e., p-value
`less than 0.05) can be calculated. Table 2 presents
`the probability
`of a statistically
`significant
`result in the second study (power) as a function of
`the observed p-value
`in the first study. For an initial study with an observed
`p-value of 0.10, th e chances of an observed p-value of 0.10 or less in a repeat
`study of the same sample size is about 37%. It is not until one observes a p-
`value of 0.001 that the chances of observing a p-value of 0.05 or less in the
`second study
`is 90%. Viewed
`in this manner, an observed
`p = 0.05 for a
`secondary endpoint might not be convincing of evidence against the null hy-
`pothesis.
`to support the position that second-
`these arguments
`Finally, after presenting
`ary endpoints cannot be validly
`interpreted when the primary endpoints are
`not statistically
`significant,
`I would conclude by saying, “Never say never.”
`There are too many situations
`that have not been thoroughly explored.
`
`REFERENCES
`
`1.
`
`2.
`
`3.
`
`4.
`
`5.
`
`CPMP Working Party and Efficacy of Medicinal Products. Note for guidance: biosta-
`tistical methodology
`in clinical trials in applications
`for marketing authorization
`for
`medical products. Stat Med 1995;14:1659-1682.
`Capizzi T, Zhang JI. Testing the hypothesis
`that matters for multiple primary end-
`points. Drug Info J 1996;30:949-956.
`Bristol DR. Sample sizes for constructing confidence
`Stat Med 1989;8:803-811.
`Hung HMJ, O’Neill R, Bauer P, Kohne K. The behavior of the p-value when the
`alternative hypothesis
`is true. Biometrics 1997;53:11-22.
`Goodman SN. A comment on replication, p-values and evidence. Stat Med 1992;
`11:875-879.
`
`intervals and testing hypotheses.
`
`Page 7 of 7
`
`YEDA EXHIBIT NO. 2081
`MYLAN PHARM. v YEDA
`IPR2015-00644
`
`556