`clinical trials
`
`Peter F. Thall and Richard M. Simon
`
`Introduction
`
`Clinical trials of new medical treatments may be classified into three succes-
`sive phases. Phase I trials typically are small pilot studies to determine the
`therapeutic dose of a drug, biological agent, radiation schedule, or a com-
`bination of these regimens (cf. [1]). In cancer therapeutics, the underlying
`idea is that a higher dose of the therapeutic agent kills more cancer cells but
`also is more likely to harm and possibly kill the patient. Consequently,
`toxicity is the usual criterion for determining a maximum tolerable dose
`(MTD), and most phase I cancer trials involve very small groups of patients,
`usually three to six patients per dose, with each successive group receiving a
`higher dose until it is likely that the MTD has been reached. A more refined
`approach that continually updates an estimate of the probability of toxicity
`has also been proposed by O’Quigley, Pepe and Fisher [2].
`Once a dose and schedule of a new experimental regimen E have been
`determined, its therapeutic efficacy is evaluated in a phase II trial. Phase II
`trials are usually single-arm studies involving roughly 11 = 14 to 90 patients
`treated with E, with It usually well under 60. These studies typically are
`carried out within a single institution and are most prominent in clinical
`environments where there are many new treatments to be tested. The
`primary goal is to determine whether E has a level of antidisease activity
`sufficiently promising to warrant its evaluation in a subsequent phase III
`trial (described below). Phase II results also frequently serve as the basis for
`additional single-arm studies involving E in other combination regimens or
`dosage schedules. The main statistical objective of a phase II trial thus is to
`provide an estimator of the response rate associated with E (cf. [3]). Treat-
`ment success generally is characterized by a binary patient response, such as
`50% or more shrinkage of a solid tumor or complete remission of leukemia,
`and the scientific focus is p, the probability of response with E. Patient
`response usually is defined over a relatively short time period in phase 11,
`based on the underlying idea that short-term response is a necessary pre-
`cursor to improved long-term survival and reduction in morbidity. Phase
`II trials are important because they are the primary means of selecting
`
`Peter F. 7712!! (ed), RECENTADVANCES IN CLINICAL TRIAL DESIGN AND ANALYSIS.
`Copyright © 1995 Kfuwr Academic Publishers. Barton. AH rights reserved. ISBN 9784-4613-5505.
`
`Genentech 2066
`
`Celltrion v. Genentech
`
`|PR2017-01122
`
`Genentech 2066
`Celltrion v. Genentech
`IPR2017-01122
`
`
`
`treatments for phase III evaluation, andmoreover, many patients receive
`treatment within the context of a phase II trial.
`The ultimate standard for evaluation of medical treatments is the ran-
`
`domized comparative phase III clinical trial. Phase III trials generally are
`large, multi—institutional studies with treatments evaluated in terms of long-
`term patient response, such as survival or time to disease progression. Phase
`III trials are designed and conducted to evaluate the effectiveness of a
`treatment relative to an appropriate control and with regard to endpoints
`that represent patient benefit, such as survival. To achieve such objectives,
`the trial design is based on statistical tests of one or more hypotheses and
`may require approximate balance and minimal sample size within important
`patient subgroups. Because they are larger and of longer duration than
`phase II trials, and typically involve multiple institutions, phase III trials are
`usually much more costly and logistically complicated. The results of phase
`III trials are broadly disseminated within the medical community and form-
`the basis for changes and advances in general medical practice.
`The simplest phase II design is a single-arm, single-stage trial in which n
`patients are treated with E. The data consist of the random variable Y",
`namely, the number of successes after 11 patients are evaluated, which is
`binomial in n and p. The sample size is determined so that, given a fixed
`standard rate p0 that is of no clinical interest, a test of H0: p S p0 versus
`H1: p 2 p1 has type I error probability (significance level) S (1 and type—II
`error probability S B for a given target response probability p1 = p0 + 5.
`The test is determined by a cutoff r, with H0 rejected if Y” 2 r and H1
`rejected if Yn < r. A type I error occurs if it is concluded that E is promising
`compared to standard therapy, i.e., if H1 is accepted, when in fact p .S p0.
`The consequences of this are that an uninteresting or even inferior treat-
`ment is likely to become the basis for a phase III trial, and that if future
`phase II trials using a combination therapy based on E. are conducted, the
`patients in those trials will be treated with an inferior agent while phase II
`trials of other potentially promising new treatments are delayed. A type II
`error occurs if it is concluded that E is not promising compared tostandard
`therapy, i.e., if H0 is accepted when in fact, p 2 p0- + 5. The power of the
`test is 1 — [3, the probability of correctly accepting H1 when E really has
`success rate p0 + 5. The consequence of a type II error is that a promising
`treatment has been lost or its detection delayed. The required sample size n
`and test cutoff r are determined by specifying a, B, p0, and 5. Since there is a
`trade-off between type I and type II error,
`in practice-typically. ((M3) =,-
`(010,010), (005,020), or (005,010). We shall refer to (1 and [3, and more
`generally any parameters that describe a design’s behavior, as its operating
`characteristics.
`_
`.
`
`Smaller treatment advances 5 are harder to detect, i.e., they require a
`larger sample size for given p0, 0t, and B. A very large. 5'requires a-trivially
`small sample size, i.e.,
`it
`is easy to detect a large treatment advance.
`Reasonable values are ,thus 5 =._. 0.15 to 0.20, since 5 < 0.15 usually leads to
`
`50
`
`
`
`unrealistically large n while 5 > 0.20 leads to a trial yielding very little
`information about E and in many cases is intellectually dishonest. Parameters
`of some typical single-stage designs are given in table 1.
`An alternative to designing a single-stage trial in terms of hypothesis
`testing, which is a formal method for deciding whether E is promising
`compared to the fixed-standard success probability p0,
`is to choose 77
`to
`obtain a confidence interval of given width and level (coverage probability)
`to estimate 1). A good approximate confidence interval, due to Ghosh [4], is
`
`
`p + A/2 i z{p(1 — mm + A/(4n)}“2
`1 + A
`
`’
`
`where 13 = Yn/n, z = 1.645, 1.96, or 2.576 for a 90%, 95%, or 99% coverage
`probability, respectively, and A = zz/n. The exact binomial confidence
`interval of Clopper and Pearson [5] also may be used, although the above
`approximation is quite adequate for planning purposes. An important caveat
`is that the commonly used approximate interval 13 i 2{ 13(1 — [By/1}” is
`rather inaccurate for many values of n and p encountered in phase II trials
`[4] and is not recommended. Table 2 gives the sample sizes needed to obtain
`90% or 95% confidence intervals for p of given width, based on values of 13
`from 0.20 to 0.50. The sample sizes for 13 = 0.50 + A and 0.50 — A are
`identical. For example, if it is anticipated that the empirical rate Yn/n will be
`approximately 0.30 or 0.70,
`then a sample of 34 patients is required to
`obtain a 90% confidence interval for 1) having width at most 0.25. Given an
`observed rate of 10/34, one could be 90% certain that the true success
`probability of E is somewhere between 0.185 and 0.434.
`Although the single-stage design is easy to understand and implement, it
`has several severe practical limitations. Each of the designs described in the
`following sections was created to address one or more of the following
`problems.
`
`Table I. Single-stage designs Conclude p 2 p, at level a and
`power 1 — 0 if Km 2 r/n.
`
`
`(9,13)
`
`
`p0
`p,
`(010,010)
`(005,020)
`(005,010)
`
`5
`
`0.20
`
`0.15
`
`0.10
`0.20
`0.30
`0.40
`
`0.10
`0.20
`0.30
`0.40
`
`0.30
`0.40
`0.50
`0.60
`
`0.25
`0.35
`0.45
`0.55
`
`5/25
`11/36
`16/39
`21/41
`
`7/40
`17/61
`27/71
`36/75
`
`6/25
`12/35
`17/39
`23/42
`
`8/40
`17/56
`, 27/67
`36/71
`
`7/33
`15/47
`22/53
`29/56
`
`10/55
`22/77
`36/93-
`46/94
`
`51
`
`
`
`Table 2. Single-stage n to obtain confidence interval of given
`level and width SW
`
`Level
`
`90%
`
`95%
`
`W
`
`0.20
`0.25
`0. 30
`
`0.20
`0.25
`0. 30
`
`Anticipated 1‘) = Yn/n
`
`0.20
`
`0.30
`
`0.40
`
`0.50
`
`44
`26
`19
`
`64
`39
`26
`
`55
`34
`24
`
`78
`48
`33
`
`63
`40
`26
`
`89
`56
`38
`
`66
`42
`28
`
`94
`58
`40
`
`1. The most serious limitation of the single-stage design is that it ignores
`all data prior to observation of Y", and in particular has no provision for
`early termination if the interim observed response rate is unacceptably low.
`For example,
`if p0 = 0.30 is the established response rate with standard
`treatment and E also has rate p = 0.30, then an initial run of 12 failures
`should occur with probability 0.014, and if p > 0.30 then such a run has
`probability close to 0. Most clinicians would be strongly inclined to dis-
`continue use of E at or before this point, especially in trials of treatments for
`rapidly fatal diseases or other circumstances where early failure increases
`morbidity or reduces survival. Designs with early stopping rules address this
`problem (cf. [6—14]).
`2. Reporting results of a phase II trial entails augmenting or replacing
`significance test results with a confidence interval for p, since the real goal of
`a phase II trial is estimation [3]. If rules for early stopping are included in
`the design, however,
`then computation of the confidence interval for p
`based on the final data must account for the fact that the trial continued
`
`through its intermediate stages, since the usual unadjusted confidence
`intervals are biased in this case. Methods for computing a confidence interval
`for p after a multistage trial have been given by numerous authors, including
`Jennison and Turnbull [15], Tsiatis, Rosner, and Mehta [16], Atkinson and
`Brown [17], and Duffy and Santner [18].
`3. Another problem, addressed by Thall and Simon [19], is that p0 often
`is estimated from historical data and hence is a statistic [30, not a fixed value.
`Since this estimator has an associated variance, the usual test statistic Yn/n
`— 130 has variance p(1 — p)/n + var(p0). The sample size computation that
`ignores var(p0) is incorrect, and the actual type I and type II error rates are
`larger than their nominal values.
`-
`4. In some settings several new treatments may be ready simultaneously
`for phase II evaluation. The question then arises of whether to carry out a
`sequence of single-arm trials or one randomized trial, and in either case
`strategies are needed for prioritizing treatments and for selecting one or
`
`52
`
`
`
`more promising treatments from those tested. Several approaches to this
`general problem have been proposed. Simon, Wittes, and Ellenberg [20]
`propose a randomized phase II trial; Whitehead [21] proposes a combined
`phase II—III strategy; Thall, Simon, and Ellenberg [22,23] propose ‘select
`then test’ designs for comparing the best of several experimental treatments
`to a standard; and Strauss and Simon [24] examine properties of a sequence
`of ‘play the winner’ randomized phase II trials.
`5. The assumption that patient response can be characterized effectively
`by a single variable is rather strong, even for short-term response, and it
`may be necessary to monitor more than one-patient outcome. For example,
`in most cancer chemotherapy trials, toxicity is an important issue, and it is
`highly desirable to have an early stopping rule to protect future patients
`from unacceptably high rates of toxicity. Many phase II trials include such a
`rule either formally or informally in their prOtocols, but they ignore the
`interdependence between toxicity and response in the design. Designs
`accounting for multiple outcomes have been proposed by Etzioni and Pepe
`[25] and Thall, Simon, and Estey [26].
`trials
`6. Patient-to-patient variability is often high, even in clinical
`with very specific entry criteria. Since phase II trials are relatively small,
`a study with an unusually high proportion of either poor-prognosis 0r
`good-prognosis patients may give a misleadingly pessimistic or optimistic
`indication of how E would behave in the general patient population.
`7. Although most phase II designs regard treatment response rate p
`as a fixed unknown quantity, many clinicians regard p as random. For
`example, when asked to specify p0, the clinician may respond by giving a
`range rather than a single value, and may even describe the probability
`distribution of po within that range.
`In such circumstances, a Bayesian
`design, based on random values of po and p, may be more appropriate.
`Bayesian phase 11 designs have been proposed by Sylvester and Staquet
`[27,28] Sylvester
`[29], Etzioni and Pepe [25], and Thall and Simon
`[12—14], and Thall, Simon and Estey [30].
`
`Refinements of the phase I—II—III paradigm
`
`When the best available therapy has little or no effect against the disease,
`the phase II trial’s objective is to determine whether E has any antidisease
`activity at all. This is a phase IIA trial. Since p0 = 0 or possibly 0.05 in this
`case, type II error is the main consideration. Gehan [6] proposed the first
`phase IIA design, a two—stage design in which In patients are treated at stage
`1, the trial is stopped if Yn] = 0, and an additional n2 patients are treated in
`stage 2 if Y"1 > 0. The stage 1 sample size is chosen to control type II error,
`specifically n1 2 log(B)/log(1 — p1) for targeted success rate p1. The stage 2
`sample size is chosen to obtain 13 having standard error no larger than a
`given magnitude, and 11; also depends on Ynl. For example, if [3 = 0.05 and
`
`53
`
`
`
`p1 = 0.20, then n1 = 14 patients are required at stage 1. If Y14 > 0, then to
`obtain an estimate of p having standard error 0.10 requires 122 = 1, 6, 9, or
`11 if Y14 is 1, 2, 3, or 24, respectively.
`When there exists a standard treatment, say S, having some level of
`activity (i.e., when p0 > 0), then the goal is to identify new treatments that
`are promising compared to p0. This is a phase IIB trial. In this case, there
`are compelling data,.arising from clinical trials or in vitro testing, indicating
`that E is likely to be active at a level exceeding p0. An important considera-
`tion in IIB trials is that it is clinically undesirable to continue a trial of an
`experimental treatment that proves to be not promising compared to S. For
`example, when p0 = 0.40 and p1 = 0.55,
`if interim trial results strongly
`indicate that p < 0.40, then it is unethical to continue; if it is likely that 0.40
`S p < 0.55, then it may be desirable to terminate the trial to make way for
`other, potentially more promising new treatments. It is also important to
`recognize the comparative aspect of phase IIB trials, which may lead to
`formal use of historical data on S in the evaluation of E, and possibly to a
`randomized trial [19]. This issue will be discussed below.
`If several new treatments are simultaneously available for phase II test-
`ing, then’the problem of choosing among them arises. Since the number of
`patients in any clinic- is limited, this situation frequently occurs in institutions
`with high levels of research activity in growth factors or pharmacologic
`agents. Thall and Estey [30] propose a pre-phase II Bayesian strategy in
`which patients having a prognosis more favorable than that of phase I
`patients but less favorable than that of the target group of the subsequent
`phase II trial are randomized among several experimental treatments. The
`response rate. distribution in each treatment arm is updated. continually
`during the trial and is compared to early termination cutoffs, and the best
`final treatment must satisfy a minimal posterior efficacy criterion before it is
`evaluated in a subsequent phase II trial. This type of study, the phase 1.5
`trial, bridges the gap between phase I and phase IIB. It provides an ethical
`means of giving poor—prognosis patients experimental
`treatments while
`replacing the usual informal pre-phase II treatment selection process with a
`fair comparison formally based on a combination of prior opinion and
`clinical data.
`
`As an example, a phase 1.5 trial might be carried out in patients who have
`acute myelogenous leukemia (AML) ' with 21 prior relapse and poor-
`prognosis cytogenetic characteristics, in order to select a treatment for phase
`II testing in untreated AML patients who have good-prognosis cytogenetics.
`If the accrual rate is 40 per year in the poor-prognosis group, then a phase
`1.5 trial of three treatments with up to 10 patients per treatment arm could
`be carried out in nine months. Assuming a prior mean response rate of 0.40
`for all three arms, Thall and Estey [30] recommend a design in which a
`treatment arm is terminated if there are 0 responses in the first 4 patients;
`otherwise, 10 patients are accrued in that arm. The best treatment, among
`
`54
`
`
`
`those not terminated, must have 24 responses to be selected for the phase
`II trial.
`
`The response rates obtained in different phase II trials of the same
`treatment often vary widely. Simon, Wittes, and Ellenberg [20] cite a
`number of factors as the sources of this variability, including patient selec—
`tion, definition of response, interobserver variability in response evaluation,
`drug dosage and schedule, reporting procedures, and sample size. To deal
`with these problems,
`these authors prOpose randomizing patients among
`several experimental treatments in phase II, with ranking and selection
`methods rather than hypothesis testing used to evaluate treatments. They
`recommend the use of conventional phase 11 sample sizes and early stopping
`criteria in each treatment arm, and that a standard treatment arm not be
`
`included. Specifically, they propose that sample size be computed to ensure
`that, if one group of treatments has response rate [)0 + 8 and the rest have
`rate p0, then a ‘select the best’ strategy will choose one of the superior
`treatments with a desired probability. For example, if p0 = 0.20 and 5 =
`0.15, then 44 patients in each of three arms will ensure a 90% chance of
`choosing a treatment with response rate 0.35.
`Strategies for phase II evaluation of new treatments that become available
`sequentially over time have been considered by Whitehead [31] and by
`Strauss and Simon [24]. Whitehead is motivated in part by the desire to
`examine the properties of small sample sizes for phase H studies. He as-
`sumes that the success rates of the experimental treatments are random and
`may be considered as independent draWs from a beta prior distribution.
`Given N equal to the total number of patients for all the trials, he derives
`the number of trials k and number of patients per trial n that maximize the
`expected success probability E(7c) of the selected treatment, subject to nk =
`N. For example, if N = 60 and the mean experimental success rate is 0.20,
`then depending upon prior variability, the optimal integer values of (n, k)
`and E(1r) vary from (4,15) with E(TE) = 0.426, to (6,10) with E(Tc) = 0.292.
`Strauss and Simon [24] study properties of a sequence of comparative
`phase II trials. At each of k stages, 211 patients are randomized between a
`new experimental treatment and the better of the two treatments from the
`previous stage, starting with a known standard S at stage 1. The better of
`the two treatments at each stage (the ‘winner’)
`thus becomes the new
`standard, and is then compared to the next experimental treatment. The
`goal
`is to select a single treatment for phase III evaluation. Similar to
`Whitehead [31], Strauss and Simon assume that the success probabilities
`of the experimental treatments‘are independent draws from a beta prior
`distribution, either with fixed mean equal to that of S or with distribution
`adapted to the data in that its mean equals that of the latest winner. This
`approach, however,
`is more robust against time trends in the selection of
`patients. Given a total of N = nk patients, they examine the manner in
`which the expected success probability E( p) of the final selected treatment
`
`55
`
`
`
`varies with n, k, and N. They identify conditions under which such a
`sequence of phase II trials is more likely than a single phase II trial to
`identify a promising experimental treatment.
`Whitehead [21] also proposes an integrated approach to the problem of
`evaluating several new treatments. A sequence of single-arm phase II trials
`is conducted; the most promising experimental treatment among them is
`selected, and it
`is then compared to the standard in a phase III trial.
`Assuming that the success rates of the experimental treatments are random
`and may be considered as independent draws from a beta prior distribution,
`Whitehead derives strategies for dividing patients between the two phases
`(given the number of phase II trials and the total number of patients) that
`maximize the probability TC of obtaining a significant result in the phase III
`trial. For example, if N = 300 patients are available and there are five new
`agents to be tested, then allocating 18 patients to each of the five phase II
`trials and 210 to phase III ensures that 1: = 0.52. If instead N = 500, then
`the optimal allocation is 31 with 345 in phase III, which ensures that TC =
`0.63. Whitehead notes that, when using this strategy, the main trade-off is
`between the total numbers of patients allocated to the two phases.
`
`Some practical considerations
`
`Because phase II trials are developmental, their design and conduct must
`include several ethical and logistical considerations. These include the
`appropriateness of treating patients with E, the relevance of the trial within
`the larger context of treatment development, the patient accrual rate, defini-
`tion of patient response, and the monetary cost of the trial. In any phase II
`setting, a priori there must be a reasonable basis for the belief that E may
`provide an improvement over the standard, whether p0 = 0 or p0 > 0. If in
`the course of the trial it becomes clear that this is unlikely, then it may be
`desirable to terminate early, and here the unavoidable conflict between type
`I and type II error comes into play. The trade-off is between protecting
`patients from an ineffective or dangerous experimental regimen and risking
`the loss of a treatment advance. If an adverse outcome, such as toxicity,
`is monitored along with the usual efficacy outcome,
`then an alternative
`goal may be to decrease the adverse event rate while maintaining a given
`response rate. Designs which monitor multiple events, such as response and
`toxicity, are discussed in a later section.
`Ethical considerations are most pressing for rapidly fatal diseases, and the
`standards of clinical conduct for such diseases may provide a basis for
`analogous decisions in less extreme circumstances. The desirability of a
`particular treatment E in a phase II trial must be assessed from the view—
`points of the individual patient, all patients in the trial taken as a group, and
`future patients after the trial is completed. A general consideration is that
`patients are more likely to choose a physician rather than a treatment and to
`
`.
`
`56
`
`
`
`rely on their physician’s advice regarding treatment choice. The centuries-
`old process of entrusting one’s life and well-being to one’s physician is a
`fundamental part of medicine, informed consent notwithstanding. Thus, the
`trial must be designed so that trial objectives and individual patient benefit
`are not in conflict. The situation is most desperate in phase IIA trials of
`treatments for rapidly fatal diseases for which no effective treatment exists.
`The trade-off for both the individual patient and for the trial is between the
`risk of adverse treatment effects and the likelihood of any therapeutic
`benefit. For nonfatal diseases, the potential severity of adverse effects first
`must be weighed against the effects of the disease itself, and it is inappro-
`priate to conduct a trial of E if its effects are likely to be worse than those of
`the disease. Phase IIB trials often evaluate combination therapies whose
`components are already known to have antidisease activity. Consequently, a
`new combination regimen with an activity level below that of the standard is
`usually not promising for future development. Two exceptions are a trial in
`which a reduced likelihood of early response may be an acceptable trade-off
`for improved overall survival, and a trial in which the real goal is to reduce
`toxicity and a small reduction in response rate is considered an acceptable
`trade—off. Examples of such trials are given in a later section.
`Patient accrual and monetary cost are absolute limits on the size of any
`clinical trial. If either the number of patients or the available resources are
`insufficient to achieve initial goals, then a smaller trial may be appropriate.
`However, the magnitudes of or and B and the reliability of the final estimate
`of p should be kept in mind when reducing sample size due to low accrual
`rate or limited resources. The results of very small trials often are of limited
`value and, due to their high variability, are potentially misleading. If re—
`sources are inadequate to conduct a trial that will produce useful results,
`then it is inappropriate to conduct the trial.
`A simple but critical
`issue in trial design and conduct is definition of
`patient outcome. For example,
`in AML,
`treatment response is typically
`complete remission (CR), which is defined in terms of several parameters
`(e.g., blast count, platelet recovery, white cell count, etc.), as measured
`within a given timeframe. It is essential that CR be defined formally in the
`protocol and that, however CR is defined, all clinicians involved in the trial
`adhere to that definition. Otherwise, one clinician’s CR may be another’s
`failure, which renders the recorded trial results Virtually meaningless. The
`same considerations apply to definition of adverse outcomes, since there are
`various grades of toxicity, etc. This problem is potentially more severe in
`multi-institutional phase II trials; hence, an even stronger effort must be
`made to define and score patient outcomes consistently.
`Short-term response in a phase II trial is used as the measure of treatment
`effect. For solid tumors, however, partial response often is not a validated
`measure of patient benefit. In general, the comparison of survival between
`responders and nonresponders is not valid for demonstrating that treatment
`has extended survival for responders [32]. Because response is often viewed
`
`57
`
`
`
`as a necessary but not sufficient condition for extending, survival, response
`may be used in phase II
`trials for screening promising treatments. To
`evaluate the effectiveness of a regimen in prolonging survival, however, a
`phase III trial of survival is required.
`
`Historical data and Bayesian designs
`
`Most phase II trials evaluate one or more new treatments relative to a
`standard therapy S; hence, they are inherently comparative, even though a
`standard treatment arm usually is not
`included. In designing the single-
`stage, single-arm trial described in the introduction to this chapter, a common
`practice is to assume that p0 is a known constant (and hence that the statistic
`[3 — p0 = (Yn/n) — p0 has variance var([3) = p(1 — p)/n) and to determine n
`to obtain a test of p 2 p0 versus p = p0 + 5 having given type I and type II
`error rates a and B. For phase IIB trials, where p0 represents the activity
`level of available regimens, the numerical value of po used in this computa-
`tion is often a statistical estimate p0 based on historical data, rather than a
`known constant. The empirical difference [31 — 130, which is the basis for the
`test,
`is thus the difference between two statistics and has variance larger
`than the assumed p(1 — p)/n.r Consequently,
`the sample size computed
`under a model
`ignoring the fact
`that 130 is a statistic is incorrect. This
`common practice may be due to the belief that the variability of 130 is of no
`practical consequence or to the absence of a theoretical basis and associated
`statistical software for computing sample sizes correctly.
`Thall and Simon [19] derive optimal single-stage phase II designs that
`incorporate historical data from one or more trials of S and account for the
`variability inherent in 130. They consider both binary and normally distributed
`responses. Because the variability between historical pilot studies sometimes
`exceeds what is predicted by a binomial model for binary responses, they
`use a beta-binomial model to account for possible extrabinomial variation.
`Their results indicate that it is sometimes best to randomize a proportion of
`patients to S, and they derive the total sample size and optimal proportions
`for allocation to E and S that minimize var(131 — [30). Their results indicate
`that an unbalanced randomization may be superior to a single-arm trial of
`E alone, and that ignoring var(p‘0) may lead to trials with actual values of
`(1 and [3 much higher than their nominal values. For example, consider a
`trial in which [30 = 0.20 is based on three historical trials of 20 patients
`each. To obtain a test that detects an improvement of 8 = 0.20, i.e., for
`alternative p1 = 0.40, with a = 0.05 and B = 0.20,
`the optimal design
`requires 85 patients with 27 allocated to S and 58 to E. If the variability
`in [30 is ignored and a single-arm trial of E is conducted,
`the standard
`computation yields n = 35, and the resulting test will have actual or = 0.14
`and B = 0.27. Since the numerical computations to incorporate the his-
`torical data and obtain the optimal design are somewhat complicated,
`
`58
`
`
`
`a menudriven computer program written in Splus has been made
`available.
`
`The above method for dealing with the variability of an estimate of po
`may be regarded as a particular approach to a more general problem. Given
`that in a phase II trial the success rate of E ultimately must be compared
`to that of S, and that uncertainty regarding the response rate of S will always
`exist, the general problem is to account for this uncertainty when planning
`the trial and interpreting its results. A different statistical approach is based
`on the Bayesian framework, in which the success probabilities of E and S
`are regarded as random rather than fixed parameters. To underscore this
`distinction, we denote the random response probabilities by 05 and 05.
`Although the theoretical basis for Bayesian methods is well established,
`practical methods for clinical trials have been proposed only recently, notably
`by Freedman and Spiegelhalter [33,34], Spiegelhalter and Freedman [35,36],
`Racine et al. [37], and Berry [38,39].
`Sylvester and Staquet [28] and Sylvester [29] propose decision-theoretic
`Bayesian methods for phase II clinical trials. They optimize the sample size
`and decision cutoff of a single-stage design where n is fixed, to determine
`whether a new drug is active, by minimizing the Bayes risk. Their approach
`assumes that Pr [BE = p1] = 1 — Pr [BE = p2], with p2 > p1, where p2 and
`p1 are response rates at which E would and would not be considered
`promising, respectively — i.e., they assume that GE may take on two pos-
`sible values.
`
`Herson [7] proposes the use of predictive probability (PP) as a criterion
`for early termination of phase II trials to minimize the number of patients
`exposed to an ineffective therapy. The PP of an event, such as concluding
`that E is or is not promising according to some decision rule, is the condi-
`tional probability of that event given the current data, computed by first
`averaging over the prior distributions of the parameters, which are 05 and
`BE in the present context. Mehta and Cain [9] provide charts of early
`stopping rules based on the posterior probability of [6,; > p0], where p0 is a
`fixed level at which E would be considered active.
`
`[40] proposes a‘Bayesian procedure for identifying the best
`Palmer
`of three treatments E1,E2,E3. He assumes that
`their respective success
`probabilities are 151 = (a,b,b), 152 = (b,a,b), or 153 = (b,b,a) with prior pro-
`bability 1/3 each, where b < a are known fixed standards, analogous to p0
`and p0 + 5 in the hypothesis—testing context. Given a maximum sample size
`N, patients are first randomized among the treatments in triplets, and based
`on the posterior probabilities of {7:1,Tc2,n3} the worst treatment may be
`dropped. Patients are then randomized between the two remaining treat-
`ments in pairs, and the worse of the two is subsequently dropped based
`on the posterior distribution. The optimality criterion is to maximize the
`expected number of future treatment successes. Palmer g