How meaningful are studies that are stopped early?

arznei-telegramm 2005; 36: 107

HOW MEANINGFUL ARE STUDIES THAT ARE STOPPED EARLY?

Randomised clinical studies that are stopped prematurely because a benefit of the tested intervention is recognised generally attract special attention and not infrequently affect treatment standards. The most recent example is the situation with trastuzumab (HERCEPTIN) in the adjuvant treatment of breast cancer (a-t 2005; 36: 96-8). However, the authors of a systematic review are now calling for the results of such studies to be viewed with scepticism (1). They assessed studies stopped early due to apparent benefits for their frequency, the extent and plausibility of the treatment effect as well as the quality of the published information. They found 143 discontinued studies between 1975 and 2004, over half in the fields of cardiology, cancer and HIV/AIDS. One in two came from the New England Journal of Medicine (55) or The Lancet (27). In the period studied, the proportion that such reports account for out of all the randomised studies listed on Medline increased ten-fold from 0.01% to 0.1%. The quality of the publications in terms of information relevant to the early discontinuation (e.g. planned sample size or the interim analysis after which the study was ended) also falls short: only eight studies (6%) reported criteria of importance for assessment. At the time the study was stopped, an average 63% of planned participants had been recruited, the median follow-up observation time was 13 months, and the analysis was based on a median of 66 events (1).

Studies stopped prematurely for apparent benefit frequently show a strikingly large treatment effect (ratio of rates of events in the intervention group to those in the control group). In order to be able to justify the stoppage and exclude any chance effects as reliably as possible despite multiple analyses, stringent criteria are set by choosing a very low p-value, for example p < 0.001. Particularly in studies with few events, the risk reduction must be 50% or more in order to achieve a p-value of this kind (2). In view of the 25% to 30% therapeutic effect usually achieved, such a result often does not seem plausible. Furthermore, there is always the danger that - despite every precaution - it is a random high. Since the likelihood of early termination increases with - even purely chance - fluctuations or even a large effect, the risk of random effects being involved increases with studies stopped early for benefit (3).

Interim analyses can therefore be misleading despite highly significant results. This is made clear by studies that were continued and at the end show only weakly positive or even negative results:

In the CHARM* study, which consisted of three separate trials and investigated the benefit of the angiotensin-II-blocker candesartan (ATACAND, BLOPRESS) in heart failure (a-t 2003; 34: 81-2), all-cause mortality in all the participants was calculated every six months. At the fourth interim analysis, when patient recruitment was nearly complete, the boundary set by the pre-determined threshold value of p < 0.001, from which early termination was to be considered, had been crossed (260 vs. 339 deaths, hazard ratio [HR] 0.76; 95% confidence range [CI] 0.64-0.87; p = 0.0006). Nevertheless, the study was continued because, amongst other reasons, the results in two of the three separate trials did not even reach the usual significance level of p = 0.05 and the follow-up period was short. Furthermore, the members of the independent data-monitoring committee were aware that the treatment effects of trials stopped early are often greatly exaggerated and that a "regression to the truth" is possible with continued observation. In subsequent analyses, the benefit for candesartan became less and less. In the final analysis two years later, no significant difference in all-cause mortality could be found any more (886 vs. 945 deaths; HR 0.91; 95% CI 0.83-1.00; p = 0055) (4).

In the OPTIMIST* study in patients with severe sepsis, the second interim analysis for tifacogin also showed enhanced survival versus the placebo (29.1% vs. 38.9%; p = 0,006). At the end of the study, the mortality rate was numerically higher on tifacogin (34.2% vs. 33.9%) (5).

The twelfth Medical Research Council acute myeloid leukaemia trial showed a fifth cycle of chemotherapy to offer no survival advantage over a four-cycle regimen (HR 1.09; 95% CI 0.87-1.37; p = 0.4). Two interim analyses had previously shown highly significant effects in favour of the additional cycle (HR 0.47; p = 0.003 and HR 0.53; p = 0.002). The reporting authors, the head of the data-monitoring committee and the study statisticians warned against deciding to stop a study early solely on the basis of rigid threshold values, without considering the context. They pointed out that chance effects do occur and happen "more frequently than many clinicians realize" (6).

Besides fundamental reservations about the results of trials stopped early, the choice of a decisive endpoint for discontinuation also plays an important role. In cancer, for example, treatment interventions are supposed to prolong life and/or improve quality of life. However, in adjuvant treatment situations as well as in the trials of trastuzumab or aromatase inhibitors in breast cancer, the decision to stop the trial early is usually based on a benefit in terms of disease-free survival (7). Conversely, an effect on all-cause mortality is not sufficiently proven and in some circumstances cannot be explained. The crucial safety factors also remain open, including because the side effect rate is too low to allow sufficiently reliable statements due to the short follow-up period. Similar reservations apply to the endpoint of "progression-free survival" in advanced cancer (1,7). Where there are combined endpoints, it matters that the benefit does not rest solely on the event of less importance to patients (e.g. a fall in the angina pectoris rate with a combined endpoint of death, cardiac infarction or angina) (1).

In most cases, ethical reasons are cited as a justification for early stoppage: since the "benefit" of the intervention has been be proven, it can no longer be withheld from the control group. However, the interests of patients in randomised trials have to be safeguarded while also protecting society (and, of course, the participants) from "overzealous premature claims " of apparent treatment successes (2).

What are the implications of this for trastuzumab in adjuvant breast cancer treatment? "The best that can be said about Herceptin's efficacy and safety for the treatment of early breast cancer is that the available evidence is insufficient to make reliable judgments.", commented The Lancet, and noted that the manufacturers, Roche, do not yet have sufficient data to submit to the licensing authorities (8).

	Controlled clinical trials are increasingly being stopped early due to highly significant positive interim results.
	There is a danger that the - often strikingly large - treatment effect is a random high that will become smaller if the study is continued.
	Where prematurely stoppage of a trial is being considered, a statistical "threshold value" should therefore be considered as only one aspect among others.
	The availability of sufficient data on which to assess patient-relevant endpoints such as overall survival or safety is more crucial.

		(R = randomised study, M = meta-analysis)
M	1	MONTORI, V.M. et al.: JAMA 2005; 294: 2203-9
	2	POCOCK, S.J.: JAMA 2005; 294: 2228-30
	3	SCHULZ, K.F., GRIMES, D.A.: Lancet 2005; 365: 1657-61
	4	POCOCK, S.: Am. Heart J. 2005; 149: 939-43
R	5	ABRAHAM, E. et al.: JAMA 2003; 290: 238-47
	6	WHEATLEY, K, CLAYTON, D.: Contr. Clin. Trials 2003; 24: 66-70
	7	CANNISTRA, S.A.: J. Clin. Oncol. 2004; 22: 1542-5
	8	The Lancet: Lancet 2005; 366: 1673
*		CHARM = Candesartan in Heart Failure Assessment of Reduction in Mortality and Morbidity OPTIMIST = Optimized phase 3 tifacogin in multicenter international sepsis trial

© arznei-telegramm 12/05