Editorial Note: This post offers Jon Jureidini, David Healy, Mickey Nardo, Melissa Raven, Elia Abi Jaoude, Catalin Tufanaru, and Joanna Le Noury’s response to the Keller letter. It was published in BMJ in early February. All material is available on Study329.org.
Re: Restoring Study 329: Response to Keller
The response by Keller and selected colleagues1 to our Restoring Study 329 article alleges three overarching faults:2 bias and lack of blind ratings in relation to harms; lack of detailed methodology; and failure to consider the available methodological knowledge regarding paediatric depression from twenty-four years ago. Regarding the second issue, there is in fact a detailed explanation of all the methods in our paper and its RIAT Audit Record (appendix 1). We tackle the first and third issues below.
While there was uncertainty twenty-four years ago about the appropriate rating scale to use in pediatric depression trials, there were serious methodological problems in the conduct and reporting of Study 329 that have nothing to do with that uncertainty. Instead, in their reporting of efficacy in Study 329,3 and their defence of it, Keller and colleagues have asked that the field suspend many widely held tenets about clinical trial analysis, by asking us to do the following:
- accept that the a priori protocol is not binding, and that changes can be made to the outcome variables while the study is ongoing, without amending the protocol with the IRB or documenting the rationale for the change
- ignore the requirement to correct the threshold of significance for the analysis of multiple variables
- ignore the requirement that when there are more than two groups, preliminary omnibus statistical analysis needs to be done prior to making any pairwise comparisons between groups – an integral part of the ANOVA analysis declared in the Study 329 protocol
- allow the parametric analysis of rank-order, ordinal rating scales [CGI, HAM-D and K-SADS-L Depressed Mood Items] rather than the expected non-parametric methods specifically derived for this kind of data
- allow 19 outcome measures to be added to the original eight at various times up to and after the breaking of the blind, purportedly according to an analytical plan ‘developed prior to opening of the blind’ (In spite of multiple requests, neither GSK nor Keller and colleagues have ever produced this analytic plan, suggesting that either it does not exist, or that it contains information unsympathetic to their claims.)
- accept the dismissal of protocol-specified secondary outcomes and the introduction of rogue variables on the grounds that ‘the Hamilton Depression Rating Scale (our primary outcome measure) had significant limitations in assessing mood disturbance in younger patients’, when none of the protocol-specified secondary outcome measures that they discarded were based on the HAM-D, and two of the rogue measures that they introduced were HAM-D measures
- accept the clinically dubious improvements in four of these rogue variables as evidence of efficacy. (Although these measures achieved statistical significance in the pre-defined eighth (final) week of the acute phase of the study, they did not do so in the weekly assessments over the previous seven weeks, a pattern unseen in any known antidepressant; we are working on another manuscript analysing Keller et al’s rogue variables.)
There was no ambiguity about the appropriateness of these methodological manoeuvres when Study 329 was conducted and reported. However, although some of these problems were obvious when the paper was first published, others were not apparent until we had access to the raw clinical data. This lack of transparency erodes confidence that RCTs will be conducted, analysed and reported free from covert manipulation.
Furthermore, Keller and colleagues also failed to report on the continuation phase of Study 329, even though that was a protocol-specified outcome. A report of this phase is almost ready for submission by us.
With regard to harms, Keller and colleagues are simply incorrect in many of their claims about our purported bias and lack of blind ratings.
First, our paper makes it clear that both coders in the re-analysis were blind to randomisation status.
Second, there was no ‘re-scoring’. This odd choice of words raises doubts that Keller et al have much expertise in analysing harms. We used a dictionary that adhered much more closely to the verbatim terms used by the face-to-face interviewers. The fact that Keller and colleagues say that we have labelled emotional lability as suicidality makes us wonder if they have seen the individual patient level data; it was the SKBs coders who came up with the term ‘emotional lability’, not the face-to-face interviewers, whose verbatim terms were of suicidal thoughts and behaviour. Simply using the verbatim terms that the named authors or their colleagues had used when faced with these adolescents reveals a striking rate of suicidal events. To argue that our return to these verbatim terms was arbitrary is bizarre.
Third, we made it clear there is unavoidable uncertainty in coding, and we invited others to download the data we have made available and juggle it to see if they can improve on our categorisation of the data. In our correspondence with BMJ, we made it clear that there are items that GSK could argue are more appropriately coded differently. We would be receptive to a rationale for alternate coding of certain items that is cogently argued rather than simply asserted, but our hunch is that a disinterested observer reviewing the coding as presented by GSK across all 1500 adverse effects in this study (or 2000+ if we include the continuation phase) would conclude that our efforts are a better representation of the data.
Fourth, reading our paper makes it clear why we reviewed the clinical records of 93 subjects; these were the subjects who dropped out or became suicidal. Our claims about underreporting of adverse events stand independently of that non-random sub-sample.
With regard to suicidal ideation and attempts, Keller et al. refer to a reanalysis by Bridge and colleagues, which found that there was no significant difference in suicidality between paroxetine and placebo. But Bridge et al. relied on Keller et al.’s misleading 2001 report.
With regard to bias, our point was that the best protection against bias is rigorous adherence to predetermined protocols and making data freely available. We, like everyone, are subject to the unwitting influence of our bias. The question is whether the Keller et al publication of 2001 manifests unconscious bias or deliberate misrepresentation.
The original and restored studies, the study data, reviews and responses are all available at Study329.org, offering a broad range of options when it comes to consideration of authorship, research misconduct and the newly described species, ‘research parasite’.4
- Keller MB, Birmaher B, Carlson GA, Clarke GN, Emslie GJ, Koplewicz H, Kutcher S, Ryan N, Sack WH, Strober M. Restoring Study 329: efficacy and harms of paroxetine and imipramine in treatment of major depression in adolescence. Response from the authors of the original Study 329. BMJ 2015;351:h4320
- Le Noury J, Nardo JM, Healy D, Jureidini J, Raven M, Tufanaru C, Abi-Jaoude E. Restoring Study 329: efficacy and harms of paroxetine and imipramine in treatment of major depression in adolescence. BMJ. 2015 Sep 16;351:h4320.
- Keller MB, Ryan ND, Strober M, et al. Efficacy of paroxetine in the treatment of adolescent major depression: a randomized, controlled trial. J Am Acad Child Adolesc Psychiatry 2001;40:762-72.
- Longo DL, Drazen JM. Data sharing. N Engl J Med. 2016;374:276-7.