Propensity weighting to generate effectiveness estimates


Randomised controlled trials (RCTs) of new medicines in oncology and other disease areas are often undertaken in study populations drawn from many centres, satisfying strict inclusion and exclusion criteria. These trials are primarily designed to meet the regulatory requirements for marketing authorisation, with tight protocols and high internal validity. However, generalisability of trial results to specific national or local ‘reimbursable’ populations may be questioned, in particular when results are presented for health technology assessment (HTA).

What is it?

The propensity weighting method uses an external real-world data (RWD) source to derive propensity scores. These are used to reweight data available from an RCT to provide estimates of relative effectiveness that may be more relevant for decision makers.

In this context the ‘propensity score’ denotes the approximate probability of a patient being enrolled in the trial of interest.

In a GetReal case study the use of this method was explored using RCT data (Scagliotti et al., 2008) and RWD data (Schnabel et al., 2012) for pemetrexed in stage IIIB/IV non-small-cell lung cancer (NSCLC). Data were made available to GetReal by Eli Lilly and Co. (GetReal partner). Estimates of the relative effectiveness of pemetrexed + cisplatin vs. gemcitabine + cisplatin in stage IIIB/IV NSCLC were generated for the population described by the FRAME study, assuming this to represent a target population for reimbursement. Although NSCLC was chosen as a suitable disease for using this method, the same method can be applied generally to other disease areas.

The analysis was carried out as follows:

  • Patient data from the JMDB and FRAME studies were combined, using a membership indicator to denote each study.
  • A logistic regression model was constructed using the membership indicator as the dependent variable and the other covariates as the independent variables.
  • Standardised differences were used to examine the propensity-adjusted balance between the source studies and between the weighted treatment cohorts in the RCT (JMDB study).
  • Each patient’s propensity score was calculated by applying the logistic model to the given covariates for that patient.
  • Efficacy in the RCT was re-calibrated relative to the FRAME population using the general approach proposed by (Cole & Stuart, 2010). Each patient in the JMDB trial was assigned a weight based on the inverse of their propensity score, resulting in a weighted trial population appearing to represent a random sample from the FRAME study.
  • The quality of the calibration was assessed by comparing demographic characteristics, and the balance in baseline covariates after weighting, between the RCT study groups.
  • Overall and progression-free survival was computed for each patient in the trial, and calibrated efficacy was estimated as the hazard ratio (HR) from a weighted Cox proportional hazards model comparing the study arms.
  • Adjusted outcomes for treatments in the reweighted RCT were also compared with outcomes in FRAME for the same treatments, using Cox proportional hazards and Kaplan-Meier methods.
  • The variability and confidence interval of the calibrated efficacy was assessed using a non-parametric bootstrap procedure.
  • As a sensitivity analysis, a more refined weighting algorithm was also applied, using entropy weights (Hainmueller, 2012).

What were the results?

The figure below gives the distribution of propensity scores between the JMDB trial (in red) and FRAME study (in black), largely driven by the number of metastases present.

Figure. Distribution of propensity scores for JMDB and FRAME populations


Results of propensity-weighted re-analysis of JMDB RCT using RWD from FRAME
The reweighted analysis of the clinical trial yielded a HR closer to 1, with greater uncertainty (HR: 0.91, 95% CI: 0.60 to 1.33) compared with the original (HR: 0.85, 95% CI: 0.75 to 0.97) in a similar population in the clinical trial. Overall survival differences appeared to be more pronounced. Sensitivity analyses to both the methods of reweighting and the inclusion of baseline covariates gave broadly similar results.

Table. Results with propensity weighting compared to the original analysis.

Category Treatment N Median OS (months) Hazard
95% LCL 95% UCL
No weighting Gemcitabine 608 10.15 0.851 0.746 0.972
Pemetrexed 614 11.14
Bootstrap 2.5   percentile Bootstrap
97.5 percentile
Weighted Gemcitabine 593 10.15 0.915 0.599 1.333
Pemetrexed 616 15.57
Abbreviations: HR – hazard ratio, OS – overall survival, LCL – lower confidence level, UCL – upper confidence level.

Stakeholder views on this analysis were sought at a GetReal workshop held in Frankfurt (10 Sept 2015). For more information, see a summary of the GetReal case study for propensity weighting and extrapolation.

Effectiveness challenge addressed by the method

This analytical method allows a real-world treatment effect to be predicted based on trial efficacy data, while accounting for a possible efficacy-effectiveness gap. The gap here is due to the trial population having different characteristics to the target population for (local) reimbursement.

When is it useful?

  • For many disease areas: this method can in principle be applied to medicines in a variety of disease areas.
  • After phase 3 trials: its most obvious use is after pivotal phase 3 trials have concluded, as a supplementary analysis for HTA submissions, where it can be used to demonstrate the potential difference in effectiveness that might be expected between the clinical trial population and the local target population. However, it is unlikely that HTA agencies will accept results using this method in preference to unadjusted trial results.
  • To project overall survival: as demonstrated in the case study, in the case of cancer medicines the method can be combined with modelling techniques to project overall survival, the effectiveness measure favoured by HTA agencies such as NICE in the UK.
  • To support trial effectiveness results: the method might be used to support high levels of effectiveness reported in a trial; in this case by demonstrating a likely small efficacy-effectiveness gap.
  • It may be possible to use the method to create comparisons with other treatments when there are data from a single arm trial only (for example, for rare diseases).
  • To support research and development (R&D): this method may also be of value to pharmaceutical R&D when planning trials. An early analysis of this type may inform decisions on the late development programme, for example the design of the phase 3a trials (site and patient selection criteria) and the mix of phase 3a and other studies such as pragmatic controlled trials (PCTs). If the analysis based on phase 2b or 3a data suggests that expected effectiveness is greater than observed clinical trial efficacy, this may indicate the value of conducting a further study such as a PCT.

What are the limitations?

  • Availability of RWD: observational data for a real-world population need to be available, containing the important demographic variables and risk factors as well as health outcomes of interest.
  • RWD before authorisation may be on target population: while RWD used in the NSCLC case study included data on the treatment comparator of interest, this data is unlikely to be available for many medicines before market authorisation. However, RWD is useful to provide information on the target population which can be used to derive propensity.
  • Unmeasured covariates: the analysis may be invalidated if there are important unmeasured covariates. This may be tested by comparing the outcome in the RWE dataset with the outcome in the reweighted placebo/control arm of the trial.
  • Similar RCT and RWD populations: this method may not be feasible if the trial population differs too substantially from the real-world patient population that is used for the reweighting.
  • Similar RWD and target population: the population providing RWD needs to be similar to the target population required by decision makers.

What do stakeholders say?

Stakeholder views on this analysis were sought through a GetReal workshop held in Frankfurt (10th Sept 2015). For more information, see a summary of the GetReal case study for propensity weighting and extrapolation.

Key contributors

Michael Happich and Mark Belger, Lilly
Prof. Keith Abrams, University of Leicester
Mike Chambers, GSK