Producing out-of-sample predictions from a flexible parametric survival model

Download Stata Do file here

By Sarah Booth (sarah.booth@le.ac.uk)

Background

After fitting a flexible parametric survival model (FPM) it is easy to produce out-of-sample predictions in a new dataset as the model is stored in memory. However, what if you didn’t have access to the original dataset? For example, you may want to test how well a published prognostic model works in your own dataset.

The easiest way to reproduce an FPM is if you have access to the saved Stata model estimates in an .ster file. This file doesn’t include anything about the original data used to fit the model and can therefore be shared freely. The required predictions can then be produced after using estimates use to load the model.

However, as this file may not be available, in this tutorial I’ll also show how an FPM can be reconstructed using only the model coefficients and knot locations.

Fitting an example model

As an example, I’ll use a simulated dataset that is loosely based on survival following a diagnosis of cancer. It includes the following variables: ID number (id), year of diagnosis (yydx, 2009-2019), date of diagnosis (dx), date of death/censoring (exit), survival status (status, 0 = Alive, 1 = Dead), age at diagnosis (age, 44-93), sex (sex, 0 = Male, 1 = Female) and stage of tumour at diagnosis (stage, 1-3).

I’ll start by creating some dummy variables for sex and stage of tumour at diagnosis and fit a model that includes age, sex and stage of tumour as covariates. I’ll centre age on 70 so that the baseline has a meaningful interpretation.

use https://www.pclambert.net/data/simulated_survival, clear
tab stage, gen(stage)
gen female = sex == 1 
gen age_centre = age - 70

When fitting the model, I’ll use the noorthog option so that the baseline splines won’t be orthogonalized which makes it easier to re-construct the FPM without the need for any matrix algebra.

. stset exit, origin(dx) fail(status==1) scale(365.25) ///
> exit(time dx+10*365.25)

Survival-time data settings

         Failure event: status==1
Observed time interval: (origin, exit]
     Exit on or before: time dx+10*365.25
     Time for analysis: (time-origin)/365.25
                Origin: time dx

--------------------------------------------------------------------------
     55,050  total observations
          0  exclusions
--------------------------------------------------------------------------
     55,050  observations remaining, representing
     22,180  failures in single-record/single-failure data
 397,922.36  total analysis time at risk and under observation
                                                At risk from t =         0
                                     Earliest observed entry t =         0
                                          Last observed exit t =        10

. 
. stpm2 age_centre female stage2 stage3, df(5) scale(hazard) noorthog

Iteration 0:   log likelihood = -58450.715  
Iteration 1:   log likelihood = -58447.706  
Iteration 2:   log likelihood = -58447.701  

Log likelihood = -58447.701                             Number of obs = 55,050

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
xb           |
  age_centre |   .0288174   .0011243    25.63   0.000     .0266138    .0310211
      female |   -.077212   .0134341    -5.75   0.000    -.1035424   -.0508816
      stage2 |   1.211915   .0217077    55.83   0.000     1.169368    1.254461
      stage3 |   3.078478   .0214495   143.52   0.000     3.036438    3.120519
       _rcs1 |   .8072818   .0240466    33.57   0.000     .7601513    .8544123
       _rcs2 |   .0093188   .0062436     1.49   0.136    -.0029184     .021556
       _rcs3 |  -.0225463   .0219503    -1.03   0.304    -.0655681    .0204754
       _rcs4 |   .0177935   .0393819     0.45   0.651    -.0593935    .0949805
       _rcs5 |   -.003851   .0377596    -0.10   0.919    -.0778584    .0701564
       _cons |  -3.654475   .0716417   -51.01   0.000     -3.79489    -3.51406
------------------------------------------------------------------------------

Producing the predictions using the model stored in memory

I’ll now use the model stored in memory to produce 10-year survival curves for the following covariate patterns:

  • 70 years old (age_centre = 70 - 70 = 0) male with a stage 1 tumour (baseline)

  • 80 years old (age_centre = 80 - 70 = 10) female with a stage 2 tumour

  • 85 years old (age_centre = 85 - 70 = 15) male with a stage 3 tumour

range timevar10 0 10 100
predict s0, zeros surv timevar(timevar10)
predict s1, at(age_centre 10 female 1 stage2 1) zeros surv timevar(timevar10)
predict s2, at(age_centre 15 stage3 1) zeros surv timevar(timevar10)

Producing the predictions using an .ster file

If the .ster file is available, this can be loaded in to Stata in order to produce the predictions.

. estimates use https://www.pclambert.net/data/oosmodel.ster

. predict s0_ster, zeros surv timevar(timevar10)

. predict s1_ster, at(age_centre 10 female 1 stage2 1) zeros surv timevar(timevar10)

. predict s2_ster, at(age_centre 15 stage3 1) zeros surv timevar(timevar10)

Producing the predictions using the model coefficients

A PH flexible parametric survival model is fitted on the log cumulative hazard scale where a restricted cubic spline function $\zeta (\ln(t)|\gamma,k_0)$ is used to model the baseline. Here, $k_0$ is the vector of knot locations, the $z$ terms are the derived variables (basis functions) and the $\gamma$ terms are the model coefficients for these derived variables. The linear predictor $\beta x_i$ contains the log hazard ratios. Using this information, I’ll now show how these predictions can be produced when the model is not stored in memory.

$$ \ln[H(t|x_i)] = \ln[H_0(t)] + \beta x_i = \zeta (\ln(t)|\gamma,k_0) + \beta x_i $$

$$ \zeta (\ln(t)|\gamma,k_0) = \gamma_0 + \gamma_1 z_{1} + \gamma_2 z_{2} + … + \gamma_{K-1} z_{K-1} $$

The table below displays the model coefficients (log hazard ratios) from the model. The knot locations on the log time scale (stored in e(ln_bhknots)) are also required to produce the predictions. For this model the knots are positioned at: -5.875508413078693 -0.6125073943959456 0.4430906852086161 1.156903994968461 1.750977782183514 2.302560494813347 on the log time scale.

Variable Coefficient
Age (centred) 0.02881744
Female -0.07721197
Stage 2 1.21191450
Stage 3 3.07847830
$\gamma_0$ (_cons) -3.65447470
$\gamma_1$ (_rcs1) 0.80728180
$\gamma_2$ (_rcs2) 0.00931881
$\gamma_3$ (_rcs3) -0.02254634
$\gamma_4$ (_rcs4) 0.01779351
$\gamma_5$ (_rcs5) -0.00385097

The first step is to generate the restricted cubic spline functions of log time which can be achieved using rcsgen and specifying the list of knot locations.

gen double lntime = ln(timevar10)
rcsgen lntime, gen(z) knots(-5.875508413078693 -0.6125073943959456 ///
               0.4430906852086161 1.156903994968461 1.750977782183514 2.302560494813347)

The log cumulative hazard can then be calculated for each of the covariate patterns using the above equations. These can then be transformed to the survival scale using the following equation:

$$ S(t|x_i) = \exp \left[ - [H(t|x_i)] \right] = \exp \left[ - \exp \left[\ln[H(t|x_i)] \right] \right] $$

The survival predictions for time zero will be undefined as the log of zero is required in this calculation. However, by definition S(0) = 1, and therefore this can be added manually.

// Calculate the log cumulative hazard
gen double logCH_0 = -3.6544747 + 0.8072818*z1 + 0.00931881*z2 + -0.02254634*z3 + ///
                     0.01779351*z4 + -0.00385097*z5
gen double logCH_1 = logCH_0 + 10*0.02881744 + -0.07721197 + 1.2119145  
gen double logCH_2 = logCH_0 + 15*0.02881744 + 3.0784783 

// Transform to the survival scale
gen double s0_rcs = exp(-exp(logCH_0))
gen double s1_rcs = exp(-exp(logCH_1))
gen double s2_rcs = exp(-exp(logCH_2))

// S(0) = 1
forvalues i = 0/2 {
	replace s`i'_rcs = 1 in 1
}

Finally, we can check that the survival predictions match those that we originally calculated when the model was stored in memory. Here there is agreement to at least 5 decimal places but the predictions that were produced using the model coefficients are slightly different due to the precision that the model coefficients, knots and variables were stored to.

. summ s0* s1* s2*

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
          s0 |        100     .922136    .0380317   .8633618          1
     s0_ster |        100     .922136    .0380317   .8633618          1
      s0_rcs |        100     .922136    .0380316   .8633618          1
          s1 |        100    .7223094    .1249454   .5435774          1
     s1_ster |        100    .7223094    .1249454   .5435774          1
-------------+---------------------------------------------------------
      s1_rcs |        100    .7223094    .1249453   .5435776          1
          s2 |        100    .1542438    .2111177   .0073144          1
     s2_ster |        100    .1542438    .2111177   .0073144          1
      s2_rcs |        100    .1542438    .2111177   .0073144          1

References

Royston, P. & Lambert, P. C. Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model, 2011

Biostatistician