Producing out-of-sample predictions from a flexible parametric survival model
By Sarah Booth (sarah.booth@le.ac.uk)
Background
After fitting a flexible parametric survival model (FPM) it is easy to produce out-of-sample predictions in a new dataset as the model is stored in memory. However, what if you didn’t have access to the original dataset? For example, you may want to test how well a published prognostic model works in your own dataset.
The easiest way to reproduce an FPM is if you have access to the saved Stata model estimates in an .ster
file. This file doesn’t include anything about the original data used to fit the model and can therefore be shared freely. The required predictions can then be produced after using estimates use
to load the model.
However, as this file may not be available, in this tutorial I’ll also show how an FPM can be reconstructed using only the model coefficients and knot locations.
Fitting an example model
As an example, I’ll use a simulated dataset that is loosely based on survival following a diagnosis of cancer. It includes the following variables: ID number (id
), year of diagnosis (yydx
, 2009-2019), date of diagnosis (dx
), date of death/censoring (exit
), survival status (status
, 0 = Alive, 1 = Dead), age at diagnosis (age
, 44-93), sex (sex
, 0 = Male, 1 = Female) and stage of tumour at diagnosis (stage
, 1-3).
I’ll start by creating some dummy variables for sex and stage of tumour at diagnosis and fit a model that includes age, sex and stage of tumour as covariates. I’ll centre age on 70 so that the baseline has a meaningful interpretation.
use https://www.pclambert.net/data/simulated_survival, clear
tab stage, gen(stage)
gen female = sex == 1
gen age_centre = age - 70
When fitting the model, I’ll use the noorthog
option so that the baseline splines won’t be orthogonalized which makes it easier to re-construct the FPM without the need for any matrix algebra.
. stset exit, origin(dx) fail(status==1) scale(365.25) ///
> exit(time dx+10*365.25)
Survival-time data settings
Failure event: status==1
Observed time interval: (origin, exit]
Exit on or before: time dx+10*365.25
Time for analysis: (time-origin)/365.25
Origin: time dx
--------------------------------------------------------------------------
55,050 total observations
0 exclusions
--------------------------------------------------------------------------
55,050 observations remaining, representing
22,180 failures in single-record/single-failure data
397,922.36 total analysis time at risk and under observation
At risk from t = 0
Earliest observed entry t = 0
Last observed exit t = 10
.
. stpm2 age_centre female stage2 stage3, df(5) scale(hazard) noorthog
Iteration 0: log likelihood = -58450.715
Iteration 1: log likelihood = -58447.706
Iteration 2: log likelihood = -58447.701
Log likelihood = -58447.701 Number of obs = 55,050
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
xb |
age_centre | .0288174 .0011243 25.63 0.000 .0266138 .0310211
female | -.077212 .0134341 -5.75 0.000 -.1035424 -.0508816
stage2 | 1.211915 .0217077 55.83 0.000 1.169368 1.254461
stage3 | 3.078478 .0214495 143.52 0.000 3.036438 3.120519
_rcs1 | .8072818 .0240466 33.57 0.000 .7601513 .8544123
_rcs2 | .0093188 .0062436 1.49 0.136 -.0029184 .021556
_rcs3 | -.0225463 .0219503 -1.03 0.304 -.0655681 .0204754
_rcs4 | .0177935 .0393819 0.45 0.651 -.0593935 .0949805
_rcs5 | -.003851 .0377596 -0.10 0.919 -.0778584 .0701564
_cons | -3.654475 .0716417 -51.01 0.000 -3.79489 -3.51406
------------------------------------------------------------------------------
Producing the predictions using the model stored in memory
I’ll now use the model stored in memory to produce 10-year survival curves for the following covariate patterns:
-
70 years old (
age_centre
= 70 - 70 = 0) male with a stage 1 tumour (baseline) -
80 years old (
age_centre
= 80 - 70 = 10) female with a stage 2 tumour -
85 years old (
age_centre
= 85 - 70 = 15) male with a stage 3 tumour
range timevar10 0 10 100
predict s0, zeros surv timevar(timevar10)
predict s1, at(age_centre 10 female 1 stage2 1) zeros surv timevar(timevar10)
predict s2, at(age_centre 15 stage3 1) zeros surv timevar(timevar10)
Producing the predictions using an .ster
file
If the .ster
file is available, this can be loaded in to Stata in order to produce the predictions.
. estimates use https://www.pclambert.net/data/oosmodel.ster
. predict s0_ster, zeros surv timevar(timevar10)
. predict s1_ster, at(age_centre 10 female 1 stage2 1) zeros surv timevar(timevar10)
. predict s2_ster, at(age_centre 15 stage3 1) zeros surv timevar(timevar10)
Producing the predictions using the model coefficients
A PH flexible parametric survival model is fitted on the log cumulative hazard scale where a restricted cubic spline function $\zeta (\ln(t)|\gamma,k_0)$ is used to model the baseline. Here, $k_0$ is the vector of knot locations, the $z$ terms are the derived variables (basis functions) and the $\gamma$ terms are the model coefficients for these derived variables. The linear predictor $\beta x_i$ contains the log hazard ratios. Using this information, I’ll now show how these predictions can be produced when the model is not stored in memory.
$$ \ln[H(t|x_i)] = \ln[H_0(t)] + \beta x_i = \zeta (\ln(t)|\gamma,k_0) + \beta x_i $$
$$ \zeta (\ln(t)|\gamma,k_0) = \gamma_0 + \gamma_1 z_{1} + \gamma_2 z_{2} + … + \gamma_{K-1} z_{K-1} $$
The table below displays the model coefficients (log hazard ratios) from the model. The knot locations on the log time scale (stored in e(ln_bhknots)
) are also required to produce the predictions. For this model the knots are positioned at: -5.875508413078693 -0.6125073943959456 0.4430906852086161 1.156903994968461 1.750977782183514 2.302560494813347 on the log time scale.
Variable | Coefficient |
---|---|
Age (centred) | 0.02881744 |
Female | -0.07721197 |
Stage 2 | 1.21191450 |
Stage 3 | 3.07847830 |
$\gamma_0$ (_cons) | -3.65447470 |
$\gamma_1$ (_rcs1) | 0.80728180 |
$\gamma_2$ (_rcs2) | 0.00931881 |
$\gamma_3$ (_rcs3) | -0.02254634 |
$\gamma_4$ (_rcs4) | 0.01779351 |
$\gamma_5$ (_rcs5) | -0.00385097 |
The first step is to generate the restricted cubic spline functions of log time which can be achieved using rcsgen
and specifying the list of knot locations.
gen double lntime = ln(timevar10)
rcsgen lntime, gen(z) knots(-5.875508413078693 -0.6125073943959456 ///
0.4430906852086161 1.156903994968461 1.750977782183514 2.302560494813347)
The log cumulative hazard can then be calculated for each of the covariate patterns using the above equations. These can then be transformed to the survival scale using the following equation:
$$ S(t|x_i) = \exp \left[ - [H(t|x_i)] \right] = \exp \left[ - \exp \left[\ln[H(t|x_i)] \right] \right] $$
The survival predictions for time zero will be undefined as the log of zero is required in this calculation. However, by definition S(0) = 1, and therefore this can be added manually.
// Calculate the log cumulative hazard
gen double logCH_0 = -3.6544747 + 0.8072818*z1 + 0.00931881*z2 + -0.02254634*z3 + ///
0.01779351*z4 + -0.00385097*z5
gen double logCH_1 = logCH_0 + 10*0.02881744 + -0.07721197 + 1.2119145
gen double logCH_2 = logCH_0 + 15*0.02881744 + 3.0784783
// Transform to the survival scale
gen double s0_rcs = exp(-exp(logCH_0))
gen double s1_rcs = exp(-exp(logCH_1))
gen double s2_rcs = exp(-exp(logCH_2))
// S(0) = 1
forvalues i = 0/2 {
replace s`i'_rcs = 1 in 1
}
Finally, we can check that the survival predictions match those that we originally calculated when the model was stored in memory. Here there is agreement to at least 5 decimal places but the predictions that were produced using the model coefficients are slightly different due to the precision that the model coefficients, knots and variables were stored to.
. summ s0* s1* s2*
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
s0 | 100 .922136 .0380317 .8633618 1
s0_ster | 100 .922136 .0380317 .8633618 1
s0_rcs | 100 .922136 .0380316 .8633618 1
s1 | 100 .7223094 .1249454 .5435774 1
s1_ster | 100 .7223094 .1249454 .5435774 1
-------------+---------------------------------------------------------
s1_rcs | 100 .7223094 .1249453 .5435776 1
s2 | 100 .1542438 .2111177 .0073144 1
s2_ster | 100 .1542438 .2111177 .0073144 1
s2_rcs | 100 .1542438 .2111177 .0073144 1
References
Royston, P. & Lambert, P. C. Flexible Parametric Survival Analysis Using Stata: Beyond the Cox Model, 2011