title: “Standardized cumulative incidence functions” |
Originally written June 2019, updated to
stpm3
August 2024.
You will need to install
standsurv
andstpm3
to run the example.
Background
I have been meaning to write about using standsurv
for standardized measures in competing risks for a while and how many of the ideas of standardization in a standard survival carry over to competing risks. Last week a very nice paper by Kipourou et al. was published that decribes using cause-specific flexible parametric models on the log-hazard scale and then using these to derive standardized cause-specific cumulative incidence functions. It is great that they include R code for their analysis and as they used the publically available MGUS2 data set available from the survival library in R, I thought I would try and do a similar analysis using standsurv
.
In competing risks there is interest in the time to more than one type of event. A common example is different causes of death. For example, for patients diagnosed with a type of cancer, they at risk from dying from their cancer or from some other cause, i.e. there are competing events. An indvidual can only experience one of these events. In such situations we can think of a separate hazard (mortality) rate for each of the different causes.
So for two causes we could have two proportional hazards models,
Model 1: \(h_1(t) = h_{01}(t) \exp\left(\boldsymbol{\beta_1} \mathbf{Z_1}\right)\)
Model 2: \(h_2(t) = h_{02}(t) \exp\left(\boldsymbol{\beta_2} \mathbf{Z_2} \right)\)
It is well known that transforming the cause-specific hazard function to a cause-specific survival function does not give a “real” world probability of death, i.e. using the transformation
\[ S_k(t) = \exp\left(-\int_0^t h_k(u) du\right) \]
If one is willing to assume conditional independence between the times to each event then \(S_k(t)\) can be interpreted as a net probability, that is the probability of still being alive in the hypothetical situation where it is only possible to die from cause \(k\). The independence asumption cannot be assessed from the data as we of course never observe the time to two events on the same individual.
If interest lies in the probability of dying of cause \(k\) in the situation where dying from another cause first will make it impossible to die from cause \(k\), then the cause-specific cumulative incidence function, \(F_k(t)\) should be estimated. This is defined as follows
\[ F_{k}(t) = \int_0^t S(u) h_k(u) du \]
where \(S(t)\) is the overall survival function,
\[ S(t) = S_1(t) S_2(t) = \exp\left(-\int_0^t {h_1(u) + h_2(u) du}\right) \]
This is for two competing risks, but of course it can be extended to any number. For more detail on competing risks see Andersen et al. 2012, Geskus 2016.
Example
First I load the mgus2 data. This is the same data used by Kipourou et al in their example. This is a dataset that comes with the survival package in R. The two events of interest are time to plasma cell malignacy (PCM) or death (before PCM). I will fit similar models to those in the Kipourou paper.
First I will load the data and drop rows with missing values for the mspike
variable
use https://www.pclambert.net/data/mgus2, clear
.
drop if mspike == .
. (11 observations deleted)
After dropping the missing values of mspike
we are left with 1373 individuals. I will now tabulate the event
variable,
tab event
.
event | Freq. Percent Cum.
------------+-----------------------------------
0 | 404 29.42 29.42
1 | 115 8.38 37.80
2 | 854 62.20 100.00
------------+----------------------------------- Total | 1,373 100.00
There are 404 individuals who are censored (event=0
), 115 had a PCM event (event=1
) and 854 died before PCM (event=2
).
I will now fit the models. As there are two events we fit two separate models, one for PCM and one for death. I will fit similar models to Kipourou et al, but the models here will be on the log cumulative hazard scale rather than the log hazard scale.
PCM model
For the PCM model we need to use, event=1
, when using stset
stset survtime, failure(event=1)
.
data settings
Survival-time
Failure event: event==1
Observed time interval: (0, survtime]on or before: failure
Exit
--------------------------------------------------------------------------total observations
1,373
0 exclusions
--------------------------------------------------------------------------
1,373 observations remaining, representingin single-record/single-failure data
115 failures total analysis time at risk and under observation
10,739.583
At risk from t = 0
Earliest observed entry t = 0exit t = 35.33333
Last observed
scale(lncumhazard) df(4)
. stpm3 age mspike i.male,
Iteration 0: Log likelihood = -436.35927
Iteration 1: Log likelihood = -435.78559
Iteration 2: Log likelihood = -435.78118
Iteration 3: Log likelihood = -435.78118
of obs = 1,373
Number chi2(3) = 31.52
Wald chi2 = 0.0000
Log likelihood = -435.78118 Prob >
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------xb |
age | .016625 .0083603 1.99 0.047 .0002391 .0330108
mspike | .883143 .1645591 5.37 0.000 .5606132 1.205673
1.male | -.0255884 .1877314 -0.14 0.892 -.3935353 .3423584
-------------+----------------------------------------------------------------
time |
_ns1 | -19.2611 2.240876 -8.60 0.000 -23.65313 -14.86906
_ns2 | 4.240545 1.148641 3.69 0.000 1.98925 6.49184
_ns3 | -2.345719 .2692337 -8.71 0.000 -2.873408 -1.818031
_ns4 | -1.711317 .4184964 -4.09 0.000 -2.531555 -.891079_cons | -3.02243 .6282616 -4.81 0.000 -4.2538 -1.79106
------------------------------------------------------------------------------
estimates store pcm .
This is a proportional hazards model including the effects of age at diagnosis, age
, size of the monoclonal serum spike (mspike
) and sex (male
). The effects of age
and mspike
are assumed to be linear. I have used 4 degrees of freedom, i.e. 5 knots, for the effect of time since diagnosis for the restricted cubic splines.
Death before PCM model
For the death before PCM model we to need use, event=2
, when using stset
.
stset survtime, failure(event=2)
.
data settings
Survival-time
Failure event: event==2
Observed time interval: (0, survtime]on or before: failure
Exit
--------------------------------------------------------------------------total observations
1,373
0 exclusions
--------------------------------------------------------------------------
1,373 observations remaining, representingin single-record/single-failure data
854 failures total analysis time at risk and under observation
10,739.583
At risk from t = 0
Earliest observed entry t = 0exit t = 35.33333
Last observed
scale(lncumhazard) df(4) tvc(age) dftvc(2)
. stpm3 age mspike i.male,
Iteration 0: Log likelihood = -1820.2633
Iteration 1: Log likelihood = -1797.1359
Iteration 2: Log likelihood = -1794.6344
Iteration 3: Log likelihood = -1794.5668
Iteration 4: Log likelihood = -1794.5667
of obs = 1,373
Number chi2(3) = 195.73
Wald chi2 = 0.0000
Log likelihood = -1794.5667 Prob >
----------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-----------------+----------------------------------------------------------------xb |
age | .0893457 .0067803 13.18 0.000 .0760566 .1026348
mspike | -.0445153 .0638026 -0.70 0.485 -.1695661 .0805354
1.male | .4041454 .0699583 5.78 0.000 .2670297 .5412612
-----------------+----------------------------------------------------------------
time |
_ns1 | -3.526569 2.605836 -1.35 0.176 -8.633914 1.580777
_ns2 | 1.730746 1.163022 1.49 0.137 -.5487345 4.010226
_ns3 | -.3259044 .5716026 -0.57 0.569 -1.446225 .7944161
_ns4 | -.3377873 .514584 -0.66 0.512 -1.346353 .6707788
|
c.age#c._ns_tvc1 | -.102553 .0189964 -5.40 0.000 -.1397854 -.0653207
|
c.age#c._ns_tvc2 | -.0814865 .020701 -3.94 0.000 -.1220596 -.0409133
|_cons | -4.912348 .4686894 -10.48 0.000 -5.830962 -3.993734
----------------------------------------------------------------------------------
estimates store death .
This model has relaxed the assumption of proportional hazards for age through use of the tvc()
and dftvc()
options.
I have stored both models using estimates store
as these will need to be passed to to standsurv
when making predictions.
Estimating the Marginal Cause-Specific CIFs
The marginal cause-specific CIFs is the expection of the CIF over covariates \(Z\), i.e.
\[ E\left[F_k(t|Z)\right] \]
This can be estimated by predicting the cause specific CIF for each individual and then taking the mean, i.e.
\[ \widehat{F}_{k}(t|Z) = \frac{1}{N} \sum_{i=1}^{N} \int_0^t \widehat{S}_1(u|Z_i) \widehat{S}_2(u|Z_i) \widehat{h}_k(u|Z_i) du \]
In this case \(Z_i=(\mbox{age}\_i,\mbox{mspike}_i,\mbox{male}_i)\)
I now create the times I want to predict the cause-specific CIF at
range tt 0 30 31
. missing values generated) (1,342
This creates 31 equally spaced values between 0 and 30, i.e. increasing in steps of 1 year, and stores in a new variable, tt
. I then use standsurv
.
ci frame(cif,replace) atvar(F) . standsurv, crmodels(pcm death) cif timevar(tt)
The crmodels(pcm death)
gives the names of the two models, the cif
option requests the cause-specific CIFs are estimated (the default is overall survival), the ci
option means that confidence intervals will be calculated (using the delta method), the timevar(tt)
option gives the times to predict at and the atvar(F)
option gives the stub name of the new variables. Here the defaults of using the model names will be used so variables F_pcm
and F_death
will be created.
We can now plot the results,
. frame cif {twoway (rarea F_pcm_lci F_pcm_uci tt, color(red%30)) ///
. line F_pcm tt, color(red)) ///
> (color(blue%30)) ///
> (rarea F_death_lci F_death_uci tt, line F_death tt, color(blue)) ///
> (legend(order(2 "PCM" 4 "Death") cols(1) ring(0) pos(11)) ///
> , ylabel(,angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ///
> ytitle("cause-specific CIF") ///
> name(cifs, replace)
> . }
This is similar to Figure 3 of the Kipourou et al. paper.
Contrasts
Contrasts are of more interest and the general idea is essentially the same as when making contrasts of survival functions. For a binary exposure \(X\) and confounders \(Z\) we want to estimate
\[ E\left[F_k(t|X=1,Z)\right] - E\left[F_k(t|X=0,Z)\right] \]
\(F_k(t|X=x,Z)\) is estimated by,
\[ \widehat{F}_{k}(t|X=x,Z) = \frac{1}{N} \sum_{i=1}^{N} \int_0^t \widehat{S}_1(u|X=x,Z_i) \widehat{S}_2(u|X=x,Z_i) \widehat{h}_k(u|X=x,Z_i) du \]
Here we are interested in the effect of sex, so here our exposure is male
and the potential confounders are age
and mspike
. The standsurv
command is shown below.
ci timevar(tt) frame(cif2, replace) ///
. standsurv, crmodels(pcm death) cif ///
> at1(male 1) at2(male 0) atvar(F_male F_female) > contrast(difference) contrastvar(cif_diff)
The options that are different to the previous standsurv
command are that I have used at1(male 1)
and at2(male 0)
. When estimating the standardized CIFs it first forces the covariate male
to be set to 1 for all subjects in the individual predictions and then to 0. As in the other standsurv
examples the key point here is that the distribution of confounders is forced to be the same for males and females when obtaining the the estimates. As there are two at()
options, two new variables are listed using atvar(F_male F_female)
. If atvar()
is not specified the default names would be _at2
and _at1
. The contrast(difference)
option means that the difference between the at
options will be taken. By default at1
is the reference, but this can be changed using atreference()
. The contrastvar(cif_diff)
option gives the variable name for the difference. Note that two new variables will be created as there are two competing events, cif_diff_pcm
and cif_diff_death
.
Having created the standardized cause-specific CIFs we can now plot them.
. frame cif2 {twoway (rarea F_male_pcm_lci F_male_pcm_uci tt, color(red%30)) ///
. line F_male_pcm tt, color(red)) ///
> (color(blue%30)) ///
> (rarea F_female_pcm_lci F_female_pcm_uci tt, line F_female_pcm tt, color(blue)) ///
> (legend(order(2 "Males" 4 "Females") cols(1) ring(0) pos(11)) ///
> , ylabel(, angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ///
> ytitle("cause-specific CIF") ///
> title("PCM") ///
> name(pcm, replace)
>
. twoway (rarea F_male_death_lci F_male_death_uci tt, color(red%30)) ///
. line F_male_death tt, color(red)) ///
> (color(blue%30)) ///
> (rarea F_female_death_lci F_female_death_uci tt, line F_female_death tt, color(blue)) ///
> (legend(order(2 "Males" 4 "Females") cols(1) ring(0) pos(11)) ///
> , ylabel(, angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ///
> ytitle("cause-specific CIF") ///
> title("Death") ///
> name(death, replace)
>
. }
graph combine pcm death, nocopies ycommon .
This is similar to the top row of Figure 4 in the Kipourou et al. paper although I have shown both cause-specific CIFs with the y-axes over the same range.
For the contrast I just need to plot the new variables cif_diff_pcm
and cif_diff_death
together with their confidence limits.
. frame cif2 {twoway (rarea cif_diff_pcm_lci cif_diff_pcm_uci tt, color(red%30)) ///
. line cif_diff_pcm tt, color(red)) ///
> (legend(off) ///
> , ylabel(, angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ///
> ytitle("cause-specific CIF") ///
> title("PCM") ///
> name(pcm_diff, replace)
>
. twoway (rarea cif_diff_death_lci cif_diff_death_uci tt, color(red%30)) ///
. line cif_diff_death tt, color(red)) ///
> (legend(off) ///
> , ylabel(, angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ///
> ytitle("cause-specific CIF") ///
> title("Death") ///
> name(death_diff, replace)
>
. }
graph combine pcm_diff death_diff, nocopies ycommon .
This is similar to the bottom row of Figure 4 in the Kipourou et al. paper although, as above, I have shown both cause-specific CIFs with the y-axes over the same range.
Using different survival models
standsurv
is a general command and it possible to use a variety of different survival models. It is also possible to use different survival distributions for different causes. In order to illustrate this I will use a weibull model for PCM where the shape parameter is a function of age and a log-logistic accelerated failure time model for death. I am not claiming these are sensible models, but just aim to show the versatility of standsurv
I fit the two models below.
stset survtime, failure(event=1)
.
data settings
Survival-time
Failure event: event==1
Observed time interval: (0, survtime]on or before: failure
Exit
--------------------------------------------------------------------------total observations
1,373
0 exclusions
--------------------------------------------------------------------------
1,373 observations remaining, representingin single-record/single-failure data
115 failures total analysis time at risk and under observation
10,739.583
At risk from t = 0
Earliest observed entry t = 0exit t = 35.33333
Last observed
streg age mspike male, dist(weibull) anc(age)
.
Failure _d: event==1
Analysis time _t: survtime
model:
Fitting constant-only
Iteration 0: Log likelihood = -452.48861
Iteration 1: Log likelihood = -452.44835
Iteration 2: Log likelihood = -452.44832
model:
Fitting full
Iteration 0: Log likelihood = -452.44832
Iteration 1: Log likelihood = -437.70531
Iteration 2: Log likelihood = -435.60846
Iteration 3: Log likelihood = -435.60532
Iteration 4: Log likelihood = -435.60532
Weibull PH regression
of subjects = 1,373 Number of obs = 1,373
No. of failures = 115
No. at risk = 10,739.5833
Time chi2(3) = 33.69
LR chi2 = 0.0000
Log likelihood = -435.60532 Prob >
------------------------------------------------------------------------------
_t | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
_t |
age | .0429871 .0171147 2.51 0.012 .0094429 .0765314
mspike | .8710854 .1643031 5.30 0.000 .5490573 1.193114
male | -.0308749 .1875072 -0.16 0.869 -.3983823 .3366325_cons | -9.181064 1.316739 -6.97 0.000 -11.76183 -6.600303
-------------+----------------------------------------------------------------
ln_p |
age | -.0088294 .0042727 -2.07 0.039 -.0172039 -.000455_cons | .802942 .3018274 2.66 0.008 .2113712 1.394513
------------------------------------------------------------------------------
estimates store pcm2
.
. stset survtime, failure(event=2)
.
data settings
Survival-time
Failure event: event==2
Observed time interval: (0, survtime]on or before: failure
Exit
--------------------------------------------------------------------------total observations
1,373
0 exclusions
--------------------------------------------------------------------------
1,373 observations remaining, representingin single-record/single-failure data
854 failures total analysis time at risk and under observation
10,739.583
At risk from t = 0
Earliest observed entry t = 0exit t = 35.33333
Last observed
streg age mspike male, dist(llogistic)
.
Failure _d: event==2
Analysis time _t: survtime
model:
Fitting constant-only
Iteration 0: Log likelihood = -2056.9536
Iteration 1: Log likelihood = -2051.5925
Iteration 2: Log likelihood = -2051.5305
Iteration 3: Log likelihood = -2051.5305
model:
Fitting full
Iteration 0: Log likelihood = -2051.5305
Iteration 1: Log likelihood = -1936.9223
Iteration 2: Log likelihood = -1924.5341
Iteration 3: Log likelihood = -1924.4786
Iteration 4: Log likelihood = -1924.4786
Loglogistic AFT regression
of subjects = 1,373 Number of obs = 1,373
No. of failures = 854
No. at risk = 10,739.5833
Time chi2(3) = 254.10
LR chi2 = 0.0000
Log likelihood = -1924.4786 Prob >
------------------------------------------------------------------------------
_t | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | -.0590111 .0039933 -14.78 0.000 -.0668378 -.0511844
mspike | .1093803 .075813 1.44 0.149 -.0392104 .2579711
male | -.4249628 .0857712 -4.95 0.000 -.5930714 -.2568543_cons | 6.452482 .3170312 20.35 0.000 5.831113 7.073852
-------------+----------------------------------------------------------------lngamma | -.1515847 .0296779 -5.11 0.000 -.2097524 -.0934171
/
-------------+----------------------------------------------------------------gamma | .8593451 .0255036 .810785 .9108135
------------------------------------------------------------------------------
estimates store death2 .
I have stored the models using estimates store
and just need to pass these new models to standsurv
. The rest of the standsurv
code is the same, with the exception of creating new variables names.
ci timevar(tt) frame(cif3, replace) ///
. standsurv, crmodels(pcm2 death2) cif ///
> at1(male 1) at2(male 0) atvar(F_male F_female) > contrast(difference) contrastvar(cif_diff)
The plots can be reproduced the same way and are shown in a panel graph below.
They are reasonably similar to the previous plot, with some small differences.
Some comments
For the flexible parametric models I have modelled on the log cumulative hazard scale, while Kipourou et al fitted models on the log hazard scale. In stpm3
you can fit a model on log hazard scale using scale(lnhazard)
for standard survival models and relative survival models these are compatable with standsurv
. However, computations are easier and faster on the log cumulative hazard scale so unless there is a good reason for not doing so, I prefer models on this scale. In the MGUS2 example, the models here have very similar fits (BIC = 925.6 for stpm3
with scale(lncumhazard) PCM model and 926.7 for stpm3
with scale(lnhazard)
for the PCM model and BIC = 3656.6 vs 3656.4 for the death model.
References
Andersen, P. K., Geskus, R. B., de Witte, T., Putter, H. Competing risks in epidemiology: possibilities and pitfalls. Int J Epidemiol 2012;41:861-70
Geskus, R. B. Data analysis with competing risks and intermediate states. Chapman and Hall 2016
Kipourou, D.-K., Charvat, H., Rachet, B., Belot, A. Estimation of the adjusted cause-specific cumulative probability using flexible regression models for the cause-specific hazards.
Statistics in Medicine 2019. DOI: 10.1002/sim.8209