Standardized cumulative incidence functions
You will need to install
standsurv
to run the example. Details here
Background
I have been meaning to write about using standsurv
for standardized measures in competing risks for a while and how many of the
ideas of standardization in a standard survival carry over to competing risks. Last week a very nice
paper by Kipourou et al. was published that decribes using cause-specific flexible parametric models on the log-hazard scale and then using these to
derive standardized cause-specific cumulative incidence functions. It is great that they include R code for their analysis and as they used the
publically available MGUS2 data set available from the survival library in R, I thought I would try and do a similar analysis using standsurv
.
In competing risks there is interest in the time to more than one type of event. A common example is different causes of death. For example, for patients diagnosed with a type of cancer, they at risk from dying from their cancer or from some other cause, i.e. there are competing events. An invidual can only experience one of these events. In such situations we can think of a separate hazard (mortality) rate for each of the different causes.
So for two causes we could have two proportional hazards models,
Model 1: $h_1(t) = h_{01}(t) \exp\left(\boldsymbol{\beta_1} \mathbf{Z_1}\right)$
Model 2: $h_2(t) = h_{02}(t) \exp\left(\boldsymbol{\beta_2} \mathbf{Z_2} \right)$
It is well known that transforming the cause-specific hazard function to a cause-specific survival function does not give a “real” world probability of death, i.e. using the transformation
$$ S_k(t) = \exp\left(-\int_0^t h_k(u) du\right) $$
If one is willing to assume conditional independence between the times to each event then $S_k(t)$ can be interpreted as a net probability, that is the probability of still being alive in the hypothetical situation where it is only possible to die from cause $k$. The independence asumption cannot be assessed from the data as we of course never observe the time to two events on the same individual.
If interest lies in the probability of dying of cause $k$ in the situation where dying from another cause first will make it impossible to die from cause $k$, then the cause-specific cumulative incidence function, $F_k(t)$ should be estimated. This is defined as follows
$$ F_{k}(t) = \int_0^t S(u) h_k(u) du $$
where $S(t)$ is the overall survival function,
$$ S(t) = S_1(t) S_2(t) = \exp\left(-\int_0^t {h_1(u) + h_2(u) du}\right) $$
This is for two competing risks, but of course it can be extended to any number. For more detail on competing risks see Andersen et al. 2012, Geskus 2016.
Example
First I load the mgus2 data. This is the same data used by Kipourou et al in their example. This is a dataset that comes with the survival package in R. The two events of interest are time to plasma cell malignacy (PCM) or death (before PCM). I will fit similar models to those in the Kipourou paper.
First I will load the data and drop rows with missing values for the mspike
variable
. use https://www.pclambert.net/data/mgus2, clear
. drop if mspike == .
(11 observations deleted)
After dropping the missing values of mspike
we are left with 1373 individuals. I will now tabulate the event
variable,
. tab event
event | Freq. Percent Cum.
------------+-----------------------------------
0 | 404 29.42 29.42
1 | 115 8.38 37.80
2 | 854 62.20 100.00
------------+-----------------------------------
Total | 1,373 100.00
There are 404 individuals who are censored (event=0
), 115 had a PCM event (event=1
) and 854 died before PCM (event=2
).
I will now fit the models. As there are two events we fit two separate models, one for PCM and one for death. I will fit similar models to Kipourou et al, but the models here will be on the log cumulative hazard scale rather than the log hazard scale.
PCM model
For the PCM model we need use, event=1
, when using stset
. stset survtime, failure(event=1)
failure event: event == 1
obs. time interval: (0, survtime]
exit on or before: failure
------------------------------------------------------------------------------
1,373 total observations
0 exclusions
------------------------------------------------------------------------------
1,373 observations remaining, representing
115 failures in single-record/single-failure data
10,739.583 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 35.33333
. stpm2 age mspike male, scale(h) df(4)
Iteration 0: log likelihood = -436.35927
Iteration 1: log likelihood = -435.78559
Iteration 2: log likelihood = -435.78118
Iteration 3: log likelihood = -435.78118
Log likelihood = -435.78118 Number of obs = 1,373
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
age | .016625 .0083603 1.99 0.047 .0002391 .0330108
mspike | .883143 .1645591 5.37 0.000 .5606132 1.205673
male | -.0255884 .1877314 -0.14 0.892 -.3935353 .3423584
_rcs1 | 1.559546 .1860716 8.38 0.000 1.194852 1.924239
_rcs2 | .0055689 .1146625 0.05 0.961 -.2191654 .2303031
_rcs3 | -.0861545 .0539364 -1.60 0.110 -.1918679 .019559
_rcs4 | .0222621 .0326597 0.68 0.495 -.0417498 .0862739
_cons | -5.437525 .6764209 -8.04 0.000 -6.763286 -4.111765
------------------------------------------------------------------------------
. estimates store pcm
This is a proportional hazards model including the effects of age at diagnosis,
age
, size of the monoclonal serum spike (mspike
) and sex (male
).
The effects of age
and mspike
are assumed to be linear.
I have used 4 degrees of freedom, i.e. 5 knots, for the effect of time since diagnosis for the restricted cubic splines.
Death before PCM model
For the death before PCM model we need use, event=2
, when using stset
.
. stset survtime, failure(event=2)
failure event: event == 2
obs. time interval: (0, survtime]
exit on or before: failure
------------------------------------------------------------------------------
1,373 total observations
0 exclusions
------------------------------------------------------------------------------
1,373 observations remaining, representing
854 failures in single-record/single-failure data
10,739.583 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 35.33333
. stpm2 age mspike male, scale(h) df(4) tvc(age) dftvc(2)
Iteration 0: log likelihood = -1820.2633
Iteration 1: log likelihood = -1797.1359
Iteration 2: log likelihood = -1794.6344
Iteration 3: log likelihood = -1794.5668
Iteration 4: log likelihood = -1794.5667
Log likelihood = -1794.5667 Number of obs = 1,373
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
age | .0583808 .0038839 15.03 0.000 .0507684 .0659932
mspike | -.0445153 .0638026 -0.70 0.485 -.1695661 .0805354
male | .4041454 .0699583 5.78 0.000 .2670297 .5412612
_rcs1 | .1601389 .188315 0.85 0.395 -.2089519 .5292296
_rcs2 | .0821883 .1468049 0.56 0.576 -.2055441 .3699206
_rcs3 | -.0809094 .01913 -4.23 0.000 -.1184035 -.0434154
_rcs4 | .0073994 .0133311 0.56 0.579 -.018729 .0335279
_rcs_age1 | .0139219 .0025865 5.38 0.000 .0088524 .0189914
_rcs_age2 | -.0034103 .0020469 -1.67 0.096 -.0074221 .0006014
_cons | -5.19614 .3131461 -16.59 0.000 -5.809895 -4.582385
------------------------------------------------------------------------------
. estimates store death
This model has relaxed the assumption of proportional hazards for age through use of the tvc()
and dftvc()
options.
I have stored both models using estimates store
as these will need to be passed to to standsurv
when making predictions.
Estimating the Marginal Cause-Specific CIFs
The marginal cause-specific CIFs is the expection of the CIF over covariates $Z$, i.e.
$$ E\left[F_k(t|Z)\right] $$
This can be estimated by predicting the cause specific CIF for each individual and then taking the mean, i.e.
$$ \widehat{F}_{k}(t|Z) = \frac{1}{N} \sum_{i=1}^{N} \int_0^t \widehat{S}_1(u|Z_i) \widehat{S}_2(u|Z_i) \widehat{h}_k(u|Z_i) du $$
In this case $Z_i=(\mbox{age}_i,\mbox{mspike}_i,\mbox{male}_i)$
I now create the times I want to predict the cause-specific CIF at
. range tt 0 30 31
(1,342 missing values generated)
This creates 31 equally spaced values between 0 and 30, i.e. increasing in steps of 1 year, and stores
in a new variable, tt
. I then use standsurv
.
. standsurv, crmodels(pcm death) cif ci timevar(tt) atvar(F) verbose
Calling main mata program
Reading in things to set up structure
Finished setting up structure
...............................
The crmodels(pcm death)
gives the names of the two models, the cif
option requests the cause-specific CIFs are estimated
(the default is overall survival), the ci
option means that confidence intervals
will be calculated (using the delta method), the timevar(tt)
option gives the
times to predict at and the atvar(F)
option gives the stub name of the new variables.
Here the defaults of using the model names will be used so variables F_pcm
and F_death
will be created. The verbose option is not necessary, but gives some information on the progress of calculations.
We can now plot the results,
. twoway (rarea F_pcm_lci F_pcm_uci tt, color(red%30)) ///
> (line F_pcm tt, color(red)) ///
> (rarea F_death_lci F_death_uci tt, color(blue%30)) ///
> (line F_death tt, color(blue)) ///
> , legend(order(2 "PCM" 4 "Death") cols(1) ring(0) pos(11)) ///
> ylabel(,angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ytitle("cause-specific CIF") ///
> name(cifs, replace)
This is similar to Figure 3 of the Kipourou et al. paper.
Contrasts
Contrasts are of more interest and the general idea is essentially the same as when making contrasts of survival functions. For a binary exposure $X$ and confounders $Z$ we want to estimate
$$ E\left[F_k(t|X=1,Z)\right] - E\left[F_k(t|X=0,Z)\right] $$
$F_k(t|X=x,Z)$ is estimated by,
$$ \widehat{F}_{k}(t|X=x,Z) = \frac{1}{N} \sum_{i=1}^{N} \int_0^t \widehat{S}_1(u|X=x,Z_i) \widehat{S}_2(u|X=x,Z_i) \widehat{h}_k(u|X=x,Z_i) du $$
Here we are interestes in the effect of sex, so here our exposure is male
and the potential confounders are age
and mspike
. The standsurv
command
is shown below.
. standsurv, crmodels(pcm death) cif ci timevar(tt) verbose ///
> at1(male 1) at2(male 0) atvar(F_male F_female) ///
> contrast(difference) contrastvar(cif_diff)
Calling main mata program
Reading in things to set up structure
Finished setting up structure
...............................
The options that are different to the previous standsurv
command
are that I have used at1(male 1)
and at2(male 0)
.
When estimating the standardized CIFs it first forces the covariate
male
to be set to 1 for all subjects in the
individual predictions and then to 0.
As in the other standsurv
examples the key point here is that the
distribution of confounders is forced to be the same for males and females
when obtaining the the estimates.
As there are two at()
options, two new variables are listed using
atvar(F_male F_female)
.
If atvar()
is not specified the default names would be _at2
and _at1
.
The contrast(difference)
option means that the difference between the at
options will be taken.
By default at1
is the reference, but this can be changed using atref()
.
The contrastvar(cif_diff)
option gives the variable name for the difference.
Note that two new variables will be created as there are two competing events,
cif_diff_pcm
and cif_diff_death
.
Having created the standardized cause-specific CIFs we can now plot them.
. twoway (rarea F_male_pcm_lci F_male_pcm_uci tt, color(red%30)) ///
> (line F_male_pcm tt, color(red)) ///
> (rarea F_female_pcm_lci F_female_pcm_uci tt, color(blue%30)) ///
> (line F_female_pcm tt, color(blue)) ///
> , legend(order(2 "Males" 4 "Females") cols(1) ring(0) pos(11)) ///
> ylabel(, angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ytitle("cause-specific CIF") ///
> title("PCM") ///
> name(pcm, replace)
.
. twoway (rarea F_male_death_lci F_male_death_uci tt, color(red%30)) ///
> (line F_male_death tt, color(red)) ///
> (rarea F_female_death_lci F_female_death_uci tt, color(blue%30)) ///
> (line F_female_death tt, color(blue)) ///
> , legend(order(2 "Males" 4 "Females") cols(1) ring(0) pos(11)) ///
> ylabel(, angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ytitle("cause-specific CIF") ///
> title("Death") ///
> name(death, replace)
.
. graph combine pcm death, nocopies ycommon
This is similar to the top row of Figure 4 in the Kipourou et al. paper although I have shown both cause-specific CIFs with the y-axes over the same range.
For the contrast I just need to plot the new variables cif_diff_pcm
and cif_diff_death
together with their confidence limits.
. twoway (rarea cif_diff_pcm_lci cif_diff_pcm_uci tt, color(red%30)) ///
> (line cif_diff_pcm tt, color(red)) ///
> , legend(off) ///
> ylabel(, angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ytitle("cause-specific CIF") ///
> title("PCM") ///
> name(pcm_diff, replace)
.
. twoway (rarea cif_diff_death_lci cif_diff_death_uci tt, color(red%30)) ///
> (line cif_diff_death tt, color(red)) ///
> , legend(off) ///
> ylabel(, angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ytitle("cause-specific CIF") ///
> title("Death") ///
> name(death_diff, replace)
.
. graph combine pcm_diff death_diff, nocopies ycommon
This is similar to the bottom row of Figure 4 in the Kipourou et al. paper although, as above, I have shown both cause-specific CIFs with the y-axes over the same range.
Using different survival models
standsurv
is a general command and it possible to use a variety of different survival models.
It is also possible to use different survival distributions for different causes.
In order to illustrate this I will use a weibull model for PCM where the shape parameter
is a function of age and a log-logistic accelerated failure time model for death.
I am not claiming these are sensible models, but just aim to show the versatility of standsurv
I fit the two models below.
. stset survtime, failure(event=1)
failure event: event == 1
obs. time interval: (0, survtime]
exit on or before: failure
------------------------------------------------------------------------------
1,373 total observations
0 exclusions
------------------------------------------------------------------------------
1,373 observations remaining, representing
115 failures in single-record/single-failure data
10,739.583 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 35.33333
. streg age mspike male, dist(weibull) anc(age)
failure _d: event == 1
analysis time _t: survtime
Fitting constant-only model:
Iteration 0: log likelihood = -452.48861
Iteration 1: log likelihood = -452.44835
Iteration 2: log likelihood = -452.44832
Fitting full model:
Iteration 0: log likelihood = -452.44832
Iteration 1: log likelihood = -437.70531
Iteration 2: log likelihood = -435.60846
Iteration 3: log likelihood = -435.60532
Iteration 4: log likelihood = -435.60532
Weibull PH regression
No. of subjects = 1,373 Number of obs = 1,373
No. of failures = 115
Time at risk = 10739.58334
LR chi2(3) = 33.69
Log likelihood = -435.60532 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t |
age | .0429871 .0171147 2.51 0.012 .0094429 .0765314
mspike | .8710854 .1643031 5.30 0.000 .5490573 1.193114
male | -.0308749 .1875072 -0.16 0.869 -.3983823 .3366325
_cons | -9.181064 1.316739 -6.97 0.000 -11.76183 -6.600303
-------------+----------------------------------------------------------------
ln_p |
age | -.0088294 .0042727 -2.07 0.039 -.0172039 -.000455
_cons | .802942 .3018274 2.66 0.008 .2113712 1.394513
------------------------------------------------------------------------------
. estimates store pcm2
.
. stset survtime, failure(event=2)
failure event: event == 2
obs. time interval: (0, survtime]
exit on or before: failure
------------------------------------------------------------------------------
1,373 total observations
0 exclusions
------------------------------------------------------------------------------
1,373 observations remaining, representing
854 failures in single-record/single-failure data
10,739.583 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 35.33333
. streg age mspike male, dist(llogistic)
failure _d: event == 2
analysis time _t: survtime
Fitting constant-only model:
Iteration 0: log likelihood = -2056.9536
Iteration 1: log likelihood = -2051.5925
Iteration 2: log likelihood = -2051.5305
Iteration 3: log likelihood = -2051.5305
Fitting full model:
Iteration 0: log likelihood = -2051.5305
Iteration 1: log likelihood = -1936.9223
Iteration 2: log likelihood = -1924.5341
Iteration 3: log likelihood = -1924.4786
Iteration 4: log likelihood = -1924.4786
Loglogistic AFT regression
No. of subjects = 1,373 Number of obs = 1,373
No. of failures = 854
Time at risk = 10739.58334
LR chi2(3) = 254.10
Log likelihood = -1924.4786 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0590111 .0039933 -14.78 0.000 -.0668378 -.0511844
mspike | .1093803 .075813 1.44 0.149 -.0392104 .2579711
male | -.4249628 .0857712 -4.95 0.000 -.5930714 -.2568543
_cons | 6.452482 .3170312 20.35 0.000 5.831113 7.073852
-------------+----------------------------------------------------------------
/lngamma | -.1515847 .0296779 -5.11 0.000 -.2097524 -.0934171
-------------+----------------------------------------------------------------
gamma | .8593451 .0255036 .810785 .9108135
------------------------------------------------------------------------------
. estimates store death2
I have stored the models using estimates store
and just need to pass these new
models to standsurv
. The rest of the standsurv
code is the same, with the
exception of creating new variables names.
. standsurv, crmodels(pcm2 death2) cif ci timevar(tt) verbose ///
> at1(male 1) at2(male 0) atvar(F2_male F2_female) ///
> contrast(difference) contrastvar(cif2_diff)
Calling main mata program
Reading in things to set up structure
Finished setting up structure
...............................
The plots can be reproduced the same way and are shown in a panel graph below.
They are reasonably similar to the previous plot, with some small differences.
Some comments
At the moment my code is a bit inefficient, it loops over time points and repeates the numerical integration for each time point. Thus it will be quick for a single time point, but will take some time for more time points (needed for plotting). When I have some time over the summer, I will increase computational time considerably.
For the flexible parametric models I have modelled on the log cumulative hazard scale,
while Kipourou et al fitted models on the log hazard scale.
Our strcs
Stata command enables models on the log hazard scale to be fitted
and for standard survival models and relative survival models these are compatable
with standsurv
. I will also enable these to work for competing risks model after my vacation.
However, computations are easier and faster on the log cumulative hazard scale so unless
there is a good reason for not doing so, I prefer models on this scale. In the MGUS2 example, the
models here have very similar fits (BIC = 925.6 for stpm2
PCM model and 926.7 for
strcs
PCM model and BIC = 3656.6 for stpm2
death model and 3656.4 for
strcs
death model.
References
Andersen, P. K., Geskus, R. B., de Witte, T., Putter, H. Competing risks in epidemiology: possibilities and pitfalls. Int J Epidemiol 2012;41:861-70
Geskus, R. B. Data analysis with competing risks and intermediate states. Chapman and Hall 2016
Kipourou, D.-K., Charvat, H., Rachet, B., Belot, A. Estimation of the adjusted cause-specific cumulative probability using flexible regression models for the cause-specific hazards.
Statistics in Medicine 2019. DOI: 10.1002/sim.8209