# Standardized cumulative incidence functions

You will need to install

`standsurv`

to run the example. Details here

## Background

I have been meaning to write about using `standsurv`

for standardized measures in competing risks for a while and how many of the
ideas of standardization in a standard survival carry over to competing risks. Last week a very nice
paper by Kipourou *et al*. was published that decribes using cause-specific flexible parametric models on the log-hazard scale and then using these to
derive standardized cause-specific cumulative incidence functions. It is great that they include R code for their analysis and as they used the
publically available MGUS2 data set available from the survival library in R, I thought I would try and do a similar analysis using `standsurv`

.

In competing risks there is interest in the time to more than one type of event. A common example is different causes of death. For example, for patients diagnosed with a type of cancer, they at risk from dying from their cancer or from some other cause, i.e. there are competing events. An invidual can only experience one of these events. In such situations we can think of a separate hazard (mortality) rate for each of the different causes.

So for two causes we could have two proportional hazards models,

**Model 1:** $h_1(t) = h_{01}(t) \exp\left(\boldsymbol{\beta_1} \mathbf{Z_1}\right)$

**Model 2:** $h_2(t) = h_{02}(t) \exp\left(\boldsymbol{\beta_2} \mathbf{Z_2} \right)$

It is well known that transforming the cause-specific hazard function to a cause-specific survival function does not give a “real” world probability of death, i.e. using the transformation

$$ S_k(t) = \exp\left(-\int_0^t h_k(u) du\right) $$

If one is willing to assume conditional independence between the times to each event then $S_k(t)$ can be interpreted as a net probability, that is the probability of still being alive in the hypothetical situation where it is only possible to die from cause $k$. The independence asumption cannot be assessed from the data as we of course never observe the time to two events on the same individual.

If interest lies in the probability of dying of cause $k$ in the situation where dying from another cause first will make it impossible to die from cause $k$, then the cause-specific cumulative incidence function, $F_k(t)$ should be estimated. This is defined as follows

$$ F_{k}(t) = \int_0^t S(u) h_k(u) du $$

where $S(t)$ is the overall survival function,

$$ S(t) = S_1(t) S_2(t) = \exp\left(-\int_0^t {h_1(u) + h_2(u) du}\right) $$

This is for two competing risks, but of course it can be extended to any number.
For more detail on competing risks see Andersen *et al.* 2012, Geskus 2016.

## Example

First I load the mgus2 data. This is the same data used by Kipourou *et al* in their example. This is a dataset that comes with the survival
package in R. The two events of interest are time to plasma cell malignacy (PCM) or death
(before PCM). I will fit similar models to those in the Kipourou paper.

First I will load the data and drop rows with missing values for the `mspike`

variable

```
. use https://www.pclambert.net/data/mgus2, clear
. drop if mspike == .
(11 observations deleted)
```

After dropping the missing values of `mspike`

we are left with 1373 individuals. I will now tabulate the `event`

variable,

```
. tab event
event | Freq. Percent Cum.
------------+-----------------------------------
0 | 404 29.42 29.42
1 | 115 8.38 37.80
2 | 854 62.20 100.00
------------+-----------------------------------
Total | 1,373 100.00
```

There are 404 individuals who are censored (`event=0`

), 115 had a PCM event (`event=1`

) and 854 died before PCM (`event=2`

).

I will now fit the models. As there are two events we fit two separate models, one for PCM and one for death. I will fit
similar models to Kipourou *et al*, but the models here will be on the log cumulative hazard scale rather than the log hazard scale.

#### PCM model

For the PCM model we need use, `event=1`

, when using `stset`

```
. stset survtime, failure(event=1)
failure event: event == 1
obs. time interval: (0, survtime]
exit on or before: failure
------------------------------------------------------------------------------
1,373 total observations
0 exclusions
------------------------------------------------------------------------------
1,373 observations remaining, representing
115 failures in single-record/single-failure data
10,739.583 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 35.33333
. stpm2 age mspike male, scale(h) df(4)
Iteration 0: log likelihood = -436.35927
Iteration 1: log likelihood = -435.78559
Iteration 2: log likelihood = -435.78118
Iteration 3: log likelihood = -435.78118
Log likelihood = -435.78118 Number of obs = 1,373
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
age | .016625 .0083603 1.99 0.047 .0002391 .0330108
mspike | .883143 .1645591 5.37 0.000 .5606132 1.205673
male | -.0255884 .1877314 -0.14 0.892 -.3935353 .3423584
_rcs1 | 1.559546 .1860716 8.38 0.000 1.194852 1.924239
_rcs2 | .0055689 .1146625 0.05 0.961 -.2191654 .2303031
_rcs3 | -.0861545 .0539364 -1.60 0.110 -.1918679 .019559
_rcs4 | .0222621 .0326597 0.68 0.495 -.0417498 .0862739
_cons | -5.437525 .6764209 -8.04 0.000 -6.763286 -4.111765
------------------------------------------------------------------------------
. estimates store pcm
```

This is a proportional hazards model including the effects of age at diagnosis,
`age`

, size of the monoclonal serum spike (`mspike`

) and sex (`male`

).
The effects of `age`

and `mspike`

are assumed to be linear.
I have used 4 degrees of freedom, i.e. 5 knots, for the effect of time since diagnosis for the restricted cubic splines.

#### Death before PCM model

For the death before PCM model we need use, `event=2`

, when using `stset`

.

```
. stset survtime, failure(event=2)
failure event: event == 2
obs. time interval: (0, survtime]
exit on or before: failure
------------------------------------------------------------------------------
1,373 total observations
0 exclusions
------------------------------------------------------------------------------
1,373 observations remaining, representing
854 failures in single-record/single-failure data
10,739.583 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 35.33333
. stpm2 age mspike male, scale(h) df(4) tvc(age) dftvc(2)
Iteration 0: log likelihood = -1820.2633
Iteration 1: log likelihood = -1797.1359
Iteration 2: log likelihood = -1794.6344
Iteration 3: log likelihood = -1794.5668
Iteration 4: log likelihood = -1794.5667
Log likelihood = -1794.5667 Number of obs = 1,373
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
age | .0583808 .0038839 15.03 0.000 .0507684 .0659932
mspike | -.0445153 .0638026 -0.70 0.485 -.1695661 .0805354
male | .4041454 .0699583 5.78 0.000 .2670297 .5412612
_rcs1 | .1601389 .188315 0.85 0.395 -.2089519 .5292296
_rcs2 | .0821883 .1468049 0.56 0.576 -.2055441 .3699206
_rcs3 | -.0809094 .01913 -4.23 0.000 -.1184035 -.0434154
_rcs4 | .0073994 .0133311 0.56 0.579 -.018729 .0335279
_rcs_age1 | .0139219 .0025865 5.38 0.000 .0088524 .0189914
_rcs_age2 | -.0034103 .0020469 -1.67 0.096 -.0074221 .0006014
_cons | -5.19614 .3131461 -16.59 0.000 -5.809895 -4.582385
------------------------------------------------------------------------------
. estimates store death
```

This model has relaxed the assumption of proportional hazards for age through use of the `tvc()`

and `dftvc()`

options.

I have stored both models using `estimates store`

as these will need to be passed to to `standsurv`

when making predictions.

#### Estimating the Marginal Cause-Specific CIFs

The marginal cause-specific CIFs is the expection of the CIF over covariates $Z$, i.e.

$$ E\left[F_k(t|Z)\right] $$

This can be estimated by predicting the cause specific CIF for each individual and then taking the mean, i.e.

$$ \widehat{F}_{k}(t|Z) = \frac{1}{N} \sum_{i=1}^{N} \int_0^t \widehat{S}_1(u|Z_i) \widehat{S}_2(u|Z_i) \widehat{h}_k(u|Z_i) du $$

In this case $Z_i=(\mbox{age}_i,\mbox{mspike}_i,\mbox{male}_i)$

I now create the times I want to predict the cause-specific CIF at

```
. range tt 0 30 31
(1,342 missing values generated)
```

This creates 31 equally spaced values between 0 and 30, i.e. increasing in steps of 1 year, and stores
in a new variable, `tt`

. I then use `standsurv`

.

```
. standsurv, crmodels(pcm death) cif ci timevar(tt) atvar(F) verbose
Calling main mata program
Reading in things to set up structure
Finished setting up structure
...............................
```

The `crmodels(pcm death)`

gives the names of the two models, the `cif`

option requests the cause-specific CIFs are estimated
(the default is overall survival), the `ci`

option means that confidence intervals
will be calculated (using the delta method), the `timevar(tt)`

option gives the
times to predict at and the `atvar(F)`

option gives the stub name of the new variables.
Here the defaults of using the model names will be used so variables `F_pcm`

and `F_death`

will be created. The verbose option is not necessary, but gives some information on the progress of calculations.

We can now plot the results,

```
. twoway (rarea F_pcm_lci F_pcm_uci tt, color(red%30)) ///
> (line F_pcm tt, color(red)) ///
> (rarea F_death_lci F_death_uci tt, color(blue%30)) ///
> (line F_death tt, color(blue)) ///
> , legend(order(2 "PCM" 4 "Death") cols(1) ring(0) pos(11)) ///
> ylabel(,angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ytitle("cause-specific CIF") ///
> name(cifs, replace)
```

This is similar to Figure 3 of the Kipourou *et al*. paper.

### Contrasts

Contrasts are of more interest and the general idea is essentially the same as when making contrasts of survival functions. For a binary exposure $X$ and confounders $Z$ we want to estimate

$$ E\left[F_k(t|X=1,Z)\right] - E\left[F_k(t|X=0,Z)\right] $$

$F_k(t|X=x,Z)$ is estimated by,

$$ \widehat{F}_{k}(t|X=x,Z) = \frac{1}{N} \sum_{i=1}^{N} \int_0^t \widehat{S}_1(u|X=x,Z_i) \widehat{S}_2(u|X=x,Z_i) \widehat{h}_k(u|X=x,Z_i) du $$

Here we are interestes in the effect of sex, so here our exposure is `male`

and the potential confounders are `age`

and `mspike`

. The `standsurv`

command
is shown below.

```
. standsurv, crmodels(pcm death) cif ci timevar(tt) verbose ///
> at1(male 1) at2(male 0) atvar(F_male F_female) ///
> contrast(difference) contrastvar(cif_diff)
Calling main mata program
Reading in things to set up structure
Finished setting up structure
...............................
```

The options that are different to the previous `standsurv`

command
are that I have used `at1(male 1)`

and `at2(male 0)`

.
When estimating the standardized CIFs it first forces the covariate
`male`

to be set to 1 for all subjects in the
individual predictions and then to 0.
As in the other `standsurv`

examples the key point here is that the
distribution of confounders is forced to be the same for males and females
when obtaining the the estimates.
As there are two `at()`

options, two new variables are listed using
`atvar(F_male F_female)`

.
If `atvar()`

is not specified the default names would be `_at2`

and `_at1`

.
The `contrast(difference)`

option means that the difference between the `at`

options will be taken.
By default `at1`

is the reference, but this can be changed using `atref()`

.
The `contrastvar(cif_diff)`

option gives the variable name for the difference.
Note that two new variables will be created as there are two competing events,
`cif_diff_pcm`

and `cif_diff_death`

.

Having created the standardized cause-specific CIFs we can now plot them.

```
. twoway (rarea F_male_pcm_lci F_male_pcm_uci tt, color(red%30)) ///
> (line F_male_pcm tt, color(red)) ///
> (rarea F_female_pcm_lci F_female_pcm_uci tt, color(blue%30)) ///
> (line F_female_pcm tt, color(blue)) ///
> , legend(order(2 "Males" 4 "Females") cols(1) ring(0) pos(11)) ///
> ylabel(, angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ytitle("cause-specific CIF") ///
> title("PCM") ///
> name(pcm, replace)
.
. twoway (rarea F_male_death_lci F_male_death_uci tt, color(red%30)) ///
> (line F_male_death tt, color(red)) ///
> (rarea F_female_death_lci F_female_death_uci tt, color(blue%30)) ///
> (line F_female_death tt, color(blue)) ///
> , legend(order(2 "Males" 4 "Females") cols(1) ring(0) pos(11)) ///
> ylabel(, angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ytitle("cause-specific CIF") ///
> title("Death") ///
> name(death, replace)
.
. graph combine pcm death, nocopies ycommon
```

This is similar to the top row of Figure 4 in the Kipourou *et al*. paper although I have shown both cause-specific CIFs with the y-axes over the same range.

For the contrast I just need to plot the new variables `cif_diff_pcm`

and `cif_diff_death`

together with their confidence limits.

```
. twoway (rarea cif_diff_pcm_lci cif_diff_pcm_uci tt, color(red%30)) ///
> (line cif_diff_pcm tt, color(red)) ///
> , legend(off) ///
> ylabel(, angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ytitle("cause-specific CIF") ///
> title("PCM") ///
> name(pcm_diff, replace)
.
. twoway (rarea cif_diff_death_lci cif_diff_death_uci tt, color(red%30)) ///
> (line cif_diff_death tt, color(red)) ///
> , legend(off) ///
> ylabel(, angle(h) format(%3.2f)) ///
> xtitle("Time from diagnosis (years)") ytitle("cause-specific CIF") ///
> title("Death") ///
> name(death_diff, replace)
.
. graph combine pcm_diff death_diff, nocopies ycommon
```

This is similar to the bottom row of Figure 4 in the Kipourou *et al*. paper although, as above, I have shown both cause-specific CIFs with the y-axes over the same range.

## Using different survival models

`standsurv`

is a general command and it possible to use a variety of different survival models.
It is also possible to use different survival distributions for different causes.
In order to illustrate this I will use a weibull model for PCM where the shape parameter
is a function of age and a log-logistic accelerated failure time model for death.
I am not claiming these are sensible models, but just aim to show the versatility of `standsurv`

I fit the two models below.

```
. stset survtime, failure(event=1)
failure event: event == 1
obs. time interval: (0, survtime]
exit on or before: failure
------------------------------------------------------------------------------
1,373 total observations
0 exclusions
------------------------------------------------------------------------------
1,373 observations remaining, representing
115 failures in single-record/single-failure data
10,739.583 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 35.33333
. streg age mspike male, dist(weibull) anc(age)
failure _d: event == 1
analysis time _t: survtime
Fitting constant-only model:
Iteration 0: log likelihood = -452.48861
Iteration 1: log likelihood = -452.44835
Iteration 2: log likelihood = -452.44832
Fitting full model:
Iteration 0: log likelihood = -452.44832
Iteration 1: log likelihood = -437.70531
Iteration 2: log likelihood = -435.60846
Iteration 3: log likelihood = -435.60532
Iteration 4: log likelihood = -435.60532
Weibull PH regression
No. of subjects = 1,373 Number of obs = 1,373
No. of failures = 115
Time at risk = 10739.58334
LR chi2(3) = 33.69
Log likelihood = -435.60532 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_t |
age | .0429871 .0171147 2.51 0.012 .0094429 .0765314
mspike | .8710854 .1643031 5.30 0.000 .5490573 1.193114
male | -.0308749 .1875072 -0.16 0.869 -.3983823 .3366325
_cons | -9.181064 1.316739 -6.97 0.000 -11.76183 -6.600303
-------------+----------------------------------------------------------------
ln_p |
age | -.0088294 .0042727 -2.07 0.039 -.0172039 -.000455
_cons | .802942 .3018274 2.66 0.008 .2113712 1.394513
------------------------------------------------------------------------------
. estimates store pcm2
.
. stset survtime, failure(event=2)
failure event: event == 2
obs. time interval: (0, survtime]
exit on or before: failure
------------------------------------------------------------------------------
1,373 total observations
0 exclusions
------------------------------------------------------------------------------
1,373 observations remaining, representing
854 failures in single-record/single-failure data
10,739.583 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 35.33333
. streg age mspike male, dist(llogistic)
failure _d: event == 2
analysis time _t: survtime
Fitting constant-only model:
Iteration 0: log likelihood = -2056.9536
Iteration 1: log likelihood = -2051.5925
Iteration 2: log likelihood = -2051.5305
Iteration 3: log likelihood = -2051.5305
Fitting full model:
Iteration 0: log likelihood = -2051.5305
Iteration 1: log likelihood = -1936.9223
Iteration 2: log likelihood = -1924.5341
Iteration 3: log likelihood = -1924.4786
Iteration 4: log likelihood = -1924.4786
Loglogistic AFT regression
No. of subjects = 1,373 Number of obs = 1,373
No. of failures = 854
Time at risk = 10739.58334
LR chi2(3) = 254.10
Log likelihood = -1924.4786 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
_t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -.0590111 .0039933 -14.78 0.000 -.0668378 -.0511844
mspike | .1093803 .075813 1.44 0.149 -.0392104 .2579711
male | -.4249628 .0857712 -4.95 0.000 -.5930714 -.2568543
_cons | 6.452482 .3170312 20.35 0.000 5.831113 7.073852
-------------+----------------------------------------------------------------
/lngamma | -.1515847 .0296779 -5.11 0.000 -.2097524 -.0934171
-------------+----------------------------------------------------------------
gamma | .8593451 .0255036 .810785 .9108135
------------------------------------------------------------------------------
. estimates store death2
```

I have stored the models using `estimates store`

and just need to pass these new
models to `standsurv`

. The rest of the `standsurv`

code is the same, with the
exception of creating new variables names.

```
. standsurv, crmodels(pcm2 death2) cif ci timevar(tt) verbose ///
> at1(male 1) at2(male 0) atvar(F2_male F2_female) ///
> contrast(difference) contrastvar(cif2_diff)
Calling main mata program
Reading in things to set up structure
Finished setting up structure
...............................
```

The plots can be reproduced the same way and are shown in a panel graph below.

They are reasonably similar to the previous plot, with some small differences.

## Some comments

At the moment my code is a bit inefficient, it loops over time points and repeates the numerical integration for each time point. Thus it will be quick for a single time point, but will take some time for more time points (needed for plotting). When I have some time over the summer, I will increase computational time considerably.

For the flexible parametric models I have modelled on the log cumulative hazard scale,
while Kipourou *et al* fitted models on the log hazard scale.
Our `strcs`

Stata command enables models on the log hazard scale to be fitted
and for standard survival models and relative survival models these are compatable
with `standsurv`

. I will also enable these to work for competing risks model after my vacation.
However, computations are easier and faster on the log cumulative hazard scale so unless
there is a good reason for not doing so, I prefer models on this scale. In the MGUS2 example, the
models here have very similar fits (BIC = 925.6 for `stpm2`

PCM model and 926.7 for
`strcs`

PCM model and BIC = 3656.6 for `stpm2`

death model and 3656.4 for
`strcs`

death model.

## References

Andersen, P. K., Geskus, R. B., de Witte, T., Putter, H. Competing risks in epidemiology: possibilities and pitfalls.
*Int J Epidemiol* 2012;41:861-70

Geskus, R. B. *Data analysis with competing risks and intermediate states.* Chapman and Hall 2016

Kipourou, D.-K., Charvat, H., Rachet, B., Belot, A. Estimation of the adjusted cause-specific cumulative probability using flexible regression models for the cause-specific hazards.

*Statistics in Medicine* 2019. DOI: 10.1002/sim.8209