# stcrprep - non parametric cause-specific CIFs

I will use the same data set I use in the Stata Journal article on stcrprep. This comprises of 1977 patients from the European Blood and Marrow Transplantation (EBMT) registry who received an allogeneic bone marrow transplantation. Time is measured in days from transplantation to either relapse or death. There is only one covariate of interest, the EBMT risk score, which has been categorized into 3 groups (low, medium and high risk). The data is available as part of the mstate R package (de Wreede et al. 2011).

. use http://www.pclambert.net/data/ebmt1_stata.dta, clear
(Written by R.              )

. tab status

status |      Freq.     Percent        Cum.
------------+-----------------------------------
censored |        836       42.29       42.29
relapse |        456       23.07       65.35
died |        685       34.65      100.00
------------+-----------------------------------
Total |      1,977      100.00



The tabulation shows that of the 1,977 subjects, 836 were censored, 456 had a relapse and 686 had a death before relapse. Now we can stset the data declaring both relapse and death as an event in the failure() option.

. stset time, failure(status==1,2) scale(365.25) id(patid)

id:  patid
failure event:  status == 1 2
obs. time interval:  (time[_n-1], time]
exit on or before:  failure
t for analysis:  time/365.25

------------------------------------------------------------------------------
1,977  total observations
0  exclusions
------------------------------------------------------------------------------
1,977  observations remaining, representing
1,977  subjects
1,141  failures in single-failure-per-subject data
3,796.057  total analysis time at risk and under observation
at risk from t =         0
earliest observed entry t =         0
last observed exit t =  8.454483



In order to show how stcrprep expands the data and calculates the probability of censoring weights for those with a competing event, I will list the data of a single individual before and after using stcrprep. The listing is for subject 17 (patid==17).

. list patid status _t0 _t _d if patid==17, noobs

+---------------------------------------+
| patid   status   _t0          _t   _d |
|---------------------------------------|
|    17     died     0   2.2888433    1 |
+---------------------------------------+



This subject died after 2.29 years and before using stcrprep has just has one row of data.

Next I use stcrprep to restructure the data. The events() option requires the variable defining all possible events and the censored value. The trans() option gives the transitions of the events of interest; here we are interested in the transitions to both relapse(status=1) and death (status=2); this is actually the default, but is shown here for clarity. The keep() option is used to list variables to retain in the expanded data; usually any covariates that will be later analysed are included here. The byg() option requests the censoring distribution to be estimated separately for the given groups. Since we are first going to obtain a separate non-parametric estimate of the cause-specific CIF in each group, the byg() option will estimate the censoring distribution separately in each group.

. stcrprep, events(status) keep(score) trans(1 2) byg(score)

. di "There are " _N " observations"
There are 70262 observations

. format tstart %6.5f

. format tstop %6.5f

. format weight_c %6.5f

. list failcode patid status tstart tstop weight_c weight_t status if patid==17, ///
>          sepby(failcode) noobs

+------------------------------------------------------------------------------+
| failcode   patid   status    tstart     tstop   weight_c   weight_t   status |
|------------------------------------------------------------------------------|
|  relapse      17     died   0.00000   2.28884    1.00000          1     died |
|  relapse      17     died   2.28884   2.31622    0.99000          1     died |
|  relapse      17     died   2.31622   2.32717    0.98497          1     died |
|  relapse      17     died   2.32717   2.36003    0.97992          1     died |
|  relapse      17     died   2.36003   2.55441    0.91392          1     died |
|  relapse      17     died   2.55441   2.65845    0.89843          1     died |
|  relapse      17     died   2.65845   2.89938    0.85142          1     died |
|  relapse      17     died   2.89938   3.02806    0.80937          1     died |
|  relapse      17     died   3.02806   3.18960    0.76176          1     died |
|  relapse      17     died   3.18960   3.26626    0.74578          1     died |
|  relapse      17     died   3.26626   3.62765    0.63847          1     died |
|  relapse      17     died   3.62765   3.89870    0.59519          1     died |
|  relapse      17     died   3.89870   3.97536    0.57881          1     died |
|  relapse      17     died   3.97536   4.10951    0.55124          1     died |
|  relapse      17     died   4.10951   4.39425    0.51163          1     died |
|  relapse      17     died   4.39425   4.50103    0.47714          1     died |
|  relapse      17     died   4.50103   4.69815    0.45968          1     died |
|  relapse      17     died   4.69815   5.08419    0.37101          1     died |
|  relapse      17     died   5.08419   5.22656    0.32235          1     died |
|  relapse      17     died   5.22656   5.33607    0.30995          1     died |
|  relapse      17     died   5.33607   5.97673    0.22772          1     died |
|  relapse      17     died   5.97673   6.27515    0.20170          1     died |
|------------------------------------------------------------------------------|
|     died      17     died   0.00000   2.28884    1.00000          1     died |
+------------------------------------------------------------------------------+



After using stcrprep the number of rows has increased from 1977 to 70262. The rows have been divided based on the failure of the newly created variable failcode. This variable will be used to fit different models depending on the event of interest. The variables patid and status are the same as in the non expanded data. The variables tstart and tstop give the times an individual starts and stops being at risk. They change within an individual when their weight, defined by variable weight_c, changes value. The weight_t gives the weights when there is left trunction. As there is no left truncation in this data, it takes the value 1 for all subjects at all times.

When failcode==1 this corresponds to when a relapse is the event of interest. As the subject with patid==17 died after 2.29 years (i.e. had a competing event), they are initially at risk until this time and they should receive a weight of 1 in the analysis. After their death they are still kept in the risk set, but their weight decreases. The decrease is based on the conditional probability of being censored which is estimated using a non-parametric (Kaplan-Meier) estimate of the censoring distribution. The weights only change at times when there is a failure for the event of interest and the value of censoring distribution has changed.

When failcode==2 this corresponds to when death is the event of interest. Since this patient experienced the event of interest (i.e. they died) rather than the competing event, they only require one row of data.

We can use sts graph to give a plot of the cause-specific CIF. We first need to stset the data utilizing the information on the weights contained in variable weights_c by specifiying iweights.

. gen event = status == failcode

. stset tstop [iw=weight_c], failure(event) enter(tstart) noshow                                          // stset using weights

failure event:  event != 0 & event < .
obs. time interval:  (0, tstop]
enter on or after:  time tstart
exit on or before:  failure
weight:  [iweight=weight_c]

------------------------------------------------------------------------------
70,262  total observations
0  exclusions
------------------------------------------------------------------------------
70,262  observations remaining, representing
1,141  failures in single-record/single-failure data
13,820.402  total analysis time at risk and under observation
at risk from t =         0
earliest observed entry t =         0
last observed exit t =  8.454483



We first create the variable, event. This is defined as 1 if the event of interest occurs and zero otherwise. As we have split time data, we need to give information on the start time (tstart) and stop time (tstop) of each row of data. We use sts graph in the usual way, but use the failure option as we are interested in the probability of relapse as opposed to the probability of not having a relapse (which includes the probability of death). For example, the cause-specific CIF for relapse can be plotted as follows,

. sts graph if failcode==1, by(score) failure ///
>         ytitle("Probability of Relapse") ///
>         xtitle("Years since transplanation") ///
>         ylabel(0(0.1)0.5, angle(h) format(%3.1f)) ///
>         legend(order(1 "Low Risk" 2 "Medium Risk" 3 "High Risk") ///
>         cols(1) ring(0) pos(5)) ///
>         scheme(sj) name(cif_relapse, replace)



Note that the lines are extended to the maximum censoring time in each group, rather than the maximum event time. Alternatively, sts gen can be used to generate the cause-specific CIF and this can be plotted with appropriate if statements to control the maximum follow-up time for each line.

It is also possible to list the CIF at specific time points using sts list. For example, the cause-specific CIF at 1 and 5 years by risk group and for each cause can be obtained as follows,

. sts list, at(1 5) failure by(failcode score)

Beg.                      Failure       Std.
Time     Total     Fail             Function     Error     [95% Conf. Int.]
-------------------------------------------------------------------------------
relapse Low risk
1   348.001       38              0.0959    0.0148     0.0707    0.1295
5   94.7875       36              0.2268    0.0250     0.1821    0.2805
relapse Medium risk
1   1125.93      225              0.1636    0.0100     0.1451    0.1843
5   268.081      100              0.2594    0.0131     0.2347    0.2861
relapse High risk
1   116.387       39              0.2417    0.0338     0.1827    0.3156
5         6       10              0.3306    0.0410     0.2574    0.4181
died Low risk
1   306.828       81              0.2032    0.0202     0.1669    0.2462
5   94.9111       10              0.2368    0.0223     0.1964    0.2839
died Medium risk
1   915.771      441              0.3189    0.0126     0.2950    0.3442
5   209.617       70              0.3829    0.0137     0.3566    0.4104
died High risk
1   84.7723       73              0.4494    0.0392     0.3764    0.5296
5         6        7              0.5160    0.0452     0.4310    0.6071
-------------------------------------------------------------------------------
Note: Failure function is calculated over full data and evaluated at indicated
times; it is not calculated from aggregates shown at left.



Now, we can test for differences in the cause-specific CIF using sts test. Note that is slightly different to the modified log rank test defined by Gray (1988).

. sts test score if failcode==1

Log-rank test for equality of survivor functions

|   Events         Events
score       |  observed       expected
------------+-------------------------
Low risk    |        79          99.64
Medium risk |       328         324.33
High risk   |        49          32.04
------------+-------------------------
Total       |       456         456.00

chi2(2) =      13.37
Pr>chi2 =     0.0012



## References

de Wreede, L.; Fiocco, M. & Putter, H. mstate: An R package for the analysis of competing risks and multi-state models. Journal of Statistical Software 2011;38.

Gray, R. A class of K-sample tests for comparing the cumulative incidence of a competing risk. The Annals of Statistics 1988;16:1141-1154.