stcrprep - non parametric cause-specific CIFs
I will use the same data set I use in the Stata Journal article on stcrprep
.
This comprises of 1977 patients from the European Blood and Marrow Transplantation (EBMT) registry who received an allogeneic bone
marrow transplantation. Time is measured in days from transplantation to either relapse
or death. There is only one covariate of interest, the EBMT risk score, which has been
categorized into 3 groups (low, medium and high risk). The data is available as part of
the mstate R package (de Wreede et al. 2011).
First I load the data,
. use http://www.pclambert.net/data/ebmt1_stata.dta, clear
(Written by R. )
. tab status
status | Freq. Percent Cum.
------------+-----------------------------------
censored | 836 42.29 42.29
relapse | 456 23.07 65.35
died | 685 34.65 100.00
------------+-----------------------------------
Total | 1,977 100.00
The tabulation shows that of the 1,977 subjects, 836 were censored, 456 had a relapse and 686 had a death before relapse. Now we can stset
the data declaring both relapse and death as an event in the failure()
option.
. stset time, failure(status==1,2) scale(365.25) id(patid)
id: patid
failure event: status == 1 2
obs. time interval: (time[_n-1], time]
exit on or before: failure
t for analysis: time/365.25
------------------------------------------------------------------------------
1,977 total observations
0 exclusions
------------------------------------------------------------------------------
1,977 observations remaining, representing
1,977 subjects
1,141 failures in single-failure-per-subject data
3,796.057 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 8.454483
In order to show how stcrprep
expands the data and calculates the probability of censoring weights for those with a competing event, I will list the data of a single individual before and after using stcrprep
. The listing is for subject 17 (patid==17
).
. list patid status _t0 _t _d if patid==17, noobs
+---------------------------------------+
| patid status _t0 _t _d |
|---------------------------------------|
| 17 died 0 2.2888433 1 |
+---------------------------------------+
This subject died after 2.29 years and before using stcrprep
has just has one row of
data.
Next I use stcrprep
to restructure the data. The events()
option requires the variable defining all possible events and the censored value. The trans()
option gives the transitions of the events of interest; here we
are interested in the transitions to both relapse(status=1
) and death (status=2
); this is actually the default, but is shown here for clarity. The keep()
option is used to list variables to retain in the expanded data; usually any covariates that will be later analysed are included here. The byg()
option requests the censoring distribution to be estimated separately for the given groups. Since we are first going to obtain a separate
non-parametric estimate of the cause-specific CIF in each group, the byg() option will estimate the censoring distribution separately in each group.
. stcrprep, events(status) keep(score) trans(1 2) byg(score)
. di "There are " _N " observations"
There are 70262 observations
. format tstart %6.5f
. format tstop %6.5f
. format weight_c %6.5f
. list failcode patid status tstart tstop weight_c weight_t status if patid==17, ///
> sepby(failcode) noobs
+------------------------------------------------------------------------------+
| failcode patid status tstart tstop weight_c weight_t status |
|------------------------------------------------------------------------------|
| relapse 17 died 0.00000 2.28884 1.00000 1 died |
| relapse 17 died 2.28884 2.31622 0.99000 1 died |
| relapse 17 died 2.31622 2.32717 0.98497 1 died |
| relapse 17 died 2.32717 2.36003 0.97992 1 died |
| relapse 17 died 2.36003 2.55441 0.91392 1 died |
| relapse 17 died 2.55441 2.65845 0.89843 1 died |
| relapse 17 died 2.65845 2.89938 0.85142 1 died |
| relapse 17 died 2.89938 3.02806 0.80937 1 died |
| relapse 17 died 3.02806 3.18960 0.76176 1 died |
| relapse 17 died 3.18960 3.26626 0.74578 1 died |
| relapse 17 died 3.26626 3.62765 0.63847 1 died |
| relapse 17 died 3.62765 3.89870 0.59519 1 died |
| relapse 17 died 3.89870 3.97536 0.57881 1 died |
| relapse 17 died 3.97536 4.10951 0.55124 1 died |
| relapse 17 died 4.10951 4.39425 0.51163 1 died |
| relapse 17 died 4.39425 4.50103 0.47714 1 died |
| relapse 17 died 4.50103 4.69815 0.45968 1 died |
| relapse 17 died 4.69815 5.08419 0.37101 1 died |
| relapse 17 died 5.08419 5.22656 0.32235 1 died |
| relapse 17 died 5.22656 5.33607 0.30995 1 died |
| relapse 17 died 5.33607 5.97673 0.22772 1 died |
| relapse 17 died 5.97673 6.27515 0.20170 1 died |
|------------------------------------------------------------------------------|
| died 17 died 0.00000 2.28884 1.00000 1 died |
+------------------------------------------------------------------------------+
After using stcrprep
the number of rows has increased from 1977 to 70262. The rows have been divided based on the failure of the newly created variable failcode
. This variable will be used to fit different models depending on the event of interest. The variables patid
and status
are the same as in the non expanded data. The variables tstart
and tstop
give the times an individual starts and stops being at risk. They
change within an individual when their weight, defined by variable weight_c
, changes value. The weight_t
gives the weights when there is left trunction. As there is no left truncation in this data, it takes the value 1 for all subjects at all times.
When failcode==1
this corresponds to when a relapse is the event of interest. As
the subject with patid==17
died after 2.29 years (i.e. had a competing event), they
are initially at risk until this time and they should receive a weight of 1 in the analysis.
After their death they are still kept in the risk set, but their weight decreases. The
decrease is based on the conditional probability of being censored which is estimated
using a non-parametric (Kaplan-Meier) estimate of the censoring distribution. The
weights only change at times when there is a failure for the event of interest and the
value of censoring distribution has changed.
When failcode==2
this corresponds to when death is the event of interest. Since
this patient experienced the event of interest (i.e. they died) rather than the competing event, they only require one row of data.
We can use sts graph
to give a plot of the cause-specific CIF. We first need to stset
the data utilizing the information on the weights contained in variable weights_c
by
specifiying iweights
.
. gen event = status == failcode
. stset tstop [iw=weight_c], failure(event) enter(tstart) noshow // stset using weights
failure event: event != 0 & event < .
obs. time interval: (0, tstop]
enter on or after: time tstart
exit on or before: failure
weight: [iweight=weight_c]
------------------------------------------------------------------------------
70,262 total observations
0 exclusions
------------------------------------------------------------------------------
70,262 observations remaining, representing
1,141 failures in single-record/single-failure data
13,820.402 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 8.454483
We first create the variable, event
. This is defined as 1 if the event of interest occurs
and zero otherwise. As we have split time data, we need to give information on the start
time (tstart
) and stop time (tstop
) of each row of data.
We use sts graph
in the usual way, but use the failure option as we are interested
in the probability of relapse as opposed to the probability of not having a relapse (which
includes the probability of death). For example, the cause-specific CIF for relapse can
be plotted as follows,
. sts graph if failcode==1, by(score) failure ///
> ytitle("Probability of Relapse") ///
> xtitle("Years since transplanation") ///
> ylabel(0(0.1)0.5, angle(h) format(%3.1f)) ///
> legend(order(1 "Low Risk" 2 "Medium Risk" 3 "High Risk") ///
> cols(1) ring(0) pos(5)) ///
> scheme(sj) name(cif_relapse, replace)
Note that the lines are extended to the maximum censoring time in each group, rather than the maximum event time.
Alternatively, sts gen
can be used to generate the cause-specific CIF and this can be
plotted with appropriate if statements to control the maximum follow-up time for each line.
It is also possible to list the CIF at specific time points using sts list
. For example, the cause-specific CIF at 1 and 5 years by risk group and for each cause can be obtained as follows,
. sts list, at(1 5) failure by(failcode score)
Beg. Failure Std.
Time Total Fail Function Error [95% Conf. Int.]
-------------------------------------------------------------------------------
relapse Low risk
1 348.001 38 0.0959 0.0148 0.0707 0.1295
5 94.7875 36 0.2268 0.0250 0.1821 0.2805
relapse Medium risk
1 1125.93 225 0.1636 0.0100 0.1451 0.1843
5 268.081 100 0.2594 0.0131 0.2347 0.2861
relapse High risk
1 116.387 39 0.2417 0.0338 0.1827 0.3156
5 6 10 0.3306 0.0410 0.2574 0.4181
died Low risk
1 306.828 81 0.2032 0.0202 0.1669 0.2462
5 94.9111 10 0.2368 0.0223 0.1964 0.2839
died Medium risk
1 915.771 441 0.3189 0.0126 0.2950 0.3442
5 209.617 70 0.3829 0.0137 0.3566 0.4104
died High risk
1 84.7723 73 0.4494 0.0392 0.3764 0.5296
5 6 7 0.5160 0.0452 0.4310 0.6071
-------------------------------------------------------------------------------
Note: Failure function is calculated over full data and evaluated at indicated
times; it is not calculated from aggregates shown at left.
Now, we can test for differences in the cause-specific CIF using sts test
. Note that is slightly different to the modified log rank test defined by Gray (1988).
. sts test score if failcode==1
Log-rank test for equality of survivor functions
| Events Events
score | observed expected
------------+-------------------------
Low risk | 79 99.64
Medium risk | 328 324.33
High risk | 49 32.04
------------+-------------------------
Total | 456 456.00
chi2(2) = 13.37
Pr>chi2 = 0.0012
References
de Wreede, L.; Fiocco, M. & Putter, H. mstate
: An R package for the analysis of competing risks and multi-state models. Journal of Statistical Software 2011;38.
Gray, R. A class of K-sample tests for comparing the cumulative incidence of a competing risk. The Annals of Statistics 1988;16:1141-1154.