stcrprep - non parametric cause-specific CIFs
clear frames
.
set scheme fpsaus_c .
I will use the same data set I use in the Stata Journal article on stcrprep
. This comprises of 1977 patients from the European Blood and Marrow Transplantation (EBMT) registry who received an allogeneic bone marrow transplantation. Time is measured in days from transplantation to either relapse or death. There is only one covariate of interest, the EBMT risk score, which has been categorized into 3 groups (low, medium and high risk). The data is available as part of the mstate R package (de Wreede et al. 2011).
First I load the data,
use http://www.pclambert.net/data/ebmt1_stata.dta, clear
. by R. )
(Written
tab status
.
status | Freq. Percent Cum.
------------+-----------------------------------
censored | 836 42.29 42.29
relapse | 456 23.07 65.35
died | 685 34.65 100.00
------------+----------------------------------- Total | 1,977 100.00
The tabulation shows that of the 1,977 subjects, 836 were censored, 456 had a relapse and 686 had a death before relapse. Now we can stset
the data declaring both relapse and death as an event in the failure()
option.
stset time, failure(status==1,2) scale(365.25) id(patid)
.
data settings
Survival-time
variable: patid
ID
Failure event: status==1 2_n-1], time]
Observed time interval: (time[on or before: failure
Exit for analysis: time/365.25
Time
--------------------------------------------------------------------------total observations
1,977
0 exclusions
--------------------------------------------------------------------------
1,977 observations remaining, representing
1,977 subjectsin single-failure-per-subject data
1,141 failures total analysis time at risk and under observation
3,796.057
At risk from t = 0
Earliest observed entry t = 0exit t = 8.454483 Last observed
In order to show how stcrprep
expands the data and calculates the probability of censoring weights for those with a competing event, I will list the data of a single individual before and after using stcrprep
. The listing is for subject 17 (patid==17
).
list patid status _t0 _t _d if patid==17, noobs
.
+---------------------------------------+
| patid status _t0 _t _d |
|---------------------------------------|
| 17 died 0 2.2888433 1 | +---------------------------------------+
This subject died after 2.29 years and before using stcrprep
has just has one row of data.
Next I use stcrprep
to restructure the data. The events()
option requires the variable defining all possible events and the censored value. The trans()
option gives the transitions of the events of interest; here we are interested in the transitions to both relapse(status=1
) and death (status=2
); this is actually the default, but is shown here for clarity. The keep()
option is used to list variables to retain in the expanded data; usually any covariates that will be later analysed are included here. The byg()
option requests the censoring distribution to be estimated separately for the given groups. Since we are first going to obtain a separate non-parametric estimate of the cause-specific CIF in each group, the byg() option will estimate the censoring distribution separately in each group.
keep(score) trans(1 2) byg(score)
. stcrprep, events(status)
di "There are " _N " observations"
.
There are 70262 observations
format tstart %6.5f
.
format tstop %6.5f
.
format weight_c %6.5f
.
list failcode patid status tstart tstop weight_c weight_t status if patid==17, ///
. noobs
> sepby(failcode)
+------------------------------------------------------------------------------+
| failcode patid status tstart tstop weight_c weight_t status |
|------------------------------------------------------------------------------|
| relapse 17 died 0.00000 2.28884 1.00000 1 died |
| relapse 17 died 2.28884 2.31622 0.99000 1 died |
| relapse 17 died 2.31622 2.32717 0.98497 1 died |
| relapse 17 died 2.32717 2.36003 0.97992 1 died |
| relapse 17 died 2.36003 2.55441 0.91392 1 died |
| relapse 17 died 2.55441 2.65845 0.89843 1 died |
| relapse 17 died 2.65845 2.89938 0.85142 1 died |
| relapse 17 died 2.89938 3.02806 0.80937 1 died |
| relapse 17 died 3.02806 3.18960 0.76176 1 died |
| relapse 17 died 3.18960 3.26626 0.74578 1 died |
| relapse 17 died 3.26626 3.62765 0.63847 1 died |
| relapse 17 died 3.62765 3.89870 0.59519 1 died |
| relapse 17 died 3.89870 3.97536 0.57881 1 died |
| relapse 17 died 3.97536 4.10951 0.55124 1 died |
| relapse 17 died 4.10951 4.39425 0.51163 1 died |
| relapse 17 died 4.39425 4.50103 0.47714 1 died |
| relapse 17 died 4.50103 4.69815 0.45968 1 died |
| relapse 17 died 4.69815 5.08419 0.37101 1 died |
| relapse 17 died 5.08419 5.22656 0.32235 1 died |
| relapse 17 died 5.22656 5.33607 0.30995 1 died |
| relapse 17 died 5.33607 5.97673 0.22772 1 died |
| relapse 17 died 5.97673 6.27515 0.20170 1 died |
|------------------------------------------------------------------------------|
| died 17 died 0.00000 2.28884 1.00000 1 died | +------------------------------------------------------------------------------+
After using stcrprep
the number of rows has increased from 1977 to 70262. The rows have been divided based on the failure of the newly created variable failcode
. This variable will be used to fit different models depending on the event of interest. The variables patid
and status
are the same as in the non expanded data. The variables tstart
and tstop
give the times an individual starts and stops being at risk. They change within an individual when their weight, defined by variable weight_c
, changes value. The weight_t
gives the weights when there is left trunction. As there is no left truncation in this data, it takes the value 1 for all subjects at all times.
When failcode==1
this corresponds to when a relapse is the event of interest. As the subject with patid==17
died after 2.29 years (i.e. had a competing event), they are initially at risk until this time and they should receive a weight of 1 in the analysis. After their death they are still kept in the risk set, but their weight decreases. The decrease is based on the conditional probability of being censored which is estimated using a non-parametric (Kaplan-Meier) estimate of the censoring distribution. The weights only change at times when there is a failure for the event of interest and the value of censoring distribution has changed.
When failcode==2
this corresponds to when death is the event of interest. Since this patient experienced the event of interest (i.e. they died) rather than the competing event, they only require one row of data.
We can use sts graph
to give a plot of the cause-specific CIF. We first need to stset
the data utilizing the information on the weights contained in variable weights_c
by specifiying iweights
.
gen event = status == failcode
.
stset tstop [iw=weight_c], failure(event) enter(tstart) noshow
. // stset using weights
>
data settings
Survival-time
Failure event: event!=0 & event<.
Observed time interval: (0, tstop]on or after: time tstart
Enter on or before: failure
Exit iweight=weight_c]
Weight: [
--------------------------------------------------------------------------total observations
70,262
0 exclusions
--------------------------------------------------------------------------
70,262 observations remaining, representingin single-record/single-failure data
1,141 failures total analysis time at risk and under observation
13,820.402
At risk from t = 0
Earliest observed entry t = 0exit t = 8.454483 Last observed
We first create the variable, event
. This is defined as 1 if the event of interest occurs and zero otherwise. As we have split time data, we need to give information on the start time (tstart
) and stop time (tstop
) of each row of data. We use sts graph
in the usual way, but use the failure option as we are interested in the probability of relapse as opposed to the probability of not having a relapse (which includes the probability of death). For example, the cause-specific CIF for relapse can be plotted as follows,
sts graph if failcode==1, by(score) failure ///
. ytitle("Probability of Relapse") ///
> xtitle("Years since transplanation") ///
> ylabel(0(0.1)0.5, angle(h) format(%3.1f)) ///
> legend(order(1 "Low Risk" 2 "Medium Risk" 3 "High Risk") ///
> cols(1) ring(0) pos(5)) ///
> name(cif_relapse, replace) >
Note that the lines are extended to the maximum censoring time in each group, rather than the maximum event time. Alternatively, sts gen
can be used to generate the cause-specific CIF and this can be plotted with appropriate if statements to control the maximum follow-up time for each line.
It is also possible to list the CIF at specific time points using sts list
. For example, the cause-specific CIF at 1 and 5 years by risk group and for each cause can be obtained as follows,
sts list, at(1 5) failure by(failcode score)
.
function
Kaplan–Meier failure score
By variables: failcode
Beg. Failure Std.total Fail function error [95% conf. int.]
Time
----------------------------------------------------------------------
relapse Low risk
1 348.001 38 0.0959 0.0148 0.0707 0.1295
5 94.7875 36 0.2268 0.0250 0.1821 0.2805
relapse Medium risk
1 1125.93 225 0.1636 0.0100 0.1451 0.1843
5 268.081 100 0.2594 0.0131 0.2347 0.2861
relapse High risk
1 116.387 39 0.2417 0.0338 0.1827 0.3156
5 6 10 0.3306 0.0410 0.2574 0.4181
died Low risk
1 306.828 81 0.2032 0.0202 0.1669 0.2462
5 94.9111 10 0.2368 0.0223 0.1964 0.2839
died Medium risk
1 915.771 441 0.3189 0.0126 0.2950 0.3442
5 209.617 70 0.3829 0.0137 0.3566 0.4104
died High risk
1 84.7723 73 0.4494 0.0392 0.3764 0.5296
5 6 7 0.5160 0.0452 0.4310 0.6071
----------------------------------------------------------------------function is calculated over full data and evaluated at
Note: Failure not calculated from aggregates shown at left. indicated times; it is
Now, we can test for differences in the cause-specific CIF using sts test
. Note that is slightly different to the modified log rank test defined by Gray (1988).
sts test score if failcode==1
.
of survivor functions
Equality rank test
Log-
| Observed Expectedscore | events events
------------+-------------------------
Low risk | 79 99.64
Medium risk | 328 324.33
High risk | 49 32.04
------------+-------------------------
Total | 456 456.00
chi2(2) = 13.37
chi2 = 0.0012 Pr>
References
de Wreede, L.; Fiocco, M. & Putter, H. mstate
: An R package for the analysis of competing risks and multi-state models. Journal of Statistical Software 2011;38.
Gray, R. A class of K-sample tests for comparing the cumulative incidence of a competing risk. The Annals of Statistics 1988;16:1141-1154.