stcrprep
- computational benefits
When using stcrprep
there are some computational benefits when compared to using Stata’s inbuilt stcrreg
. One reason for this is that everytime you fit a model using stcrreg
, the probability of censoring weights are calculated and the data must be expanded (in the background) when maximising the likelihood. When using stcrprep
the data is expanded once and then diffenet models can be fitted to this expanded data.
I have run some timings. If I fit a simple model to the embt1
data with risk score as the only covariate (2 dummy variables) then these are the timings on my current work laptop (Intel i5 - running Stata 15 MP4).
First I load and stset
the data.
use https://www.pclambert.net/data/ebmt1_stata.dta, clear
. by R. )
(Written
stset time, failure(status==1) scale(365.25) id(patid) noshow
.
data settings
Survival-time
variable: patid
ID
Failure event: status==1_n-1], time]
Observed time interval: (time[on or before: failure
Exit for analysis: time/365.25
Time
--------------------------------------------------------------------------total observations
1,977
0 exclusions
--------------------------------------------------------------------------
1,977 observations remaining, representing
1,977 subjectsin single-failure-per-subject data
456 failures total analysis time at risk and under observation
3,796.057
At risk from t = 0
Earliest observed entry t = 0exit t = 8.454483 Last observed
Now, I use Stata’s inbuilt stcrreg
,
timer clear
.
timer on 1
.
score, compete(status==2) nolog noshow
. stcrreg i.
of obs = 1,977
Competing-risks regression No. of subjects = 1,977
No.
Failure event: status == 1 No. failed = 456
Competing event: status == 2 No. competing = 685
No. censored = 836
chi2(2) = 9.87
Wald chi2 = 0.0072
Log pseudolikelihood = -3333.3217 Prob >
for 1,977 clusters in patid)
(Std. err. adjusted
------------------------------------------------------------------------------
| Robuststd. err. z P>|z| [95% conf. interval]
_t | SHR
-------------+----------------------------------------------------------------score |
Medium risk | 1.271221 .1554323 1.96 0.050 1.000333 1.615465
High risk | 1.769853 .3238535 3.12 0.002 1.236465 2.533337
------------------------------------------------------------------------------
timer off 1
.
timer list
. 1: 12.64 / 1 = 12.6400
This takes 12.6 seconds to fit.
I now reload and stset
the data, but this time declaring both status=1
and status=2
as events.
use https://www.pclambert.net/data/ebmt1_stata.dta, clear
.
stset time, failure(status==1,2) scale(365.25) id(patid) .
We can now run stcrprep
.
timer on 2
.
keep(score) trans(1)
. stcrprep, events(status)
timer off 2
.
timer list 2
. 2: 0.62 / 1 = 0.6230
This takes 0.6 seconds to run. However, this only restructures the data and calculates the weights. To fit the model, we first generate the event indicator and use stset
.
gen event = status == failcode
.
stset tstop [iw=weight_c], failure(event) enter(tstart) .
We use stcox
to fit the proportional subhazards model to the expanded data.
timer on 3
.
stcox i.score
.
Failure _d: event
Analysis time _t: tstopon or after: time tstart
Enter iweight=weight_c]
Weight: [
Iteration 0: Log likelihood = -3338.1244
Iteration 1: Log likelihood = -3333.4173
Iteration 2: Log likelihood = -3333.3113
Iteration 3: Log likelihood = -3333.3112estimates:
Refining
Iteration 0: Log likelihood = -3333.3112
for ties
Cox regression with Breslow method
of subjects = 72,880 Number of obs = 72,880
No. of failures = 456
No. at risk = 6,026.2743
Time chi2(2) = 9.63
LR chi2 = 0.0081
Log likelihood = -3333.3112 Prob >
------------------------------------------------------------------------------ratio Std. err. z P>|z| [95% conf. interval]
_t | Haz.
-------------+----------------------------------------------------------------score |
Medium risk | 1.271235 .1593392 1.91 0.056 .9943389 1.625238
High risk | 1.769899 .3219273 3.14 0.002 1.239148 2.52798
------------------------------------------------------------------------------
timer off 3
.
timer list
.
1: 12.64 / 1 = 12.6400
2: 0.62 / 1 = 0.6230 3: 0.67 / 1 = 0.6730
This takes 0.7 seconds to run giving a combined total of 1.3 seconds. What is important is that if we want to fit other models (including other covariates etc), then we do not need to run stcrprep
again.
To assess the time on larger data I have expanded the data by 20 times and added a small random number to each time, so that there are no ties. I used the following code.
expand 20replace time = time + runiform()*0.0001
replace patid = _n
This leads to 19,770 indviduals in the analysis. The fact that there are no ties is perhaps a little unrealistic in a dataset this size, but this is still a usefull assessment of computational speed. The same analysis as above on this larger dataset gave the following times.
command | Time |
---|---|
stcrreg |
2066.3 seconds |
stcrprep |
890.2 seconds |
stcox |
46.1 seconds |
I think this really highlights the benfits of restructuring the data and using stcox
in terms of computational time. Unless there is need to recalculate the probability of censoring weights, there is no need to do this every time you fit a model. Thus, in this case the stcrreg
model takes almost 35 minutes, whilst the same model using stcox
, after using stcrprep
, takes only 46 seconds.
It is worthwhile noting that Stata’s implementation of Fine and Grays proportional subhazards model using stcrreg
seems particularly slow. If I fit the model in R using crr
the model fitted to the expanded data it only takes 370 seconds compared to 2066 in Stata.
There are other benefits with using stcox
to fit the subhazards model, mainly because we can now use many of the other commands and extensions associated with stcox
. I will discuss these in other tutorials.