stcrprep - computational benefits

When using stcrprep there are some computational benefits when compared to using Stata’s inbuilt stcrreg. One reason for this is that everytime you fit a model using stcrreg you the probability of censoring weights are calculated and the data must be expanded (in the background) when maximising the likelihood. When using stcrprep you only need to do this once.

I have run some timings. If I fit a simple model to the embt1 data with risk score as the only covariate (2 dummy variables) then these are the timings no my current work laptop (Intel i5 - running Stata 15 MP2).

First I load and stset the data.

. use https://www.pclambert.net/data/ebmt1_stata.dta, clear

. stset time, failure(status==1) scale(365.25) id(patid) noshow

Now, stcrreg can be used

. timer clear

. timer on 1

. stcrreg i.score, compete(status==2) nolog noshow

Competing-risks regression                       No. of obs       =      1,977
                                                 No. of subjects  =      1,977
Failure event  : status == 1                     No. failed       =        456
Competing event: status == 2                     No. competing    =        685
                                                 No. censored     =        836

                                                 Wald chi2(2)     =       9.87
Log pseudolikelihood = -3333.3217                Prob > chi2      =     0.0072

                              (Std. Err. adjusted for 1,977 clusters in patid)
------------------------------------------------------------------------------
             |               Robust
          _t |        SHR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       score |
Medium risk  |   1.271221   .1554323     1.96   0.050     1.000333    1.615465
  High risk  |   1.769853   .3238535     3.12   0.002     1.236465    2.533337
------------------------------------------------------------------------------

. timer off 1

. timer list
   1:     23.32 /        1 =      23.3150

This takes 23.3 seconds to fit.

I now reload and stset the data, but this time declaring both status=1 and status=2 as events.

. use https://www.pclambert.net/data/ebmt1_stata.dta, clear

. stset time, failure(status==1,2) scale(365.25) id(patid)

We can now run stcrprep.

. timer on 2

. stcrprep, events(status) keep(score) trans(1)   

. timer off 2

. timer list 2
   2:      6.22 /        1 =       6.2240

This takes 6.2 seconds to run. However, this only restructures the data and calculates the weights. To fit the model, we first generate the event indicator and use stset.

. gen      event = status == failcode

. stset tstop [iw=weight_c], failure(event) enter(tstart) 

We use stcox to fit the model.

. timer on 3

. stcox i.score

         failure _d:  event
   analysis time _t:  tstop
  enter on or after:  time tstart
             weight:  [iweight=weight_c]

Iteration 0:   log likelihood = -3338.1244
Iteration 1:   log likelihood = -3333.4173
Iteration 2:   log likelihood = -3333.3113
Iteration 3:   log likelihood = -3333.3112
Refining estimates:
Iteration 0:   log likelihood = -3333.3112

Cox regression -- Breslow method for ties

No. of subjects =       72,880                  Number of obs    =      72,880
No. of failures =          456
Time at risk    =   6026.27434
                                                LR chi2(2)       =        9.63
Log likelihood  =   -3333.3112                  Prob > chi2      =      0.0081

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       score |
Medium risk  |   1.271235   .1593392     1.91   0.056     .9943389    1.625238
  High risk  |   1.769899   .3219273     3.14   0.002     1.239148     2.52798
------------------------------------------------------------------------------

. timer off 3

. timer list
   1:     23.32 /        1 =      23.3150
   2:      6.22 /        1 =       6.2240
   3:      1.41 /        1 =       1.4060

This takes 1.4 seconds to run giving a combined total of 7.6 seconds. What is important is that if we want to fit other models (including other covariates etc), then we do not need to run stcrprep again.

To assess the time on larger data I have expanded the data by 20 times and added a small random number to each time, so that there are no ties. I used the following code.

expand 20
replace time = time + runiform()*0.0001
replace patid = _n

This leads to 19,770 indviduals in the analysis. The fact that there are no ties is perhaps a little unrealistic in a dataset this size, but this is still a usefull assessment of computational speed. The same analysis as above on this larger dataset gave the following times.

command Time
stcrreg 2066.3 seconds
stcrprep 890.2 seconds
stcox 46.1 seconds

I think this really highlights the benfits of restructuring the data and using stcox in terms of computational time. Unless there is need to recalculate the probability of censoring weights, there is no need to do this every time you fit a model. Thus, in this case an stcrreg model takes almost 35 minutes, whilst the same model using stcox after using stcrprep takes only 46 seconds.

It is worthwhile noting that Stata’s implementation of Fine and Grays proportional subhazards model using stcrreg seems particularly slow. If I fit the model in R using crr the model fitted to the expanded data it only takes 370 seconds compared to 2066 in Stata.

There are other benefits with using stcox to fit the subhazards model, mainly because we can now use many of the other commands and extensions associated with stcox. I will discuss these in other tutorials.

Biostatistican