`stcrprep` - computational benefits

When using stcrprep there are some computational benefits when compared to using Stata’s inbuilt stcrreg. One reason for this is that everytime you fit a model using stcrreg, the probability of censoring weights are calculated and the data must be expanded (in the background) when maximising the likelihood. When using stcrprep the data is expanded once and then diffenet models can be fitted to this expanded data.

I have run some timings. If I fit a simple model to the embt1 data with risk score as the only covariate (2 dummy variables) then these are the timings on my current work laptop (Intel i5 - running Stata 15 MP4).

First I load and stset the data.

. use https://www.pclambert.net/data/ebmt1_stata.dta, clear
(Written by R.              )

. stset time, failure(status==1) scale(365.25) id(patid) noshow

Survival-time data settings

           ID variable: patid
         Failure event: status==1
Observed time interval: (time[_n-1], time]
     Exit on or before: failure
     Time for analysis: time/365.25

--------------------------------------------------------------------------
      1,977  total observations
          0  exclusions
--------------------------------------------------------------------------
      1,977  observations remaining, representing
      1,977  subjects
        456  failures in single-failure-per-subject data
  3,796.057  total analysis time at risk and under observation
                                                At risk from t =         0
                                     Earliest observed entry t =         0
                                          Last observed exit t =  8.454483

Now, I use Stata’s inbuilt stcrreg,

. timer clear

. timer on 1

. stcrreg i.score, compete(status==2) nolog noshow

Competing-risks regression                        No. of obs      =      1,977
                                                  No. of subjects =      1,977
Failure event:   status == 1                      No. failed      =        456
Competing event: status == 2                      No. competing   =        685
                                                  No. censored    =        836

                                                  Wald chi2(2)    =       9.87
Log pseudolikelihood = -3333.3217                 Prob > chi2     =     0.0072

                              (Std. err. adjusted for 1,977 clusters in patid)
------------------------------------------------------------------------------
             |               Robust
          _t |        SHR   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       score |
Medium risk  |   1.271221   .1554323     1.96   0.050     1.000333    1.615465
  High risk  |   1.769853   .3238535     3.12   0.002     1.236465    2.533337
------------------------------------------------------------------------------

. timer off 1

. timer list
   1:     12.64 /        1 =      12.6400

This takes 12.6 seconds to fit.

I now reload and stset the data, but this time declaring both status=1 and status=2 as events.

. use https://www.pclambert.net/data/ebmt1_stata.dta, clear

. stset time, failure(status==1,2) scale(365.25) id(patid)

We can now run stcrprep.

. timer on 2

. stcrprep, events(status) keep(score) trans(1)   

. timer off 2

. timer list 2
   2:      0.62 /        1 =       0.6230

This takes 0.6 seconds to run. However, this only restructures the data and calculates the weights. To fit the model, we first generate the event indicator and use stset.

. gen      event = status == failcode

. stset tstop [iw=weight_c], failure(event) enter(tstart)

We use stcox to fit the proportional subhazards model to the expanded data.

. timer on 3

. stcox i.score

         Failure _d: event
   Analysis time _t: tstop
  Enter on or after: time tstart
             Weight: [iweight=weight_c]

Iteration 0:  Log likelihood = -3338.1244
Iteration 1:  Log likelihood = -3333.4173
Iteration 2:  Log likelihood = -3333.3113
Iteration 3:  Log likelihood = -3333.3112
Refining estimates:
Iteration 0:  Log likelihood = -3333.3112

Cox regression with Breslow method for ties

No. of subjects =     72,880                            Number of obs = 72,880
No. of failures =        456
Time at risk    = 6,026.2743
                                                        LR chi2(2)    =   9.63
Log likelihood = -3333.3112                             Prob > chi2   = 0.0081

------------------------------------------------------------------------------
          _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       score |
Medium risk  |   1.271235   .1593392     1.91   0.056     .9943389    1.625238
  High risk  |   1.769899   .3219273     3.14   0.002     1.239148     2.52798
------------------------------------------------------------------------------

. timer off 3

. timer list
   1:     12.64 /        1 =      12.6400
   2:      0.62 /        1 =       0.6230
   3:      0.67 /        1 =       0.6730

This takes 0.7 seconds to run giving a combined total of 1.3 seconds. What is important is that if we want to fit other models (including other covariates etc), then we do not need to run stcrprep again.

To assess the time on larger data I have expanded the data by 20 times and added a small random number to each time, so that there are no ties. I used the following code.

expand 20
replace time = time + runiform()*0.0001
replace patid = _n

This leads to 19,770 indviduals in the analysis. The fact that there are no ties is perhaps a little unrealistic in a dataset this size, but this is still a usefull assessment of computational speed. The same analysis as above on this larger dataset gave the following times.

command	Time
`stcrreg`	2066.3 seconds
`stcrprep`	890.2 seconds
`stcox`	46.1 seconds

I think this really highlights the benfits of restructuring the data and using stcox in terms of computational time. Unless there is need to recalculate the probability of censoring weights, there is no need to do this every time you fit a model. Thus, in this case the stcrreg model takes almost 35 minutes, whilst the same model using stcox, after using stcrprep, takes only 46 seconds.

It is worthwhile noting that Stata’s implementation of Fine and Grays proportional subhazards model using stcrreg seems particularly slow. If I fit the model in R using crr the model fitted to the expanded data it only takes 370 seconds compared to 2066 in Stata.

There are other benefits with using stcox to fit the subhazards model, mainly because we can now use many of the other commands and extensions associated with stcox. I will discuss these in other tutorials.