Extended Functions

Include non-linear function in a varlist

A new addition to stpm3 is the use of extended functions within a model varlist. This allows you to specify various spline, polynomials, fractional polynomials functions directly when specifying the model. This makes fitting the model slightly easier, but the main advantage is the ease at which you can get predictions from complex models.

We first load the Rotterdam 2 breast cancer data and then use stset to declare the survival time and event indicator.

. use https://www.pclambert.net/data/rott2b, clear
(Rotterdam breast cancer data (augmented with cause of death))

. stset os, f(osi==1) scale(12) exit(time 60)

Survival-time data settings

         Failure event: osi==1
Observed time interval: (0, os]
     Exit on or before: time 60
     Time for analysis: time/12

--------------------------------------------------------------------------
      2,982  total observations
          0  exclusions
--------------------------------------------------------------------------
      2,982  observations remaining, representing
        753  failures in single-record/single-failure data
 13,038.968  total analysis time at risk and under observation
                                                At risk from t =         0
                                     Earliest observed entry t =         0
                                          Last observed exit t =         5

If you want to include a non-linear effect for a covariate in a model, the usual approach would be to generate some new derived variables and then include these in the model. For example, generate a quadratic and cubic term or generate spline basis functions using mkspline or rcsgen.

The code below shows how to include non-linear effects using restricted cubic splines with rcsgen with 4 knots, which equates to 3 spline variables.

. rcsgen age, gen(agercs) df(3)
Variables agercs1 to agercs3 were created

. global ageknots `r(knots)'

. stpm3 i.hormon agercs1-agercs3, scale(lncumhazard) df(5) 

Iteration 0:   log likelihood = -2120.3313  
Iteration 1:   log likelihood = -2120.0705  
Iteration 2:   log likelihood = -2120.0694  
Iteration 3:   log likelihood = -2120.0694  

                                                        Number of obs =  2,982
                                                        Wald chi2(4)  =  69.03
Log likelihood = -2120.0694                             Prob > chi2   = 0.0000

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
xb           |
      hormon |
        yes  |   .3966145   .1041656     3.81   0.000     .1924536    .6007754
     agercs1 |  -.0430389   .0160545    -2.68   0.007    -.0745051   -.0115727
     agercs2 |  -.0000314    .000038    -0.82   0.410    -.0001059    .0000432
     agercs3 |  -1.70e-07   .0000357    -0.00   0.996    -.0000701    .0000697
-------------+----------------------------------------------------------------
time         |
        _ns1 |  -22.10362   2.190797   -10.09   0.000    -26.39751   -17.80974
        _ns2 |   6.252487    1.18838     5.26   0.000     3.923304     8.58167
        _ns3 |  -1.093169   .0583025   -18.75   0.000    -1.207439   -.9788977
        _ns4 |  -.5425182   .0376756   -14.40   0.000    -.6163609   -.4686754
        _ns5 |   -.376672   .0361107   -10.43   0.000    -.4474477   -.3058964
       _cons |   .3065984    .606017     0.51   0.613    -.8811731     1.49437
------------------------------------------------------------------------------

. estimates store stpm3_rcsgen

I have stored the knot locations as these will be needed for certain predictions.

The same model can be fitted by using @rcs(age, df(3)) within an stpm3 varlist.

. stpm3 i.hormon @rcs(age, df(3)), scale(lncumhazard) df(5) 

Iteration 0:   log likelihood = -2120.3313  
Iteration 1:   log likelihood = -2120.0705  
Iteration 2:   log likelihood = -2120.0694  
Iteration 3:   log likelihood = -2120.0694  

                                                        Number of obs =  2,982
                                                        Wald chi2(4)  =  69.03
Log likelihood = -2120.0694                             Prob > chi2   = 0.0000

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
xb           |
      hormon |
        yes  |   .3966145   .1041656     3.81   0.000     .1924536    .6007754
_rcs_f1_age1 |  -.0430389   .0160545    -2.68   0.007    -.0745051   -.0115727
_rcs_f1_age2 |  -.0000314    .000038    -0.82   0.410    -.0001059    .0000432
_rcs_f1_age3 |  -1.70e-07   .0000357    -0.00   0.996    -.0000701    .0000697
-------------+----------------------------------------------------------------
time         |
        _ns1 |  -22.10362   2.190797   -10.09   0.000    -26.39751   -17.80974
        _ns2 |   6.252487    1.18838     5.26   0.000     3.923304     8.58167
        _ns3 |  -1.093169   .0583025   -18.75   0.000    -1.207439   -.9788977
        _ns4 |  -.5425182   .0376756   -14.40   0.000    -.6163609   -.4686754
        _ns5 |   -.376672   .0361107   -10.43   0.000    -.4474477   -.3058964
       _cons |   .3065984    .606017     0.51   0.613    -.8811731     1.49437
------------------------------------------------------------------------------
Extended functions
 (1) @rcs(age, df(3))

If you compare the model coefficients and log-likelhoods you will see they are identical.

From here on I will use natural splines - these will give the same predicted values as when using restricted cubic splines, but have some useful additional properties.

. stpm3 i.hormon @ns(age, df(3)), scale(lncumhazard) df(5) 

Iteration 0:   log likelihood = -2120.3313  
Iteration 1:   log likelihood = -2120.0705  
Iteration 2:   log likelihood = -2120.0694  
Iteration 3:   log likelihood = -2120.0694  

                                                        Number of obs =  2,982
                                                        Wald chi2(4)  =  69.03
Log likelihood = -2120.0694                             Prob > chi2   = 0.0000

------------------------------------------------------------------------------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
xb           |
      hormon |
        yes  |   .3966145   .1041656     3.81   0.000     .1924536    .6007754
 _ns_f1_age1 |  -1.769774   1.089136    -1.62   0.104    -3.904441    .3648924
 _ns_f1_age2 |  -1.233782   .4602279    -2.68   0.007    -2.135812   -.3317519
 _ns_f1_age3 |  -1.709216   .4899396    -3.49   0.000     -2.66948   -.7489521
-------------+----------------------------------------------------------------
time         |
        _ns1 |  -22.10362   2.190797   -10.09   0.000    -26.39751   -17.80974
        _ns2 |   6.252487    1.18838     5.26   0.000     3.923304     8.58167
        _ns3 |  -1.093169   .0583025   -18.75   0.000    -1.207439   -.9788977
        _ns4 |  -.5425182   .0376756   -14.40   0.000    -.6163609   -.4686754
        _ns5 |   -.376672   .0361107   -10.43   0.000    -.4474477   -.3058964
       _cons |  -.1364105   .2354592    -0.58   0.562     -.597902     .325081
------------------------------------------------------------------------------
Extended functions
 (1) @ns(age, df(3))

. estimates store stpm3_ns

The coefficients for the spline variables are now different, but if you predict for a specified covariate pattern, the predictions will be identical.

Predictions with extended functions

If we generated the spline variables for age at diagnosis and then included these in the model, we would have to work out the values of the spline variables at the age at diagnosis of interest. For example, using the scalar option of rcsgen to predict survival for a 70 year old on hormonal treatment the code below can be used.

. estimates restore stpm3_rcsgen
(results stpm3_rcsgen are active now)

. rcsgen, scalar(70) gen(c) knots(${ageknots})
Scalars c1 to c3 were created

. predict S70a, survival ci                                             ///
>               at1(agercs1 `=c1' agercs2 `=c2' agercs3 `=c3' hormon 1) ///
>               timevalues(0 5, step(0.1))                             ///
>               frame(f1)
Predictions are stored in frame - f1

By storing the knots and passing these to rcsgen combined with the scalar option, the values of the restricted cubic spline variables for a 70 year old are obtained. These can then be passed to the predict command.

If you use an extended function to fit an equivalent model this simplifies the predictions. For example predicting for a 70 year old woman on hormonal treatment we can use the following predict command.

. estimates restore stpm3_ns
(results stpm3_ns are active now)

. predict S70b, survival ci          ///
>               at1(age 70 hormon 1) ///
>               frame(f1, merge)
Predictions are stored in frame - f1

These predictions have been merged into frame f1, so we can compare the predictions.

. frame f1: list in 1/21

     +-----------------------------------------------------------------------------+
     |  tt        S70a    S70a_lci    S70a_uci        S70b    S70b_lci    S70b_uci |
     |-----------------------------------------------------------------------------|
  1. |   0           1           1           1           1           1           1 |
  2. |  .1   .99980531   .99901994   .99996134   .99980531   .99901994   .99996134 |
  3. |  .2    .9990375   .99724993   .99966333    .9990375   .99724993   .99966333 |
  4. |  .3   .99757461   .99484041   .99886071   .99757461   .99484041   .99886071 |
  5. |  .4   .99537568   .99175595   .99740816   .99537568   .99175595   .99740816 |
     |-----------------------------------------------------------------------------|
  6. |  .5   .99243419   .98793852   .99525821   .99243419   .98793852   .99525821 |
  7. |  .6   .98876195   .98333349   .99242911   .98876195   .98333349   .99242911 |
  8. |  .7   .98438145   .97790805    .9889687   .98438145   .97790805    .9889687 |
  9. |  .8   .97932165   .97166223   .98492686   .97932165   .97166223   .98492686 |
 10. |  .9   .97361549   .96462911   .98034204   .97361549   .96462911   .98034204 |
     |-----------------------------------------------------------------------------|
 11. |   1   .96729824   .95686644   .97523978   .96729824   .95686644   .97523978 |
 12. | 1.1   .96040644   .94844522     .969637   .96040644   .94844522     .969637 |
 13. | 1.2   .95297712   .93943933   .96354746   .95297712   .93943933   .96354746 |
 14. | 1.3   .94504732   .92991782   .95698631   .94504732   .92991782   .95698631 |
 15. | 1.4   .93665358   .91993953    .9499734   .93665358   .91993953    .9499734 |
     |-----------------------------------------------------------------------------|
 16. | 1.5   .92783183   .90954967   .94253584   .92783183   .90954967   .94253584 |
 17. | 1.6   .91861703   .89877754   .93470981   .91861703   .89877754   .93470981 |
 18. | 1.7   .90904198   .88763238   .92654197   .90904198   .88763238   .92654197 |
 19. | 1.8   .89913147   .87609834   .91808407   .89913147   .87609834   .91808407 |
 20. | 1.9   .88890629   .86417228   .90937397   .88890629   .86417228   .90937397 |
     |-----------------------------------------------------------------------------|
 21. |   2   .87838596   .85187732   .90042978   .87838596   .85187732   .90042978 |
     +-----------------------------------------------------------------------------+

The predictions are identical, but obtaining the predictions using extended functions was much easier. If the model includes interactions then I would say it is much, much easier and less prone to coding errors.

The current extended functions are

@bs() B-splines
@fp() fractional polynomials
@ns() natural cubic splines
@poly() polynomials
@rcs() restricted cubic splines

See the help file for options for each of these functions. For example, with the spline functions you specify knots positions or the degree for B-splines. For fractional polynomials there are scale and center options. For all extended functions, except fractional polynomials, it is possible to perform winsorising before the derived variables are generated using the winsor option.

Predictions in more complex models

If the model get complex with numerous interactions and non-linear effects and interactions with time, the predictions remain simple.

Consider the model below with main effects and an interaction between hormon and a natural spline function for age. stpm3 stores the details of the extended function and so when predictions are made the appropriate values of the derived variables can be obtained.

In the code below, I obtain predictions for 70 year old women and without hormonal treatment.

. stpm3 i.hormon##@ns(age, df(3)), scale(lncumhazard) df(5) 

Iteration 0:   log likelihood = -2117.3748  
Iteration 1:   log likelihood = -2117.1157  
Iteration 2:   log likelihood = -2117.1148  
Iteration 3:   log likelihood = -2117.1148  

                                                        Number of obs =  2,982
                                                        Wald chi2(7)  =  77.72
Log likelihood = -2117.1148                             Prob > chi2   = 0.0000

--------------------------------------------------------------------------------------
                     | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
---------------------+----------------------------------------------------------------
xb                   |
              hormon |
                yes  |   1.295954   .5457098     2.37   0.018      .226382    2.365525
         _ns_f1_age1 |  -1.006332    1.17959    -0.85   0.394    -3.318285    1.305621
         _ns_f1_age2 |  -1.519938    .473222    -3.21   0.001    -2.447436   -.5924397
         _ns_f1_age3 |  -1.322119   .5553445    -2.38   0.017    -2.410574   -.2336639
                     |
hormon#c._ns_f1_age1 |
                yes  |  -6.312222    4.38403    -1.44   0.150    -14.90476    2.280319
                     |
hormon#c._ns_f1_age2 |
                yes  |   3.192456   2.152641     1.48   0.138    -1.026642    7.411555
                     |
hormon#c._ns_f1_age3 |
                yes  |  -2.436599   1.186322    -2.05   0.040    -4.761746   -.1114511
---------------------+----------------------------------------------------------------
time                 |
                _ns1 |  -22.10808   2.190478   -10.09   0.000    -26.40134   -17.81483
                _ns2 |   6.252367   1.188247     5.26   0.000     3.923447    8.581288
                _ns3 |  -1.094601   .0583504   -18.76   0.000    -1.208966   -.9802364
                _ns4 |  -.5429931   .0377085   -14.40   0.000    -.6169004   -.4690859
                _ns5 |  -.3770147   .0361414   -10.43   0.000    -.4478506   -.3061788
               _cons |  -.2889299   .2699311    -1.07   0.284    -.8179851    .2401253
--------------------------------------------------------------------------------------
Extended functions
 (1) @ns(age, df(3))

. 
. predict S70h0 S70h1, survival ci                  ///
>                      at1(age 70 hormon 0)         ///
>                      at2(age 70 hormon 1)         ///
>                      timevalues(0 5, step(0.1))  ///
>                      frame(f2)
Predictions are stored in frame - f2

The predict command would be identical if there was no interaction.

Even if there are time-dependent effects, the predict statement does not change. The model below adds an interaction for the time-dependent effects for hormon and age.

. stpm3 i.hormon##@ns(age, df(3)),                 ///
>       tvc(i.hormon##@ns(age, df(2))) dftvc(2)    ///
>       scale(lncumhazard) df(5)                           

Iteration 0:   log likelihood = -2116.6285  
Iteration 1:   log likelihood = -2107.5574  
Iteration 2:   log likelihood = -2103.0052  
Iteration 3:   log likelihood = -2102.7883  
Iteration 4:   log likelihood = -2102.7869  
Iteration 5:   log likelihood = -2102.7869  

                                                        Number of obs =  2,982
                                                        Wald chi2(7)  =  74.61
Log likelihood = -2102.7869                             Prob > chi2   = 0.0000

-------------------------------------------------------------------------------------------------
                                | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
--------------------------------+----------------------------------------------------------------
xb                              |
                         hormon |
                           yes  |     1.2372    .559703     2.21   0.027     .1402021    2.334198
                    _ns_f1_age1 |  -.9347474    1.18498    -0.79   0.430    -3.257267    1.387772
                    _ns_f1_age2 |  -1.538365   .4738744    -3.25   0.001    -2.467142   -.6095881
                    _ns_f1_age3 |  -1.308583   .5588214    -2.34   0.019    -2.403853   -.2133128
                                |
           hormon#c._ns_f1_age1 |
                           yes  |  -5.991194   4.385156    -1.37   0.172    -14.58594    2.603554
                                |
           hormon#c._ns_f1_age2 |
                           yes  |   3.188885   2.145601     1.49   0.137    -1.016415    7.394185
                                |
           hormon#c._ns_f1_age3 |
                           yes  |  -2.318151   1.205093    -1.92   0.054    -4.680089    .0437866
--------------------------------+----------------------------------------------------------------
time                            |
                           _ns1 |  -19.18669   7.212764    -2.66   0.008    -33.32345   -5.049932
                           _ns2 |   5.607982   3.176181     1.77   0.077    -.6172182    11.83318
                           _ns3 |   -1.05599   .2202421    -4.79   0.000    -1.487656    -.624323
                           _ns4 |  -.5553481   .1523431    -3.65   0.000    -.8539351   -.2567611
                           _ns5 |  -.3960238   .1116303    -3.55   0.000    -.6148151   -.1772324
                                |
              hormon#c._ns_tvc1 |
                           yes  |   4.903816   4.856667     1.01   0.313    -4.615075    14.42271
                                |
              hormon#c._ns_tvc2 |
                           yes  |   .0799753   1.642765     0.05   0.961    -3.139786    3.299736
                                |
       c._ns_f2_age1#c._ns_tvc1 |  -29.15335   8.875221    -3.28   0.001    -46.54847   -11.75824
                                |
       c._ns_f2_age1#c._ns_tvc2 |   .6654049   1.325155     0.50   0.616     -1.93185     3.26266
                                |
       c._ns_f2_age2#c._ns_tvc1 |  -.8954604   11.34532    -0.08   0.937    -23.13188    21.34096
                                |
       c._ns_f2_age2#c._ns_tvc2 |   .3600924   2.127002     0.17   0.866    -3.808755     4.52894
                                |
hormon#c._ns_f2_age1#c._ns_tvc1 |
                           yes  |  -117.1348   72.85753    -1.61   0.108    -259.9329    25.66332
                                |
hormon#c._ns_f2_age1#c._ns_tvc2 |
                           yes  |  -.4836296   6.116531    -0.08   0.937    -12.47181    11.50455
                                |
hormon#c._ns_f2_age2#c._ns_tvc1 |
                           yes  |  -5.302671   20.04585    -0.26   0.791    -44.59182    33.98648
                                |
hormon#c._ns_f2_age2#c._ns_tvc2 |
                           yes  |  -1.511553   5.134032    -0.29   0.768    -11.57407    8.550965
                                |
                          _cons |  -.3015535   .2724136    -1.11   0.268    -.8354743    .2323674
-------------------------------------------------------------------------------------------------
Extended functions
 (1) @ns(age, df(3))
 (2) @ns(age, df(2))

.       
. predict S70h0 S70h1, survival ci                ///
>                      at1(age 70 hormon 0)       ///
>                      at2(age 70 hormon 1)       ///
>                      timevalues(0 5, step(0.1)) ///
>                      frame(f3)
Predictions are stored in frame - f3

The key point is that the predict command stays simple. Note that I use a different function for the time-dependent effect.

Finally I will fit a model with a spline function that winsorises the variables before the spline basis functions are generated. This can improve model stability in the tails in complex models.

. stpm3 i.hormon##@bs(age, df(3) degree(2) winsor(2 98)), ///
>       scale(lncumhazard) df(5)                           

Iteration 0:   log likelihood = -2117.8274  
Iteration 1:   log likelihood = -2117.5679  
Iteration 2:   log likelihood = -2117.5669  
Iteration 3:   log likelihood = -2117.5669  

                                                        Number of obs =  2,982
                                                        Wald chi2(7)  =  75.27
Log likelihood = -2117.5669                             Prob > chi2   = 0.0000

--------------------------------------------------------------------------------------
                     | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
---------------------+----------------------------------------------------------------
xb                   |
              hormon |
                yes  |  -.2013625   .8479376    -0.24   0.812     -1.86329    1.460565
         _bs_f1_age1 |   -.789629   .2399149    -3.29   0.001    -1.259854   -.3194044
         _bs_f1_age2 |  -.3251852   .1782746    -1.82   0.068     -.674597    .0242265
         _bs_f1_age3 |   .2358417   .2106851     1.12   0.263    -.1770935     .648777
                     |
hormon#c._bs_f1_age1 |
                yes  |    1.71135   1.096385     1.56   0.119    -.4375242    3.860225
                     |
hormon#c._bs_f1_age2 |
                yes  |  -.2149046   .8373336    -0.26   0.797    -1.856048    1.426239
                     |
hormon#c._bs_f1_age3 |
                yes  |   .9610499   .9192863     1.05   0.296    -.8407181    2.762818
---------------------+----------------------------------------------------------------
time                 |
                _ns1 |  -22.10814   2.191013   -10.09   0.000    -26.40245   -17.81384
                _ns2 |   6.253144   1.188507     5.26   0.000     3.923714    8.582574
                _ns3 |  -1.093972   .0583289   -18.76   0.000    -1.208295   -.9796495
                _ns4 |  -.5428561   .0376962   -14.40   0.000    -.6167393   -.4689729
                _ns5 |  -.3769876   .0361375   -10.43   0.000    -.4478158   -.3061593
               _cons |  -.9075211   .1468767    -6.18   0.000    -1.195394   -.6196481
--------------------------------------------------------------------------------------
Extended functions
 (1) @bs(age, df(3) degree(2) winsor(2 98))

.       
. predict S70h0 S70h1, survival ci                 ///
>                      at1(age 70 hormon 0)        ///
>                      at2(age 70 hormon 1)        ///
>                      timevalues(0 5, step(0.1)) ///
>                      frame(f4)
Predictions are stored in frame - f4

The winsor(2 98) option replaces values less than the 2nd centile with the value at the 2nd centile and values greater than the 98th centile with the values at the 98th centile. There is a values suboption where you can directly specifying values for the cutoffs rather than centiles.

Biostatistician