Extended Functions
Include non-linear function in a varlist
A new addition to stpm3
is the use of extended functions within a model varlist
.
This allows you to specify various spline, polynomials, fractional polynomials functions
directly when specifying the model. This makes fitting the model slightly easier,
but the main advantage is the ease at which you can get predictions from complex models.
We first load the Rotterdam 2 breast cancer data and then use stset
to declare the survival time and event indicator.
. use https://www.pclambert.net/data/rott2b, clear
(Rotterdam breast cancer data (augmented with cause of death))
. stset os, f(osi==1) scale(12) exit(time 60)
Survival-time data settings
Failure event: osi==1
Observed time interval: (0, os]
Exit on or before: time 60
Time for analysis: time/12
--------------------------------------------------------------------------
2,982 total observations
0 exclusions
--------------------------------------------------------------------------
2,982 observations remaining, representing
753 failures in single-record/single-failure data
13,038.968 total analysis time at risk and under observation
At risk from t = 0
Earliest observed entry t = 0
Last observed exit t = 5
If you want to include a non-linear effect for a covariate in a model,
the usual approach would be to generate some new derived variables
and then include these in the model. For example, generate a quadratic and cubic term or
generate spline basis functions using mkspline
or rcsgen
.
The code below shows how to include non-linear effects using
restricted cubic splines with rcsgen
with 4 knots,
which equates to 3 spline variables.
. rcsgen age, gen(agercs) df(3)
Variables agercs1 to agercs3 were created
. global ageknots `r(knots)'
. stpm3 i.hormon agercs1-agercs3, scale(lncumhazard) df(5)
Iteration 0: log likelihood = -2120.3313
Iteration 1: log likelihood = -2120.0705
Iteration 2: log likelihood = -2120.0694
Iteration 3: log likelihood = -2120.0694
Number of obs = 2,982
Wald chi2(4) = 69.03
Log likelihood = -2120.0694 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
xb |
hormon |
yes | .3966145 .1041656 3.81 0.000 .1924536 .6007754
agercs1 | -.0430389 .0160545 -2.68 0.007 -.0745051 -.0115727
agercs2 | -.0000314 .000038 -0.82 0.410 -.0001059 .0000432
agercs3 | -1.70e-07 .0000357 -0.00 0.996 -.0000701 .0000697
-------------+----------------------------------------------------------------
time |
_ns1 | -22.10362 2.190797 -10.09 0.000 -26.39751 -17.80974
_ns2 | 6.252487 1.18838 5.26 0.000 3.923304 8.58167
_ns3 | -1.093169 .0583025 -18.75 0.000 -1.207439 -.9788977
_ns4 | -.5425182 .0376756 -14.40 0.000 -.6163609 -.4686754
_ns5 | -.376672 .0361107 -10.43 0.000 -.4474477 -.3058964
_cons | .3065984 .606017 0.51 0.613 -.8811731 1.49437
------------------------------------------------------------------------------
. estimates store stpm3_rcsgen
I have stored the knot locations as these will be needed for certain predictions.
The same model can be fitted by using @rcs(age, df(3))
within an stpm3
varlist
.
. stpm3 i.hormon @rcs(age, df(3)), scale(lncumhazard) df(5)
Iteration 0: log likelihood = -2120.3313
Iteration 1: log likelihood = -2120.0705
Iteration 2: log likelihood = -2120.0694
Iteration 3: log likelihood = -2120.0694
Number of obs = 2,982
Wald chi2(4) = 69.03
Log likelihood = -2120.0694 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
xb |
hormon |
yes | .3966145 .1041656 3.81 0.000 .1924536 .6007754
_rcs_f1_age1 | -.0430389 .0160545 -2.68 0.007 -.0745051 -.0115727
_rcs_f1_age2 | -.0000314 .000038 -0.82 0.410 -.0001059 .0000432
_rcs_f1_age3 | -1.70e-07 .0000357 -0.00 0.996 -.0000701 .0000697
-------------+----------------------------------------------------------------
time |
_ns1 | -22.10362 2.190797 -10.09 0.000 -26.39751 -17.80974
_ns2 | 6.252487 1.18838 5.26 0.000 3.923304 8.58167
_ns3 | -1.093169 .0583025 -18.75 0.000 -1.207439 -.9788977
_ns4 | -.5425182 .0376756 -14.40 0.000 -.6163609 -.4686754
_ns5 | -.376672 .0361107 -10.43 0.000 -.4474477 -.3058964
_cons | .3065984 .606017 0.51 0.613 -.8811731 1.49437
------------------------------------------------------------------------------
Extended functions
(1) @rcs(age, df(3))
If you compare the model coefficients and log-likelhoods you will see they are identical.
From here on I will use natural splines - these will give the same predicted values as when using restricted cubic splines, but have some useful additional properties.
. stpm3 i.hormon @ns(age, df(3)), scale(lncumhazard) df(5)
Iteration 0: log likelihood = -2120.3313
Iteration 1: log likelihood = -2120.0705
Iteration 2: log likelihood = -2120.0694
Iteration 3: log likelihood = -2120.0694
Number of obs = 2,982
Wald chi2(4) = 69.03
Log likelihood = -2120.0694 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
xb |
hormon |
yes | .3966145 .1041656 3.81 0.000 .1924536 .6007754
_ns_f1_age1 | -1.769774 1.089136 -1.62 0.104 -3.904441 .3648924
_ns_f1_age2 | -1.233782 .4602279 -2.68 0.007 -2.135812 -.3317519
_ns_f1_age3 | -1.709216 .4899396 -3.49 0.000 -2.66948 -.7489521
-------------+----------------------------------------------------------------
time |
_ns1 | -22.10362 2.190797 -10.09 0.000 -26.39751 -17.80974
_ns2 | 6.252487 1.18838 5.26 0.000 3.923304 8.58167
_ns3 | -1.093169 .0583025 -18.75 0.000 -1.207439 -.9788977
_ns4 | -.5425182 .0376756 -14.40 0.000 -.6163609 -.4686754
_ns5 | -.376672 .0361107 -10.43 0.000 -.4474477 -.3058964
_cons | -.1364105 .2354592 -0.58 0.562 -.597902 .325081
------------------------------------------------------------------------------
Extended functions
(1) @ns(age, df(3))
. estimates store stpm3_ns
The coefficients for the spline variables are now different, but if you predict for a specified covariate pattern, the predictions will be identical.
Predictions with extended functions
If we generated the spline variables for age at diagnosis
and then included these in the model, we would have to work out the values of the spline variables
at the age at diagnosis of interest. For example, using the scalar
option of rcsgen
to predict
survival for a 70 year old on hormonal treatment the code below can be used.
. estimates restore stpm3_rcsgen
(results stpm3_rcsgen are active now)
. rcsgen, scalar(70) gen(c) knots(${ageknots})
Scalars c1 to c3 were created
. predict S70a, survival ci ///
> at1(agercs1 `=c1' agercs2 `=c2' agercs3 `=c3' hormon 1) ///
> timevalues(0 5, step(0.1)) ///
> frame(f1)
Predictions are stored in frame - f1
By storing the knots and passing these to rcsgen
combined with the scalar
option,
the values of the restricted cubic spline variables for a 70 year old are obtained.
These can then be passed to the predict
command.
If you use an extended function to fit an equivalent model this simplifies the predictions. For example predicting for a 70 year old woman on hormonal treatment we can use the following predict command.
. estimates restore stpm3_ns
(results stpm3_ns are active now)
. predict S70b, survival ci ///
> at1(age 70 hormon 1) ///
> frame(f1, merge)
Predictions are stored in frame - f1
These predictions have been merged into frame f1
, so we can compare the predictions.
. frame f1: list in 1/21
+-----------------------------------------------------------------------------+
| tt S70a S70a_lci S70a_uci S70b S70b_lci S70b_uci |
|-----------------------------------------------------------------------------|
1. | 0 1 1 1 1 1 1 |
2. | .1 .99980531 .99901994 .99996134 .99980531 .99901994 .99996134 |
3. | .2 .9990375 .99724993 .99966333 .9990375 .99724993 .99966333 |
4. | .3 .99757461 .99484041 .99886071 .99757461 .99484041 .99886071 |
5. | .4 .99537568 .99175595 .99740816 .99537568 .99175595 .99740816 |
|-----------------------------------------------------------------------------|
6. | .5 .99243419 .98793852 .99525821 .99243419 .98793852 .99525821 |
7. | .6 .98876195 .98333349 .99242911 .98876195 .98333349 .99242911 |
8. | .7 .98438145 .97790805 .9889687 .98438145 .97790805 .9889687 |
9. | .8 .97932165 .97166223 .98492686 .97932165 .97166223 .98492686 |
10. | .9 .97361549 .96462911 .98034204 .97361549 .96462911 .98034204 |
|-----------------------------------------------------------------------------|
11. | 1 .96729824 .95686644 .97523978 .96729824 .95686644 .97523978 |
12. | 1.1 .96040644 .94844522 .969637 .96040644 .94844522 .969637 |
13. | 1.2 .95297712 .93943933 .96354746 .95297712 .93943933 .96354746 |
14. | 1.3 .94504732 .92991782 .95698631 .94504732 .92991782 .95698631 |
15. | 1.4 .93665358 .91993953 .9499734 .93665358 .91993953 .9499734 |
|-----------------------------------------------------------------------------|
16. | 1.5 .92783183 .90954967 .94253584 .92783183 .90954967 .94253584 |
17. | 1.6 .91861703 .89877754 .93470981 .91861703 .89877754 .93470981 |
18. | 1.7 .90904198 .88763238 .92654197 .90904198 .88763238 .92654197 |
19. | 1.8 .89913147 .87609834 .91808407 .89913147 .87609834 .91808407 |
20. | 1.9 .88890629 .86417228 .90937397 .88890629 .86417228 .90937397 |
|-----------------------------------------------------------------------------|
21. | 2 .87838596 .85187732 .90042978 .87838596 .85187732 .90042978 |
+-----------------------------------------------------------------------------+
The predictions are identical, but obtaining the predictions using extended functions was much easier. If the model includes interactions then I would say it is much, much easier and less prone to coding errors.
The current extended functions are
@bs() |
B-splines |
@fp() |
fractional polynomials |
@ns() |
natural cubic splines |
@poly() |
polynomials |
@rcs() |
restricted cubic splines |
See the help file for options for each of these functions. For example,
with the spline functions you specify knots positions or the degree for B-splines.
For fractional polynomials there are scale
and center
options. For all
extended functions, except fractional polynomials, it is possible to perform
winsorising before the derived variables are generated using the winsor
option.
Predictions in more complex models
If the model get complex with numerous interactions and non-linear effects and interactions with time, the predictions remain simple.
Consider the model below with main effects and an interaction between
hormon
and a natural spline function for age
. stpm3
stores the details
of the extended function and so when predictions are made the appropriate values
of the derived variables can be obtained.
In the code below, I obtain predictions for 70 year old women and without hormonal treatment.
. stpm3 i.hormon##@ns(age, df(3)), scale(lncumhazard) df(5)
Iteration 0: log likelihood = -2117.3748
Iteration 1: log likelihood = -2117.1157
Iteration 2: log likelihood = -2117.1148
Iteration 3: log likelihood = -2117.1148
Number of obs = 2,982
Wald chi2(7) = 77.72
Log likelihood = -2117.1148 Prob > chi2 = 0.0000
--------------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
---------------------+----------------------------------------------------------------
xb |
hormon |
yes | 1.295954 .5457098 2.37 0.018 .226382 2.365525
_ns_f1_age1 | -1.006332 1.17959 -0.85 0.394 -3.318285 1.305621
_ns_f1_age2 | -1.519938 .473222 -3.21 0.001 -2.447436 -.5924397
_ns_f1_age3 | -1.322119 .5553445 -2.38 0.017 -2.410574 -.2336639
|
hormon#c._ns_f1_age1 |
yes | -6.312222 4.38403 -1.44 0.150 -14.90476 2.280319
|
hormon#c._ns_f1_age2 |
yes | 3.192456 2.152641 1.48 0.138 -1.026642 7.411555
|
hormon#c._ns_f1_age3 |
yes | -2.436599 1.186322 -2.05 0.040 -4.761746 -.1114511
---------------------+----------------------------------------------------------------
time |
_ns1 | -22.10808 2.190478 -10.09 0.000 -26.40134 -17.81483
_ns2 | 6.252367 1.188247 5.26 0.000 3.923447 8.581288
_ns3 | -1.094601 .0583504 -18.76 0.000 -1.208966 -.9802364
_ns4 | -.5429931 .0377085 -14.40 0.000 -.6169004 -.4690859
_ns5 | -.3770147 .0361414 -10.43 0.000 -.4478506 -.3061788
_cons | -.2889299 .2699311 -1.07 0.284 -.8179851 .2401253
--------------------------------------------------------------------------------------
Extended functions
(1) @ns(age, df(3))
.
. predict S70h0 S70h1, survival ci ///
> at1(age 70 hormon 0) ///
> at2(age 70 hormon 1) ///
> timevalues(0 5, step(0.1)) ///
> frame(f2)
Predictions are stored in frame - f2
The predict
command would be identical if there was no interaction.
Even if there are time-dependent effects, the predict
statement does not change.
The model below adds an interaction for the time-dependent effects for hormon
and age
.
. stpm3 i.hormon##@ns(age, df(3)), ///
> tvc(i.hormon##@ns(age, df(2))) dftvc(2) ///
> scale(lncumhazard) df(5)
Iteration 0: log likelihood = -2116.6285
Iteration 1: log likelihood = -2107.5574
Iteration 2: log likelihood = -2103.0052
Iteration 3: log likelihood = -2102.7883
Iteration 4: log likelihood = -2102.7869
Iteration 5: log likelihood = -2102.7869
Number of obs = 2,982
Wald chi2(7) = 74.61
Log likelihood = -2102.7869 Prob > chi2 = 0.0000
-------------------------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
--------------------------------+----------------------------------------------------------------
xb |
hormon |
yes | 1.2372 .559703 2.21 0.027 .1402021 2.334198
_ns_f1_age1 | -.9347474 1.18498 -0.79 0.430 -3.257267 1.387772
_ns_f1_age2 | -1.538365 .4738744 -3.25 0.001 -2.467142 -.6095881
_ns_f1_age3 | -1.308583 .5588214 -2.34 0.019 -2.403853 -.2133128
|
hormon#c._ns_f1_age1 |
yes | -5.991194 4.385156 -1.37 0.172 -14.58594 2.603554
|
hormon#c._ns_f1_age2 |
yes | 3.188885 2.145601 1.49 0.137 -1.016415 7.394185
|
hormon#c._ns_f1_age3 |
yes | -2.318151 1.205093 -1.92 0.054 -4.680089 .0437866
--------------------------------+----------------------------------------------------------------
time |
_ns1 | -19.18669 7.212764 -2.66 0.008 -33.32345 -5.049932
_ns2 | 5.607982 3.176181 1.77 0.077 -.6172182 11.83318
_ns3 | -1.05599 .2202421 -4.79 0.000 -1.487656 -.624323
_ns4 | -.5553481 .1523431 -3.65 0.000 -.8539351 -.2567611
_ns5 | -.3960238 .1116303 -3.55 0.000 -.6148151 -.1772324
|
hormon#c._ns_tvc1 |
yes | 4.903816 4.856667 1.01 0.313 -4.615075 14.42271
|
hormon#c._ns_tvc2 |
yes | .0799753 1.642765 0.05 0.961 -3.139786 3.299736
|
c._ns_f2_age1#c._ns_tvc1 | -29.15335 8.875221 -3.28 0.001 -46.54847 -11.75824
|
c._ns_f2_age1#c._ns_tvc2 | .6654049 1.325155 0.50 0.616 -1.93185 3.26266
|
c._ns_f2_age2#c._ns_tvc1 | -.8954604 11.34532 -0.08 0.937 -23.13188 21.34096
|
c._ns_f2_age2#c._ns_tvc2 | .3600924 2.127002 0.17 0.866 -3.808755 4.52894
|
hormon#c._ns_f2_age1#c._ns_tvc1 |
yes | -117.1348 72.85753 -1.61 0.108 -259.9329 25.66332
|
hormon#c._ns_f2_age1#c._ns_tvc2 |
yes | -.4836296 6.116531 -0.08 0.937 -12.47181 11.50455
|
hormon#c._ns_f2_age2#c._ns_tvc1 |
yes | -5.302671 20.04585 -0.26 0.791 -44.59182 33.98648
|
hormon#c._ns_f2_age2#c._ns_tvc2 |
yes | -1.511553 5.134032 -0.29 0.768 -11.57407 8.550965
|
_cons | -.3015535 .2724136 -1.11 0.268 -.8354743 .2323674
-------------------------------------------------------------------------------------------------
Extended functions
(1) @ns(age, df(3))
(2) @ns(age, df(2))
.
. predict S70h0 S70h1, survival ci ///
> at1(age 70 hormon 0) ///
> at2(age 70 hormon 1) ///
> timevalues(0 5, step(0.1)) ///
> frame(f3)
Predictions are stored in frame - f3
The key point is that the predict
command stays simple.
Note that I use a different function for the time-dependent effect.
Finally I will fit a model with a spline function that winsorises the variables before the spline basis functions are generated. This can improve model stability in the tails in complex models.
. stpm3 i.hormon##@bs(age, df(3) degree(2) winsor(2 98)), ///
> scale(lncumhazard) df(5)
Iteration 0: log likelihood = -2117.8274
Iteration 1: log likelihood = -2117.5679
Iteration 2: log likelihood = -2117.5669
Iteration 3: log likelihood = -2117.5669
Number of obs = 2,982
Wald chi2(7) = 75.27
Log likelihood = -2117.5669 Prob > chi2 = 0.0000
--------------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
---------------------+----------------------------------------------------------------
xb |
hormon |
yes | -.2013625 .8479376 -0.24 0.812 -1.86329 1.460565
_bs_f1_age1 | -.789629 .2399149 -3.29 0.001 -1.259854 -.3194044
_bs_f1_age2 | -.3251852 .1782746 -1.82 0.068 -.674597 .0242265
_bs_f1_age3 | .2358417 .2106851 1.12 0.263 -.1770935 .648777
|
hormon#c._bs_f1_age1 |
yes | 1.71135 1.096385 1.56 0.119 -.4375242 3.860225
|
hormon#c._bs_f1_age2 |
yes | -.2149046 .8373336 -0.26 0.797 -1.856048 1.426239
|
hormon#c._bs_f1_age3 |
yes | .9610499 .9192863 1.05 0.296 -.8407181 2.762818
---------------------+----------------------------------------------------------------
time |
_ns1 | -22.10814 2.191013 -10.09 0.000 -26.40245 -17.81384
_ns2 | 6.253144 1.188507 5.26 0.000 3.923714 8.582574
_ns3 | -1.093972 .0583289 -18.76 0.000 -1.208295 -.9796495
_ns4 | -.5428561 .0376962 -14.40 0.000 -.6167393 -.4689729
_ns5 | -.3769876 .0361375 -10.43 0.000 -.4478158 -.3061593
_cons | -.9075211 .1468767 -6.18 0.000 -1.195394 -.6196481
--------------------------------------------------------------------------------------
Extended functions
(1) @bs(age, df(3) degree(2) winsor(2 98))
.
. predict S70h0 S70h1, survival ci ///
> at1(age 70 hormon 0) ///
> at2(age 70 hormon 1) ///
> timevalues(0 5, step(0.1)) ///
> frame(f4)
Predictions are stored in frame - f4
The winsor(2 98)
option replaces values less than the 2nd centile with the
value at the 2nd centile and values greater than the 98th centile with the values
at the 98th centile. There is a values
suboption where you can directly specifying
values for the cutoffs rather than centiles.