Extended Functions
Include non-linear function in a varlist
A new addition to stpm3
is the use of extended functions within a model varlist
. This allows you to specify various spline, polynomials, fractional polynomials functions directly when specifying the model. This makes fitting the model slightly easier, but the main advantage is the ease at which you can get predictions from complex models.
We first load the Rotterdam 2 breast cancer data and then use stset
to declare the survival time and event indicator.
use https://www.pclambert.net/data/rott2b, clear
. data (augmented with cause of death))
(Rotterdam breast cancer
stset os, f(osi==1) scale(12) exit(time 60)
.
data settings
Survival-time
Failure event: osi==1
Observed time interval: (0, os]on or before: time 60
Exit for analysis: time/12
Time
--------------------------------------------------------------------------total observations
2,982
0 exclusions
--------------------------------------------------------------------------
2,982 observations remaining, representingin single-record/single-failure data
753 failures total analysis time at risk and under observation
13,038.968
At risk from t = 0
Earliest observed entry t = 0exit t = 5 Last observed
If you want to include a non-linear effect for a covariate in a model, the usual approach would be to generate some new derived variables and then include these in the model. For example, by generating a quadratic and cubic term or generating spline basis functions using mkspline
or rcsgen
.
The code below shows how to include non-linear effects using restricted cubic splines with rcsgen
with 4 knots, which equates to 3 spline variables.
gen(agercs) df(3)
. rcsgen age,
Variables agercs1 to agercs3 were created
global ageknots `r(knots)'
.
scale(lncumhazard) df(5)
. stpm3 i.hormon agercs1-agercs3,
Iteration 0: Log likelihood = -2120.3306
Iteration 1: Log likelihood = -2120.0705
Iteration 2: Log likelihood = -2120.0694
Iteration 3: Log likelihood = -2120.0694
of obs = 2,982
Number chi2(4) = 69.03
Wald chi2 = 0.0000
Log likelihood = -2120.0694 Prob >
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------xb |
hormon |
yes | .3966145 .1041656 3.81 0.000 .1924536 .6007754
agercs1 | -.0430389 .0160545 -2.68 0.007 -.0745051 -.0115727
agercs2 | -.0000314 .000038 -0.82 0.410 -.0001059 .0000432
agercs3 | -1.70e-07 .0000357 -0.00 0.996 -.0000701 .0000697
-------------+----------------------------------------------------------------
time |
_ns1 | -22.10362 2.190797 -10.09 0.000 -26.39751 -17.80974
_ns2 | 6.252487 1.18838 5.26 0.000 3.923304 8.58167
_ns3 | -1.093169 .0583025 -18.75 0.000 -1.207439 -.9788977
_ns4 | -.5425182 .0376756 -14.40 0.000 -.6163609 -.4686754
_ns5 | -.376672 .0361107 -10.43 0.000 -.4474477 -.3058964_cons | .3065984 .606017 0.51 0.613 -.8811731 1.49437
------------------------------------------------------------------------------
estimates store stpm3_rcsgen .
I have stored the knot locations as these will be needed for certain predictions.
To fit an identical model, but using extended function we can use @rcs(age, df(3))
within an stpm3
varlist
.
scale(lncumhazard) df(5)
. stpm3 i.hormon @rcs(age, df(3)),
Iteration 0: Log likelihood = -2120.3306
Iteration 1: Log likelihood = -2120.0705
Iteration 2: Log likelihood = -2120.0694
Iteration 3: Log likelihood = -2120.0694
of obs = 2,982
Number chi2(4) = 69.03
Wald chi2 = 0.0000
Log likelihood = -2120.0694 Prob >
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------xb |
hormon |
yes | .3966145 .1041656 3.81 0.000 .1924536 .6007754
_rcs_f1_age1 | -.0430389 .0160545 -2.68 0.007 -.0745051 -.0115727
_rcs_f1_age2 | -.0000314 .000038 -0.82 0.410 -.0001059 .0000432
_rcs_f1_age3 | -1.70e-07 .0000357 -0.00 0.996 -.0000701 .0000697
-------------+----------------------------------------------------------------
time |
_ns1 | -22.10362 2.190797 -10.09 0.000 -26.39751 -17.80974
_ns2 | 6.252487 1.18838 5.26 0.000 3.923304 8.58167
_ns3 | -1.093169 .0583025 -18.75 0.000 -1.207439 -.9788977
_ns4 | -.5425182 .0376756 -14.40 0.000 -.6163609 -.4686754
_ns5 | -.376672 .0361107 -10.43 0.000 -.4474477 -.3058964_cons | .3065984 .606017 0.51 0.613 -.8811731 1.49437
------------------------------------------------------------------------------
Extended functions (1) @rcs(age, df(3))
If you compare the model coefficients and log-likelihoods you will see they are identical.
From here on I will use natural splines - these will give the same predicted values as when using restricted cubic splines, but have some useful additional properties.
scale(lncumhazard) df(5)
. stpm3 i.hormon @ns(age, df(3)),
Iteration 0: Log likelihood = -2120.3306
Iteration 1: Log likelihood = -2120.0705
Iteration 2: Log likelihood = -2120.0694
Iteration 3: Log likelihood = -2120.0694
of obs = 2,982
Number chi2(4) = 69.03
Wald chi2 = 0.0000
Log likelihood = -2120.0694 Prob >
------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------xb |
hormon |
yes | .3966145 .1041656 3.81 0.000 .1924536 .6007754
_ns_f1_age1 | -1.769774 1.089136 -1.62 0.104 -3.904441 .3648924
_ns_f1_age2 | -1.233782 .4602279 -2.68 0.007 -2.135812 -.3317519
_ns_f1_age3 | -1.709216 .4899396 -3.49 0.000 -2.66948 -.7489521
-------------+----------------------------------------------------------------
time |
_ns1 | -22.10362 2.190797 -10.09 0.000 -26.39751 -17.80974
_ns2 | 6.252487 1.18838 5.26 0.000 3.923304 8.58167
_ns3 | -1.093169 .0583025 -18.75 0.000 -1.207439 -.9788977
_ns4 | -.5425182 .0376756 -14.40 0.000 -.6163609 -.4686754
_ns5 | -.376672 .0361107 -10.43 0.000 -.4474477 -.3058964_cons | -.1364105 .2354592 -0.58 0.562 -.597902 .325081
------------------------------------------------------------------------------
Extended functions
(1) @ns(age, df(3))
estimates store stpm3_ns .
The coefficients for the spline variables are now different, but if you predict for a specified covariate pattern, the predictions will be identical.
Predictions with extended functions
If we generated the spline variables for age at diagnosis and then included these in the model, we would have to work out the values of the spline variables at the age at diagnosis of interest. For example, using the scalar
option of rcsgen
to predict survival for a 70 year old on hormonal treatment the code below can be used.
estimates restore stpm3_rcsgen
.
(results stpm3_rcsgen are active now)
scalar(70) gen(c) knots(${ageknots})
. rcsgen,
Scalars c1 to c3 were created
predict S70a, survival ci ///
. `=c1' agercs2 `=c2' agercs3 `=c3' hormon 1) ///
> at1(agercs1 ///
> timevar(0 5, step(0.1))
> frame(f1)in frame - f1 Predictions are stored
By storing the knots and passing these to rcsgen
combined with the scalar
option, the values of the restricted cubic spline variables for a 70 year old are obtained. These can then be passed to the predict
command.
If you use an extended function to fit an equivalent model this simplifies the predictions. For example predicting for a 70 year old woman on hormonal treatment we can use the following predict command.
estimates restore stpm3_ns
.
(results stpm3_ns are active now)
predict S70b, survival ci ///
. ///
> at1(age 70 hormon 1) merge)
> frame(f1, in frame - f1 Predictions are stored
These predictions have been merged into frame f1
, so we can compare the predictions.
list in 1/21, noobs
. frame f1:
+-----------------------------------------------------------------------------+
| tt S70a S70a_lci S70a_uci S70b S70b_lci S70b_uci |
|-----------------------------------------------------------------------------|
| 0 1 1 1 1 1 1 |
| .1 .99980531 .99901994 .99996134 .99980531 .99901994 .99996134 |
| .2 .9990375 .99724993 .99966333 .9990375 .99724993 .99966333 |
| .3 .99757461 .99484041 .99886071 .99757461 .99484041 .99886071 |
| .4 .99537568 .99175595 .99740816 .99537568 .99175595 .99740816 |
|-----------------------------------------------------------------------------|
| .5 .99243419 .98793852 .99525821 .99243419 .98793852 .99525821 |
| .6 .98876195 .98333349 .99242911 .98876195 .98333349 .99242911 |
| .7 .98438145 .97790805 .9889687 .98438145 .97790805 .9889687 |
| .8 .97932165 .97166223 .98492686 .97932165 .97166223 .98492686 |
| .9 .97361549 .96462911 .98034204 .97361549 .96462911 .98034204 |
|-----------------------------------------------------------------------------|
| 1 .96729824 .95686644 .97523978 .96729824 .95686644 .97523978 |
| 1.1 .96040644 .94844522 .969637 .96040644 .94844522 .969637 |
| 1.2 .95297712 .93943933 .96354746 .95297712 .93943933 .96354746 |
| 1.3 .94504732 .92991782 .95698631 .94504732 .92991782 .95698631 |
| 1.4 .93665358 .91993953 .9499734 .93665358 .91993953 .9499734 |
|-----------------------------------------------------------------------------|
| 1.5 .92783183 .90954967 .94253584 .92783183 .90954967 .94253584 |
| 1.6 .91861703 .89877754 .93470981 .91861703 .89877754 .93470981 |
| 1.7 .90904198 .88763238 .92654197 .90904198 .88763238 .92654197 |
| 1.8 .89913147 .87609834 .91808407 .89913147 .87609834 .91808407 |
| 1.9 .88890629 .86417228 .90937397 .88890629 .86417228 .90937397 |
|-----------------------------------------------------------------------------|
| 2 .87838596 .85187732 .90042978 .87838596 .85187732 .90042978 | +-----------------------------------------------------------------------------+
The predictions are identical, but obtaining the predictions using extended functions was much easier. If the model includes interactions then I would say it is much, much easier and less prone to coding errors.
The current extended functions are
@bs() |
B-splines |
@fn() |
general transformation |
@fp() |
fractional polynomials |
@ns() |
natural cubic splines |
@poly() |
polynomials |
@rcs() |
restricted cubic splines |
See the help file for options for each of these functions. For example, with the spline functions you specify knots positions or the degree for B-splines. For fractional polynomials there are scale
and center
options. For all extended functions, except fractional polynomials, it is possible to perform winsorising before the derived variables are generated using the winsor()
option.
Predictions in more complex models
If the model get complex with numerous interactions and non-linear effects and interactions with time, the predictions remain simple.
Consider the model below with main effects and an interaction between hormon
and a natural spline function for age
. stpm3
stores the details of the extended function and so when predictions are made the appropriate values of the derived variables can be obtained.
In the code below, I obtain predictions for 70 year old women with and without hormonal treatment.
scale(lncumhazard) df(5)
. stpm3 i.hormon##@ns(age, df(3)),
Iteration 0: Log likelihood = -2117.3742
Iteration 1: Log likelihood = -2117.1157
Iteration 2: Log likelihood = -2117.1148
Iteration 3: Log likelihood = -2117.1148
of obs = 2,982
Number chi2(7) = 77.72
Wald chi2 = 0.0000
Log likelihood = -2117.1148 Prob >
--------------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
---------------------+----------------------------------------------------------------xb |
hormon |
yes | 1.295954 .5457098 2.37 0.018 .226382 2.365525
_ns_f1_age1 | -1.006332 1.17959 -0.85 0.394 -3.318285 1.305621
_ns_f1_age2 | -1.519938 .473222 -3.21 0.001 -2.447436 -.5924397
_ns_f1_age3 | -1.322119 .5553445 -2.38 0.017 -2.410574 -.2336639
|
hormon#c._ns_f1_age1 |
yes | -6.312222 4.38403 -1.44 0.150 -14.90476 2.280319
|
hormon#c._ns_f1_age2 |
yes | 3.192456 2.152641 1.48 0.138 -1.026642 7.411555
|
hormon#c._ns_f1_age3 |
yes | -2.436599 1.186322 -2.05 0.040 -4.761746 -.1114511
---------------------+----------------------------------------------------------------
time |
_ns1 | -22.10808 2.190478 -10.09 0.000 -26.40134 -17.81483
_ns2 | 6.252367 1.188247 5.26 0.000 3.923447 8.581288
_ns3 | -1.094601 .0583504 -18.76 0.000 -1.208966 -.9802364
_ns4 | -.5429931 .0377085 -14.40 0.000 -.6169004 -.4690859
_ns5 | -.3770147 .0361414 -10.43 0.000 -.4478506 -.3061788_cons | -.2889299 .2699311 -1.07 0.284 -.8179851 .2401253
--------------------------------------------------------------------------------------
Extended functions
(1) @ns(age, df(3))
. predict S70h0 S70h1, survival ci ///
. ///
> at1(age 70 hormon 0) ///
> at2(age 70 hormon 1) ///
> timevar(0 5, step(0.1))
> frame(f2)in frame - f2 Predictions are stored
The nice thing here is that thehe predict
command would be identical if there was no interaction.
Even if there are time-dependent effects, the predict
statement does not change. The model below adds an interaction for the time-dependent effects for hormon
and age
.
///
. stpm3 i.hormon##@ns(age, df(3)), ///
> tvc(i.hormon##@ns(age, df(2))) dftvc(2) scale(lncumhazard) df(5)
>
Iteration 0: Log likelihood = -2116.6275
Iteration 1: Log likelihood = -2107.5573
Iteration 2: Log likelihood = -2103.0051
Iteration 3: Log likelihood = -2102.7883
Iteration 4: Log likelihood = -2102.7869
Iteration 5: Log likelihood = -2102.7869
of obs = 2,982
Number chi2(7) = 74.61
Wald chi2 = 0.0000
Log likelihood = -2102.7869 Prob >
-------------------------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
--------------------------------+----------------------------------------------------------------xb |
hormon |
yes | 1.2372 .559703 2.21 0.027 .1402021 2.334198
_ns_f1_age1 | -.9347474 1.18498 -0.79 0.430 -3.257267 1.387772
_ns_f1_age2 | -1.538365 .4738744 -3.25 0.001 -2.467142 -.6095881
_ns_f1_age3 | -1.308583 .5588214 -2.34 0.019 -2.403853 -.2133128
|
hormon#c._ns_f1_age1 |
yes | -5.991194 4.385156 -1.37 0.172 -14.58594 2.603554
|
hormon#c._ns_f1_age2 |
yes | 3.188885 2.145601 1.49 0.137 -1.016415 7.394185
|
hormon#c._ns_f1_age3 |
yes | -2.318151 1.205093 -1.92 0.054 -4.680089 .0437866
--------------------------------+----------------------------------------------------------------
time |
_ns1 | -19.18669 7.212764 -2.66 0.008 -33.32345 -5.049932
_ns2 | 5.607982 3.176181 1.77 0.077 -.6172182 11.83318
_ns3 | -1.05599 .2202421 -4.79 0.000 -1.487656 -.624323
_ns4 | -.5553481 .1523431 -3.65 0.000 -.8539351 -.2567611
_ns5 | -.3960238 .1116303 -3.55 0.000 -.6148151 -.1772324
|
hormon#c._ns_tvc1 |
yes | 4.903816 4.856667 1.01 0.313 -4.615075 14.42271
|
hormon#c._ns_tvc2 |
yes | .0799753 1.642765 0.05 0.961 -3.139786 3.299736
|
c._ns_f2_age1#c._ns_tvc1 | -29.15335 8.875221 -3.28 0.001 -46.54847 -11.75824
|
c._ns_f2_age1#c._ns_tvc2 | .6654049 1.325155 0.50 0.616 -1.93185 3.26266
|
c._ns_f2_age2#c._ns_tvc1 | -.8954604 11.34532 -0.08 0.937 -23.13188 21.34096
|
c._ns_f2_age2#c._ns_tvc2 | .3600924 2.127002 0.17 0.866 -3.808755 4.52894
|
hormon#c._ns_f2_age1#c._ns_tvc1 |
yes | -117.1348 72.85753 -1.61 0.108 -259.9329 25.66332
|
hormon#c._ns_f2_age1#c._ns_tvc2 |
yes | -.4836296 6.116531 -0.08 0.937 -12.47181 11.50455
|
hormon#c._ns_f2_age2#c._ns_tvc1 |
yes | -5.302671 20.04585 -0.26 0.791 -44.59182 33.98648
|
hormon#c._ns_f2_age2#c._ns_tvc2 |
yes | -1.511553 5.134032 -0.29 0.768 -11.57407 8.550965
|_cons | -.3015535 .2724136 -1.11 0.268 -.8354743 .2323674
-------------------------------------------------------------------------------------------------
Extended functions
(1) @ns(age, df(3))
(2) @ns(age, df(2))
. predict S70h0 S70h1, survival ci ///
. ///
> at1(age 70 hormon 0) ///
> at2(age 70 hormon 1) ///
> timevar(0 5, step(0.1))
> frame(f3)in frame - f3 Predictions are stored
The key point is that the predict
command stays simple. Note that I use a different function for the time-dependent effect, so two @ns()
functions at the end of the model output.
Finally I will fit a model with a spline function that winsorises the variables before the spline basis functions are generated. This can improve model stability in the tails in complex models. I use B-splines rather than natural splines for the effect of age.
///
. stpm3 i.hormon##@bs(age, df(3) degree(2) winsor(2 98)), scale(lncumhazard) df(5)
>
Iteration 0: Log likelihood = -2117.8269
Iteration 1: Log likelihood = -2117.5679
Iteration 2: Log likelihood = -2117.5669
Iteration 3: Log likelihood = -2117.5669
of obs = 2,982
Number chi2(7) = 75.27
Wald chi2 = 0.0000
Log likelihood = -2117.5669 Prob >
--------------------------------------------------------------------------------------
| Coefficient Std. err. z P>|z| [95% conf. interval]
---------------------+----------------------------------------------------------------xb |
hormon |
yes | -.2013625 .8479376 -0.24 0.812 -1.86329 1.460565
_bs_f1_age1 | -.789629 .2399149 -3.29 0.001 -1.259854 -.3194044
_bs_f1_age2 | -.3251852 .1782746 -1.82 0.068 -.674597 .0242265
_bs_f1_age3 | .2358417 .2106851 1.12 0.263 -.1770935 .648777
|
hormon#c._bs_f1_age1 |
yes | 1.71135 1.096385 1.56 0.119 -.4375242 3.860225
|
hormon#c._bs_f1_age2 |
yes | -.2149046 .8373336 -0.26 0.797 -1.856048 1.426239
|
hormon#c._bs_f1_age3 |
yes | .9610499 .9192863 1.05 0.296 -.8407181 2.762818
---------------------+----------------------------------------------------------------
time |
_ns1 | -22.10814 2.191013 -10.09 0.000 -26.40245 -17.81384
_ns2 | 6.253144 1.188507 5.26 0.000 3.923714 8.582574
_ns3 | -1.093972 .0583289 -18.76 0.000 -1.208295 -.9796495
_ns4 | -.5428561 .0376962 -14.40 0.000 -.6167393 -.4689729
_ns5 | -.3769876 .0361375 -10.43 0.000 -.4478158 -.3061593_cons | -.9075211 .1468767 -6.18 0.000 -1.195394 -.6196481
--------------------------------------------------------------------------------------
Extended functions
(1) @bs(age, df(3) degree(2) winsor(2 98))
. predict S70h0 S70h1, survival ci ///
. ///
> at1(age 70 hormon 0) ///
> at2(age 70 hormon 1) ///
> timevar(0 5, step(0.1))
> frame(f4)in frame - f4 Predictions are stored
The winsor(2 98)
option replaces values less than the 2nd centile with the value at the 2nd centile and values greater than the 98th centile with the values at the 98th centile. There is a values
suboption where you can directly specifying values for the cutoffs rather than centiles.
Again, the key point here is that the predict commands does not change.