Use of the timevar() option
Background
In this tutorial I will describe some simple use of the timevar()
option when obtaining predictions after fitting a model using stpm2
. When using Stata’s survival models, such as streg
and stcox
, predictions are made at the values of _t
, which is each record’s event or censoring time. This is the default behaviour of stpm2
.
One of the advantages of parametric survival models is that we can predict various quantities (hazard, survival functions etc etc) at any value of time and for any covariate pattern as we have an equation which is a function of time and any covariates we have modelled.
Before I show some examples I should explain that we need to be a bit cautious when making such predictions. In Stata it is only possible to have one data set in memory. When we make predictions at specific values of time using the timevar()
option we effectively want a second data set that we can use for predictions, and then use for producing graphs and tabulations.
We have found it easiest to think of two data sets side by side as shown below.
------------------ --------------
| | | |
| | | Prediction |
| Analysis Data | | Data |
| | | |
| | --------------
| |
------------------
This means that we have our analysis data and our prediction data stored in the same data set. We have to remember that there are actually two (or more) data sets and that row 1 or the analysis data does not have a relationship with row 1 of the prediction data.
I now will illustrate the use of the timevar()
option.
Example
I first load and stset
the rott2b
data.
. use https://www.pclambert.net/data/rott2b, clear
(Rotterdam breast cancer data (augmented with cause of death))
. stset rf, f(rfi==1) scale(12) exit(time 60)
failure event: rfi == 1
obs. time interval: (0, rf]
exit on or before: time 60
t for analysis: time/12
------------------------------------------------------------------------------
2,982 total observations
0 exclusions
------------------------------------------------------------------------------
2,982 observations remaining, representing
1,181 failures in single-record/single-failure data
11,130.825 total analysis time at risk and under observation
at risk from t = 0
earliest observed entry t = 0
last observed exit t = 5
I will model the effect of age using restricted cubic splines. These can be generated using the rcsgen
command. I make use of the center
option make the created spline variables all equal 0 at the specified value, in this case at age 60. I then fit an stpm2
model including the effect of hormonal therapy (hormon
), progesterone receptor (transformed using $\log(pr+1)$), and age (using the 3 created restricted cubic spline variables).
. rcsgen age, df(3) gen(agercs) center(60)
Variables agercs1 to agercs3 were created
. stpm2 hormon agercs* pr_1, scale(hazard) df(4) eform
Iteration 0: log likelihood = -3065.3989
Iteration 1: log likelihood = -3065.2479
Iteration 2: log likelihood = -3065.2478
Log likelihood = -3065.2478 Number of obs = 2,982
------------------------------------------------------------------------------
| exp(b) Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
xb |
hormon | 1.253481 .112407 2.52 0.012 1.051442 1.494342
agercs1 | .9678468 .0119668 -2.64 0.008 .9446742 .9915879
agercs2 | .999971 .0000315 -0.92 0.357 .9999092 1.000033
agercs3 | 1.000014 .0000304 0.46 0.648 .9999543 1.000074
pr_1 | .9080251 .0118154 -7.41 0.000 .8851601 .9314808
_rcs1 | 2.464395 .0655981 33.88 0.000 2.339121 2.596377
_rcs2 | 1.227705 .0289332 8.70 0.000 1.172287 1.285744
_rcs3 | 1.013422 .0117182 1.15 0.249 .9907135 1.036652
_rcs4 | 1.005574 .0050256 1.11 0.266 .995772 1.015472
_cons | .3891571 .0248262 -14.79 0.000 .3434176 .4409885
------------------------------------------------------------------------------
Note: Estimates are transformed only in the first equation.
I now create some values of time that I want to predict at. I use the range
command to give 100 values between 0 and 5 in a new variable tt
.
. range tt 0 5 100
(2,882 missing values generated)
After creating the new variable I can use it in the timevar()
option when using stpm2
’s predict
command.
. predict s0, survival timevar(tt) zeros ci
This will predict the baseline survival function at the time values in the variable tt
. The zeros
option sets all covarites equal to zero, i.e. the baseline. The ci
option asks for the upper and lower bounds of the 95% confidence interval to be calculated. The function can now be plotted.
. twoway (rarea s0_lci s0_uci tt, color(blue%25)) ///
> (line s0 tt, lcolor(red)) ///
> , legend(off) ///
> ylabel(,angle(h) format(%3.1f)) ///
> xtitle("Years from surgery") ///
> ytitle("S(t)")
Combining with the at()
option.
It is possible to make predictions at any values the covariates included in the model using the at()
option. The two lines below predict the hazard functions for women using and not using hormonal treatment at the reference age (60) and the mean value of log progesterone receptor (3.43).
. predict h0, hazard timevar(tt) at(hormon 0 pr_1 3.43) zeros per(1000) ci
. predict h1, hazard timevar(tt) at(hormon 1 pr_1 3.43) zeros per(1000) ci
I have used the timevar(tt)
option again and so predictions will be at the 100 value of tt
(actually at 99 values as the hazard is not defined at t=0). The at()
option gives the values of the covariates that we want to predict at. The zeros
option will set any remaining covariates equal to zero, i.e. the age spline variables are set to zero which is the reference age of 60. The per(1000)
option multiplies the hazard rate by 1000 as it is easier to interpret the rate per 1000 years than per person per year.
The resulting predictions are then plotted,
. twoway (rarea h0_lci h0_uci tt, color(red%25)) ///
> (rarea h1_lci h1_uci tt, color(blue%25)) ///
> (line h0 tt, lcolor(red) lwidth(thick)) ///
> (line h1 tt, lcolor(blue) lwidth(thick)) ///
> ,xtitle("Years from surgery") ///
> ytitle("Recrurrence rate (per 1000 py)") ///
> legend(order( 3 "hormon=0" 4 "hormon=1") ring(0) pos(1) cols(1))
As the model assumes proportional hazards the predicted hazard functions are perfectly proportional.
Predictions at single values of time.
It can be useful to see the variation in survival at specific values of time, for example at one and five years. The followig code predicts the survival at one year for all subjects in the dataset.
. gen t1 = 1
. predict s_time1, survival timevar(t1)
This can then be plotted in a histogram.
. hist s_time1, width(0.02) xlab(0.3(0.1)1) name(t1, replace)
(bin=7, start=.81623113, width=.02)
We can compare this to the variation at 5 years.
. gen t5 = 5
. predict s_time5, survival timevar(t5)
. hist s_time5, width(0.02) xlab(0.3(0.1)1) name(t5, replace)
(bin=22, start=.30392122, width=.02)
If we are interested in specific covariates then we can look at 1 and 5 year survival as a function of that covariate. For example, we can plot the 1 and 5 year survival as a function of age at diagnosis. As this will also depend on the values of the other covariate I will fix these at specific values (not on hormonal treatment and at the mean level of log progesterone receptor).
First the one year survival as a function of age,
. predict t1_age, surv at(hormon 0 pr_1 3.43) ci timevar(t1)
. twoway (rarea t1_age_lci t1_age_uci age, sort color(blue%25)) ///
> (line t1_age age, sort lcolor(red)) ///
> ,ytitle("1 year survival") legend(off) ///
> ylabel(,angle(h) format(%3.1f)) name(age_t1, replace)
And now the 5 year survival….
. predict t5_age, surv at(hormon 0 pr_1 3.43) ci timevar(t5)
. twoway (rarea t5_age_lci t5_age_uci age, sort color(blue%25)) ///
> (line t5_age age, sort lcolor(red)) ///
> ,ytitle("5 year survival") legend(off) ///
> ylabel(,angle(h) format(%3.1f)) name(age_t5, replace)