# Standardized relative survival

You will need to install standsurv to run the example. Details here

## Background

Here I will describe how to obtain estimates of standardized relative survival after fitting a relative survival model. The relative survival framework is used extensively in population-based cancer studies for the analysis of cancer registry data. Rather than use cause of death information, the relative survival framework uses expected mortality rates in order to estimate the mortality in excess of that expected in the general population.

In a relative survival model the underlying all cause mortality rate, $h(t|Z=z_i)$, for the $i^{th}$ individual with covariate pattern, $Z=z_i$, is partitioned into the expected mortality rate if they did not have cancer, $h^*(t|Z_1 = z_{1i})$, and their excess mortality rate due to the cancer, $\lambda(t|Z_2=z_{2i})$.

$$h(t|Z=z_i) = h^*(t|Z_1 = z_{1i}) + \lambda(t|Z_2=z_{2i})$$

with $Z$ denoting the set of all covariates. $Z_1$ and $Z_2$ denote the covariates for expected and excess mortalities respectively. In the example below $Z_1$ and $Z_2$ will be the same.

The survival analogue of excess mortality is relative survival. The relative survival of individual $i$, $R(t|Z_2=z_{2i}⁠)$, is defined as their all-cause survival, $S(t|Z=z_i)⁠$, divided by their expected survival, $S^*(t|Z1=z_{1i})$⁠. The all-cause survival is thus,

$$S(t|Z=z_i) = S^*(t|Z_1 = z_{1i})R(t|Z_2=z_{2i})$$

There are a number of different relative survival models. I will use an adaption of Royston-Parmar (flexible parametric survival) models to the relative survival framework (Nelson et al. 2007). These models are conditional models, i.e. the can predict relative survival/excess mortality conditional on specific covariate patterns. I will use these models to obtain marginal (standardized) estimates using standsurv, but first I need to fit the conditional model.

## Example (simulated colon cancer data)

I will use simulated data. The data is simulated in a way to be similar colon cancer data in England. There are just three covariates, age at diagnosis (agediag), sex (female) and deprivation group (dep). There are five derpivation groups derived from national quintiles of the income domain of the area of patients’ residence at diagnosis (in real data, but simulated here).

I first load and stset the data.

. use https://www.pclambert.net/data/colonsim, clear

. stset t, failure(dead=1,2) id(id) exit(time 5)

Survival-time data settings

ID variable: id
Observed time interval: (t[_n-1], t]
Exit on or before: time 5

--------------------------------------------------------------------------
20,000  total observations
0  exclusions
--------------------------------------------------------------------------
20,000  observations remaining, representing
20,000  subjects
10,677  failures in single-failure-per-subject data
60,922.154  total analysis time at risk and under observation
At risk from t =         0
Earliest observed entry t =         0
Last observed exit t =         5

There are 20,000 inidviduals and follow-up has been restricted to 5 years.

For simplicity in this example I will only keep the least and most deprived groups.

. keep if inlist(dep,1,5)
(12,667 observations deleted)

In a relative survival model the expected mortality rates at the event times are required. Expected mortality rates are stored in a “popmort file”. Attained age and attained calendar year can be calculated through making use of _t, and then can be merged in.

. // attained age
. gen age = min(floor(agediag + _t),99)

. // attained calendar year
. gen year = floor(yeardiag + _t)

. merge m:1 age year dep sex using https://www.pclambert.net/data/popmort_uk_2017, ///
>       keep(match master) keepusing(rate)

Result                      Number of obs
-----------------------------------------
Not matched                             0
Matched                             7,333  (_merge==3)
-----------------------------------------

Rather than create age groups, age will be modelled continuously using restricted cubic splines with 4 knots (3 restricted cubic spline terms). I also create dummy variables for deprivation group and a variable, female (so I don’t have to remember how sex is coded).

. rcsgen agediag, gen(agercs) df(3) orthog
Variables agercs1 to agercs3 were created

. global ageknots r(knots)'

. matrix Rage =r(R)

. gen female = sex == 2

. gen dep5 = dep == 5

I have store the knot locations and the “R Matrix”, so I can later derive estimated relative survival for specific ages.

I will include all twoway interactions in the model and create these below,

. gen femdep5 = female*dep5

. forvalues i = 1/3 {
2.   gen agercsi'_dep5 = agercsi'*dep5
3.   gen agercsi'_fem  = agercsi'*female
4. }

The model can now be fitted. I include the main effect and the twoway interactions. In addition time-dependent effect of deprivation, sex and age are incorporated though use of the tvc() and dftvc options. Time-dependent effects are an interaction with time.

. stpm2 dep5 femdep5 agercs*, scale(hazard) df(5) ///
>       tvc(dep5 agercs1-agercs3 female) dftvc(5) ///
>       bhazard(rate)

Iteration 0:   log likelihood = -8480.8684
Iteration 1:   log likelihood = -8285.0221
Iteration 2:   log likelihood =  -8263.517
Iteration 3:   log likelihood = -8262.9791
Iteration 4:   log likelihood = -8262.9789

Log likelihood = -8262.9789                              Number of obs = 7,333

---------------------------------------------------------------------------------
| Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
----------------+----------------------------------------------------------------
xb              |
dep5 |   .3529399   .0524812     6.73   0.000     .2500786    .4558012
femdep5 |  -.0739321   .0607853    -1.22   0.224    -.1930692    .0452049
agercs1 |   .1821014   .0408958     4.45   0.000     .1019472    .2622557
agercs2 |  -.0093421   .0394447    -0.24   0.813    -.0866522    .0679681
agercs3 |   .1969082   .0403894     4.88   0.000     .1177464    .2760699
agercs1_dep5 |  -.1052794   .0470379    -2.24   0.025    -.1974719   -.0130868
agercs1_fem |  -.0452735   .0469758    -0.96   0.335    -.1373444    .0467974
agercs2_dep5 |   .1197769   .0446492     2.68   0.007     .0322661    .2072876
agercs2_fem |   .0362455   .0447414     0.81   0.418    -.0514461    .1239371
agercs3_dep5 |   .0008329   .0470677     0.02   0.986    -.0914181    .0930839
agercs3_fem |   .1104378   .0468519     2.36   0.018     .0186097     .202266
_rcs1 |   .9550851    .031853    29.98   0.000     .8926544    1.017516
_rcs2 |    .140241   .0233181     6.01   0.000     .0945383    .1859437
_rcs3 |   .0241263    .012884     1.87   0.061     -.001126    .0493785
_rcs4 |   .0101729   .0074127     1.37   0.170    -.0043557    .0247014
_rcs5 |   .0143555   .0045209     3.18   0.001     .0054946    .0232164
_rcs_dep51 |  -.1329958   .0343032    -3.88   0.000    -.2002289   -.0657627
_rcs_dep52 |   .0330663    .025393     1.30   0.193    -.0167031    .0828358
_rcs_dep53 |  -.0349497   .0138436    -2.52   0.012    -.0620825   -.0078168
_rcs_dep54 |  -.0022025   .0079669    -0.28   0.782    -.0178174    .0134124
_rcs_dep55 |  -.0059359   .0047849    -1.24   0.215    -.0153141    .0034423
_rcs_agercs11 |   -.151491   .0235736    -6.43   0.000    -.1976944   -.1052876
_rcs_agercs12 |   .0450463   .0179144     2.51   0.012     .0099346    .0801579
_rcs_agercs13 |  -.0123745   .0099067    -1.25   0.212    -.0317912    .0070422
_rcs_agercs14 |  -.0119208    .005704    -2.09   0.037    -.0231005   -.0007412
_rcs_agercs15 |  -.0009359   .0034617    -0.27   0.787    -.0077207     .005849
_rcs_agercs21 |    .031694   .0249741     1.27   0.204    -.0172543    .0806423
_rcs_agercs22 |  -.0150531   .0192092    -0.78   0.433    -.0527025    .0225964
_rcs_agercs23 |   .0070513   .0106222     0.66   0.507    -.0137678    .0278705
_rcs_agercs24 |  -.0017843   .0059484    -0.30   0.764     -.013443    .0098744
_rcs_agercs25 |   .0034489   .0035076     0.98   0.325    -.0034259    .0103237
_rcs_agercs31 |  -.0198329   .0235528    -0.84   0.400    -.0659956    .0263298
_rcs_agercs32 |   .0079817   .0178004     0.45   0.654    -.0269064    .0428697
_rcs_agercs33 |  -.0088423   .0099241    -0.89   0.373    -.0282931    .0106085
_rcs_agercs34 |   .0070305   .0057611     1.22   0.222    -.0042612    .0183221
_rcs_agercs35 |   .0022557   .0035423     0.64   0.524    -.0046871    .0091984
_rcs_female1 |   .0000808   .0334771     0.00   0.998     -.065533    .0656947
_rcs_female2 |   .0016642   .0250578     0.07   0.947    -.0474481    .0507765
_rcs_female3 |   .0271875   .0136936     1.99   0.047     .0003486    .0540263
_rcs_female4 |  -.0023534   .0079136    -0.30   0.766    -.0178637     .013157
_rcs_female5 |  -.0005499   .0048164    -0.11   0.909      -.00999    .0088901
_cons |  -1.367196   .0317966   -43.00   0.000    -1.429516   -1.304876
---------------------------------------------------------------------------------

This is a complex model and I would not attempt to interpret individual parameters. However, the model can predict relative survival for any covariate pattern. For example, the code below predict relative survival for 50, 65 and 80 year old females in each deprivation group

range tt 0 5 101
foreach age in 50 65 80 {
rcsgen, scalar(age') knots($ageknots) rmatrix(Rage) gen(c) predict sage'_dep1, surv timevar(tt) /// at(female 1 dep5 0 femdep5 0 /// agercs1 =c1' agercs2 =c2' agercs3 =c3' /// agercs1_dep5 =c1' agercs2_dep5 =c2' agercs3_dep5 =c3' /// agercs1_fem =c1' agercs2_fem =c2' agercs3_fem =c3') predict sage'_dep5, surv timevar(tt) /// at(female 1 dep5 1 femdep5 1 /// agercs1 =c1' agercs2 =c2' agercs3 =c3' /// agercs1_dep5 =c1' agercs2_dep5 =c2' agercs3_dep5 =c3' /// agercs1_fem =c1' agercs2_fem =c2' agercs3_fem =c3') } Note the use of the scalar option in rcsgen to get the appropriate values of the spline variables for the different ages. These are then stored as scalars and used in the at() option of the predict command. Also note that I have created a variable, tt, containing the times to predict at. The predictions can then be plotted. Relative survival is lower as age increases and lower among those who live in more deprived areas. These are conditional predictions, i.e. prediction conditional on specific covariate patterns. By using standsurv we can obtain marginal predictions in order to obtain an overall summary of relative survival and perform contrasts between different population groups. ### Using standsurv standsurv enables various marginal estimates to be obtained. For a review of marginal measures in survival analysis see our recent International Journal of Epidemiology paper (Syrioupoulou et al. 2020). If an overall population summary of relative survival is required then marginal (standardized) measures can be obtained. For example marginal relative survival is simply the expectation over covariates$Z_2$, $$R_M(t) = E\left[R(t|Z_2) \right]$$ In a modelling framework an estimate can be obtained by obtaining predictions for each of the$N$individuals in the study at the observed values of their covariates and then taking an average of these predictions. $$\widehat{R}_{M}(t) = \frac{1}{N} \sum_{i=1}^{N} {\widehat{R}(t|Z_2=z_{2i})}$$ standsurv will do these calculations . standsurv, surv timevar(tt) ci /// > atvar(mrs) /// > at1(.) Note the at(.) option requests standsurv uses the observed covariate distribution to average over. Here standsurv has predicted a survival function for each of the 7,333 individuals in our study and then taken the average of these functions. The results can then be plotted. I have overlayed the non-parametric Pohar Perme estimator of marginal relative survival (using stpp) that shows the model is doing a pretty good job. I have referred to this is marginal relative survival. Under assumptions this can be interpreted as marginal net survival. Net survival is survival in the hypothetical situation where it is not possible to die from other causes. For details these interpretations and of the assumptons, see (Lambert et al. 2015, Pavlic, K. & Pohar Perme 2018). This measure is a summary for our population, but there is often interest in comparing different population subgroups. In these comparisons it is important to account for the fact that the age (and other covariate) distribution may be different between the groups being compared. This is where standsurv is useful through allowing one to “force” the same covariate distribution on both groups. Note that in the relative survival framework it is common to standardise to an external age distribution. This is shown in a separate tutorial and here I will use the covariate distribution observed in the study. The key issue is that by applying a simple comparison of marginal relative survival betwen population groups, the differences could be due to differences in the covariate distribution between the groups. For example. if one group was older an average then this could explain any observed difference in marginal relative survival. With two or more population groups it is useful to perform contrasts. Here we will compare the absolute difference in marginal relative survival between the two deprivation groups. To illustrate the methods I will compare the least and the most deprived individuals. First I will average over the combined covariate distribution of the two groups. The estimand of interest here is as follows, $$E\left[R(t|X=1,Z_2\right)] - E\left[R(t|X=0,Z_2\right)]$$ Here$X$denotes a binary exposure (deprivation group in our case) with$X=1$denoting the most deprived and$X=0\$ denoting the least deprived. Thus, the left hand term is the marginal relative survival among the exposed and the right hand term is the marginal relative survival among the unexposed.

This can be estimated using using two standardized survival functions, one where all individuals are forced to be exposed and one where there are forced to be unexposed.

$$\frac{1}{N}\sum_{i=1}^N{\widehat{R}(t|X=1,Z_2=z_{2i})} - \frac{1}{N}\sum_{i=1}^N{\widehat{R}(t|X=0,Z_2=z_{2i})}$$

The code required for standsurv to estimate this is a follows,

. standsurv, surv timevar(tt) ci                                                        ///
>            atvar(ms_dep1a ms_dep5a)                                                   ///
>            at1(dep5 1 femdep5 = female                                                ///
>                agercs1_dep5 = agercs1 agercs1_dep5 = agercs2 agercs1_dep5 = agercs3)  ///
>            at2(dep5 0 femdep5 0                                                       ///
>                agercs1_dep5 0 agercs1_dep5 0 agercs1_dep5 0)                          ///
>            contrast(difference)                                                       ///
>            contrastvar(ms_diffa)

As there are interactions in the model with the exposure we need to be careful in how these are specified in the at() options to manipulate exposures. The at1() refers to the those in the most deprived group (dep5=1). We want to force everyone to be in the deprived group and thus use dep5 1. However, there are interactions between deprivation and sex and also deprivation and age. As we are first forcing everyone to be exposed we need to force these interactions terms to be the values they would be if everyone was actually exposed, i.e. we use femdep5 = female for the deprivation / sex interactions and similar for the spline variables for age at diagnois. For at2() we are forcing everyone to be in the least deprived group (dep5=0) and so the interactions terms are all set to zero.

The resulting predictions can then be plotted.

An important point here is that we have not averaged over the covariate pattern within each deprivation group as the distribution of age and sex will be different and thus could potentially explain any differencs we see. We have averaged over same covariate distribution (of age and sex). The estimate in each group is hypothetical in that it is the estimated relative survival if each of the deprivation groups had the age/sex distribution of the group as a whole. It is done to give a “fair” comparison.

When performing standardization, it is possible to standardise to the covariate distribution of one of the groups, so at least one of the groups does not have a hypothetical covariate distribution. In the code below I use an if statement to restrict the standardisation to the covariate distribution among the deprived group.

. standsurv if dep==1, surv timevar(tt) ci                                             ///
>            atvar(ms_dep1b ms_dep5b)                                                  ///
>            at1(dep5 0 femdep5 0                                                      ///
>                agercs1_dep5 0 agercs1_dep5 0 agercs1_dep5 0)                         ///
>            at2(dep5 1 femdep5 = female                                               ///
>                agercs1_dep5 = agercs1 agercs1_dep5 = agercs2 agercs1_dep5 = agercs3)

The resulting predictions can then be plotted.

You probably can’t see much of a difference between the different graphs in this case as the age/sex distribution is very similar between the two groups.

. tab sex dep5, row

+----------------+
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

|         dep5
sex |         0          1 |     Total
-----------+----------------------+----------
Male |     2,122      1,703 |     3,825
|     55.48      44.52 |    100.00
-----------+----------------------+----------
Female |     1,949      1,559 |     3,508
|     55.56      44.44 |    100.00
-----------+----------------------+----------
Total |     4,071      3,262 |     7,333
|     55.52      44.48 |    100.00

. tabstat age, by(dep5)

Summary for variables: age
Group variable: dep5

dep5 |      Mean
---------+----------
0 |  74.24367
1 |  73.48467
---------+----------
Total |  73.90604
--------------------

An important point here is that the above analysis is fine for internal comparisons for this study, but could not be directly compared to other studies where the age/sex distribution could be different. See the tutorial on external standardzation in relative survival for how this can be done.

## References

Lambert, P. C.; Dickman, P. W. & Rutherford, M. J. Comparison of approaches to estimating age-standardized net survival. BMC Med Res Methodol 2015;15:64

Nelson, C. P.; Lambert, P. C.; Squire, I. B. & Jones, D. R. Flexible parametric models for relative survival, with application in coronary heart disease. Statistics in Medicine 2007;26:5486-5498

Pavlic, K. & Pohar Perme, M. Using pseudo-observations for estimation in relative survival. Biostatistics 2018;20:384-399

Syriopoulou, E.; Rutherford, M. J. & Lambert, P. C. Marginal measures and causal effects using the relative survival framework. International Journal of Epidemiology 2020;49:619–628