When data cannot cross borders
Comparing models fitted in different countries
This example is based on a presentation at the Association of Nordic Cancer Registries Conference 2024. The presentation can be found here
International collaborative research using cancer registry data enables exploration of differences in cancer incidence, mortality, and survival. However, there is increasing difficulty in moving data across borders, which means that international comparisons are becomiong more challenging. This is particularly the case when an analysis requires fitting survival models, as individual level data is usually required.
Separate analyses can be performed in each country, but a consequence of this is analyses are often restricted to being descriptive and more simplistic than if data could be combined.
For common cancer sites data is often large enough to fit separate models for each country without a great loss in precision.
The aim of this tutorial is to describe a framework where separate models are fitted in each country and the Stata model object (i.e. a .ster
file) is shared. This file contains the model parameters and other details of the model, but crucially no individual level data.
The example is based on entirely simulated (symthetic) data where the interest is to compare survival between two countries, Country A and Country B.
This is cancer registry data where the preferred method of analysis is relative survival, so relative survival models will be fitted. However, the approach of fitting separate models is general and also relevant for more standard (and more complex) survival models.
The data is stored in two files CountryA.dta
and CountryB.dta
. These data files will be analysed separately to mimic situations where data could not be shared between different countries.
Kaplan-Meier plots
The following code loads each dataset in turn and produces Kaplan-Meier plots.
use https://www.pclambert.net/data/CountryA, clear
.
stset survtime, failure(dead=1)
.
data settings
Survival-time
Failure event: dead==1
Observed time interval: (0, survtime]on or before: failure
Exit
---------------------------------------------------------------------
> -----total observations
20,997
0 exclusions
---------------------------------------------------------------------
> -----
20,997 observations remaining, representingin single-record/single-failure data
7,854 failures total analysis time at risk and under observation
59,517.533
At risk from t =
> 0
Earliest observed entry t =
> 0exit t =
Last observed
> 5
sts graph, title(Country A) name(CountryA_km, replace)
.
Failure _d: dead==1
Analysis time _t: survtime
. summ agediag
dev. Min M
Variable | Obs Mean Std.
> ax
-------------+-------------------------------------------------------
> --
agediag | 20,997 74.78556 9.973784 18.12546 101.57
> 84
. use https://www.pclambert.net/data/CountryB, clear
.
stset survtime, failure(dead=1)
.
data settings
Survival-time
Failure event: dead==1
Observed time interval: (0, survtime]on or before: failure
Exit
---------------------------------------------------------------------
> -----total observations
10,498
0 exclusions
---------------------------------------------------------------------
> -----
10,498 observations remaining, representingin single-record/single-failure data
4,379 failures total analysis time at risk and under observation
28,253.044
At risk from t =
> 0
Earliest observed entry t =
> 0exit t =
Last observed
> 5
sts graph, title(Country B) name(CountryB_km, replace)
.
Failure _d: dead==1
Analysis time _t: survtime
. summ agediag
dev. Min M
Variable | Obs Mean Std.
> ax
-------------+-------------------------------------------------------
> --
agediag | 10,498 75.53747 10.05265 18.12765 102.46
> 85
. graph combine CountryA_km CountryB_km .
It can be seen that survival appears slightly worse in Country B. The individuals in Country B are slightly older on average.
Fitting a model to Country A
We will first fit a model to data from Country A. We are assuming we have access to data from Country B, but have a colleague who will fit the model in Country A and then send the model .ster
file.
We first load and stset
the data.
use https://www.pclambert.net/data/CountryA, clear
.
stset survtime, failure(dead=1)
.
data settings
Survival-time
Failure event: dead==1
Observed time interval: (0, survtime]on or before: failure
Exit
---------------------------------------------------------------------
> -----total observations
20,997
0 exclusions
---------------------------------------------------------------------
> -----
20,997 observations remaining, representingin single-record/single-failure data
7,854 failures total analysis time at risk and under observation
59,517.533
At risk from t =
> 0
Earliest observed entry t =
> 0exit t =
Last observed > 5
As we are fitting relative survial model we need to merge in the expected rates at the event/censoring times.
// merge in rates
. gen _age = min(floor(agediag + _t),99)
.
gen _year = year(diagdate + _t*365.25)
.
merge m:1 _age _year sex using https://www.pclambert.net/data/popmo
. ///
> rtA, keep(match master)
>
of obs
Result Number
-----------------------------------------
Not matched 0_merge==3)
Matched 20,997 ( -----------------------------------------
We then fit a model that incorporates the expected mortality rates using then bhazard(rate)
option. The model will include age at diagnosis (agediag
) and sex (sex
) as covariates. Natural splines will be used to model the effect of agediag
with an interaction included with sex
. In addition, both agediag
and sex
have time-dependent effects, i.e. we are allowing for non-proportional excess hazards.
scale(lncumhazard) df(5) ///
. stpm3 i.sex##@ns(agediag,df(3)), ///
> tvc(i.sex @ns(agediag,df(3) winsor(1 99))) dftvc(3)
> bhazard(rate)
Iteration 0: Log likelihood = -18987.251
Iteration 1: Log likelihood = -18784.806
Iteration 2: Log likelihood = -18772.486
Iteration 3: Log likelihood = -18772.439
Iteration 4: Log likelihood = -18772.439
of obs
Number
> = 20,997chi2(7)
Wald
> = 408.23chi2
Log likelihood = -18772.439 Prob >
> = 0.0000
---------------------------------------------------------------------
> ---------
| Coefficient Std. err. z P>|z| [95% conf.
> interval]
-------------+-------------------------------------------------------
> ---------xb |
2.sex | -1.074461 .2982228 -3.60 0.000 -1.658967
> -.4899548
_ns_f1_age~1 | -10.73306 2.229922 -4.81 0.000 -15.10362
> -6.362488
_ns_f1_age~2 | 1.838314 1.074649 1.71 0.087 -.2679584
> 3.944587
_ns_f1_age~3 | -1.328113 .4536643 -2.93 0.003 -2.217278
> -.4389471
|
sex#|
c. |
_ns_f1_age~1 |
2 | 7.430845 3.10376 2.39 0.017 1.347588
> 13.5141
|
sex#|
c. |
_ns_f1_age~2 |
2 | -2.276144 1.5657 -1.45 0.146 -5.34486
> .7925727
|
sex#|
c. |
_ns_f1_age~3 |
2 | 2.137709 .5439365 3.93 0.000 1.071613
> 3.203805
-------------+-------------------------------------------------------
> ---------
time |
_ns1 | -33.94606 3.820788 -8.88 0.000 -41.43466
> -26.45745
_ns2 | 13.16497 1.857333 7.09 0.000 9.524664
> 16.80527
_ns3 | -.7336646 .1254783 -5.85 0.000 -.9795976
> -.4877316
_ns4 | -.4517587 .1236179 -3.65 0.000 -.6940453
> -.2094721
_ns5 | -.3260899 .1155413 -2.82 0.005 -.5525466
> -.0996332
|
sex#|
c._ns_tvc1 |
2 | 4.06715 2.178313 1.87 0.062 -.2022648
> 8.336565
|
sex#|
c._ns_tvc2 |
2 | -1.092972 1.153449 -0.95 0.343 -3.353691
> 1.167748
|
sex#|
c._ns_tvc3 |
2 | .3388006 .0639397 5.30 0.000 .2134811
> .4641201
|
c. |
_ns_f2_age~1#|
c._ns_tvc1 | -10.87524 36.43505 -0.30 0.765 -82.28662
> 60.53614
|
c. |
_ns_f2_age~1#|
c._ns_tvc2 | -2.683456 19.10172 -0.14 0.888 -40.12214
> 34.75523
|
c. |
_ns_f2_age~1#|
c._ns_tvc3 | -.0657363 .8404523 -0.08 0.938 -1.712993
> 1.58152
|
c. |
_ns_f2_age~2#|
c._ns_tvc1 | 5.078 20.44475 0.25 0.804 -34.99297
> 45.14897
|
c. |
_ns_f2_age~2#|
c._ns_tvc2 | -.6415725 10.72052 -0.06 0.952 -21.6534
> 20.37026
|
c. |
_ns_f2_age~2#|
c._ns_tvc3 | -.6496364 .3865693 -1.68 0.093 -1.407298
> .1080254
|
c. |
_ns_f2_age~3#|
c._ns_tvc1 | 2.506379 7.991825 0.31 0.754 -13.15731
> 18.17007
|
c. |
_ns_f2_age~3#|
c._ns_tvc2 | -4.63819 4.24192 -1.09 0.274 -12.9522
> 3.67582
|
c. |
_ns_f2_age~3#|
c._ns_tvc3 | -.2157422 .3409415 -0.63 0.527 -.8839752
> .4524909
|_cons | .233223 .2607551 0.89 0.371 -.2778476
> .7442937
---------------------------------------------------------------------
> ---------
Extended functions
(1) @ns(agediag, df(3)) (2) @ns(agediag, df(3) winsor(1 99))
Now the model has been fitted, we can use estimates save
to save the model estimates.
estimates save CountryA, replace
. file CountryA.ster saved
The saves the file CountryA.ster
. It is useful to understand what is stored in this file. Essentially, it is the information you see when you type ereturn list
, the output of which is shown below.
ereturn list
.
scalars:e(rank) = 25
e(N) = 20997
e(ic) = 4
e(k) = 29
e(k_eq) = 2
e(k_dv) = 0
e(converged) = 1
e(rc) = 0
e(ll) = -18772.43865176562
e(k_eq_model) = 1
e(df_m) = 7
e(chi2) = 408.2254068442145
e(p) = 4.10585674052e-84
e(dev) = 37544.87730353124
e(AIC) = 37594.87730353124
e(BIC) = 37769.09675940969
macros:e(cmdline) : "i.sex##@ns(agediag,df(3)), scale(lncumh.."
e(cmd) : "stpm3"
e(predict) : "stpm3_pred"
e(ef_Nuser) : "0"
e(ef_agediag_winsor
"46.43008277735672 93.75740895924326"
2) : e(ef_agediag_knots2) : "46.43008277735672 71.40992416203093 79..."
e(ef_agediag_opts2) : "df(3) winsor(1 99)"
e(ef_agediag_fn2) : "ns"
e(ef_agediag_knots1) : "18.12545708589547 71.40992416203093 79..."
e(ef_agediag_opts1) : "df(3)"
e(ef_agediag_fn1) : "ns"
e(ef_agediag_Nfn) : "2"
e(bhazard) : "rate"
e(ef_varlist) : "agediag"
e(model_vars) : "sex agediag"
e(scale) : "lncumhazard"
e(degree) : "3"
e(ttrans) : "lnt"
e(splinetype) : "ns"
e(knots_tvc) : "-6.536461339131925 -.4699117670703595 ..."
e(tvcvars) : "_ns_f2_agediag1 _ns_f2_agediag2 _ns_f2_.."
e(sharedtvc_knots) : "1"
e(dfbase) : "5"
e(knots) : "-6.536461339131925 -1.053905396170044 -.."
e(dsplinevars) : "_dns1 _dns2 _dns3 _dns4 _dns5 2.sex#c._.."
e(splinevars_tvc) : "_ns_tvc1 _ns_tvc2 _ns_tvc3"
e(splinelist_tvc) : "2.sex#c._ns_tvc1 2.sex#c._ns_tvc2 2.sex.."
e(splinelist) : "_ns1 _ns2 _ns3 _ns4 _ns5"
e(tvc_included) : "2.sex _ns_f2_agediag1 _ns_f2_agediag2 _.."
e(tvc_original) : "i.sex @ns(agediag,df(3) winsor(1 99))"
e(tvc) : "i.sex c.(_ns_f2_agediag1 _ns_f2_agediag.."
e(varlist_original) : "i.sex##@ns(agediag,df(3))"
e(varlist) : "i.sex##c.(_ns_f1_agediag1 _ns_f1_agedia.."
e(chi2type) : "Wald"
e(deriv_useminbound) : "off"
e(opt) : "moptimize"
e(vce) : "oim"
e(user) : "stpm3_gf2_lncumhazard_rs()"
e(ml_method) : "gf2"
e(technique) : "nr"
e(which) : "max"
e(properties) : "b V"
matrices:e(b) : 1 x 29
e(V) : 29 x 29
e(Cns) : 4 x 30
e(ilog) : 1 x 20
e(gradient) : 1 x 29
functions:e(sample)
It contains summary information such as the log-likelihood (e(ll)
), the knot locations for the baseline (e(knots)
), the details of the natural spline for agediag
including the location of the knots (e(ef_agediag_knots1)
).
The parameter estimates are stored in e(b)
and the variance matrix in e(V)
. The parameter estimates are listed below,
matrix list e(b)
.
e(b)[1,29]
xb: xb: xb: xb:
1b. 2.
sex sex _ns_f1_age~1 _ns_f1_age~2
y1 0 -1.0744608 -10.733055 1.8383143
xb: xb: xb: xb:
1b.sex# 2.sex# 1b.sex#
_ns_f1_age~3 co._ns_f1_~1 c._ns_f1_a~1 co._ns_f1_~2
y1 -1.3281127 0 7.4308452 0
xb: xb: xb: time:
2.sex# 1b.sex# 2.sex#
c._ns_f1_a~2 co._ns_f1_~3 c._ns_f1_a~3 _ns1
y1 -2.2761439 0 2.1377088 -33.946056
time: time: time: time:
_ns2 _ns3 _ns4 _ns5
y1 13.164969 -.73366459 -.45175869 -.3260899
time: time: time: time:
2.sex# 2.sex# 2.sex# c._ns_f2_a~1#
c._ns_tvc1 c._ns_tvc2 c._ns_tvc3 c._ns_tvc1
y1 4.0671502 -1.0929718 .33880061 -10.875241
time: time: time: time:
c._ns_f2_a~1# c._ns_f2_a~1# c._ns_f2_a~2# c._ns_f2_a~2#
c._ns_tvc2 c._ns_tvc3 c._ns_tvc1 c._ns_tvc2
y1 -2.6834564 -.06573629 5.078 -.64157254
time: time: time: time:
c._ns_f2_a~2# c._ns_f2_a~3# c._ns_f2_a~3# c._ns_f2_a~3#
c._ns_tvc3 c._ns_tvc1 c._ns_tvc2 c._ns_tvc3
y1 -.6496364 2.5063795 -4.6381903 -.21574216
time:
_cons
y1 .23322305
Essentially the information shown in the output of estimates list
and saved in the CountryA.ster
file contains all the information we need to make predictions from the model. We are able to predict survival and other measures for any covariate pattern for any point in time from diagnosis. Importantly, we do not need the original data to do this. It is important to stress again, that no individual level data is stored in the CountryA.ster
file.
Fitting a model to Country B
We are assuming that we have access to the data in Country B, so we can load the data and fit the model. We use the same model as before for simplicity and consistency, but we could have fitted a model with different numbers/locations of knots or more/less interactions.
use https://www.pclambert.net/data/CountryB, clear
.
stset survtime, failure(dead=1)
.
data settings
Survival-time
Failure event: dead==1
Observed time interval: (0, survtime]on or before: failure
Exit
---------------------------------------------------------------------
> -----total observations
10,498
0 exclusions
---------------------------------------------------------------------
> -----
10,498 observations remaining, representingin single-record/single-failure data
4,379 failures total analysis time at risk and under observation
28,253.044
At risk from t =
> 0
Earliest observed entry t =
> 0exit t =
Last observed
> 5
. // merge in rates
. gen _age = min(floor(agediag + _t),99)
.
gen _year = year(diagdate + _t*365.25)
.
merge m:1 _age _year sex using https://www.pclambert.net/data/popmo
. ///
> rtA, keep(match master)
>
of obs
Result Number
-----------------------------------------
Not matched 0_merge==3)
Matched 10,498 (
-----------------------------------------
. scale(lncumhazard) df(5) ///
. stpm3 i.sex##@ns(agediag,df(3)), ///
> tvc(i.sex @ns(agediag,df(3) winsor(1 99))) dftvc(3)
> bhazard(rate)
Iteration 0: Log likelihood = -9934.4206
Iteration 1: Log likelihood = -9869.628
Iteration 2: Log likelihood = -9844.5096
Iteration 3: Log likelihood = -9844.268
Iteration 4: Log likelihood = -9844.2677
of obs
Number
> = 10,498chi2(7)
Wald
> = 281.55chi2
Log likelihood = -9844.2677 Prob >
> = 0.0000
---------------------------------------------------------------------
> ---------
| Coefficient Std. err. z P>|z| [95% conf.
> interval]
-------------+-------------------------------------------------------
> ---------xb |
2.sex | -.5367956 .3572884 -1.50 0.133 -1.237068
> .1634767
_ns_f1_age~1 | -17.11178 3.507369 -4.88 0.000 -23.98609
> -10.23746
_ns_f1_age~2 | 4.56246 1.740491 2.62 0.009 1.151161
> 7.973759
_ns_f1_age~3 | -2.074449 .5580852 -3.72 0.000 -3.168276
> -.9806221
|
sex#|
c. |
_ns_f1_age~1 |
2 | 10.21369 4.610953 2.22 0.027 1.176392
> 19.251
|
sex#|
c. |
_ns_f1_age~2 |
2 | -4.779487 2.362219 -2.02 0.043 -9.40935
> -.1496237
|
sex#|
c. |
_ns_f1_age~3 |
2 | 1.155039 .6679121 1.73 0.084 -.154045
> 2.464122
-------------+-------------------------------------------------------
> ---------
time |
_ns1 | -26.61548 3.729184 -7.14 0.000 -33.92454
> -19.30641
_ns2 | 9.032865 1.781381 5.07 0.000 5.541423
> 12.52431
_ns3 | -1.155737 .1753568 -6.59 0.000 -1.49943
> -.8120444
_ns4 | -.7050778 .1712158 -4.12 0.000 -1.040655
> -.3695009
_ns5 | -.5002046 .1657581 -3.02 0.003 -.8250844
> -.1753247
|
sex#|
c._ns_tvc1 |
2 | 2.543556 1.996199 1.27 0.203 -1.368922
> 6.456034
|
sex#|
c._ns_tvc2 |
2 | -.4942632 1.061708 -0.47 0.642 -2.575173
> 1.586647
|
sex#|
c._ns_tvc3 |
2 | .3206083 .0887039 3.61 0.000 .1467518
> .4944648
|
c. |
_ns_f2_age~1#|
c._ns_tvc1 | -.3671341 29.36376 -0.01 0.990 -57.91905
> 57.18478
|
c. |
_ns_f2_age~1#|
c._ns_tvc2 | -2.516029 15.62894 -0.16 0.872 -33.14818
> 28.11612
|
c. |
_ns_f2_age~1#|
c._ns_tvc3 | 1.186299 1.126176 1.05 0.292 -1.020966
> 3.393564
|
c. |
_ns_f2_age~2#|
c._ns_tvc1 | 3.057383 16.23862 0.19 0.851 -28.76972
> 34.88449
|
c. |
_ns_f2_age~2#|
c._ns_tvc2 | -1.467367 8.623812 -0.17 0.865 -18.36973
> 15.43499
|
c. |
_ns_f2_age~2#|
c._ns_tvc3 | -1.063785 .4957864 -2.15 0.032 -2.035508
> -.0920615
|
c. |
_ns_f2_age~3#|
c._ns_tvc1 | -2.472087 7.672579 -0.32 0.747 -17.51006
> 12.56589
|
c. |
_ns_f2_age~3#|
c._ns_tvc2 | .5907294 4.114411 0.14 0.886 -7.473367
> 8.654826
|
c. |
_ns_f2_age~3#|
c._ns_tvc3 | .1797678 .4757156 0.38 0.706 -.7526177
> 1.112153
|_cons | .9081922 .317337 2.86 0.004 .2862231
> 1.530161
---------------------------------------------------------------------
> ---------
Extended functions
(1) @ns(agediag, df(3)) (2) @ns(agediag, df(3) winsor(1 99))
We save the model estimates as before,
estimates save CountryB, replace
. file CountryB.ster saved
Obtaining marginal (standardized) relative survival
Now that we have the two .ster
files. We can use them to obtain summary measures that will allow us to compare survival between Country A and Country B.
An overall summary measure is standardized marginal relative survival. As we assuming we are in Counrty B we will use this data to define the covariate (age/sex) distribution that we want to standardize to.
We load the data we have access to, i.e. Country B.
use https://www.pclambert.net/data/CountryB, clear
.
stset survtime, failure(dead=1)
.
data settings
Survival-time
Failure event: dead==1
Observed time interval: (0, survtime]on or before: failure
Exit
---------------------------------------------------------------------
> -----total observations
10,498
0 exclusions
---------------------------------------------------------------------
> -----
10,498 observations remaining, representingin single-record/single-failure data
4,379 failures total analysis time at risk and under observation
28,253.044
At risk from t =
> 0
Earliest observed entry t =
> 0exit t =
Last observed > 5
We then load the model estimates for the model fitted to this data using estimates use
.
use CountryB . estimate
We can then use standsurv
to obtain the estimated marginal relative survival function.
range tt 0 5 101
. missing values generated)
(10,397
ci frame(RS, replace)
. standsurv RS_B, surv timevar(tt) not included in the model
!!WARNING!! You are including observations estimates.
when calculating standardized this is what you intended. Ensure
We get a warning that we are averaging over observations not included in the model. This is not actually true as we fitted the model to Country B and have reloaded the data for Country B. However, when you use estimate use
the e(sample)
is set to zero, which means that Stata has not linked the data in memory to the data that was used to fit the model. See the Stata help file for why this is. We will get the same warning when we use standsurv
for a model that was not fitted to the data in memory.
After running standsurv
we have the is the estimated internally age/sex standardized relative survival for Country B.
When we standardize we want to standardize over the same covariate pattern so that comparisons between the population groups we are comparing are fair. Therefore to obtain standardized estimates for Country A, we keep the same data in memory, (i.e. the data for Country B), but the predictions are based on the model for Country A. To do this we load the model estimates for Country A from CountryA.ster
use CountryA . estimate
We can then run standsurv
as before, but we now use frame(RS, merge)
so we store the predictions in the same frame as the predictions for Country B.
merge) ci
. standsurv RS_A, surv frame(RS, not included in the model
!!WARNING!! You are including observations estimates.
when calculating standardized this is what you intended. Ensure
We can now plot the marginal relative survival.
. frame RS {twoway (rarea RS_A_lci RS_A_uci tt, color(%30)) //
.
> /line RS_A tt, pstyle(p1line)) //
> (
> / color(%30)) //
> (rarea RS_B_lci RS_B_uci tt, pstyle(p2line)
> /line RS_B tt, pstyle(p2line)), //
> (
> /xtitle("Years from diagnosis") //
>
> /ytitle("Marginal Relative Survival") //
>
> /ylabel(,format(%3.1f)) //
>
> /legend(order(2 "Country A" 4 "Country B") pos(1)) //
>
> /name(RS, replace)
> . }
Note that as we have standardized to the age/sex distribution of Country B, the relative survival function for country B is a model based estimate of the ‘observed’ marginal relative survival in Country B. The estimated relative survival function for Country A is an estimate of the marginal relative survival in Country A if it had the age/sex distribution of Country B.
Relative survival as a function of age at diagnosis
We can now do a more complicated prediction. The code below obtains the relative survival at 5 years as a function of age at diagnosis. We show the prediction for males (sex==1
), but it would be simple to extend the code to obtain predictions for females.
. frame create ageplot
. frame ageplot {range agediag 50 80 31
. of observations (_N) was 0, now 31.
Number gen sex = 1
. estimates use CountryA
. gen t5 = 5
. predict RS5_A, surv at1(.) timevar(t5) ci merge
.
. estimates use CountryB
. predict RS5_B, surv at1(.) timevar(t5) ci merge
. . }
Note the use of at1(.)
which will predict at the observed values of covariates in the active frame (which we have have calculated). In addition, we use the merge
option as we want the predictions to be returned to the current frame (ageplot
) rather than a new frame.
Having generated the predictions, we can plot the results.
. frame ageplot {twoway (rarea RS5_A_lci RS5_A_uci agediag, color(%30))
. ///
> line RS5_A agediag, pstyle(p1line))
> (///
> color(%30) pstyle(p2li
> (rarea RS5_B_lci RS5_B_uci agediag, ///
> ne)) line RS5_B agediag, pstyle(p2line)),
> (///
> xtitle("Years from diagnosis")
> ///
> ytitle("Marginal 5-year Relative Survival")
> ///
> ylabel(,format(%3.1f))
> ///
> legend(order(2 "Country A" 4 "Country B") pos(1))
> ///
> name(RS_age, replace)
> . }
All cause survival
Relative survival is awkward to interpret in that it attempts to estimate net survival, which is survival in the hypothetical situation where it is not possible to die from any cause other than the cancer under study.
However, we can transform predictions to obtain all cause survival (\(S_i(t)\)) by incorporating the expected mortality survival, \(S_i^*(t)\), and combining with relative survival, \(R_i(t)\).
\[ \widehat{S}_i(t) = S_i^*(t) \widehat{R}_i(t) \]
To do this we use the expsurv()
option of the standsurv
command.
First the predictions for Country A
estimates use CountryA
.
ci frame(AC, replace)
. standsurv S_A, surv timevar(tt) ///
>
> expsurv(datediag(diagdate) ///
>
> agediag(agediag) ///
> using(https://www.pclambert.net/data/popm
> ///
> ortA)
> pmrate(rate) ///
>
> pmyear(_year) ///
>
> pmage(_age) ///
>
> pmother(sex) ///
>
> pmmaxyear(2020)) not included in the model
!!WARNING!! You are including observations estimates.
when calculating standardized this is what you intended. Ensure
.. and then for Country B.
estimates use CountryB
.
ci frame(AC, merge)
. standsurv S_B, surv ///
>
> expsurv(datediag(diagdate) ///
>
> agediag(agediag) ///
> using(https://www.pclambert.net/data/popm
> ///
> ortB)
> pmrate(rate) ///
>
> pmyear(_year) ///
>
> pmage(_age) ///
>
> pmother(sex) ///
>
> pmmaxyear(2020)) not included in the model
!!WARNING!! You are including observations estimates.
when calculating standardized this is what you intended. Ensure
We can now plot the model based estimate of all-cause survival.
. frame AC {twoway (rarea S_A_lci S_A_uci tt, color(%30)) //
.
> /line S_A tt, pstyle(p1line)) //
> (
> / color(%30)) //
> (rarea S_B_lci S_B_uci tt, pstyle(p2line)
> /line S_B tt, pstyle(p2line)), //
> (
> /xtitle("Years from diagnosis") //
>
> /ytitle("Marginal All cause Survival") //
>
> /ylabel(,format(%3.1f)) //
>
> /legend(order(2 "Country A" 4 "Country B") pos(1)) //
>
> /name(RS, replace)
> . }
There is a clear difference in all cause survival between Country A and Country B. However, although we have standardized to the same age/sex distribution, differences could be due to differential cancer (excess) mortality rates and/or differential other cause (expected) mortality rates. This is one reason why relative survival has been the preferred metric when quantifying cancer survival. We have developed reference adjusted measures that enables the reporting of all-cause survival in a way where the only differences should be due to differential cancer mortality rates as shown in the next section.
Reference Adjustment
With reference adjustment we define a reference expected survival, \(S_i^{**}\) that is the same between any groups we are comparing. We can then obtain reference adjusted all cause survival as follows, \(\widehat{S}_i^{REF}(t)\).
\[ \widehat{S}_i^{REF}(t) = S_i^{**}(t) \widehat{R}_i(t) \]
This is for a particular individual, \(i\), but we can average of many individuals to obtain marginal (standardized) estimates using standsurv
. Here will use the expected mortality rates for Country B as the reference. This means that the reference adjusted marginal estimates of all cause survival for Country B will be an estimate of the observed all cause survival in Country B. For Country A it will be an estimate of the marginal all cause survival in Country A if it had the expected mortality rates of Country B and the age/sex distribution of Country B. Note that the appropriate expected mortality rates for country A were used when fitting the model, but the reference rates are used when making predictions from the model.
The reason for doing this is that any differences we see in the marginal all cause survival will only be due to differential relative survival between the two countries, see Lambert et al. 2020.
We reload the estimates for the model fitted to Country B and then run standsurv
estimates use CountryB
.
ci frame(RefAdj, replace)
. standsurv Sref_B, surv timevar(tt) ///
>
> expsurv(datediag(diagdate) ///
>
> agediag(agediag) ///
> using(https://www.pclambert.net/data/popm
> ///
> ortB)
> pmrate(rate) ///
>
> pmyear(_year) ///
>
> pmage(_age) ///
>
> pmother(sex) ///
>
> pmmaxyear(2020)) not included in the model
!!WARNING!! You are including observations estimates.
when calculating standardized this is what you intended. Ensure
We then reload the estimates for the model fitted to Country A and run standsurv
again.
estimates use CountryA
.
ci frame(RefAdj, merge)
. standsurv Sref_A, surv ///
>
> expsurv(datediag(diagdate) ///
>
> agediag(agediag) ///
> using(https://www.pclambert.net/data/popm
> ///
> ortB)
> pmrate(rate) ///
>
> pmyear(_year) ///
>
> pmage(_age) ///
>
> pmother(sex) ///
>
> pmmaxyear(2020)) not included in the model
!!WARNING!! You are including observations estimates.
when calculating standardized this is what you intended. Ensure
And finally plot the results.
. frame RefAdj {twoway (rarea Sref_A_lci Sref_A_uci tt, color(%30))
. ///
> line Sref_A tt, pstyle(p1line))
> (///
> color(%30)
> (rarea Sref_B_lci Sref_B_uci tt, pstyle(p2line) ///
> ) line Sref_B tt, pstyle(p2line)),
> (///
> xtitle("Years from diagnosis")
> ///
> ytitle("Reference Adjusted All cause Survival")
> ///
> ylabel(,format(%3.1f))
> ///
> legend(order(2 "Country A" 4 "Country B") pos(1))
> ///
> name(RS, replace)
> . }
As we have standardized over the age/sex distribution of Country B and used the expected mortality rates for Country B, this figure gives an estimate of the average all cause survival for Country B (it will be very similar the all cause Kaplan-Meier estimate).
However, for Country A the estimate is hypothetical in two ways. Firstly, it is standardized over the age/sex distribution for Country B. Secondly, the reference expected mortality rates are for Country B, so it gives an estimate of what the all cause survival would be in Country A if it had the expected mortality rates seen in Country B and the age/sex distribution of Country B. We should remember that the reason for making the estimates Country A hypothetical are for reasons of comparability and we want to isolate differences that are only due to differences in cancer (excess) mortality rates.
Reference Adjustment with external age standardization
When it is of interest to make comparisons with other studies it is important to standardize to the same age/sex distribution. Many population-based cancer survival studies use the International Cancer Survival Standard as an external age distribution to standardise to.
The following code calcuates the ICSS age groups and associated weights.
//Define ICSS age groups and weights
. recode agediag (min/44.999=1) (45/54.999=2) (55/64.999=3) (65/74.99
. max=5), gen(ICSSagegrp)
> 9=4) (75/
(10,498 differences between agediag and ICSSagegrp)
recode ICSSagegrp (1=0.07) (2=0.12) (3=0.23) (4=0.29) (5=0.29), gen
.
> (ICSSwt) (10,498 differences between ICSSagegrp and ICSSwt)
Next we calculate the proportion in our data (Country B) in each age group for each sex. We then calculate the ratio of the ICSS weights and the observed proportion in each age group, so that we can upweight or downweight individuals in each age group.
bysort sex: gen sextotal= _N
.
bysort ICSSagegrp sex:gen a_age = _N/sextotal
.
gen double wt_age = ICSSwt/a_age
.
. bysort sex ICSSagegrp: gen first = _n==1
.
list sex ICSSagegrp ICSSwt a_age wt_age if first, noobs ab(12)
.
+---------------------------------------------------+
| sex ICSSagegrp ICSSwt a_age wt_age |
|---------------------------------------------------|
| 1 1 .07 .00617443 11.337083 |
| 1 2 .12 .02405454 4.9886631 |
| 1 3 .23 .1039362 2.212896 |
| 1 4 .29 .32518652 .89179589 |
| 1 5 .29 .54064831 .53639305 |
|---------------------------------------------------|
| 2 1 .07 .00844347 8.2904348 |
| 2 2 .12 .0256975 4.6697143 |
| 2 3 .23 .10352423 2.2217021 |
| 2 4 .29 .30029369 .96572127 |
| 2 5 .29 .56204112 .51597649 |
+---------------------------------------------------+
sort tt .
Next we can use standsurv
’ to obtain marginal estimates. Here we calculate marginal relative survival, but we could do this for reference adjusted measures as well.
We do this first for Country B,
estimates use CountryB
.
if sex==1, surv timevar(tt) ci frame(RS_ICSS, replac
. standsurv RS_B e) ///
>
> indweights(wt_age)not included in the model
!!WARNING!! You are including observations estimates.
when calculating standardized this is what you intended. Ensure
and then for Country A.
estimates use CountryA
.
if sex==1, surv ci frame(RS_ICSS, merge) ///
. standsurv RS_A
> indweights(wt_age) not included in the model
!!WARNING!! You are including observations estimates.
when calculating standardized this is what you intended. Ensure
Finally, we can plot the results.
. frame RS_ICSS {twoway (rarea RS_A_lci RS_A_uci tt, color(%30)) //
.
> /line RS_A tt, pstyle(p1line)) //
> (
> / color(%30)) //
> (rarea RS_B_lci RS_B_uci tt, pstyle(p2line)
> /line RS_B tt, pstyle(p2line)), //
> (
> /xtitle("Years from diagnosis") //
>
> /ytitle("Marginal Relative Survival") //
>
> /ylabel(,format(%3.1f)) //
>
> /legend(order(2 "Country A" 4 "Country B") pos(1)) //
>
> /name(RS_ICSS, replace)
> . }
Standardizing to the age/sex distribution of Country A
We have been assuming that we are in, and have access to data from Country B. What if we want to standardize to the age/sex distribution of Country A? We could send the Country B.ster
file to a colleague in Country A and they could repeat the analysis above, but now the standardization would be over the age/sex distribution in Country A (i.e. the data in memory).
One other option would be to obtain a summary of the age/sex distribution in Country A and use this to create a data set to standardize over. We show two ways of standardizing once we have this information.
First we will assume that a colleague in Country A has produced a summary of the proportion in each age/sex combination, perhaps using code like the following.
use https://www.pclambert.net/data/CountryA, clear
.
gen ageint = floor(agediag)
.
collapse (percent) cellpercent=agediag, by(ageint sex)
.
list in 1/20, noobs ab(11)
.
+----------------------------+
| sex ageint cellpercent |
|----------------------------|
| 1 18 .00476259 |
| 2 18 .00476259 |
| 1 19 .01428776 |
| 2 19 .00476259 |
| 1 20 .00476259 |
|----------------------------|
| 1 21 .01428776 |
| 2 21 .00476259 |
| 1 23 .00952517 |
| 2 23 .00476259 |
| 1 25 .00952517 |
|----------------------------|
| 1 26 .00476259 |
| 1 27 .00952517 |
| 2 27 .00476259 |
| 1 28 .00476259 |
| 1 29 .01428776 |
|----------------------------|
| 1 30 .02381293 |
| 1 31 .02857551 |
| 2 31 .00476259 |
| 1 32 .00476259 |
| 2 32 .01428776 |
+----------------------------+
rename ageint agediag
.
save CountryA_age_sex_summary, replace
. file CountryA_age_sex_summary.dta saved
This given us the proportion in each age/sex comination in Country A
We can then calculate the weights.
use CountryA_age_sex_summary, clear
.
gen wt = (cellpercent/100)*_N .
These weights need to sum to the total number of observations in the data, which is why the percentage is converted to a proportion and then multiplied by _N
.
We can then use standsurv
for each of the the model estimates passing the weights using the indweights()
option. The code below shows for estimation of standardized relative survival, but this can easily be extended in a similar way to above for reference adjustment etc.
estimates use CountryA
.
range tt 0 10 101
. missing values generated)
(52
ci indweights(wt) ///
. standsurv RS_A, surv timevar(tt) replace)
> frame(RS_standA1, not included in the model
!!WARNING!! You are including observations estimates.
when calculating standardized this is what you intended.
Ensure
estimates use CountryB
.
ci indweights(wt) ///
. standsurv RS_B, surv merge)
> frame(RS_standA1, not included in the model
!!WARNING!! You are including observations estimates.
when calculating standardized this is what you intended. Ensure
This can be plotted.
. frame RS_standA1 {twoway (rarea RS_A_lci RS_A_uci tt, color(%30)) //
.
> /line RS_A tt, pstyle(p1line)) //
> (
> / color(%30)) //
> (rarea RS_B_lci RS_B_uci tt, pstyle(p2line)
> /line RS_B tt, pstyle(p2line)), //
> (
> /xtitle("Years from diagnosis") //
>
> /ytitle("Marginal Relative Survival") //
>
> /ylabel(,format(%3.1f)) //
>
> /legend(order(2 "Country A" 4 "Country B") pos(1)) //
>
> /name(RS, replace)
> . }
An alternative way to get the same result is to expand the data to match the percentage for each cell. Here we make the sample size 20000.
use CountryA_age_sex_summary, clear
.
gen freq = round(20000*cellpercent/100)
.
. expand freq
(19,853 observations created)
estimates use CountryA
.
range tt 0 10 101
. missing values generated)
(19,905
ci ///
. standsurv RS_A, surv timevar(tt) replace)
> frame(RS_standA2, not included in the model
!!WARNING!! You are including observations estimates.
when calculating standardized this is what you intended.
Ensure
estimates use CountryB
.
ci ///
. standsurv RS_B, surv merge)
> frame(RS_standA2, not included in the model
!!WARNING!! You are including observations estimates.
when calculating standardized this is what you intended. Ensure
This avoids having to use the indweights()
option.
. frame RS_standA2 {twoway (rarea RS_A_lci RS_A_uci tt, color(%30)) //
.
> /line RS_A tt, pstyle(p1line)) //
> (
> / color(%30)) //
> (rarea RS_B_lci RS_B_uci tt, pstyle(p2line)
> /line RS_B tt, pstyle(p2line)), //
> (
> /xtitle("Years from diagnosis") //
>
> /ytitle("Marginal Relative Survival") //
>
> /ylabel(,format(%3.1f)) //
>
> /legend(order(2 "Country A" 4 "Country B") pos(1)) //
>
> /name(RS, replace)
> . }
References
Lambert PC, Andersson TM-L, Rutherford MJ, Myklebust TÅ, Møller B. Reference-adjusted and standardized all-cause and crude probabilities as an alternative to net survival in population-based cancer studies. International Journal of Epidemiology 2020;49:1614–23. https://doi.org/10.1093/ije/dyaa112
Rutherford MJ, Andersson TM-L, Myklebust TÅ, Møller B, Lambert PC. Non-parametric estimation of reference adjusted, standardised probabilities of all-cause death and death due to cancer for population group comparisons. BMC Medical Research Methodology 2022;22:2. https://doi.org/10.1186/s12874-021-01465-w