When data cannot cross borders

Comparing models fitted in different countries

This example is based on a presentation at the Association of Nordic Cancer Registries Conference 2024. The presentation can be found here

International collaborative research using cancer registry data enables exploration of differences in cancer incidence, mortality, and survival. However, there is increasing difficulty in moving data across borders, which means that international comparisons are becomiong more challenging. This is particularly the case when an analysis requires fitting survival models, as individual level data is usually required.

Separate analyses can be performed in each country, but a consequence of this is analyses are often restricted to being descriptive and more simplistic than if data could be combined.

For common cancer sites data is often large enough to fit separate models for each country without a great loss in precision.

The aim of this tutorial is to describe a framework where separate models are fitted in each country and the Stata model object (i.e. a .ster file) is shared. This file contains the model parameters and other details of the model, but crucially no individual level data.

The example is based on entirely simulated (symthetic) data where the interest is to compare survival between two countries, Country A and Country B.

This is cancer registry data where the preferred method of analysis is relative survival, so relative survival models will be fitted. However, the approach of fitting separate models is general and also relevant for more standard (and more complex) survival models.

The data is stored in two files CountryA.dta and CountryB.dta. These data files will be analysed separately to mimic situations where data could not be shared between different countries.

Kaplan-Meier plots

The following code loads each dataset in turn and produces Kaplan-Meier plots.

. use https://www.pclambert.net/data/CountryA, clear

. stset survtime, failure(dead=1) 

Survival-time data settings

         Failure event: dead==1
Observed time interval: (0, survtime]
     Exit on or before: failure

---------------------------------------------------------------------
> -----
     20,997  total observations
          0  exclusions
---------------------------------------------------------------------
> -----
     20,997  observations remaining, representing
      7,854  failures in single-record/single-failure data
 59,517.533  total analysis time at risk and under observation
                                                At risk from t =     
>     0
                                     Earliest observed entry t =     
>     0
                                          Last observed exit t =     
>     5

. sts graph, title(Country A) name(CountryA_km, replace)

        Failure _d: dead==1
  Analysis time _t: survtime

. summ agediag

    Variable |        Obs        Mean    Std. dev.       Min        M
> ax
-------------+-------------------------------------------------------
> --
     agediag |     20,997    74.78556    9.973784   18.12546   101.57
> 84

. 
. use https://www.pclambert.net/data/CountryB, clear

. stset survtime, failure(dead=1) 

Survival-time data settings

         Failure event: dead==1
Observed time interval: (0, survtime]
     Exit on or before: failure

---------------------------------------------------------------------
> -----
     10,498  total observations
          0  exclusions
---------------------------------------------------------------------
> -----
     10,498  observations remaining, representing
      4,379  failures in single-record/single-failure data
 28,253.044  total analysis time at risk and under observation
                                                At risk from t =     
>     0
                                     Earliest observed entry t =     
>     0
                                          Last observed exit t =     
>     5

. sts graph, title(Country B) name(CountryB_km, replace)

        Failure _d: dead==1
  Analysis time _t: survtime

. summ agediag

    Variable |        Obs        Mean    Std. dev.       Min        M
> ax
-------------+-------------------------------------------------------
> --
     agediag |     10,498    75.53747    10.05265   18.12765   102.46
> 85

. 
. graph combine CountryA_km CountryB_km

It can be seen that survival appears slightly worse in Country B. The individuals in Country B are slightly older on average.

Fitting a model to Country A

We will first fit a model to data from Country A. We are assuming we have access to data from Country B, but have a colleague who will fit the model in Country A and then send the model .ster file.

We first load and stset the data.

. use https://www.pclambert.net/data/CountryA, clear

. stset survtime, failure(dead=1)

Survival-time data settings

         Failure event: dead==1
Observed time interval: (0, survtime]
     Exit on or before: failure

---------------------------------------------------------------------
> -----
     20,997  total observations
          0  exclusions
---------------------------------------------------------------------
> -----
     20,997  observations remaining, representing
      7,854  failures in single-record/single-failure data
 59,517.533  total analysis time at risk and under observation
                                                At risk from t =     
>     0
                                     Earliest observed entry t =     
>     0
                                          Last observed exit t =     
>     5

As we are fitting relative survial model we need to merge in the expected rates at the event/censoring times.

. // merge in rates
. gen _age  = min(floor(agediag + _t),99)                  

. gen _year = year(diagdate + _t*365.25)

. merge m:1 _age _year sex using https://www.pclambert.net/data/popmo
> rtA, ///
>           keep(match master)   

    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                            20,997  (_merge==3)
    -----------------------------------------

We then fit a model that incorporates the expected mortality rates using then bhazard(rate) option. The model will include age at diagnosis (agediag) and sex (sex) as covariates. Natural splines will be used to model the effect of agediag with an interaction included with sex. In addition, both agediag and sex have time-dependent effects, i.e. we are allowing for non-proportional excess hazards.

. stpm3 i.sex##@ns(agediag,df(3)), scale(lncumhazard) df(5) ///
>       tvc(i.sex @ns(agediag,df(3) winsor(1 99))) dftvc(3) ///
>       bhazard(rate) 

Iteration 0:  Log likelihood = -18987.251  
Iteration 1:  Log likelihood = -18784.806  
Iteration 2:  Log likelihood = -18772.486  
Iteration 3:  Log likelihood = -18772.439  
Iteration 4:  Log likelihood = -18772.439  

                                                        Number of obs
>  = 20,997
                                                        Wald chi2(7) 
>  = 408.23
Log likelihood = -18772.439                             Prob > chi2  
>  = 0.0000

---------------------------------------------------------------------
> ---------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. 
> interval]
-------------+-------------------------------------------------------
> ---------
xb           |
       2.sex |  -1.074461   .2982228    -3.60   0.000    -1.658967   
> -.4899548
_ns_f1_age~1 |  -10.73306   2.229922    -4.81   0.000    -15.10362   
> -6.362488
_ns_f1_age~2 |   1.838314   1.074649     1.71   0.087    -.2679584   
>  3.944587
_ns_f1_age~3 |  -1.328113   .4536643    -2.93   0.003    -2.217278   
> -.4389471
             |
         sex#|
          c. |
_ns_f1_age~1 |
          2  |   7.430845    3.10376     2.39   0.017     1.347588   
>   13.5141
             |
         sex#|
          c. |
_ns_f1_age~2 |
          2  |  -2.276144     1.5657    -1.45   0.146     -5.34486   
>  .7925727
             |
         sex#|
          c. |
_ns_f1_age~3 |
          2  |   2.137709   .5439365     3.93   0.000     1.071613   
>  3.203805
-------------+-------------------------------------------------------
> ---------
time         |
        _ns1 |  -33.94606   3.820788    -8.88   0.000    -41.43466   
> -26.45745
        _ns2 |   13.16497   1.857333     7.09   0.000     9.524664   
>  16.80527
        _ns3 |  -.7336646   .1254783    -5.85   0.000    -.9795976   
> -.4877316
        _ns4 |  -.4517587   .1236179    -3.65   0.000    -.6940453   
> -.2094721
        _ns5 |  -.3260899   .1155413    -2.82   0.005    -.5525466   
> -.0996332
             |
         sex#|
  c._ns_tvc1 |
          2  |    4.06715   2.178313     1.87   0.062    -.2022648   
>  8.336565
             |
         sex#|
  c._ns_tvc2 |
          2  |  -1.092972   1.153449    -0.95   0.343    -3.353691   
>  1.167748
             |
         sex#|
  c._ns_tvc3 |
          2  |   .3388006   .0639397     5.30   0.000     .2134811   
>  .4641201
             |
          c. |
_ns_f2_age~1#|
  c._ns_tvc1 |  -10.87524   36.43505    -0.30   0.765    -82.28662   
>  60.53614
             |
          c. |
_ns_f2_age~1#|
  c._ns_tvc2 |  -2.683456   19.10172    -0.14   0.888    -40.12214   
>  34.75523
             |
          c. |
_ns_f2_age~1#|
  c._ns_tvc3 |  -.0657363   .8404523    -0.08   0.938    -1.712993   
>   1.58152
             |
          c. |
_ns_f2_age~2#|
  c._ns_tvc1 |      5.078   20.44475     0.25   0.804    -34.99297   
>  45.14897
             |
          c. |
_ns_f2_age~2#|
  c._ns_tvc2 |  -.6415725   10.72052    -0.06   0.952     -21.6534   
>  20.37026
             |
          c. |
_ns_f2_age~2#|
  c._ns_tvc3 |  -.6496364   .3865693    -1.68   0.093    -1.407298   
>  .1080254
             |
          c. |
_ns_f2_age~3#|
  c._ns_tvc1 |   2.506379   7.991825     0.31   0.754    -13.15731   
>  18.17007
             |
          c. |
_ns_f2_age~3#|
  c._ns_tvc2 |   -4.63819    4.24192    -1.09   0.274     -12.9522   
>   3.67582
             |
          c. |
_ns_f2_age~3#|
  c._ns_tvc3 |  -.2157422   .3409415    -0.63   0.527    -.8839752   
>  .4524909
             |
       _cons |    .233223   .2607551     0.89   0.371    -.2778476   
>  .7442937
---------------------------------------------------------------------
> ---------
Extended functions
 (1) @ns(agediag, df(3))
 (2) @ns(agediag, df(3) winsor(1 99))

Now the model has been fitted, we can use estimates save to save the model estimates.

. estimates save CountryA, replace
file CountryA.ster saved

The saves the file CountryA.ster. It is useful to understand what is stored in this file. Essentially, it is the information you see when you type ereturn list, the output of which is shown below.

. ereturn list

scalars:
               e(rank) =  25
                  e(N) =  20997
                 e(ic) =  4
                  e(k) =  29
               e(k_eq) =  2
               e(k_dv) =  0
          e(converged) =  1
                 e(rc) =  0
                 e(ll) =  -18772.43865176562
         e(k_eq_model) =  1
               e(df_m) =  7
               e(chi2) =  408.2254068442145
                  e(p) =  4.10585674052e-84
                e(dev) =  37544.87730353124
                e(AIC) =  37594.87730353124
                e(BIC) =  37769.09675940969

macros:
            e(cmdline) : "i.sex##@ns(agediag,df(3)), scale(lncumh.."
                e(cmd) : "stpm3"
            e(predict) : "stpm3_pred"
           e(ef_Nuser) : "0"
  e(ef_agediag_winsor
    2)                 : "46.43008277735672 93.75740895924326"
  e(ef_agediag_knots2) : "46.43008277735672 71.40992416203093 79..."
   e(ef_agediag_opts2) : "df(3) winsor(1 99)"
     e(ef_agediag_fn2) : "ns"
  e(ef_agediag_knots1) : "18.12545708589547 71.40992416203093 79..."
   e(ef_agediag_opts1) : "df(3)"
     e(ef_agediag_fn1) : "ns"
     e(ef_agediag_Nfn) : "2"
            e(bhazard) : "rate"
         e(ef_varlist) : "agediag"
         e(model_vars) : "sex agediag"
              e(scale) : "lncumhazard"
             e(degree) : "3"
             e(ttrans) : "lnt"
         e(splinetype) : "ns"
          e(knots_tvc) : "-6.536461339131925 -.4699117670703595 ..."
            e(tvcvars) : "_ns_f2_agediag1 _ns_f2_agediag2 _ns_f2_.."
    e(sharedtvc_knots) : "1"
             e(dfbase) : "5"
              e(knots) : "-6.536461339131925 -1.053905396170044 -.."
        e(dsplinevars) : "_dns1 _dns2 _dns3 _dns4 _dns5 2.sex#c._.."
     e(splinevars_tvc) : "_ns_tvc1 _ns_tvc2 _ns_tvc3"
     e(splinelist_tvc) : "2.sex#c._ns_tvc1 2.sex#c._ns_tvc2 2.sex.."
         e(splinelist) : "_ns1 _ns2 _ns3 _ns4 _ns5"
       e(tvc_included) : "2.sex _ns_f2_agediag1 _ns_f2_agediag2 _.."
       e(tvc_original) : "i.sex @ns(agediag,df(3) winsor(1 99))"
                e(tvc) : "i.sex c.(_ns_f2_agediag1 _ns_f2_agediag.."
   e(varlist_original) : "i.sex##@ns(agediag,df(3))"
            e(varlist) : "i.sex##c.(_ns_f1_agediag1 _ns_f1_agedia.."
           e(chi2type) : "Wald"
  e(deriv_useminbound) : "off"
                e(opt) : "moptimize"
                e(vce) : "oim"
               e(user) : "stpm3_gf2_lncumhazard_rs()"
          e(ml_method) : "gf2"
          e(technique) : "nr"
              e(which) : "max"
         e(properties) : "b V"

matrices:
                  e(b) :  1 x 29
                  e(V) :  29 x 29
                e(Cns) :  4 x 30
               e(ilog) :  1 x 20
           e(gradient) :  1 x 29

functions:
             e(sample)

It contains summary information such as the log-likelihood (e(ll)), the knot locations for the baseline (e(knots)), the details of the natural spline for agediag including the location of the knots (e(ef_agediag_knots1)).

The parameter estimates are stored in e(b) and the variance matrix in e(V). The parameter estimates are listed below,

. matrix list e(b)

e(b)[1,29]
               xb:            xb:            xb:            xb:
               1b.             2.                              
              sex            sex   _ns_f1_age~1   _ns_f1_age~2
y1              0     -1.0744608     -10.733055      1.8383143

               xb:            xb:            xb:            xb:
                          1b.sex#         2.sex#        1b.sex#
     _ns_f1_age~3   co._ns_f1_~1   c._ns_f1_a~1   co._ns_f1_~2
y1     -1.3281127              0      7.4308452              0

               xb:            xb:            xb:          time:
            2.sex#        1b.sex#         2.sex#               
     c._ns_f1_a~2   co._ns_f1_~3   c._ns_f1_a~3           _ns1
y1     -2.2761439              0      2.1377088     -33.946056

             time:          time:          time:          time:
                                                               
             _ns2           _ns3           _ns4           _ns5
y1      13.164969     -.73366459     -.45175869      -.3260899

             time:          time:          time:          time:
            2.sex#         2.sex#         2.sex#  c._ns_f2_a~1#
       c._ns_tvc1     c._ns_tvc2     c._ns_tvc3     c._ns_tvc1
y1      4.0671502     -1.0929718      .33880061     -10.875241

             time:          time:          time:          time:
     c._ns_f2_a~1#  c._ns_f2_a~1#  c._ns_f2_a~2#  c._ns_f2_a~2#
       c._ns_tvc2     c._ns_tvc3     c._ns_tvc1     c._ns_tvc2
y1     -2.6834564     -.06573629          5.078     -.64157254

             time:          time:          time:          time:
     c._ns_f2_a~2#  c._ns_f2_a~3#  c._ns_f2_a~3#  c._ns_f2_a~3#
       c._ns_tvc3     c._ns_tvc1     c._ns_tvc2     c._ns_tvc3
y1      -.6496364      2.5063795     -4.6381903     -.21574216

             time:
                  
            _cons
y1      .23322305

Essentially the information shown in the output of estimates list and saved in the CountryA.ster file contains all the information we need to make predictions from the model. We are able to predict survival and other measures for any covariate pattern for any point in time from diagnosis. Importantly, we do not need the original data to do this. It is important to stress again, that no individual level data is stored in the CountryA.ster file.

Fitting a model to Country B

We are assuming that we have access to the data in Country B, so we can load the data and fit the model. We use the same model as before for simplicity and consistency, but we could have fitted a model with different numbers/locations of knots or more/less interactions.

. use https://www.pclambert.net/data/CountryB, clear

. stset survtime, failure(dead=1)

Survival-time data settings

         Failure event: dead==1
Observed time interval: (0, survtime]
     Exit on or before: failure

---------------------------------------------------------------------
> -----
     10,498  total observations
          0  exclusions
---------------------------------------------------------------------
> -----
     10,498  observations remaining, representing
      4,379  failures in single-record/single-failure data
 28,253.044  total analysis time at risk and under observation
                                                At risk from t =     
>     0
                                     Earliest observed entry t =     
>     0
                                          Last observed exit t =     
>     5

. 
. // merge in rates
. gen _age  = min(floor(agediag + _t),99)                  

. gen _year = year(diagdate + _t*365.25)

. merge m:1 _age _year sex using https://www.pclambert.net/data/popmo
> rtA, ///
>           keep(match master)   

    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                            10,498  (_merge==3)
    -----------------------------------------

. 
. stpm3 i.sex##@ns(agediag,df(3)), scale(lncumhazard) df(5) ///
>       tvc(i.sex @ns(agediag,df(3) winsor(1 99))) dftvc(3) ///
>       bhazard(rate) 

Iteration 0:  Log likelihood = -9934.4206  
Iteration 1:  Log likelihood =  -9869.628  
Iteration 2:  Log likelihood = -9844.5096  
Iteration 3:  Log likelihood =  -9844.268  
Iteration 4:  Log likelihood = -9844.2677  

                                                        Number of obs
>  = 10,498
                                                        Wald chi2(7) 
>  = 281.55
Log likelihood = -9844.2677                             Prob > chi2  
>  = 0.0000

---------------------------------------------------------------------
> ---------
             | Coefficient  Std. err.      z    P>|z|     [95% conf. 
> interval]
-------------+-------------------------------------------------------
> ---------
xb           |
       2.sex |  -.5367956   .3572884    -1.50   0.133    -1.237068   
>  .1634767
_ns_f1_age~1 |  -17.11178   3.507369    -4.88   0.000    -23.98609   
> -10.23746
_ns_f1_age~2 |    4.56246   1.740491     2.62   0.009     1.151161   
>  7.973759
_ns_f1_age~3 |  -2.074449   .5580852    -3.72   0.000    -3.168276   
> -.9806221
             |
         sex#|
          c. |
_ns_f1_age~1 |
          2  |   10.21369   4.610953     2.22   0.027     1.176392   
>    19.251
             |
         sex#|
          c. |
_ns_f1_age~2 |
          2  |  -4.779487   2.362219    -2.02   0.043     -9.40935   
> -.1496237
             |
         sex#|
          c. |
_ns_f1_age~3 |
          2  |   1.155039   .6679121     1.73   0.084     -.154045   
>  2.464122
-------------+-------------------------------------------------------
> ---------
time         |
        _ns1 |  -26.61548   3.729184    -7.14   0.000    -33.92454   
> -19.30641
        _ns2 |   9.032865   1.781381     5.07   0.000     5.541423   
>  12.52431
        _ns3 |  -1.155737   .1753568    -6.59   0.000     -1.49943   
> -.8120444
        _ns4 |  -.7050778   .1712158    -4.12   0.000    -1.040655   
> -.3695009
        _ns5 |  -.5002046   .1657581    -3.02   0.003    -.8250844   
> -.1753247
             |
         sex#|
  c._ns_tvc1 |
          2  |   2.543556   1.996199     1.27   0.203    -1.368922   
>  6.456034
             |
         sex#|
  c._ns_tvc2 |
          2  |  -.4942632   1.061708    -0.47   0.642    -2.575173   
>  1.586647
             |
         sex#|
  c._ns_tvc3 |
          2  |   .3206083   .0887039     3.61   0.000     .1467518   
>  .4944648
             |
          c. |
_ns_f2_age~1#|
  c._ns_tvc1 |  -.3671341   29.36376    -0.01   0.990    -57.91905   
>  57.18478
             |
          c. |
_ns_f2_age~1#|
  c._ns_tvc2 |  -2.516029   15.62894    -0.16   0.872    -33.14818   
>  28.11612
             |
          c. |
_ns_f2_age~1#|
  c._ns_tvc3 |   1.186299   1.126176     1.05   0.292    -1.020966   
>  3.393564
             |
          c. |
_ns_f2_age~2#|
  c._ns_tvc1 |   3.057383   16.23862     0.19   0.851    -28.76972   
>  34.88449
             |
          c. |
_ns_f2_age~2#|
  c._ns_tvc2 |  -1.467367   8.623812    -0.17   0.865    -18.36973   
>  15.43499
             |
          c. |
_ns_f2_age~2#|
  c._ns_tvc3 |  -1.063785   .4957864    -2.15   0.032    -2.035508   
> -.0920615
             |
          c. |
_ns_f2_age~3#|
  c._ns_tvc1 |  -2.472087   7.672579    -0.32   0.747    -17.51006   
>  12.56589
             |
          c. |
_ns_f2_age~3#|
  c._ns_tvc2 |   .5907294   4.114411     0.14   0.886    -7.473367   
>  8.654826
             |
          c. |
_ns_f2_age~3#|
  c._ns_tvc3 |   .1797678   .4757156     0.38   0.706    -.7526177   
>  1.112153
             |
       _cons |   .9081922    .317337     2.86   0.004     .2862231   
>  1.530161
---------------------------------------------------------------------
> ---------
Extended functions
 (1) @ns(agediag, df(3))
 (2) @ns(agediag, df(3) winsor(1 99))

We save the model estimates as before,

. estimates save CountryB, replace
file CountryB.ster saved

Obtaining marginal (standardized) relative survival

Now that we have the two .ster files. We can use them to obtain summary measures that will allow us to compare survival between Country A and Country B.

An overall summary measure is standardized marginal relative survival. As we assuming we are in Counrty B we will use this data to define the covariate (age/sex) distribution that we want to standardize to.

We load the data we have access to, i.e. Country B.

. use https://www.pclambert.net/data/CountryB, clear

. stset survtime, failure(dead=1)

Survival-time data settings

         Failure event: dead==1
Observed time interval: (0, survtime]
     Exit on or before: failure

---------------------------------------------------------------------
> -----
     10,498  total observations
          0  exclusions
---------------------------------------------------------------------
> -----
     10,498  observations remaining, representing
      4,379  failures in single-record/single-failure data
 28,253.044  total analysis time at risk and under observation
                                                At risk from t =     
>     0
                                     Earliest observed entry t =     
>     0
                                          Last observed exit t =     
>     5

We then load the model estimates for the model fitted to this data using estimates use.

. estimate use CountryB

We can then use standsurv to obtain the estimated marginal relative survival function.

. range tt 0 5 101
(10,397 missing values generated)

. standsurv RS_B, surv timevar(tt) ci frame(RS, replace)
!!WARNING!! You are including observations not included in the model
when calculating standardized estimates.
Ensure this is what you intended.

We get a warning that we are averaging over observations not included in the model. This is not actually true as we fitted the model to Country B and have reloaded the data for Country B. However, when you use estimate use the e(sample) is set to zero, which means that Stata has not linked the data in memory to the data that was used to fit the model. See the Stata help file for why this is. We will get the same warning when we use standsurv for a model that was not fitted to the data in memory.

After running standsurv we have the is the estimated internally age/sex standardized relative survival for Country B.

When we standardize we want to standardize over the same covariate pattern so that comparisons between the population groups we are comparing are fair. Therefore to obtain standardized estimates for Country A, we keep the same data in memory, (i.e. the data for Country B), but the predictions are based on the model for Country A. To do this we load the model estimates for Country A from CountryA.ster

. estimate use CountryA

We can then run standsurv as before, but we now use frame(RS, merge) so we store the predictions in the same frame as the predictions for Country B.

. standsurv RS_A, surv frame(RS, merge) ci
!!WARNING!! You are including observations not included in the model
when calculating standardized estimates.
Ensure this is what you intended.

We can now plot the marginal relative survival.

. frame RS {
.   twoway (rarea RS_A_lci RS_A_uci tt, color(%30))                //
> /
>          (line RS_A tt, pstyle(p1line))                          //
> / 
>          (rarea RS_B_lci RS_B_uci tt, pstyle(p2line) color(%30)) //
> /
>          (line RS_B tt, pstyle(p2line)),                         //
> /
>          xtitle("Years from diagnosis")                          //
> /
>          ytitle("Marginal Relative Survival")                    //
> /
>          ylabel(,format(%3.1f))                                  //
> /
>          legend(order(2 "Country A" 4 "Country B") pos(1))       //
> /
>          name(RS, replace)
. }

Note that as we have standardized to the age/sex distribution of Country B, the relative survival function for country B is a model based estimate of the ‘observed’ marginal relative survival in Country B. The estimated relative survival function for Country A is an estimate of the marginal relative survival in Country A if it had the age/sex distribution of Country B.

Relative survival as a function of age at diagnosis

We can now do a more complicated prediction. The code below obtains the relative survival at 5 years as a function of age at diagnosis. We show the prediction for males (sex==1), but it would be simple to extend the code to obtain predictions for females.

. frame create ageplot

. frame ageplot {
.   range agediag 50 80 31
Number of observations (_N) was 0, now 31.
.   gen sex = 1
.   estimates use CountryA
.   gen t5 = 5
.   predict RS5_A, surv at1(.) timevar(t5) ci merge
.   
.   estimates use CountryB
.   predict RS5_B, surv at1(.) timevar(t5) ci merge
. }

Note the use of at1(.) which will predict at the observed values of covariates in the active frame (which we have have calculated). In addition, we use the merge option as we want the predictions to be returned to the current frame (ageplot) rather than a new frame.

Having generated the predictions, we can plot the results.

. frame ageplot {
.   twoway (rarea RS5_A_lci RS5_A_uci agediag, color(%30))           
>      ///
>          (line RS5_A agediag, pstyle(p1line))                      
>      ///
>          (rarea RS5_B_lci RS5_B_uci agediag, color(%30) pstyle(p2li
> ne)) ///
>          (line RS5_B agediag, pstyle(p2line)),                     
>      ///
>          xtitle("Years from diagnosis")                            
>      ///
>          ytitle("Marginal 5-year Relative Survival")               
>      ///
>          ylabel(,format(%3.1f))                                    
>      ///
>          legend(order(2 "Country A" 4 "Country B") pos(1))         
>      ///
>          name(RS_age, replace)
. }

All cause survival

Relative survival is awkward to interpret in that it attempts to estimate net survival, which is survival in the hypothetical situation where it is not possible to die from any cause other than the cancer under study.

However, we can transform predictions to obtain all cause survival ( $S_{i} (t)$ ) by incorporating the expected mortality survival, $S_{i}^{*} (t)$ , and combining with relative survival, $R_{i} (t)$ .

${\hat{S}}_{i} (t) = S_{i}^{*} (t) {\hat{R}}_{i} (t)$

To do this we use the expsurv() option of the standsurv command.

First the predictions for Country A

. estimates use CountryA

. standsurv S_A, surv timevar(tt) ci frame(AC, replace)              
>       ///
>                   expsurv(datediag(diagdate)                       
>       ///
>                           agediag(agediag)                         
>       ///
>                           using(https://www.pclambert.net/data/popm
> ortA) ///
>                           pmrate(rate)                             
>       ///
>                           pmyear(_year)                            
>       ///
>                           pmage(_age)                              
>       ///
>                           pmother(sex)                             
>       ///
>                           pmmaxyear(2020))  
!!WARNING!! You are including observations not included in the model
when calculating standardized estimates.
Ensure this is what you intended.

.. and then for Country B.

. estimates use CountryB

. standsurv S_B, surv ci frame(AC, merge)                            
>       ///
>                   expsurv(datediag(diagdate)                       
>       ///
>                           agediag(agediag)                         
>       ///
>                           using(https://www.pclambert.net/data/popm
> ortB) ///
>                           pmrate(rate)                             
>       ///
>                           pmyear(_year)                            
>       ///
>                           pmage(_age)                              
>       ///
>                           pmother(sex)                             
>       ///
>                           pmmaxyear(2020))  
!!WARNING!! You are including observations not included in the model
when calculating standardized estimates.
Ensure this is what you intended.

We can now plot the model based estimate of all-cause survival.

. frame AC {
.   twoway (rarea S_A_lci S_A_uci tt, color(%30))                  //
> /
>          (line  S_A tt, pstyle(p1line))                          //
> / 
>          (rarea S_B_lci S_B_uci tt, pstyle(p2line) color(%30))   //
> /
>          (line  S_B tt, pstyle(p2line)),                         //
> /
>          xtitle("Years from diagnosis")                          //
> /
>          ytitle("Marginal All cause Survival")                   //
> /
>          ylabel(,format(%3.1f))                                  //
> /
>          legend(order(2 "Country A" 4 "Country B") pos(1))       //
> /
>          name(RS, replace)
. }

There is a clear difference in all cause survival between Country A and Country B. However, although we have standardized to the same age/sex distribution, differences could be due to differential cancer (excess) mortality rates and/or differential other cause (expected) mortality rates. This is one reason why relative survival has been the preferred metric when quantifying cancer survival. We have developed reference adjusted measures that enables the reporting of all-cause survival in a way where the only differences should be due to differential cancer mortality rates as shown in the next section.

Reference Adjustment

With reference adjustment we define a reference expected survival, $S_{i}^{* *}$ that is the same between any groups we are comparing. We can then obtain reference adjusted all cause survival as follows, ${\hat{S}}_{i}^{R E F} (t)$ .

${\hat{S}}_{i}^{R E F} (t) = S_{i}^{* *} (t) {\hat{R}}_{i} (t)$

This is for a particular individual, $i$ , but we can average of many individuals to obtain marginal (standardized) estimates using standsurv. Here will use the expected mortality rates for Country B as the reference. This means that the reference adjusted marginal estimates of all cause survival for Country B will be an estimate of the observed all cause survival in Country B. For Country A it will be an estimate of the marginal all cause survival in Country A if it had the expected mortality rates of Country B and the age/sex distribution of Country B. Note that the appropriate expected mortality rates for country A were used when fitting the model, but the reference rates are used when making predictions from the model.

The reason for doing this is that any differences we see in the marginal all cause survival will only be due to differential relative survival between the two countries, see Lambert et al. 2020.

We reload the estimates for the model fitted to Country B and then run standsurv

. estimates use CountryB

. standsurv Sref_B, surv timevar(tt) ci frame(RefAdj, replace)       
>       ///
>                   expsurv(datediag(diagdate)                       
>       ///
>                           agediag(agediag)                         
>       ///
>                           using(https://www.pclambert.net/data/popm
> ortB) ///
>                           pmrate(rate)                             
>       ///
>                           pmyear(_year)                            
>       ///
>                           pmage(_age)                              
>       ///
>                           pmother(sex)                             
>       ///
>                           pmmaxyear(2020))  
!!WARNING!! You are including observations not included in the model
when calculating standardized estimates.
Ensure this is what you intended.

We then reload the estimates for the model fitted to Country A and run standsurv again.

. estimates use CountryA

. standsurv Sref_A, surv ci frame(RefAdj, merge)                     
>       ///
>                   expsurv(datediag(diagdate)                       
>       ///
>                           agediag(agediag)                         
>       ///
>                           using(https://www.pclambert.net/data/popm
> ortB) ///
>                           pmrate(rate)                             
>       ///
>                           pmyear(_year)                            
>       ///
>                           pmage(_age)                              
>       ///
>                           pmother(sex)                             
>       ///
>                           pmmaxyear(2020))  
!!WARNING!! You are including observations not included in the model
when calculating standardized estimates.
Ensure this is what you intended.

And finally plot the results.

. frame RefAdj {
.   twoway (rarea Sref_A_lci Sref_A_uci tt, color(%30))              
>   ///
>          (line  Sref_A tt, pstyle(p1line))                         
>   /// 
>          (rarea Sref_B_lci Sref_B_uci tt, pstyle(p2line) color(%30)
> ) ///
>          (line  Sref_B tt, pstyle(p2line)),                        
>   ///
>          xtitle("Years from diagnosis")                            
>   ///
>          ytitle("Reference Adjusted All cause Survival")           
>   ///
>          ylabel(,format(%3.1f))                                    
>   ///
>          legend(order(2 "Country A" 4 "Country B") pos(1))         
>   ///
>          name(RS, replace)
. }

As we have standardized over the age/sex distribution of Country B and used the expected mortality rates for Country B, this figure gives an estimate of the average all cause survival for Country B (it will be very similar the all cause Kaplan-Meier estimate).

However, for Country A the estimate is hypothetical in two ways. Firstly, it is standardized over the age/sex distribution for Country B. Secondly, the reference expected mortality rates are for Country B, so it gives an estimate of what the all cause survival would be in Country A if it had the expected mortality rates seen in Country B and the age/sex distribution of Country B. We should remember that the reason for making the estimates Country A hypothetical are for reasons of comparability and we want to isolate differences that are only due to differences in cancer (excess) mortality rates.

Reference Adjustment with external age standardization

When it is of interest to make comparisons with other studies it is important to standardize to the same age/sex distribution. Many population-based cancer survival studies use the International Cancer Survival Standard as an external age distribution to standardise to.

The following code calcuates the ICSS age groups and associated weights.

. //Define ICSS age groups and weights
. recode agediag (min/44.999=1) (45/54.999=2) (55/64.999=3) (65/74.99
> 9=4) (75/max=5), gen(ICSSagegrp)
(10,498 differences between agediag and ICSSagegrp)

. recode ICSSagegrp (1=0.07) (2=0.12) (3=0.23) (4=0.29) (5=0.29), gen
> (ICSSwt)
(10,498 differences between ICSSagegrp and ICSSwt)

Next we calculate the proportion in our data (Country B) in each age group for each sex. We then calculate the ratio of the ICSS weights and the observed proportion in each age group, so that we can upweight or downweight individuals in each age group.

. bysort sex: gen sextotal= _N

. bysort ICSSagegrp sex:gen a_age = _N/sextotal

. gen double wt_age = ICSSwt/a_age

. 
. bysort sex ICSSagegrp: gen first = _n==1

. list sex ICSSagegrp ICSSwt a_age wt_age if first, noobs ab(12)

  +---------------------------------------------------+
  | sex   ICSSagegrp   ICSSwt       a_age      wt_age |
  |---------------------------------------------------|
  |   1            1      .07   .00617443   11.337083 |
  |   1            2      .12   .02405454   4.9886631 |
  |   1            3      .23    .1039362    2.212896 |
  |   1            4      .29   .32518652   .89179589 |
  |   1            5      .29   .54064831   .53639305 |
  |---------------------------------------------------|
  |   2            1      .07   .00844347   8.2904348 |
  |   2            2      .12    .0256975   4.6697143 |
  |   2            3      .23   .10352423   2.2217021 |
  |   2            4      .29   .30029369   .96572127 |
  |   2            5      .29   .56204112   .51597649 |
  +---------------------------------------------------+

. sort tt

Next we can use standsurv’ to obtain marginal estimates. Here we calculate marginal relative survival, but we could do this for reference adjusted measures as well.

We do this first for Country B,

. estimates use CountryB

. standsurv RS_B if sex==1, surv timevar(tt) ci frame(RS_ICSS, replac
> e) ///
>                               indweights(wt_age)
!!WARNING!! You are including observations not included in the model
when calculating standardized estimates.
Ensure this is what you intended.

and then for Country A.

. estimates use CountryA

. standsurv RS_A if sex==1, surv ci frame(RS_ICSS, merge)   ///
>                               indweights(wt_age)  
!!WARNING!! You are including observations not included in the model
when calculating standardized estimates.
Ensure this is what you intended.

Finally, we can plot the results.

. frame RS_ICSS {
.   twoway (rarea RS_A_lci RS_A_uci tt, color(%30))                //
> /
>          (line RS_A tt, pstyle(p1line))                          //
> / 
>          (rarea RS_B_lci RS_B_uci tt, pstyle(p2line) color(%30)) //
> /
>          (line RS_B tt, pstyle(p2line)),                         //
> /
>          xtitle("Years from diagnosis")                          //
> /
>          ytitle("Marginal Relative Survival")                    //
> /
>          ylabel(,format(%3.1f))                                  //
> /
>          legend(order(2 "Country A" 4 "Country B") pos(1))       //
> /
>          name(RS_ICSS, replace)
. }

Standardizing to the age/sex distribution of Country A

We have been assuming that we are in, and have access to data from Country B. What if we want to standardize to the age/sex distribution of Country A? We could send the Country B.ster file to a colleague in Country A and they could repeat the analysis above, but now the standardization would be over the age/sex distribution in Country A (i.e. the data in memory).

One other option would be to obtain a summary of the age/sex distribution in Country A and use this to create a data set to standardize over. We show two ways of standardizing once we have this information.

First we will assume that a colleague in Country A has produced a summary of the proportion in each age/sex combination, perhaps using code like the following.

. use https://www.pclambert.net/data/CountryA, clear

. gen ageint = floor(agediag)

. collapse (percent) cellpercent=agediag, by(ageint sex)

. list in 1/20, noobs ab(11)   

  +----------------------------+
  | sex   ageint   cellpercent |
  |----------------------------|
  |   1       18     .00476259 |
  |   2       18     .00476259 |
  |   1       19     .01428776 |
  |   2       19     .00476259 |
  |   1       20     .00476259 |
  |----------------------------|
  |   1       21     .01428776 |
  |   2       21     .00476259 |
  |   1       23     .00952517 |
  |   2       23     .00476259 |
  |   1       25     .00952517 |
  |----------------------------|
  |   1       26     .00476259 |
  |   1       27     .00952517 |
  |   2       27     .00476259 |
  |   1       28     .00476259 |
  |   1       29     .01428776 |
  |----------------------------|
  |   1       30     .02381293 |
  |   1       31     .02857551 |
  |   2       31     .00476259 |
  |   1       32     .00476259 |
  |   2       32     .01428776 |
  +----------------------------+

. rename ageint agediag

. save CountryA_age_sex_summary, replace
file CountryA_age_sex_summary.dta saved

This given us the proportion in each age/sex comination in Country A

We can then calculate the weights.

. use CountryA_age_sex_summary, clear

. gen wt = (cellpercent/100)*_N

These weights need to sum to the total number of observations in the data, which is why the percentage is converted to a proportion and then multiplied by _N.

We can then use standsurv for each of the the model estimates passing the weights using the indweights() option. The code below shows for estimation of standardized relative survival, but this can easily be extended in a similar way to above for reference adjustment etc.

. estimates use CountryA

. range tt 0 10 101
(52 missing values generated)

. standsurv RS_A, surv timevar(tt) ci indweights(wt) ///
>                 frame(RS_standA1, replace)
!!WARNING!! You are including observations not included in the model
when calculating standardized estimates.
Ensure this is what you intended.

. estimates use CountryB

. standsurv RS_B, surv ci indweights(wt) ///
>                 frame(RS_standA1, merge)
!!WARNING!! You are including observations not included in the model
when calculating standardized estimates.
Ensure this is what you intended.

This can be plotted.

. frame RS_standA1 {
.   twoway (rarea RS_A_lci RS_A_uci tt, color(%30))                //
> /
>          (line RS_A tt, pstyle(p1line))                          //
> / 
>          (rarea RS_B_lci RS_B_uci tt, pstyle(p2line) color(%30)) //
> /
>          (line RS_B tt, pstyle(p2line)),                         //
> /
>          xtitle("Years from diagnosis")                          //
> /
>          ytitle("Marginal Relative Survival")                    //
> /
>          ylabel(,format(%3.1f))                                  //
> /
>          legend(order(2 "Country A" 4 "Country B") pos(1))       //
> /
>          name(RS, replace)
. }

An alternative way to get the same result is to expand the data to match the percentage for each cell. Here we make the sample size 20000.

. use CountryA_age_sex_summary, clear

. gen freq = round(20000*cellpercent/100)

. expand freq
(19,853 observations created)

. estimates use CountryA

. range tt 0 10 101
(19,905 missing values generated)

. standsurv RS_A, surv timevar(tt) ci ///
>                 frame(RS_standA2, replace)
!!WARNING!! You are including observations not included in the model
when calculating standardized estimates.
Ensure this is what you intended.

. estimates use CountryB

. standsurv RS_B, surv ci             ///
>                 frame(RS_standA2, merge)
!!WARNING!! You are including observations not included in the model
when calculating standardized estimates.
Ensure this is what you intended.

This avoids having to use the indweights() option.

. frame RS_standA2 {
.   twoway (rarea RS_A_lci RS_A_uci tt, color(%30))                //
> /
>          (line RS_A tt, pstyle(p1line))                          //
> / 
>          (rarea RS_B_lci RS_B_uci tt, pstyle(p2line) color(%30)) //
> /
>          (line RS_B tt, pstyle(p2line)),                         //
> /
>          xtitle("Years from diagnosis")                          //
> /
>          ytitle("Marginal Relative Survival")                    //
> /
>          ylabel(,format(%3.1f))                                  //
> /
>          legend(order(2 "Country A" 4 "Country B") pos(1))       //
> /
>          name(RS, replace)
. }

References

Lambert PC, Andersson TM-L, Rutherford MJ, Myklebust TÅ, Møller B. Reference-adjusted and standardized all-cause and crude probabilities as an alternative to net survival in population-based cancer studies. International Journal of Epidemiology 2020;49:1614–23. https://doi.org/10.1093/ije/dyaa112

Rutherford MJ, Andersson TM-L, Myklebust TÅ, Møller B, Lambert PC. Non-parametric estimation of reference adjusted, standardised probabilities of all-cause death and death due to cancer for population group comparisons. BMC Medical Research Methodology 2022;22:2. https://doi.org/10.1186/s12874-021-01465-w