Numerical Integration for log hazard models

Models on the log hazard scale

In stpm3 it is possible to fit models on the log hazard scale (scale(lnhazard)) as well as the log cumulative hazard scale (scale(lncumhazard)). Such models could not be fitted using stpm2.

Other Stata software that could be used fit models on the log hazard scale include strcs, stgenreg and merlin. One advantage of fitting the models in stpm3 is that the time take to fit the models is much quicker, particularly when combined with the python option (LINK).

Models on the log hazard scale require numerical integration to be fitted. The is because the transformation from log hazard to cumulative hazard is not analytically tractable. The numerical integration used by stpm3 is more accurate for many types of models than that used in other software (strcs, merlin) through using a different type of numerical quadrature.

The log-likelihood for these models is as follows,

$$d_i \log\left[h(t_i|X_i)\right]- \int_{t_{0i}}^{t_i} h(u|X_i) du$$

As it is not possible to obtain the integral analytically we use numerical integration.

Gaussian Quadrature

Gaussian quadrature is a method to perform numerical integration that converts an integral into a weighted sum over a set of predefined points known as nodes.

$$\int_{-1}^{-1} f(x) dx \approx \sum_{k=1}^K w_k f(x_k)$$

Thus we just need to evaluate the integrand, $f(x)$ at the $K$ nodes, $x_1 \ldots x_K$, combine with the $K$ weights, $w_1 \ldots w_K$ and then sum. The larger the value of $K$ the more accurate the approximation of the integral will be.

There are different ways to calculate the nodes and weights. The Wikipedia page on Guassian Quadrature gives a good overview. [LINK]

The limits of the above integral are [-1,1], but can be transformed to any interval using a change of variable rule. Thus to integrate $h(t|X_i)$ over $[t_{0i},t_i]$, we first generate nodes for a given $K$, for limits [-1,1], and than transform the nodes so we integrate over the limits $[t_{0i},t_i]$ using

$$x_k' = \frac{t_i-t_{0i}}{2}x_k + \frac{t_i+t_{0i}}{2}$$

and then obtain the approximation to the integral using,

$$\int_{t_{0i}}^{t_i} h(u|X_i) du \approx \frac{t_i-t_{0i}}{2}\sum_{k=1}^K w_k h(x_k'|X_i) $$

Note that this integral need to be evaluated for all individuals in the model and the limits change between individuals. When maximizing the log-likelihood function these integrals will be need to be calculate multiple times until the model converges. For complex/awkward functions many nodes may be required to obtain a desired level of accuracy.

Gauss Legendre Quadrature

The most common way to perform the integration is Gauss Legendre Quadrature. Below is an example of using gaussian quadrature to integrate $\int_0^2 \sqrt x$, using Gauss Legendre quadrature with 7 nodes. The integrals is evaluated at 4 upper limits (0.5, 1, 1.5 and 2).

. mata:
------------------------------------------------- mata (type end to exit) --------------------------------------------------------
: // function to integrate
: function fx(real matrix x) return(sqrt(x))

: // generate Gauss Legendre nodes and weights
: stpm3_quad_gl(7, weights=.,nodes=.)

: weights, nodes
                  1              2
    +-------------------------------+
  1 |   .1294849662   -.9491079123  |
  2 |   .2797053915   -.7415311856  |
  3 |   .3818300505   -.4058451514  |
  4 |   .4179591837    9.99201e-16  |
  5 |   .3818300505    .4058451514  |
  6 |   .2797053915    .7415311856  |
  7 |   .1294849662    .9491079123  |
    +-------------------------------+

: // integral limits
: a = 0

: b = range(0.5,2,0.5) 

: // update nodes to limits
: nodes2 = 0.5:*(b:-a):*J(rows(b),1,nodes') :+ 0.5:*(b:+a)

: nodes2
                 1             2             3             4             5             6             7
    +---------------------------------------------------------------------------------------------------+
  1 |  .0127230219   .0646172036   .1485387122           .25   .3514612878   .4353827964   .4872769781  |
  2 |  .0254460438   .1292344072   .2970774243            .5   .7029225757   .8707655928   .9745539562  |
  3 |  .0381690657   .1938516108   .4456161365           .75   1.054383864   1.306148389   1.461830934  |
  4 |  .0508920877   .2584688144   .5941548486             1   1.405845151   1.741531186   1.949107912  |
    +---------------------------------------------------------------------------------------------------+

: // numeric integral
: integral =  0.5:*(b:-a):*quadrowsum(fx(nodes2):*weights')

: // analytical integral
: true_integral = (b:^1.5):/1.5

: b, integral, true_integral, reldif(true_integral,integral)
                 1             2             3             4
    +---------------------------------------------------------+
  1 |           .5   .2357893825   .2357022604   .0000704991  |
  2 |            1   .6669130851   .6666666667   .0001478292  |
  3 |          1.5   1.225197571   1.224744871   .0002034424  |
  4 |            2    1.88631506   1.885618083   .0002414763  |
    +---------------------------------------------------------+

: end
----------------------------------------------------------------------------------------------------------------------------------

The intergral is only

Three part integration

Tanhsinh quadrature

. use "https://pclambert.net/data/colon.dta", clear
(Colon carcinoma, diagnosed 1975-94, follow-up to 1995)

. stset surv_mm,f(status=1,2) id(id) scale(12) exit(time 120.5)

Survival-time data settings

           ID variable: id
         Failure event: status==1 2
Observed time interval: (surv_mm[_n-1], surv_mm]
     Exit on or before: time 120.5
     Time for analysis: time/12

--------------------------------------------------------------------------
     15,564  total observations
          0  exclusions
--------------------------------------------------------------------------
     15,564  observations remaining, representing
     15,564  subjects
     10,459  failures in single-failure-per-subject data
 51,685.667  total analysis time at risk and under observation
                                                At risk from t =         0
                                     Earliest observed entry t =         0
                                          Last observed exit t =  10.04167

. gen male = sex==1

Professor of Biostatistics