Numerical Integration for log hazard models
Models on the log hazard scale
In stpm3
it is possible to fit models on the log hazard scale (scale(lnhazard)
)
as well as the log cumulative hazard scale (scale(lncumhazard)
).
Such models could not be fitted using stpm2
.
Other Stata software that could be used fit models on the log hazard scale
include strcs
, stgenreg
and merlin
. One advantage of fitting
the models in stpm3
is that the time take to fit the models is
much quicker, particularly when combined with the python
option (LINK).
Models on the log hazard scale require numerical integration to be fitted.
The is because the transformation from log hazard to cumulative hazard
is not analytically tractable. The numerical integration used by stpm3
is more accurate for many types of models than that used in other software (strcs
, merlin
)
through using a different type of numerical quadrature.
The log-likelihood for these models is as follows,
$$d_i \log\left[h(t_i|X_i)\right]- \int_{t_{0i}}^{t_i} h(u|X_i) du$$
As it is not possible to obtain the integral analytically we use numerical integration.
Gaussian Quadrature
Gaussian quadrature is a method to perform numerical integration that converts an integral into a weighted sum over a set of predefined points known as nodes.
$$\int_{-1}^{-1} f(x) dx \approx \sum_{k=1}^K w_k f(x_k)$$
Thus we just need to evaluate the integrand, $f(x)$ at the $K$ nodes, $x_1 \ldots x_K$, combine with the $K$ weights, $w_1 \ldots w_K$ and then sum. The larger the value of $K$ the more accurate the approximation of the integral will be.
There are different ways to calculate the nodes and weights. The Wikipedia page on Guassian Quadrature gives a good overview. [LINK]
The limits of the above integral are [-1,1], but can be transformed to any interval using a change of variable rule. Thus to integrate $h(t|X_i)$ over $[t_{0i},t_i]$, we first generate nodes for a given $K$, for limits [-1,1], and than transform the nodes so we integrate over the limits $[t_{0i},t_i]$ using
$$x_k' = \frac{t_i-t_{0i}}{2}x_k + \frac{t_i+t_{0i}}{2}$$
and then obtain the approximation to the integral using,
$$\int_{t_{0i}}^{t_i} h(u|X_i) du \approx \frac{t_i-t_{0i}}{2}\sum_{k=1}^K w_k h(x_k'|X_i) $$
Note that this integral need to be evaluated for all individuals in the model and the limits change between individuals. When maximizing the log-likelihood function these integrals will be need to be calculate multiple times until the model converges. For complex/awkward functions many nodes may be required to obtain a desired level of accuracy.
Gauss Legendre Quadrature
The most common way to perform the integration is Gauss Legendre Quadrature. Below is an example of using gaussian quadrature to integrate $\int_0^2 \sqrt x$, using Gauss Legendre quadrature with 7 nodes. The integrals is evaluated at 4 upper limits (0.5, 1, 1.5 and 2).
. mata:
------------------------------------------------- mata (type end to exit) --------------------------------------------------------
: // function to integrate
: function fx(real matrix x) return(sqrt(x))
: // generate Gauss Legendre nodes and weights
: stpm3_quad_gl(7, weights=.,nodes=.)
: weights, nodes
1 2
+-------------------------------+
1 | .1294849662 -.9491079123 |
2 | .2797053915 -.7415311856 |
3 | .3818300505 -.4058451514 |
4 | .4179591837 9.99201e-16 |
5 | .3818300505 .4058451514 |
6 | .2797053915 .7415311856 |
7 | .1294849662 .9491079123 |
+-------------------------------+
: // integral limits
: a = 0
: b = range(0.5,2,0.5)
: // update nodes to limits
: nodes2 = 0.5:*(b:-a):*J(rows(b),1,nodes') :+ 0.5:*(b:+a)
: nodes2
1 2 3 4 5 6 7
+---------------------------------------------------------------------------------------------------+
1 | .0127230219 .0646172036 .1485387122 .25 .3514612878 .4353827964 .4872769781 |
2 | .0254460438 .1292344072 .2970774243 .5 .7029225757 .8707655928 .9745539562 |
3 | .0381690657 .1938516108 .4456161365 .75 1.054383864 1.306148389 1.461830934 |
4 | .0508920877 .2584688144 .5941548486 1 1.405845151 1.741531186 1.949107912 |
+---------------------------------------------------------------------------------------------------+
: // numeric integral
: integral = 0.5:*(b:-a):*quadrowsum(fx(nodes2):*weights')
: // analytical integral
: true_integral = (b:^1.5):/1.5
: b, integral, true_integral, reldif(true_integral,integral)
1 2 3 4
+---------------------------------------------------------+
1 | .5 .2357893825 .2357022604 .0000704991 |
2 | 1 .6669130851 .6666666667 .0001478292 |
3 | 1.5 1.225197571 1.224744871 .0002034424 |
4 | 2 1.88631506 1.885618083 .0002414763 |
+---------------------------------------------------------+
: end
----------------------------------------------------------------------------------------------------------------------------------
The intergral is only
Three part integration
Tanhsinh quadrature
. use "https://pclambert.net/data/colon.dta", clear
(Colon carcinoma, diagnosed 1975-94, follow-up to 1995)
. stset surv_mm,f(status=1,2) id(id) scale(12) exit(time 120.5)
Survival-time data settings
ID variable: id
Failure event: status==1 2
Observed time interval: (surv_mm[_n-1], surv_mm]
Exit on or before: time 120.5
Time for analysis: time/12
--------------------------------------------------------------------------
15,564 total observations
0 exclusions
--------------------------------------------------------------------------
15,564 observations remaining, representing
15,564 subjects
10,459 failures in single-failure-per-subject data
51,685.667 total analysis time at risk and under observation
At risk from t = 0
Earliest observed entry t = 0
Last observed exit t = 10.04167
. gen male = sex==1