The Cox proportional hazards model is the most widely used model for survival analysis because of its simplicity. The fundamental assumption in this model is the proportionality of the hazard function. When this condition is not met, other modifications or other models must be used for analysis of survival data. We illustrate in this review several methodological approaches to deal with the violation of the proportionality assumption, using survival in colon cancer as an illustrative example.
Several methods for estimating survival probability of populations from patient samples have been proposed since the first systematic approach in 1950 . One of the oldest and most straightforward methods for analyzing survival data is to compute the Life Table. This method, proposed by Berkson and Gage  for studying cancer survival, uses an enhanced frequency distribution approach. To compute a Life Table, the range of survival times for all patients is divided into subintervals. For each interval, one computes the number and proportion of cases that entered the interval "alive." the number and proportion of cases that "died", and the number of cases that were lost or "censored" in the respective time interval. An observation is censored if the subject leaves the study or is alive when the study ends. Appropriate manipulation of these quantities allows estimation of parameters of interest related to the survival distribution.
While the Life Table method worked well for a homogeneous sample, it did not address a primary goal of cancer research, namely to determine whether or not certain continuous and/or categorical variables are related to the survival times. This need led to the application of regression methods for analyzing survival data. The standard multiple linear regression model is not well suited to survival data for several reasons; among these are (i) survival times are typically not normally distributed, and (ii) censored data is commonplace, resulting in missing values for the dependent variable (survival time). Early attempts to circumvent these problems involved applying the log transform to survival time, but this worked well only when censoring was present in a very small percentage of the observations (Everitt and Rabe-Hesketh, ). Two important developments that have greatly enhanced survival analysis methods are the derivation of a nonparametric method for constructing a survival curve from censored data by Kaplan and Meier , and the Proportional Hazards (PH) model proposed by Cox . With the rapid improvements in the graphics capabilities of personal computers over the last 20 years, the use of the Kaplan-Meier method has become so popular that survival curves are often referred to as "Kaplan-Meier curves". An example of a survival curve estimated using the Kaplan-Meier approach is shown in Figure 2. The Cox model, a rnultivariate semiparametric regression model, is now the most widely used in clinical studies to characterize disease progression on existing cases by revealing the importance of covariates. It is the most popular model for survival analysis due to its simplicity. The proportional hazards model is the most general of the regression models because it is not based on any assumptions concerning the nature or shape of the underlying-survival distribution. The model assumes that the underlying hazard rate (rather than survival time) is a function of the independent variables (covariates); no assumptions are made about the nature or shape of the hazard function.
Figure 2. Survival Function and Kaplan-Meier Estimate. Kaplan-Meier estimate (step curve) of the survival function and the population survival function (dotted curve). The survival function for the population is known because these data were simulated.
The hazard function (developed more fully in section 2) is a non-negative function (of time) which can be thought of as reflecting the change in an individual's probability of death in the immediate future, given that the individual has survived up to the current time. In humans, the probability of death is higher immediately after birth than it is when the newborn becomes more mature; consequently, the hazard function decreases with age through the maturation process. Subsequent to maturation, there is a long period where an individual's probability of death (in the immediate future) is relatively low and changing very little with the passage of time; here, the hazard function is rather flat and closer to zero. In the final stages of life, this probability increases with increasing age, resulting in an increase in the hazard function.
This pattern of change gives rise to a U-shaped (or "bathtub shaped") hazard function for all cause mortality in humans. The hazard is sometimes referred to as the "force of mortality" or the "conditional failure rate".
Mathematically, the hazard function is simply a re-expression of the survival function, in that specification of either one of these uniquely determines the other. The hazard function, however, has more visual appeal in that it directly displays the time periods over which changes to the risk of death in the immediate future are occurring. To identify these periods from the survival function (rather than the hazard function), the analyst would have to look for sharp drops and flatter sections of the survival curve. The hazard function displays the information more directly.
The Cox model relates the hazard function of an individual at time t, with a vector X = (X1, X2,..., Xp) of p covariates (explanatory or predictor variables), to a baseline hazard function h0(t) via a log-linear function:
where β = (β1, β2,..., βp) is a vector of coefficients. An important consequence of this formulation, and the reason for the name "Proportional Hazards Model", is that the hazard ratio function (HRF) for two individuals with covariates X and X* does not depend on time; h(t; X) is a constant multiple of h(t; X*). It is also important to note that the effects of the covariates on the hazard are assumed to be constant over time. Inference about the regression parameter (β) is possible without making assumptions about the form of the baseline hazard function, h0(t); the hypothesis of no association of one or more of the p independent variables with survival can be tested by the likelihood ratio test (LRT) .
Figure 1 illustrates the difference between proportional hazards and non-proportional hazards. In Figure 1a the two hazard functions are proportional, and their corresponding HRF (and log hazard ratio function, LHRF) is constant over time as shown in Figure 1b. On the other hand, if the hazard functions are not proportional, they might start at the same value and then diverge, or converge to some common value (Figure 1c), or cross and diverge again, and the corresponding LHRF will not be constant (Figure 1d). Many possibilities arise due to this non-proportionality. For example, the resulting LHRFs may be linear, non linear but monotonic (e.g. logarithmic), or non-monotonic (e.g. quadratic) (Ohno-Machado ).
Figure 1. Hazard Functions. Two hazard functions that are proportional are shown in (a); the log of their ratio is constant in time as seen in (b). Two nonproportional hazard functions are shown in (c); the log of their ratio depends on time as seen in (d).
In most applications of the Cox model in cancer research, the goal is to compare survival characteristics between two or more treatment groups. This is accomplished by letting one (or more) of the covariates serve as group indicator variable(s). For example, in comparing a treated group with a control group, we might assign X1 = 0 for control group cases and X1 = 1 for treated group cases, and compare the fitted survival curves. In this setting, it follows from the constancy of the HRF that the survival curves for different groups do not cross. In practice, however, the HRF may change over time leading to incorrect inferences (O' Quigley and Pessione  and Hess ) if the Cox model were applied. Although there are several tests available to check for PH violations (Lin et al ), they are rarely used because after rejecting the assumption it may not be known how to model the effect of the predictors (Quantin et al ).
This article focuses on a review of (a) the Cox model and interpretation of its parameters, (b) assessment of the validity of the PH assumption, (c) the use of time-varying coefficients, and (d) accommodating nonproportional hazards using covariate stratification, partitioning of the time axis, and modeling tune dependence of the regression coefficients.
1 Proportional Hazards Model
We describe the models in this paper using a synthetic colon cancer survival (SCCS) data set on 100 individuals. The explanatory variables considered are Treatment taking values 1 or 2, Race taking values W and B, and Stage taking values 1,2, and 3. The survival time in months is recorded (Time) and whether the observation was censored (Event = 0). Data for the first five individuals appear in Table 1. For example, subject 1 in this table died after 3 months of followup whereas subject 3 was still alive after 60 months of followup. The number of censored events is 51 and the median of the variable Time is 24. The estimated median survival time will be higher than the sample median because of the censored observations. The explanatory variables are summarized as Treatment: 47 received 'Treatment 1', 53 received 'Treatment 2', Race: 29 were 'Black', 71 were 'White', and Stage: 44 were 'Stage 1', 29 were 'Stage 2', and 27 were 'Stage 3'.
Table 1. First five observations from SCCS
The proportional hazards model as well as the models for nonproportional hazards attempt to describe the survival function S(t) which is the proportion of individuals in the population surviving beyond time t,
For an individual selected at random from the population, the survival function can be interpreted as the probability this individual survives beyond time t.
For real data the survival function is not known but can be estimated from a sample. If everyone in the sample were observed until death the survival function could be estimated by
Typically, not all individuals are observed until death so that some of the data are censored. One method for dealing with censored data is to estimate the survival function with the Kaplan-Meier estimator. The Kaplan-Meier estimate for the data in SCCS is plotted in Figure 2. This estimate can be described as a decreasing step function where the steps occur when a death has been observed. The smooth function is the true survival function which we know in this case because the data have been simulated. This population survival function is for a population with the following distributions: Treatment (50% Treatment 1, 50% Treatment 2), Race (80% White, 20% Black), and Stage (40% Stage 1, 30% Stage 2, 30% Stage3).
The survival function shows the relationship between survival and time; that is, between the variables Time and Event in the SCCS data set. To describe the effect of other variables, such as Treatment. Race, and Stage, on survival it is convenient to use the hazard function.
The hazard function addresses the question:
What is the probability that an individual who has survived to time t dies in the interval t to t + Δt?
As the interval Δt gets close to zero so does the probability of death and the hazard function h(t) is the rate at which the probability goes to zero:
The hazard function h(t) is not a probability so it can be larger than one and its value will depend on the units used for time (e.g., days or months). The hazard function is used to give an approximate probability, namely h(t)Δt, for the event that an individual dies in interval t to t + Δt. This approximation is very good for small Δt. The hazard function is also related to the survival function. In fact, the hazard function is an equivalent formulation for the relationship between survival time and the event of death. The proportional hazards model is a simple model to describe the survival between two groups having hazard functions h1(t) and h2(t). The proportional hazards assumption is that ratio of these hazard functions does not depend on time:
where ψ is called the proportionality constant. If ψ > 1 Individuals in group 2 have a greater risk of dying than individuals in group 1, regardless of the time t.
While ψ does not depend on time, it does depend on the explanatory variables in the model, namely Treatment, Race, and Stage. A convenient and readily interpretable linear expression for the logarithm of ψ (which can be any real number, since ψ > 0) is obtained by the proper use of indicator variables according to the following scheme:
xStg2 = 0 and xStg3 = 0 if Stage is 1
xStg2 = 1 and xStg3 = 0 if Stage is 2
xStg2 = 0 and xStg3 = 1 if Stage is 3
With this coding of the explanatory variables, and appropriate substitution into equation 1, it can be shown that
log(ψ) = βTrt2xTrt2 + βRaceWxRaceW + βStg2xStg2 + βStg3xStg3 (2)
where the β's have been subscripted to reflect their role in the hypothesized causal mechanism, and the subscript on x indicates when the variable takes the value "1" rather than "0". For example. xTrt2, xRaceW, and xStg2 are all 1 and xStg3 is 0 when the individual receives treatment 2, is white and has stage 2 cancer. An individual who received treatment 1, was black, and had stage 1 cancer would have each of these explanatory variables equal to zero. If βTrt2 > 0, individuals receiving treatment 2 have a larger log hazard function than those receiving treatment 1. Expressed in terms of the hazard function, exp(βTrt2) > 1 indicates that the hazard function for treatment 2 is proportionally greater than that of treatment 1 and the proportionality is constant for all time t. The generic expression of the hazards proportionality constant is
Fitting the proportional hazards model in (2) gives the estimates in Table 2. This model indicates that the effect of receiving treatment 2 rather than 1 is to decrease the hazard rate. That is, if we compare two groups one receiving treatment 1 and the other treatment 2, but the groups have the same values for the other explanatory variables, then the hazard rate for the group receiving treatment 2 is 0.46 times the hazard rate for those receiving treatment 1. The effect of being white is to decrease the hazard rate by the factor 0.4. Similar interpretation can be given for the effects of Stage2 and Stage3.
Table 2. Parameter estimates for the proportional hazards model using SCCS data
Since the validity of inferences based on the Cox model depends on the proportional hazards assumption, it is desirable to have diagnostic methods for checking this assumption. Many tools are available for checking the PH assumption. These include:
(i) Plots of log(-log S(t))
Suppose the hazards for 2 groups are proportional, say h1(t) = ψh2(t). It can be shown that this results in the relationship log(-log S1(t)) = log ψ + log(-log S2(t)); the log(-log S(t)) curves for the two groups differ by a constant. Plotting the estimated log(-log S(t)) curves for the two groups, evaluated at the covariate mean values, provides a visual check of the PH assumption. A clear departure from parallelism of these two curves would be consistent with violation of the PH assumption.
(ii) Interactions with Time
Interaction of a covariate x with time can be modeled by including in the model a product term x × (time), resulting in a log hazard function of the form
log h(t) = ⋯ + β1x + β2xt + ⋯ = ⋯ + (β1 + β2t)x + ⋯
The coefficient β2 reflects x's dependence on time; if this effect is statistically significant in the fitted model we have evidence for non-proportionality.
(iii) Schoenfeld Partial Residuals
For each covariate, a Schoenfeld residual can be calculated for each case that was not censored. A plot of these residuals against time should be "approximately flat" if the PH assumption holds. We illustrate a check on the proportionality assumption for the SCSS dataset using a Schoenfeld residual plot with a smooth curve fit to these residuals (Grambsch and Therneau ).
Figure 3 is the Schoenfeld residual plot for βTrt2. The dashed lines are confidence bands drawn at plus and minus two standard errors. The dotted line shows the true functional dependence of βTrt2 on time t. Note that βTrt2(t) is not constant but decreases toward zero as time increases. That is, the treatment effect dies off over time. The estimated curve indicates that Trt2(t) changes sign for large time t. The true parameter βRaceW(t) is constant in time but the estimate RaceW(t) indicates this parameter could depend on t. See Figure 4.
Figure 3. Schoenfeld Residual Plot for βTrt2. Schoenfeld residual plot for βTrt2. The black solid line is Trt2, a smooth function fit to the residuals, and the dotted line shows the actual dependence of βTrt2 on t (which is known in this case because the data have been simulated).
Figure 4. Schoenfeld residual plot for βRaceW. Schoenfeld residual plot for βRaceW. The black solid line is RaceW, a smooth function fit to the residuals; βRaceW is constant in time which is indicated by the horizontal dotted line (which is known in this case because the data have been simulated).
2 Dealing with Nonproportional Hazards
In the SCCS example from the previous section we encountered a model with a nonpreportional hazard function where one of the coefficients varied with time t. In this section we address some ways to cope with time varying coefficients β(t) in the Cox model. Before describing these methods we list the following considerations and caveats.
• Detecting the true time dependence of the parameter can be difficult. The Schoenfeld plot for Treatment in Figure 3 shows that the parameter is not constant but does not indicate the proper functional form. Therneau and Grambsch  give examples where the data cannot distinguish two very different functions of time.
• While there are tests to check for nonproportionality (see for example ), these need not be effective against some alternatives. In particular, tests based on the slope of β(t) need not be sensitive to quadratic or other nonlinear functions of time.
• Even when a test for nonproportionality is statistically significant, this does not mean the nonproportionality is of practical importance or significance. This potential difficulty arises in any significance test, especially if the sample is large.
• Time varying coefficients can be due to other inadequacies in the model. For example, one or more covariates are not included, the functional form of the covariates is incorrect, or another survival function can more effectively model the data.
We consider three methods for dealing with time-varying coefficients: stratification of variables with time varying coefficients, partitioning of the time axis, and modeling the dependence of β(t) on time. We discuss each of these using the SCCS example. Other examples of these approaches can be found on pages 145–147 in .
2.1 Covariate Stratification
Nonproportionality in Treatment can be addressed by fitting a separate baseline hazard function for each level of Treatment. The results for the SCCS data are given in Table 3. In this model, the effects of covariates are the same on each stratum so there is only one set of coefficients that apply to all strata. A disadvantage of this approach is that we cannot test the effect of treatment 1 versus treatment 2. For a variable that can be controlled this would make this approach unacceptable. It could, however, be used if a covariate such as race had a time varying coefficient.
Table 3. Parameter estimates for the proportional hazards model after stratification on Treatment
2.2 Partitioning of the Time Axis
Proportional hazards may not hold over the entire time axis but may hold approximately over shorter time periods. For the SCCS data we partition the time axis into two intervals: t < 30 and t ≥ 30. The estimated coefficients for these two time periods appear in Tables 4 and 5. The parameters in the two time periods have been estimated separately so that each covariate can relate differently to survival during the two time periods.
Table 4. Parameter estimates for the proportional hazards model on the first time interval
Table 5. Parameter estimates for the proportional hazards model on the second time interval
An advantage of partitioning the time axis is that we can model the effect of the Treatment variable. The relationship between this model and the proportional hazards model is illustrated in Figure 5 where the horizontal line indicates the coefficient βTrt2(t) is constant in time, the proportional hazards assumption, and the step function breaking at time t = 30 indicates βTrt2(t) is constant in time but only within each interval of the partition. The step function, like the curve showing the actual dependence on time, is increasing.
Figure 5. Estimates for the dependence of βTrt2 on time. The dependence of βTrt2 in the proportional hazards model (solid horizontal line), in the time-partitioned model (dashed line segments), and from a nonparametric estimate based on the Schoenfeld residuals (solid curve). The dotted curve is the actual dependence of βTrt2 on time (which is known in this case because the data have been simulated).
2.3 Modeling Time Dependence of β(t)
There are two basic approaches to modeling the dependence of β(t) on t. One is to specify a known functional form for this relationship. For example,
βTrt2(t) = β0 exp(-ρt) (3)
so that the treatment effect diminishes over time. Another is to allow the data to select the functional form of the time dependence. This can be done by using splines. The basic idea is to replace each of the dashed line segments in Figure 5 with a curve such that the curves are connected at time t = 30. The time axis in Figure 5 is partitioned into just two intervals but by increasing the number of intervals a close approximation can be obtained.
We do not use this approach for the SCCS data. One reason is that software for this approach is not readily available. The SAS procedure phreg can be used when the known functional form is linear; however, this will not work for the function in equation (3). A more important reason is that allowing the data to choose the functional form increases the chance of over-fitting the data, and one must be cautious about interpreting the p-values and confidence intervals for the parameters.
2.4 Other Methods
The methods given above for dealing with nonproportional hazards all involve modification of the Cox model. Another approach is to simply use a different model. Therneau and Grambsch  suggest accelerated failure time or additive hazard models. In the Cox model, the covariates act multiplicatively on the baseline hazard function (1). The additive hazard model:
allows the covariates to modify the baseline hazard in an additive way; the effect of the covariate may also vary with time. The additive model measures the additional risk due to the effect of a covariate in absolute terms, whereas the Cox model measures it in relative terms.
Artificial Neural Networks (ANNs) are another class of models that have been used to model cancer data. These are sometimes referred to as "black box" models because of the complex, nonlinear relationships used in the fitting procedure. In the Cox model, the form of the relationship between the covariates and survival is specified; only the parameters of the model are estimated. In contrast to this, with ANNs one tries to estimate both the functional form of the relationships between variables and the parameters that describe those relationships. This limits the ability of these models to identify causal relationships, because it is usually difficult to relate the model parameters to the biological context that generated the data (Ahmed ).
Tree Structured Survival Analysis (TSSA) provides another alternative for the study of risk factors (Segal ). It is an extension of the Classification And Regression Tree (CART) algorithm developed by Breiman et al . TSSA is an exploratory, nonparametric method that requires no assumptions about the relationship between survival and the potential risk factors (covariates). The primary output of a TSSA algorithm is a "tree structure", whose branches and nodes identify the successive splits of the cases into risk groups with similar survival profiles. TSSA evaluates the relationship between risk factors and survival through recursive partitioning of patients according to their risk factors, with a comparison of survival patterns among the subgroups of patients resulting from each partition. The method not only identifies a set of significant risk factors, but also provides a simple procedure to identify subgroups of participants with an estimate of their associated risk.
3 Application of Modified Cox Models in Colon Cancer Research
Moreau et al  proposed a crude survival model that accounts for the nonproportionality of hazards by modeling the hazard ratio as a step function of the follow-up time (divided into predefined intervals). The application of the piecewise model requires determining both the number and the boundaries of these intervals. Clinicians have conveniently suggested six intervals with the following upper boundaries being 1 month, 6 months, 12 months, 24 months, 60 months, and 120 months (Bolard et al ), and employed the 2L program of BMDP Statistical Software analysis (Dixon et al ).
Esteve et al  fit the baseline hazard function and its dependence on time by partitioning the time axis. However, the coefficients of the covariates were not allowed to depend on time. In contrast, Moreau et al  allow the coefficient on age to depend on time and finds the effect of age continues to decline between 24 and 120 months. Moreau et al  also allow cancer stage to depend on time. These considerable differences between models that allow covariates to depend on time and those that do not demonstrate the ability of models with time varying coefficients to provide new insights about these changes.
Esteve et al  investigate the time-dependent logarithm of the hazard ratio for each covariate by modeling the data using B-splines (de Boor  and Stone and Koo ). Using this B-spline model, they find a significantly higher risk of post surgical mortality than the Cox model during the first six months, and by 12 months the reverse happens. Moreover, the Cox model overestimates the impact of age on colon cancer-specific mortality after the first 6 months of follow-up, whereas the impact of age is near zero during this time period when the B-spline model is used. Clinically, the risk of cancer related death for elderly patients is higher only during the initial few months corresponding to post-surgical period. The estimated effect of later periods of diagnosis shows that there may be some benefits of new treatments in reducing post-surgical mortality, but they do not appear to affect the long-term survival (Giorgi et al ).
Software routines for fitting Cox models are available in all of the popular statistical software packages. Here we provide only a brief summary for some of those that are more widely used. It is worth noting that many of these programs have active and well-organized user groups whose members are continually writing add-on functions and macros, usually available at no cost.
GLIM , the Generalised Linear Interactive Modeling package, is a flexible, interactive statistical analysis program developed by the GLIM working party of the Royal Statistical Society. It has a suite of macros that perform survival analysis. These include COXMODEL (fits Cox proportional hazards models), PHAZ (plots log hazard against log survival time), LIFETABS (produces a life table analysis from censored survival times), and STEPS (plots the estimated survival curve using the Kaplan-Meier approach). Other related macros are INIB, LOGRANK2, NORM, LNORM, WEIBMIX, WEIBULL, and RESPLOTS.
R  is a language and environment for statistical computing and graphics. Unlike the other packages mentioned here, R can be downloaded at no cost from the R website. It is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies). Much code written for S runs unaltered under R. Analysis is accomplished via calls to functions; the arguments to these functions include datasets and model specifications, and the function returns appropriate statistics and graphical output. An R add-on package called "survival" (again, available at no cost) contains over 100 files consisting of functions and datasets for survival analysis.
SAS  software has six procedures that perform survival analysis computations. LIFETEST produces life tables and graphs of survival curves. LIFEREG estimates regression models with censored data, but does not allow for time-dependent covariates. PHREG fits Cox models, handling both discrete-time and continuous-time data, as well as time-dependent covariates. Three other procedures (LOGISTIC, PROBIT, and GENMOD), while not designed specifically for survival analysis, are effective for estimating survival models in certain settings.
Marketed by Insightful Corporation, S-PLUS  contains a complete array of survival analysis tools, including frailty models, smoothing splines, penalized survival models, parametric survival regression, Kaplan-Meier curves, and Cox proportional hazards models.
SPSS  is a popular menu-driven general statistical analysis package. It contains menu-driven routines for constructing Life Tables, plotting Kaplan-Meier survival curves, and fitting Cox models with time-varying covariates.
Though less widely used than those we have mentioned above, other programs that have Cox model fitting capabilities include BMDP , NCSS , Minitab , Systat , Stata , and Statistica .
The Cox model, also known as the proportional hazards model, often provides a very good approximation to the survival function and its dependence on covariates. For those situations where the proportional hazards model is inadequate we have described three simple modifications that address nonproportional hazards. One solution is to stratify covariates. A disadvantage of this approach is that it does allow modeling the effect of the covariates used in the stratification. Another solution is to partition the time axis into intervals and require proportional hazards only within intervals. A disadvantage here is determining the number intervals that are necessary. The final approach is to model the time dependence of the coefficients. Interpretation should be tempered by the fact that these models tend to over-fit the data.
Journal American Statistical Association 1958, 53:457-481. Publisher Full Text
Stat Med 1995, 14:1707-1723. PubMed Abstract
Biometrika 1995, 80:557-572. Publisher Full Text
Quantin C, Abrahamowicz M, Moreau T, Bartlett G, Mackenzie T, Tazi M, Lalonde L, Faivre J: Variation over time of the effects of prognostic factors in a population-based study of colon cancer: comparison of statistical models.
Biometrika 1994, 81:515-26. Publisher Full Text
Molec Cancer 2005, 4:29. BioMed Central Full Text
Biometrics 1988, 44:35-47. Publisher Full Text
Stat Med 1990, 9:529-538. PubMed Abstract
Statistical analysis software and data mining software – Insightful [http://www.insightful.com/products/splus/default.asp] webcite