Casecohort design in practice – experiences from the MORGAM Project
 Sangita Kulathinal^{1_43},
 Juha Karvanen^{2_43}Email author,
 Olli Saarela^{2_43},
 Kari Kuulasmaa^{2_43} and
 the MORGAM Project^{3_43}
DOI: 10.1186/17425573415
© Kulathinal et al. 2007
Received: 13 November 2006
Accepted: 04 December 2007
Published: 04 December 2007
Abstract
When carefully planned and analysed, the casecohort design is a powerful choice for followup studies with multiple event types of interest. While the literature is rich with analysis methods for casecohort data, little is written about the designing of a casecohort study. Our experiences in designing, coordinating and analysing the MORGAM casecohort study are potentially useful for other studies with similar characteristics. The motivation for using the casecohort design in the MORGAM genetic study is discussed and issues relevant to its planning and analysis are studied. We propose solutions for appending the earlier casecohort selection after an extension of the followup period and for achieving maximum overlap between earlier designs and the casecohort design. Approaches for statistical analysis are studied in a simulation example based on the MORGAM data.
1 Introduction
The MORGAM (MONICA, Risk, Genetics, Archiving, and Monograph) Project is an ongoing multinational collaborative study with the overall aim of studying a limited number of welldefined phenotypes and several hundred genetic factors by pooling data from cohorts defined in MONICA (Multinational MONItoring of trends and determinants in CArdiovascular disease) and other similar crosssectional risk factor surveys [1, 2]. In brief, MORGAM cohorts are the respondents of random survey samples from geographically defined populations for whom several baseline measurements were made. The MORGAM cohorts are followed up prospectively for allcause mortality and nonfatal coronary heart disease (CHD) and stroke events.
Compared to the designs where casematched controls are selected, a distinct advantage of the casecohort design is that the selected subcohort can be used for analysing several endpoints of interest. Furthermore, as the subcohort forms a random sample of the original cohort, it can be used to assess the genetic distribution of the population. If the subcohort is selected efficiently, the statistical power of genedisease association is not substantially reduced compared to the alternative where the full cohort is genotyped. The theoretical foundation for the design was formulated in 1986 by Prentice [3] although an epidemiological study design similar to the casecohort design was suggested already in 1975 by Kupper et al. [4] and in 1982 by Miettinen [5]. During the past twenty years, several authors have considered the casecohort design from various viewpoints including sampling of the subcohort, weighting methods for the analysis, variance estimation, and comparison with the casecontrol and the nested casecontrol design.
Nowadays the casecohort design is one of the standard designs under prospective followup studies and the analysis methods can be implemented in commonly used statistical software packages such as R [6] and SAS [7].
The sampling of the subcohort itself has gained relatively little attention in literature. It is important to note that the followup and covariate data collected for the complete cohort can be utilised in choosing the subcohort to improve the efficiency of the design. The sampling probabilities may be defined within strata formed using matching variables or at the individual level. The stratified casecohort design is studied by Borgan et al. [8], Kulich and Lin [9] and Samuelsen et al. [10]. Kim and De Gruttola [11] compare various strategies for cohort sampling and propose an efficient subcohort sampling procedure where the sampling probabilities are proportional to predictive probabilities calculated from a logistic regression model that explains the probability of being a case by matching variables. Using this approach the distribution of important background variables will be similar for cases and the subcohort. A modification of this approach is applied in the MORGAM Project. Cai and Zeng [12] and Kim et al. [13] consider the calculation of sample size and power in casecohort studies.
Much of the literature on timetoevent analysis of casecohort data has concentrated on the relative risk model and modifications to Cox's partial likelihood [14]. Adjustments to the partial likelihood are required because the cases are overrepresented in the casecohort set and therefore unadjusted risk sets in the partial likelihood would not represent the original study cohort. The original pseudolikelihood estimator proposed by Prentice [3] uses a weighting where risk sets at event times consist of subcohort members at risk while the cases outside the subcohort enter the risk sets only at their event time. A slight modification by Self and Prentice [15] did not include the nonsubcohort cases in the risk sets at all. Kalbfleisch and Lawless [16] suggested including all cases in the risk sets with weight one and weighting the remaining subcohort members with inverse subcohort sampling probability. Barlow [17] proposed a timedependent weighting where the weights for the subcohort members are defined as the ratio of the number of cohort members at risk to the number of subcohort members at risk. Barlow [18] approximated this quantity by the inverse of the subcohort sampling fraction. Kulich and Lin [9] propose a class of weighted estimators with general timevarying weights. Samuelsen et al. [10] presents an analysis approach for general cohort sampling designs, including the casecohort design, where the weighting is based on poststratification on case status and other factors.
The different weighting schemes are compared by Barlow et al. [18], Petersen et al. [19] and most recently by OnlandMoret et al. [20]. The results suggest that when the size of the subcohort is sufficiently large (for instance, over 15% of the full cohort according to [20]) all weighting schemes give similar estimates that differ only slightly from the full cohort estimates. When the size of the subcohort is small compared to the original cohort, the authors report that the Prentice estimator may have better small sample properties than the other approaches.
Variance estimation under the casecohort design is an important topic because, for example, the standard variance estimators for relative hazard parameters in the Cox regression model are not valid for the casecohort situation. The lack of variance estimators suitable for the Cox regression analysis of casecohort data in standard statistical software may have initially limited the application of the design [18]. Self and Prentice [15] give conditions for the consistency and asymptotic normality of the pseudolikelihood estimator. Wacholder [21], Lin and Ying [22] and Barlow [17] have proposed variance estimators for Cox regression analysis under the casecohort design. Barlow [17] showed that the robust variance estimator of Lin and Wei [23] is equivalent to a jackknife variance estimator, which can be directly applied to a casecohort situation. Program codes for the computation of estimators are provided by Barlow et al. [18], Therneau and Li [24] and Langholz and Jiao [25]. Robust variance estimation is implemented in the recent versions of R and SAS software and can be applied in analysis of casecohort data when appropriate weighting is used.
In addition to the pseudolikelihood based timetoevent analysis, some authors have recently considered a full likelihood approach where the cohort sampling design is handled as a missing data problem. In this approach the likelihood expression is constructed for the complete cohort instead of the casecohort set. Parameter estimation can then be carried out using the expectation maximisation (EM) algorithm [26] or Bayesian data augmentation [27]. The full likelihood estimation is computationally more demanding due to the large amount of missing covariate data generated by the design but has systematically better performance, although the gain in efficiency is minor in case of a rare disease [26]. Further gain in efficiency can be achieved through modeling of possible dependencies between the covariate collected under the casecohort design and the covariates collected for the complete cohort [27]. An alternative likelihood based approach which uses only the casecohort set but maximises a likelihood that is conditioned on the inclusion in the casecohort set was recently proposed by Saarela and Kulathinal [28]. The likelihood based approaches potentially allow use of more general survival models.
The casecohort design and the nested casecontrol designs have been compared in various settings. Wacholder [21] compared the practical aspects of the designs. Langholz and Thomas [29, 30] reported results from a simulation studies where under some settings the casecohort design was found to be inferior to the nested casecontrol design. Zhao and Lipsitz [31] discuss twelve twostage designs including the casecontrol and the casecohort designs as special cases. Chen and Lo [32] establish a link between the casecohort and the casecontrol sampling in terms of the estimation of regression parameters in Cox's model. Chen [33] studied the casecohort, the nested casecontrol and the casecontrol designs through a unified approach.
In the abovementioned references the use of the casecohort design has been demonstrated, e.g., with data evaluating the efficacy of mammography screening in reducing breast cancer mortality [17], data from occupational exposure study of nickel refinery workers [18], data from an AIDS clinical trial [11], data on premature death of adult adoptees [19] and data on body mass index and cardiovascular disease [20]. In recent epidemiological studies, the casecohort design has been applied, for instance, in a study of the risk of myocardial infarction following radiation therapy for breast cancer [34], a study of alcohol intake and cardiovascular disease [35], a study of the relation between cancer and medication exposures [36], and a study of occupational exposures and breast cancer among women textile workers [37].
The review of the literature on the casecohort design reveals that the practical aspects of the study design have gained relatively little attention. For instance, in a recent methodological comparison [20], the authors explicitly state that they do not provide suggestions as to how to design a casecohort study. In epidemiological study reports, the study design is usually briefly described. This paper describes the casecohort design of the MORGAM Project in detail and discusses analysis approaches for casecohort data with the intention of providing proper guidelines which would be helpful in designing studies with similar characteristics.
In the present article, we describe the procedure used in selecting the subcohort in the MORGAM casecohort design and approaches to statistical analysis of the casecohort data. In section 2, the MORGAM casecohort selection is described in detail. Section 3 deals with completing the casecohort set after extending of the followup period. In section 4, the MORGAM casecohort design is compared with a local study design and a selection procedure for the MORGAM subcohort is proposed to ensure maximal overlap between the two designs. Section 5 deals with the assessment of the selection and data management issues. Statistical analysis approaches are described in section 6. Subcohort selection procedures and some analysis methods are compared with a simulation study in section 7. We conclude with a Discussion. Various aspects of the casecohort design are illustrated using a single MORGAM cohort.
2 Selection of cases and subcohort in MORGAM
As mentioned in the Introduction, the subcohort selection procedures are not described in detail in the literature. In this section, we give a detailed account of subcohort selection procedure developed in MORGAM. The selection of cases and subcohort for each cohort is done centrally at the MORGAM Data Centre, Helsinki, after the baseline and followup data have been received from the participating centre and their quality have been approved. These data are used in identifying the cases and for selection of the subcohort using the common criteria and selection procedures for each cohort.
2.1 Eligible cohort for genetic substudy
Availability of DNA and consent for the use of DNA are basic requirements for a genetic study. In MORGAM, availability of baseline data on the most important classic risk factors of cardiovascular diseases, namely smoking status, blood pressure and cholesterol is an additional requirement. An individual is considered eligible for the genetic substudy if there is consent for the use of DNA to study both CHD and stroke and the information on smoking, blood pressure, cholesterol and DNA are available [38]. A cohort consisting of the eligible individuals is referred to as an eligible cohort for the genetic study. For some cohorts, it is feasible to assess the availability of DNA for the selected casecohort set only. In such a case, availability of DNA is not a requirement for the eligibility, but the reasons for missing DNA for some individuals are assessed carefully in order to ensure that their absence will not unduly bias the results of the study.
2.2 Definitions of cases
As mentioned in the Introduction, several definitions of a case are of interest. They are defined using events such as different types of CHD, stroke, venous thromboembolic disease and death during the followup, as well as history of cardiovascular disease or stroke observed at the baseline examination. Based on the data from the baseline examination and followup of CHD, stroke, venous thromboembolic events and allcause mortality, an individual experiencing any of these is a defined as a case and is selected for genotyping. For details, we refer the reader to the MORGAM Manual [38].
2.3 Subcohort sampling
Stratification
The smallest geographic unit that can be identified in the MORGAM data is called a reporting unit (RU). It is often reasonable to combine RUs for data analysis, in particular if they represent small adjacent populations where the baseline surveys were carried out at the same time. Such combinations of RUs are called Reporting Unit Aggregates (RUA), and the individuals examined in the same survey (specified by calender period) in a RUA constitute a MORGAM cohort. The number of cohorts and their baseline years vary RUA by RUA. Within each RUA, the MORGAM casecohort design is stratified according to cohort and sex. The procedure of selection of a sample from each stratum is described below.
Size of the subcohort
The size of subcohort can be defined as a fraction of the whole cohort or proportionally to number of cases. Because the proportion of cases vary cohort by cohort in the MORGAM Project, the size of subcohort was made proportional to the number of cases. It is known from the theory of the casecontrol design that the asymptotic relative efficiency for a study involving k controls per case is k/(k + 1), which takes values of 0.5, 0.67, 0.75, 0.8 and 0.83 for k from 1 to 5 [39]. Because of this and the limitation of the genotyping budget, the subcohort size within each stratum is defined using the main study endpoints as twice the maximum of the number of first acute CHD (fatal or nonfatal) and first stroke (fatal or nonfatal) events during the followup. Relatively strict definition of disease endpoints are used in defining the subcohort size compared to the wider definition used in the definition of cases (see section 2.2). This gives more freedom for defining the endpoint of interest at the analysis stage while not needlessly expanding the subcohort size. The subcohort thus selected should be large enough for studying also other endpoints because the number of cases furnished from other endpoints are generally smaller than the number of CHD and stroke cases. However, if the total subcohort size for a RUA (that is all cohorts within the RUA) happens to be less than 100 then the subcohort size in each stratum is adjusted so that the total size for the RUA is 100. This will allow the possibility of estimation of the genotypic distribution for each RUA.
Sampling within the stratum
The number of major endpoint events during the followup increases strongly with the age of the individual at baseline. Therefore, if all members of the cohort had an equal probability of being selected in the subcohort, the power of the design would suffer from the fact that the average age at baseline of the cases would be much higher than the average age of the subcohort. The power can be increased to a level comparable to the power of an agematched casecontrol design by selecting the individuals of the subsample using agedistribution similar to the distribution of the baseline age of the cases [11].
Each individual is selected to the subcohort by the preassigned probability and such n individuals are drawn from the stratum without replacement using the HanuravVijayan algorithm [40, 41]. The selection procedure is implemented in SAS using the procedure proc surveyselect with method=pps [7]. If a fixed sample size is not a strict requirement, a more straightforward sampling procedure would be to select an individual i independently of the other individuals to the subcohort with probability p(S _{ i } = 1). In this case the sample size is random with expectation same as n above.
Subcohort selections for the example cohort: first selection: followup to the end of the year 2003, second selection: followup to the end of the year 2004. These case definitions used for CHD and stroke are specific for determining the subcohort size; the definitions used for defining the casecohort set are usually wider.
Men  Women  

2003  2004  2003  2004  
CHD cases  96  104  24  32 
Stroke cases  58  70  26  32 
Subcohort size  192  208  52  64 
3 Sampling after extension of the followup period
Because MORGAM is an ongoing project, the MORGAM participating centres are encouraged to extend the followup period continually. Extension in the followup period results in an increase in the number of observed CHD and stroke cases and hence an increase in the desired size of the subcohort. In this section, we describe the procedure used to augment the already selected subcohort to the desired size.
As mentioned in section 2, for the subcohort sampling in MORGAM, an individual with age b _{ i } at baseline is selected for the sample with probability proportional to f(b _{ i }) where the function of age is obtained from the total death rate of the cohort using a logistic model. With increase in the number of deaths, there is a change in the function though the change is usually very small.
Note that the above probability is always less than or equal to 1 since p(S _{ i } = 1) is always less than or equal to 1.
Let us assume that a sample of size n _{0} has been selected with the ultimate selection probability for individual i as p(S _{0i } = 1). The following algorithm can be given for the enlargement of the sample
1. Obtain p(S _{ i } = 1), i = 1, 2, ..., N using the extended followup data. Determine the new subcohort size n.
2. Calculate p(S _{ i } = 1  S _{0i } = 0) using (1).
3. Select n  n _{0} individuals out of N  n _{0} individuals who were not selected in the first phase with probability proportional to p(S _{ i } = 1  S _{0i } = 0). The sampling is made using the HanuravVijayan algorithm that is implemented in SAS as procedure proc surveyselect with method=pps.
Because the sample size is fixed, only the proportions of the sampling probabilities can be fixed. Therefore, the sampling probabilities p(S _{ i } = 1  S _{0i } = 0) are only approximations of the actual sampling probabilities obtained from proc surveyselect. However, our experience is that the differences between the actual sampling probabilities and the desired sampling probabilities are negligible and p(S _{ i } = 1) can be used as the ultimate sampling probability.
4 MORGAM casecohort design and locally designed studies
Some MORGAM centres have designed casecontrol or casecohort studies for local use [42] and it may be beneficial for these centres that the MORGAM casecohort set is selected in such a way that there is maximum overlap with the local casecontrol or casecohort set. On the other hand, the casecohort selection strategy should be similar in all MORGAM cohorts because it is important to treat the participating centres equally. In practice, the MORGAM subcohort is defined as a random sample of the cohort where the selection probabilities are defined as in sections 2.3 and 3. Thus, the goal is to select the MORGAM casecohort set such a way that these competing objectives can be fulfilled.
Let the subcohort selected locally be described here using a binary variable S _{0i } with value 1 with the local sampling probabilities p(S _{0i } = 1). For example, in a MORGAM cohort where a local study was conducted, individuals younger than 35 years at baseline were assigned sampling probability p(S _{0i } = 1) = 0, otherwise p(S _{0i } = 1) varied from 0.060 to 0.200 depending on the sex and cohort but not on the age.
Let p(S _{ i } = 1), i = 1, 2, ..., N be the MORGAM selection probabilities. To ensure the maximum overlap, the subcohort is selected conditional on the local selection status S _{0i }:

If individual i was selected to the local subcohort, select i to the MORGAM subcohort with probability(2)

If individual i was not selected to the local subcohort, select i to the MORGAM subcohort with probability(3)
All the subjects with the conditional probabilities equal to one are selected and let n _{1} be the size of such selected individuals. Following the algorithm described in section 3 and using equations (2) and (3) in Step 2 of the algorithm, n  n _{1} individuals are selected.
5 Selection diagnostics and data management
MORGAM being a multicentre study, ensuring uniformity with respect to the selection procedures and data management is a challenge. Hence, we describe briefly the assessment of the selection and data management procedures used for the casecohort data.
The casecohort study is designed on a cohorttocohort basis when the data for the cohort are complete and assessed for quality. The subcohort selection procedure described above is followed for each cohort. While computing the selection probabilities using the logistic regression model, the convergence and the values of the parameter estimators are checked and compared with the other cohorts. Typically, in the logistic regression model, the estimated coefficient for the age at baseline varies from 0.07 to 0.10 and the intercept term varies from 7.0 to 10.0 in the population cohorts. Major deviations from these values lead to further investigation. As mentioned earlier, a subcohort with a larger size needs to be selected if the followup data are updated with a longer followup. The data are again checked for their quality and it may be possible that some of the data items are altered compared to the earlier data.
The data through the process of selection get generated at different times and a welldefined protocol for data management listing the data items to be stored for future use, is required. Information on the selection is saved for each cohort including individuals not in the casecohort set. The data on the date and the phase of the selection, eligibility for the genetic casecohort study, selection probability, subcohort selection status and case status (one variable for each case type) are stored. We refer the reader to MORGAM manual [38] for the definitions and structure of these data items for the transfer into the database. Note that these data are essential for all the subsequent analyses. The selection is summarised in a table comprising the size of eligible cohort, the size of subcohort, the number of different cases, the number of cases in subcohort and the total number of subjects selected for genotyping.
6 Statistical analysis of casecohort data
At the design stage of the casecohort study, several endpoints are generally used for defining cases for which the covariate data are collected, and hence for defining the casecohort set. This is illustrated in Figure 1. For the purpose of statistical analysis, a casecohort set specific to a single endpoint of interest is specified. In the following we introduce some notation to describe the analysis casecohort set. Let S _{ i }, E _{ i } and O _{ i } be binary variables taking value one if the individual i is selected into the subcohort, is a case according to the definition used in the analysis and is a member of the casecohort set for the present definition of a case, respectively. Let C = {1, 2, ..., N}, S = {i ∈ C : S _{ i } = 1}, E = {i ∈ C : E _{ i } = 1} and O = {i ∈ C : O _{ i } = 1} = S ∪ E denote the complete cohort, subcohort, cases and casecohort set, respectively. Special analysis methods are needed for sets S and O because of the unequal sampling probabilities p(S _{ i } = 1) and because the cases E are overrepresented in the casecohort set O.
6.1 Estimation of summary statistics
6.2 Analysis of timetoevent data
Weighting schemes for Cox's partial likelihood. Y _{ i }(t) denotes the at risk indicator for subject i at time t. Weighting method ISSP refers to the inverse subcohort sampling probability weighting as defined in (5).
Prentice  Barlow (1994)  Barlow (1999)  Kalbfleisch & Lawless  ISSP  

noncase in subcohort  1 




case in subcohort before event  1 

 1 

case in subcohort at event  1  1  1  1  1 
case outside subcohort before event  0  0  0  1  0 
case outside subcohort at event  1  1  1  1  1 
The weighting proposed by Barlow [17] aims to retain the original interpretation of a partial likelihood as a conditional probability. Here the subcohort members are weighted by the inverse of the sampling fraction at the event time and the sum of the weights in (4) then estimates the size of the risk set in the full cohort. Therefore this weighting scheme can also be used for estimation of absolute risks. Because of difficulty in implementing time dependent weighting, Barlow [18] used the overall sampling fraction to estimate the weights.
Both (5) and (6) approximately retain the probabilistic interpretation of Cox's partial likelihood, that is, given event and risk set at time T _{ i } they approximate the probability of event occurring to individual i.
These weighting approaches also resemble the HorvitzThompson method described in the previous section. The analysis approach and notation above have referred to a single endpoint of interest; for pseudolikelihood based analysis of multiple endpoints under the competing risks setting we refer to Sørensen and Andersen [45].
where Δ_{ i } denotes the change in estimate when individual i is removed from the data. This quantity is often referred to as dfbeta residual and these can also be used to check for influential observations in the model fit. The variance estimator corresponds to a sum of squared changes in the parameter estimate vector when observations are removed from the analysis one at a time. Barlow [17] shows that this estimator is equivalent to one proposed by Lin and Wei [23]. Robust variance estimation is available in SAS versions 8.1 and onwards and in R survival package and can be applied in analysis of casecohort data when appropriate weighting is used.
Pseudolikelihood based parameter estimation can be carried out using SAS procedure PHREG or R function coxph. Examples of SAS and R code for weighting alternative (5) are presented in the Appendix 1. In both codes, a weighted data set is formed first. Cases outside the subcohort are included in the risk set only from a very short time before the event and with weight one. Noncases in the subcohort are weighted with the inverse of the subcohort sampling probability. Cases in the subcohort require two records: one censored observation for the time before event with inverse sampling probability weight and one uncensored observation from a very short time before the event with weight one. In SAS, robust standard errors can be computed by defining an ID variable identifying the subjects in the data set and specifying COVSANDWICH(AGGREGATE) in the PHREG procedure. In R, robust variance estimates are obtained by defining CLUSTER(ID) in the model equation of the coxph function. Here again ID is a variable identifying the subjects. It should be noted that in SAS versions 9.0 onwards it is possible to specify a weight variable directly while in SAS versions before this the logarithm of the weight variable has to be defined as an offset term.
7 Simulation study
Figure 2 demonstrated that the subcohort selection procedure described in section 2.3 results in similar age distributions for the subcohort and CHD cases in our example cohort. The purpose of the simulation study here is to compare this selection procedure to one where the subcohort would be selected as a simple random sample from the study cohort without adjusting for age. Also, we can compare the efficiency of the casecohort design to a situation where all data would be collected for the complete cohort. To create a realistic simulation example, we used the endpoint and covariate data for men of our example cohort and simulated only the partially observed genetic covariate and subcohort selection. This way the results tell directly how the alternative methods would have compared to the selected method in the real situation. Following the notations introduced in the previous section, we define E _{ i } = 1 to mean CHD event during the followup for individual i, while E _{ i } = 0 means right censoring. T _{ i } denotes the age at the event time or right censoring. In covariates x _{ i } we included daily smoking, mean blood pressure, nonHDL cholesterol and body mass index, all of which are observed for the full cohort. Given this data, we simulated a binary covariate g _{ i } for our example cohort with fixed effect γ and population frequency π and applied different subcohort selection and estimation procedures for the resulting simulated datasets. Details of the simulation model are described in the Appendix 2. Same covariates x _{ i } as in the simulation model were also used in the casecohort analysis.
Summary statistics from 5000 replications: sample mean of point estimates, standard deviation of point estimates, sample mean of standard error estimates, power of test for H _{0} : γ = 0 at 5% significance level. Cohort refers to a situation where the covariate in question is collected for every cohort member. SRS refers to subcohort selection with simple random sampling and MORGAM to the subcohort selection procedure described in section 2.3. Weighting method K. & L. refers to Kalbfleisch and Lawless (1988) and ISSP to the inverse subcohort sampling probability weighting as defined in (5).
μ  γ  sampling design  weighting method  mean  sd  mean est. sd  power 

0.2  0.0  cohort    0.021  0.256  0.247  0.052 
CC (SRS)  Prentice  0.001  0.404  0.362  0.076  
K. & L.  0.001  0.424  0.383  0.075  
ISSP  0.001  0.447  0.402  0.070  
CC (MORGAM)  Prentice  0.010  0.321  0.310  0.053  
K. & L.  0.009  0.332  0.325  0.050  
ISSP  0.009  0.337  0.331  0.049  
0.3  cohort    0.296  0.238  0.231  0.283  
CC (SRS)  Prentice  0.325  0.391  0.347  0.188  
K. & L.  0.342  0.411  0.367  0.189  
ISSP  0.355  0.438  0.390  0.174  
CC (MORGAM)  Prentice  0.313  0.312  0.297  0.207  
K. & L.  0.324  0.323  0.312  0.198  
ISSP  0.329  0.330  0.320  0.191  
0.6  cohort    0.613  0.226  0.218  0.787  
CC (SRS)  Prentice  0.670  0.393  0.337  0.510  
K. & L.  0.705  0.405  0.354  0.514  
ISSP  0.735  0.449  0.384  0.484  
CC (MORGAM)  Prentice  0.634  0.309  0.287  0.592  
K. & L.  0.660  0.322  0.303  0.582  
ISSP  0.667  0.331  0.313  0.565  
0.4  0.0  cohort    0.001  0.204  0.200  0.054 
CC (SRS)  Prentice  0.006  0.333  0.296  0.075  
K. & L.  0.006  0.349  0.313  0.076  
ISSP  0.007  0.368  0.329  0.071  
CC (MORGAM)  Prentice  0.003  0.265  0.251  0.059  
K. & L.  0.003  0.274  0.263  0.058  
ISSP  0.004  0.278  0.268  0.054  
0.3  cohort    0.314  0.201  0.196  0.372  
CC (SRS)  Prentice  0.330  0.327  0.294  0.225  
K. & L.  0.347  0.344  0.311  0.223  
ISSP  0.357  0.365  0.328  0.212  
CC (MORGAM)  Prentice  0.317  0.258  0.249  0.253  
K. & L.  0.329  0.266  0.261  0.248  
ISSP  0.332  0.272  0.267  0.238  
0.6  cohort    0.629  0.200  0.196  0.890  
CC (SRS)  Prentice  0.663  0.331  0.294  0.606  
K. & L.  0.700  0.347  0.311  0.604  
ISSP  0.721  0.374  0.332  0.586  
CC (MORGAM)  Prentice  0.644  0.261  0.250  0.730  
K. & L.  0.668  0.270  0.262  0.719  
ISSP  0.673  0.277  0.269  0.710 
The simulation results suggest that the estimators of the regression coefficients have some bias away from zero that seems to increase with the true value of the parameter. We also repeated the simulations with negative values of γ (results not shown) and the bias was always away from zero. The robust variance estimator used seem to have slight negative bias which appears to be more severe in the simple random sampling situation. It is to be expected that these biases are small sample properties of the estimators and will disappear with larger sample size [20]. In MORGAM the aim is to pool casecohort sets from several cohorts for analysis and therefore in practice the sample sizes will be larger than in the current example. All the estimation methods considered gave reasonable results. The estimators based on (5) and (6) had a near identical performance, while the Prentice estimator seemed to work better than these in terms of the observed bias in the point estimates. This observation matches the results reported recently by OnlandMoret et al. [20]. Also, the Prentice estimator gave slightly lower variance than the estimators which utilised the subcohort sampling probabilities but on the other hand the negative bias in the robust variance estimates seemed to be slightly larger in the Prentice weighting used with the MORGAM subcohort selection.
8 Discussion
The costeffectiveness and the availability of software for the analysis make the casecohort design appealing among epidemiologists. The literature on the casecohort design is rich with the articles on the analysis of casecohort data but little is written about the designing of the casecohort studies. In this paper we have considered implementation of casecohort design in the MORGAM Project and proposed procedures to accommodate an earlier casecohort selection. There are several advantages of the casecohort design in the MORGAM Project. Firstly, the casecohort design allows the study of multiple endpoints using the same subcohort and gives more freedom for defining the endpoint of interest at the analysis stage. Secondly, because the subcohort is a random sample of the original cohort and selected without reference to any specific case definition, the subcohort can be used to estimate population parameters. Our subcohort selection procedure is general and can be applied in other situations.
In epidemiological studies agematched controls are commonly used to improve the efficiency of the design. In MORGAM this is achieved by matching the age distribution rather than matching the individuals, as is demonstrated in Figure 2. Another feature of the MORGAM casecohort design is a natural extension to incorporate updated followup data after extending the followup period. Similar extension can be employed to achieve maximum overlap between the locally planned designs and the MORGAM casecohort design. The sampling of the subcohort in MORGAM cohorts depends on the observed number of events and the mortality rate estimated from the data. Alternatively, this information could be acquired from external sources, for example, population mortality data and population event registers.
The genetic substudy of MORGAM is analysed as a prospective study even though the selection of the casecohort set and genotyping of individuals are done retrospectively. Because the genotype information can be assumed static over time, this study avoids many potential complications that would arise if the collected covariate information depended on the time of taking the specimen. For example, no further matching is needed to ensure that the measured covariate values are compatible and the application of casecohort design is straightforward.
When carefully planned and analysed, casecohort designs are powerful choices for followup studies with multiple event types. Our experiences in designing, coordinating and analysing the MORGAM casecohort study are potentially useful for other studies with similar characteristics. To summarise, for efficient selection of the subcohort, we recommend use of the followup and covariate data collected for the entire cohort. The proposed subcohort selection procedure has a natural extension to augmenting the subcohort after identification of new cases, for example due to extension of followup. For obtaining summary statistics based on the subcohort, it is obvious that the HorvitzThompson style of weighting has to be used as described in section 6.1 but otherwise the design gives the freedom to choose the suitable analysis method at the data analysis stage and we recommend that different methods should be tried and the results be compared. The likelihood based approaches will open up new avenues to the analysis of casecohort data. The casecohort designs and the analysis of casecohort data continue to remain interesting research problems.
Appendix 1: program codes for casecohort analysis of proportional hazards regression model
These codes implement pseudolikelihood analysis with inverse subcohort sampling probability weighting as defined in (5). The Prentice weighting can be applied by giving a vector of unit probabilities or not defining the weight or offset variable in the model definition.
R
ccregression < function(dataset, covariates, selected, idvar,
censvar, agestart, agestop, prob, subcoh) {
# Arguments for the function are:
# dataset: R data frame; if this is created by reading a CSV file,
# missing data in that file has to coded with empty fields and
# indicator variables have to be coded as 1s and 0s.
# covariates: A vector of covariate names.
# selected: Logical expression indicating the observations to be selected into analysis.
# Following variables are entered as R expressions:
# idvar: Identification variable for the individuals.
# censvar: Case status (1=case, 0=not case); can be a logical expression.
# agestart: Variable for age at the start of the followup.
# agestop: Variable for age at the end of the followup.
# prob: Variable for subcohort selection probability.
# subcoh: Variable for inclusion in the subcohort.
attach(dataset)
dataset < dataset[eval(selected),]
n < nrow(dataset)
epsilon < 0.00001
idvar < eval(idvar)
censvar < as.numeric(eval(censvar))
agestart < as.numeric(eval(agestart))
agestop < as.numeric(eval(agestop))
prob < as.numeric(eval(prob))
subcoh < as.numeric(eval(subcoh))
z < matrix(NA, n, length(covariates))
for (i in 1:length(covariates))
z[,i] < as.numeric(dataset[,names(dataset) == covariates[i]])
colnames(z) < covariates
detach(dataset)
start < NULL
stop < NULL
cens < NULL
weight < NULL
keys < NULL
for (i in 1:n) {
# Case outside subcohort
if ((censvar[i]) & (!subcoh[i])) {
start < c(start, agestop[i]epsilon)
stop < c(stop, agestop[i])
cens < c(cens, 1)
weight < c(weight, 1)
keys < c(keys, idvar[i])
}
# Noncase in subcohort
else if ((!censvar[i]) & (subcoh[i])) {
start < c(start, agestart[i])
stop < c(stop, agestop[i])
cens < c(cens, 0)
weight < c(weight, 1/prob[i])
keys < c(keys, idvar[i])
}
# Case in subcohort
else if ((censvar[i]) & (subcoh[i])) {
start < c(start, agestart[i])
stop < c(stop, agestop[i]epsilon)
cens < c(cens, 0)
weight < c(weight, 1/prob[i])
keys < c(keys, idvar[i])
start < c(start, agestop[i]epsilon)
stop < c(stop, agestop[i])
cens < c(cens, 1)
weight < c(weight, 1)
keys < c(keys, idvar[i])
}
}
y < Surv(start, stop, cens)
z_ < z[match(keys, idvar),]
return(coxph(y ~ z_ + cluster(as.factor(keys)), weights=weight))
}
SAS
%MACRO ccregression(dataset, covariates, selected, idvar,
case, agestart, agestop, prob, subcoh);
/*
Arguments for the macro are:
dataset: a SAS data file.
covariates: List of covariate names.
selected: Logical expression indicating the observations to be selected into analysis.
idvar: Identification variable for the individuals.
censvar: Case status (1=case, 0=not case); can be a logical expression.
agestart: Variable for age at the start of the followup.
agestop: Variable for age at the end of the followup.
prob: Variable for subcohort selection probability.
subcoh: Variable for inclusion in the subcohort.
*/
DATA casecoh; SET &dataset.(WHERE=(&selected.));
RUN;
%LET epsilon=0.00001;
DATA weighted; SET casecoh;
IF &case. THEN DO;
/* cases within the subcohort */
IF (&subcoh. NE 0) THEN DO;
start = &agestart.;
survtime= &agestop.  &epsilon.;
cens = 0;
w = 1/&prob.;
wt = log(w);
OUTPUT;
END;
/* all cases */
survtime = &agestop.;
start = survtime  &epsilon.;
cens = 1;
w = 1;
wt = log(w);
OUTPUT;
END;
/* noncases within the subcohort */
ELSE IF (&subcoh. NE 0) AND NOT(&case.) THEN DO;
survtime = &agestop.;
start = &agestart.;
w = 1/&prob.;
wt = log(w);
cens = 0;
OUTPUT;
END;
RUN;
PROC SORT DATA=weighted;
BY &idvar. start survtime;
RUN;
/* SAS 8.1 > */
PROC PHREG DATA=weighted COVSANDWICH(AGGREGATE);
MODEL (start,survtime)*cens(0) = &covariates. / RL OFFSET=wt;
ID &idvar.;
RUN;
/* SAS 9.0 > */
PROC PHREG DATA=weighted COVSANDWICH(AGGREGATE);
MODEL (start,survtime)*cens(0) = &covariates. / RL;
WEIGHT w;
ID &idvar.;
RUN;
%MEND;
Appendix 2: simulation details
where
λ _{ i }(t) = ακ (αt)^{ κ1} exp(β'x _{ i } + γg _{ i }).
Fixing the regression coefficient γ and allele frequency μ, parameters β, κ and λ and covariates g _{ i } were simulated using Markov chain Monte Carlo sampling. 5000 datasets of covariates g _{ i } were produced with six different combinations of parameter values γ and μ as shown in Table 3, and subcohort selection and parameter estimation were carried out for each of these.
Declarations
Acknowledgements
The research was supported by the European Commission by the MORGAM Project grant under the Biomed 2 programme and by the GenomEUtwin Project grant under the programme 'Quality of Life and Management of the Living Resources' of 5th Framework Programme and by the Academy of Finland via its grant number 53646. The first author was a visiting researcher at the Department of Mathematics and Statistics, University of Helsinki and her research was supported by the Academy of Finland via its grant number 114786.
The participants of the MORGAM Project contributed to and made decision on the study design. MORGAM Participating Centres and Principal Investigators: Australia, Newcastle, (Patrick McElduff, Faculty of Health, University of Newcastle; Former Principal Investigator: Annette Dobson, University of Queensland, Brisbane); Denmark, Glostrup (Torben Jørgensen, Københavns AMT, Centre of Preventive Medicine, Glostrup); Finland, FINRISK (Veikko Salomaa, National Public Health Institute, Helsinki); Finland, ATBC (Jarmo Virtamo, National Public Health Institute, Helsinki); France, PRIME/Strasbourg (Dominique Arveiler, Department of Epidemiology and Public Health, Faculty of Medicine, Strasbourg); France, PRIME/Toulouse (Jean Ferrières, Department of Epidemiology, Faculty of Medicine, ToulousePurpan, Toulouse); France, PRIME/Lille, (Philippe Amouyel, Department of Epidemiology and Public Health, Pasteur Institute of Lille); Germany, Augsburg (Angela Döring, GSFNational Research Centre for Environment and Health, Institute of Epidemiology, Neuherberg); Italy, Brianza, (Giancarlo Cesana and Marco Ferrario, University of Milan – Bicocca, Monza); Italy, Friuli (Diego Vanuzzo and Lorenza Pilotto, Centre for Cardiovascular Prevention, ASS4 "Medio Friuli", Udine); Italy, Rome, (Simona Giampaoli and Luigi Palmieri, National Institute of Health, Rome); Lithuania, Kaunas, (Abdonas Tamosiunas, Former Principal Investigator: Stanislava Domarkiene, Kaunas University of Medicine, Institute of Cardiology, Kaunas); Poland, Krakow (Andrzej Pajak, Unit of Clinical Epidemiology and Population Studies, School of Public Health, Jagiellonian University, Krakow); Poland, Warsaw, (Grazyna Broda, Department of Cardiovascular Epidemiology and Prevention, National Institute of Cardiology, Warsaw); Russian Federation, Novosibirsk, (Yuri Nikitin, Institute of Internal Medicine, Novosibirsk); Sweden, Northern Sweden, (Birgitta Stegmayr, Umeå University Hospital, Department of Medicine, Umeå); United Kingdom, PRIME/Belfast, (Alun Evans, The Queen's University of Belfast, Belfast, Northern Ireland); United Kingdom, Scotland (Hugh TunstallPedoe, University of Dundee, Dundee, Scotland); United Kingdom, Caerphilly, (John Yarnell, The Queen's University of Belfast, Belfast, Northern Ireland). National Coordinators: France, PRIME/France: Pierre Ducimetière, National Institute of Health and Medical Research (U258), Paris; Italy: Marco Ferrario, University of Insubria, Varese, Italy. Coordinating Centre: Alun Evans, Department of Epidemiology and Public Health, The Queen's University of Belfast, Belfast, Northern Ireland. MORGAM Data Centre: Kari Kuulasmaa, Sangita Kulathinal, Juha Karvanen, Olli Saarela, Department of Health Promotion and Chronic Disease Prevention, National Public Health Institute (KTL), Helsinki, Finland. Participating Laboratories: Markus Perola, Department of Molecular Medicine, National Public Health Institute (KTL), Helsinki, Finland; François Cambien, Laurence Tiret, INSERM U525, Paris, France; Denis Shields, Royal College of Surgeons in Ireland, Dublin, Ireland. MORGAM Management Group: Alun Evans, Department of Epidemiology and Public Health, The Queen's University of Belfast, Northern Ireland; Stefan Blankenberg, Department of Medicine II, Johannes GutenbergUniversity, Mainz, Germany; François Cambien, INSERM U525, Paris, France; Marco Ferrario, University of Insubria, Varese, Italy; Kari Kuulasmaa, MORGAM Data Centre, National Public Health Institute, Finland; Leena Peltonen, Department of Molecular Medicine, National Public Health Institute, Finland; Markus Perola, Department of Molecular Medicine, National Public Health Institute, Finland; Veikko Salomaa, Department of Health Promotion and Chronic Disease Prevention, National Public Health Institute, Finland; Denis Shields, Clinical Pharmacology, Royal College of Surgeons in Ireland, Dublin, Ireland; Birgitta Stegmayr, University of Umeå, Sweden; Hugh TunstallPedoe, Cardiovascular Epidemiology Unit, University of Dundee, Scotland; Kjell Asplund, The National Board of Health and Welfare, Stockholm, Sweden.
Authors’ Affiliations
References
 TunstallPedoe H, (Ed): MONICA Monograph and Multimedia Sourcebook Geneva: WHO 2003.Google Scholar
 Evans A, Salomaa V, Kulathinal S, Asplund K, Cambien F, Ferrario M, Perola M, Peltonen L, Shields D, TunstallPedoe H, K Kuulasmaa for The MORGAM Project: MORGAM (an international pooling of cardiovascular cohorts). International Journal of Epidemiology 2005, 34:21–27.View ArticlePubMedGoogle Scholar
 Prentice RL: A casecohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 1986, 73:1–11.View ArticleGoogle Scholar
 Kupper LL, McMichael AJ, Spirtas R: A Hybrid Epidemiologic Study Design Useful in Estimating Relative Risk. Journal of the American Statistical Association 1975, 70:524–528.View ArticleGoogle Scholar
 Miettinen O: Estimability and estimation in casereferent studies. American Journal of Epidemiology 1976, 103:226–235.PubMedGoogle Scholar
 Team RDC: [http://www.Rproject.org] R: A language and environment for statistical computing Vienna, Austria: R Foundation for Statistical Computing 2006.Google Scholar
 Inc SI: SAS/STAT User's Guide, Version 8. 2000.Google Scholar
 Borgan Ø, Langholz B, Samuelsen SO, Goldstein L, Pogoda J: Exposure stratified casecohort designs. Lifetime Data Analysis 2000, 6:39–58.View ArticlePubMedGoogle Scholar
 Kulich M, Lin DY: Improving the Efficiency of RelativeRisk Estimation in CaseCohort Studies. Journal of the American Statistical Association 2004,99(467):832–844.View ArticleGoogle Scholar
 Samuelsen SO, Nestad H, Skrondahl A: Stratified casecohort analysis of general cohort sampling designs. Scandinavian Journal of Statistics 2007, 34:103–119.View ArticleGoogle Scholar
 Kim S, Gruttola VD: Strategies for Cohort Sampling Under the Cox Proportional Hazards Model, Application to an AIDS Clinical Trial. Lifetime Data Analysis 1999, 5:149–172.View ArticlePubMedGoogle Scholar
 Cai J, Zeng D: Sample size/power calculation for casecohort studies. Biometrics 2004, 60:1015–1024.View ArticlePubMedGoogle Scholar
 Kim MY, Xue X, Du Y: Approaches for calculating power for casecohort studies. Biometrics 2006, 62:929–933.View ArticlePubMedGoogle Scholar
 Cox DR: Regression models and life tables (with discussion). Journal of the Royal Statistical Society, Series B 1972, 74:187–200.Google Scholar
 Self SG, Prentice RL: Asymptotic Distribution theory and efficiency results for casecohort studies. The Annals of Statistics 1988, 15:54–81.Google Scholar
 Kalbfleisch JD, Lawless JF: Likelihood analysis of multistate models for disease incidence and mortality. Statistics in Medicine 1988, 7:149–160.View ArticlePubMedGoogle Scholar
 Barlow WE: Robust variance estimation for the casecohort design. Biometrics 1994, 50:1064–1072.View ArticlePubMedGoogle Scholar
 Barlow WE, Ichikawa L, Rosner D, Izumi S: Analysis of casecohort designs. Journal of Clinical Epidemiology 1999, 12:1165–1172.View ArticleGoogle Scholar
 Petersen L, Sørensen TIA, Andersen PK: Comparison of casecohort estimators based on data on premature death of adult adoptees. Statistics in Medicine 2003, 22:3795–3803.View ArticlePubMedGoogle Scholar
 OnlandMoret NC, van der A DL, van der Schouw YT, Buschers W, Elias SG, van Gils CH, Koerselman J, Roest M, Grobbee DE, Peeters PHM: Analysis of casecohort data: a comparison of different methods. Journal of Clinical Epidemiology 2007, 60:350–355.View ArticlePubMedGoogle Scholar
 Wacholder S: Practical considerations in choosing between the casecohort and nested casecontrol designs. Epidemiology 1991, 2:155–158.View ArticlePubMedGoogle Scholar
 Lin DY, Ying Z: Cox regression with incomplete covariate measurements. Journal of the American Statistical Association 1993, 88:1341–1349.View ArticleGoogle Scholar
 Lin DY, Wei LJ: The robust inference for the Cox proportional hazards model. Journal of the American Statistical Association 1989, 84:1074–1078.View ArticleGoogle Scholar
 Therneau TM, Li H: Computing the Cox model for case cohort designs. Lifetime Data Analysis 1999, 5:99–112.View ArticlePubMedGoogle Scholar
 Langholz B, Jiao J: Computational methods for casecohort studies. Computational Statistics & Data Analysis 2007, 51:3737–3748.View ArticleGoogle Scholar
 Scheike TH, Martinussen T: Maximum likelihood estimation for Cox's regression model under casecohort sampling. Scandinavian Journal of Statistics 2004, 31:283–293.View ArticleGoogle Scholar
 Kulathinal S, Arjas E: Bayesian inference from casecohort data with multiple endpoints. Scandinavian Journal of Statistics 2006, 33:25–36.View ArticleGoogle Scholar
 Saarela O, Kulathinal S: Conditional likelihood inference in a casecohort design: An application to haplotype analysis. [http://www.bepress.com/ijb/vol3/iss1/1] The International Journal of Biostatistics 2007., 3:
 Langholz B, Thomas DC: Nested casecontrol and casecohort methods of sampling from a cohort: a critical comparison. American Journal of Epidemiology 1990, 131:169–176.PubMedGoogle Scholar
 Langholz B, Duncan DC: Efficiency of cohort sampling designs: some suprising results. Biometrics 1991, 47:1563–1571.View ArticlePubMedGoogle Scholar
 Zhao LP, Lipsitz S: Designs and analysis of twostage studies. Statistics in Medicine 1992, 11:769–782.View ArticlePubMedGoogle Scholar
 Chen K, Lo S: Casecohort and casecontrol analysis with Cox's model. Biometrika 1999, 85:755–754.View ArticleGoogle Scholar
 Chen K: Generalised casecohort sampling. Journal of the Royal Statistical Society, Series B 2001, 63:823–838.View ArticleGoogle Scholar
 Paszata LF, Vallisb KA, Benka VMA, Groomec PA, Mackillopc WJ, Wielgosz A: A populationbased casecohort study of the risk of myocardial infarction following radiation therapy for breast cancer. Radiotherapy and Oncology 2007,82(3):294–300.View ArticleGoogle Scholar
 Friesema IHM, Zwietering PJ, Veenstra MY, Knottnerus JA, Garretsen HFL, Lemmens PHHM: Alcohol intake and cardiovascular disease and mortality: the role of preexisting disease. Journal of Epidemiology and Community Health 2007, 61:441–446.View ArticlePubMedGoogle Scholar
 Bernatsky S, Boivin JF, Joseph L, Gordon C, Urowitz M, Gladman D, Ginzler E, Fortin P, Bae SC, Barr S, Isenberg D, Rahman A, Petri M, Alarcón G, Aranow C, Dooley MA, Rajan R, Sénécal JL, Zummer M, Manzi S, Edworthy S, RamseyGoldman R, Clarke A: The relationship between cancer and medication exposures in systemic lupus erythematosus: a casecohort study. Annals of the rheumatic diseases 2007. doi:101136/ard.2006.069039
 Ray RM, Gao DL, Li W, Wernli KJ, Astrakianakis G, Seixas NS, Camp JE, Fitzgibbons ED, Feng Z, Thomas DB, Checkoway H: Occupational exposures and breast cancer among women textile workers in Shanghai. Epidemiology 2007,8(3):383–392.View ArticleGoogle Scholar
 Project M: MORGAM Manual. [http://www.ktl.fi/publications/morgam/manual/contents.htm]
 Ury HK: Efficiency of casecontrol studies with multiple controls per case: continuos and dichotomous data. Biometrics 1975, 31:643–649.View ArticlePubMedGoogle Scholar
 Hanurav TV: Optimum utilization of auxiliary information: π _{ ps } sampling of two units from a stratum. Journal of the Royal Statistical Society, Series B 1967, 29:374–391.Google Scholar
 Vijayan K: An exact π _{ ps } sampling scheme: generalization of a method of Hanurav. Journal of the Royal Statistical Society, Series B 1968, 30:556–566.Google Scholar
 Kulathinal S, Niemelä M, Kuulasmaa K: contributors from Participating Centres for the MORGAM Project Description of MORGAM cohorts. [http://www.ktl.fi/publications/morgam/cohorts/index.html]
 Horvitz DG, Thompson DJ: A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association 1952, 47:663–685.View ArticleGoogle Scholar
 Särndal CE, Swensson B, Wretman J: Model assisted survey sampling New York: SpringerVerlag 1992.Google Scholar
 Sørensen P, Anderson PK: Competing risks analysis of the casecohort design. Biometrika 2000, 87:49–59.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.