Causal analysis of casecontrol data
 Stephen C Newman^{1_18}Email author
DOI: 10.1186/1742557332
© Newman. 2006
Received: 20 July 2005
Accepted: 27 January 2006
Published: 27 January 2006
Abstract
In a series of papers, Robins and colleagues describe inverse probability of treatment weighted (IPTW) estimation in marginal structural models (MSMs), a method of causal analysis of longitudinal data based on counterfactual principles. This family of statistical techniques is similar in concept to weighting of survey data, except that the weights are estimated using study data rather than defined so as to reflect sampling design and poststratification to an external population. Several decades ago Miettinen described an elementary method of causal analysis of casecontrol data based on indirect standardization. In this paper we extend the Miettinen approach using ideas closely related to IPTW estimation in MSMs. The technique is illustrated using data from a casecontrol study of oral contraceptives and myocardial infarction.
Introduction
In a series of papers, Robins and colleagues describe inverse probability of treatment weighted (IPTW) estimation in marginal structural models (MSMs) [1–7], a method of causal analysis of longitudinal data based on counterfactual principles. This family of statistical techniques is similar in concept to weighting of survey data, except that weights are estimated using study data rather than defined so as to reflect sampling design and poststratification to an external population. Several decades ago Miettinen [8] described an elementary method of causal analysis of casecontrol data based on indirect standardization. In this paper we extend the Miettinen approach using ideas closely related to IPTW estimation in MSMs. For simplicity we ignore random error until the illustrative example.
Populationbased incidence casecontrol study
Consider a populationbased casecontrol study having an incidence design, that is, one in which only incident cases are eligible for recruitment. Let E be a dichotomous variable (0: absent, 1: present) representing the exposure of interest, and let F be a polychotomous variable (i = 0,1, ..., I), which we later treat as a confounder. At any time point we may think of the population as being comprised of exposed and unexposed (sub)populations. Suppose that recruitment of cases and controls takes place over a period of T years. We assume that during the period of recruitment the exposed and unexposed populations are stationary (i.e., independent of time) with respect to population size and incidence rate (of disease) in each of the strata of F [9]. Provided that T is not too large, say no more than two or three years, this assumption is likely to be approximately satisfied in practice.
Let N _{1i } be the number of people in the ith stratum of the exposed population who are free of disease (at any time during the period of recruitment), and let N _{0i } be the corresponding number in the ith stratum of the unexposed population. Let and . Therefore at any time during the period of recruitment, there are N _{1} exposed and N _{0} unexposed people in the population "at risk" of disease, hence eligible to be controls. Since the population is stationary, we may assume that controls are selected at the end of the period of recruitment. This avoids the inconvenience of having a control selected early in the study become a case later on. In practice, controls are usually sampled throughout the period of recruitment, with one or more controls enrolled as each case enters the study. The case triggering this activity and the associated controls can be thought of as a matched set, where the matching variable is "time." This method of subject recruitment is a type of risk set sampling and, in theory, should be followed by a conditional statistical analysis [10]. Generally, matching on time is ignored in the analysis of casecontrol data, which in practical terms is not that different from making the stationary population assumption.
Let R _{1i } and R _{0i } be the incidence rates (of disease) in the ith stratum of the exposed and unexposed populations, respectively. The crude incidence rates are
and
The impact of exposure can be measured using the standardized morbidity ratio, which has different forms depending on the choice of standard population [11]. Taking the standard population to be, in turn, the exposed, unexposed, and total (exposed plus unexposed) populations, the corresponding standardized morbidity ratios are
and
We now view the population as an open (dynamic) cohort that is followed over the period of recruitment, with onset of disease as the endpoint of interest [12]. Entry into the cohort occurs, for example, as a result of birth and inmigration, and censoring takes place when, for instance, there is outmigration and death from a cause other than the disease of interest.
Simple random sampling
Number of cases and controls in ith stratum of F under simple random sampling
E  Case  Control 
1  a _{1i } = γR _{1i } N _{1i } T  b _{1i } = λN _{1i } 
0  a _{0i } = γR _{0i } N _{0i } T  b _{0i } = λN _{0i } 
It follows from Table 1 that
and
which shows that SMR _{E}, SMR _{U} and SMR _{T} can be estimated from incidence casecontrol data [13–15]. Note that nowhere have we made the rare disease assumption.
We are interested in measuring the causal effect of exposure on the exposed cohort using counterfactual methods [16–21]. To accomplish this we imagine the group of individuals in the exposed cohort prior to exposure and consider two scenarios: in the first, exposure subsequently occurs (as it does in reality); in the second, exposure does not occur. The second scenario is counterfactual because it rests on the hypothetical condition that exposure does not take place, when in fact it does. By contrasting outcomes arising out of the two scenarios we are able to define parameters having a causal interpretation. This is because we are (in theory) comparing two groups of individuals that are identical except for exposure status. The crude incidence rate corresponding to the first scenario is R _{1}. Denote the crude incidence rate for the second scenario by R _{1} ^{*}. Even though the second scenario is counterfactual, it is possible, provided certain assumptions are satisfied, to estimate R _{1} ^{*}, as discussed below.
In practice, the unexposed cohort, not the exposed cohort under the counterfactual condition, is used for comparative purposes. To the extent that the two associated incidence rates, R _{0} and R _{1} ^{*}, differ, we say that there is confounding. More precisely, the counterfactual definition of confounding states that confounding is present if and only if R _{0} ≠ R _{1} ^{*}[16–21].
We now make two fundamental assumptions: (1) E does not "affect" F (in particular, F is not on a causal pathway between E and the disease), and (2) there is no confounding (according to the counterfactual definition) in the strata of F. Using arguments analogous to those in [21] and [22], we have
Since there is no confounding in the strata of F, when confounding is present, that is, R _{0} ≠ R _{1} ^{*}, we attribute it to F and say that F is a confounder. It follows from (1), (2) and (4) that
which shows that under the above two assumptions, SMR _{E} has a causal interpretation.
Weighted number of cases and controls under simple random sampling
E  Case  Control 

1 


0 


Let
and n _{ i } = a _{1i } + a _{0i } + b _{1i } + b _{0i }. It is readily demonstrated that sOR as given by (3) and the MantelHaenszel odds ratio estimate OR _{MH} [23] can be expressed as weighted sums of the OR _{ i }:
These expressions differ only to the extent that the relative magnitudes of the b _{0i } and n _{ i } vary across strata. For casecontrol studies in which unexposed controls constitute the majority of subjects, sOR and OR _{MH} will be close in value.
It was pointed out by Greenland [15] that OR _{MH} does not have an epidemiologic interpretation when there is effect modification. This is because the stratumspecific weights in (6) do not reflect a recognizable target population. With sOR the target population is clearly specified (namely, the exposed population), and so sOR has a causal interpretation even in the presence of effect modification. This is advantageous in a number of settings. Consider the familiar situation in which, after stratification by one or more confounders, the stratumspecific odds ratio estimates do not exhibit a meaningful pattern, or the differences in these estimates can be distinguished on statistical grounds but are of no practical importance. When this occurs it is desirable to have recourse to a summary odds ratio estimate, even though effect modification may be present.
Stratified random sampling
Number of cases and controls in ijth stratum of F × G under stratified random sampling
E  Case  Control 

1  a _{1ij } = γ_{ j } R _{1ij } N _{1ij } T  b _{1ij } = λ_{ j } N _{1ij } 
0  a _{0ij } = γ_{ j } R _{0ij } N _{0ij } T  b _{0ij } = λ_{ j } N _{0ij } 
Weighted number of cases and controls under stratified random sampling
E  Case  Control 

1 


0 


Under stratified random sampling, we assign each exposed subject in the ijth stratum the (empirical) weight 1/γ_{ j }, and each unexposed subject the weight b _{1ij }/γ_{ j } b _{0ij }. As before, in the casecontrol context we denote SMR _{E} by sOR.
MSMIPTW approach
When there are multiple confounders, the data can be stratified according to their crossclassification and the above method used. However, this may lead to cells with small or zero entries, resulting in instability of estimates. A statistically more efficient alternative is to adopt the MSMIPTW approach and obtain the weights (for controls) from a logistic regression analysis of control data, where E is the dependent variable and the confounders (of the Edisease association) are the independent variables. We refer to these weights as regression weights.
Under simple random sampling, the weight for each exposed subject is set equal to 1, and the weight for each unexposed subject is taken to be the fitted odds for that individual. For stratified random sampling, the logistic regression analysis of control data must include the stratifying variable. In the jth stratum, the weight for each exposed subject is set equal to the reciprocal of the sampling probability, and the weight for each unexposed subject is taken to be the fitted odds for that individual multiplied by the reciprocal of the sampling probability.
Once the regression weights have been calculated, the odds ratio for the exposuredisease association is estimated from a weighted logistic regression analysis using generalized estimating equations (GEE) [24], where E is the sole independent variable. As remarked by Hernán et al. [6], it has been shown by Robins [1, 2] that for longitudinal data where there are no unmeasured confounders and where a certain positivity assumption is met, the weighted GEE approach produces an asymptotically unbiased estimate of the causal parameter. Depending on the software used for the GEE analysis, it may be necessary to scale the weights such that their sum across all cases equals the actual number of cases, and likewise for controls.
Example
Casecontrol study of oral contraceptives and myocardial infarction [25]
CIG  AGE  Total  

25–34  35–44  45+  
OC  Case  Control  Case  Control  Case  Control  Case  Control  
none  1  0  38  1  12  3  2  4  52 
0  1  281  13  318  20  155  34  754  
= 2.44  = 2.03  = 11.63  = 1.71  
1–24  1  2  35  1  15  0  1  3  51 
0  5  221  32  249  42  96  79  566  
= 2.53  = 0.52  = 0.76  = 0.42  
25+  1  11  22  8  8  3  2  22  32 
0  8  112  53  125  31  50  92  287  
= 7.00  = 2.36  = 2.42  = 2.14  
Total  1  13  95  10  35  6  5  29  135 
0  14  614  98  692  93  301  205  1607  
= 6.00  = 2.02  = 3.88  = 1.68 
We first performed a standard logistic regression analysis, with MI as the dependent variable and OC, AGE and CIG as the independent variables. As pointed out by Greenland and Maldonado [26], there are problems identifying the target population when using standard logistic regression analysis. Models were fit using EGRET [27]: statistical significance of individual terms was determined using the likelihood ratio test, and the goodnessoffit statistic G ^{2} was based on the deviance. On purely statistical grounds the bestfitting model had main effects for OC, AGE and CIG, along with the interaction term AGE × CIG (G ^{2} = 12.0, df = 8, p = .15). The odds ratio estimate for the OCMI association was 2.82 (95% confidence interval [CI]: 1.70,4.68). Of note, the MantelHaenszel odds ratio estimate, OR _{MH} = 2.82 (95% CI: 1.70,4.69), was virtually identical to the logistic regression estimate. The OR _{MH} confidence interval was based on the variance estimate described by Robins, Breslow and Greenland [28, 29]. The model with main effects for OC, AGE and CIG, along with the interaction term OC × CIG also fit the data quite well (G ^{2} = 17.4, df = 10, p = .068). Given that oral contraceptive use is the exposure of interest, it is reasonable – on substantive grounds – to consider this as the "final" model. If so, because of the OC × CIG interaction, the model no longer provides a summary estimate of the odds ratio for the OCMI association.
Next, we conducted an analysis using the MSMIPTW approach. To obtain regression weights, a standard logistic regression analysis of control data was performed, with OC as the dependent variable, and with AGE and CIG as the independent variables. The bestfitting model had only a main effect for AGE (G ^{2} = 5.06, df = 6, p = .54). We then conducted a weighted logistic regression analysis using generalized estimating equations, with MI as the dependent variable and OC as the sole independent variable. Following Hernán et al. [4] and Sato and Matsuyama [11], calculations were performed using the SAS procedure PROC GENMOD [30]. The odds ratio estimate for the OCMI association was 3.34 (95% CI: 2.15, 5.21). Interestingly, when empirical weights were used instead of regression weights, the odds ratio estimate (which equals sOR) was 2.83 (95% CI: 1.82,4.41). This is very close to the odds ratio and confidence interval estimates based on the standard logistic regression and MantelHaenszel analyses.
Discussion
The counterfactual definition of confounding represents an important conceptual advance over earlier formulations of confounding. Working within the counterfactual framework, Robins and colleagues developed inverse probability of treatment weighted estimation in marginal structural models for the analysis of longitudinal data [1–7]. Although primarily aimed at the problem of timedependent confounding, this method is valid when confounders are independent of time.
Extending the work of Miettinen [8], in this paper we present a method of causal analysis of casecontrol data that is closely related to IPTW estimation in MSMs. We consider only casecontrol studies conducted in a stationary population. Provided the time period during which the study is conducted is not too long, it may be reasonable to regard the population as at least approximately stationary. Whether strictly valid or not, the stationary population assumption appears to be made routinely – usually implicitly – when casecontrol studies are conducted. An alternative is to match controls to cases on time of recruitment using risk set sampling [10] and perform a conditional data analysis. Under the rare disease assumption, approximate parameter estimates can then be obtained using the MSMIPTW approach [7].
Declarations
Acknowledgements
The author thanks Dr. James Robins for helpful discussions.
Authors’ Affiliations
References
 Robins JM: Marginal structural models. 1997 Proceedings of the Section on Bayesian Statistical Science Alexandria, VA, American Statistical Association 1998, 1–10.
 Robins JM: Marginal structural models versus structural nested models as tools for causal inference. Statistical Models in Epidemiology: the Environment and Clinical Trials (Edited by: Halloran ME, Berry D). New York, SpringerVerlag 1999, 95–134.Google Scholar
 Robins JM: Association, causation, and marginal structural models. Synthese 1999, 121:151–179.View ArticleGoogle Scholar
 Hernán MA, Brumback B, Robins JM: Marginal structural models to estimate the causal effect of zidovudine on the survival of HIVpositive men. Epidemiology 2000, 11:561–570.View ArticlePubMedGoogle Scholar
 Robins JM, Hernán MA, Brumback B: Marginal structural models and causal inference in epidemiology. Epidemiology 2000, 11:550–560.View ArticlePubMedGoogle Scholar
 Hernán MA, Brumback B, Robins JM: Estimating the causal effect of zidovudine on CD4 count with a marginal structural model for repeated measures. Statistics in Medicine 2002, 21:1689–1709.View ArticlePubMedGoogle Scholar
 Robins JM: Comment on " Covariance adjustment in randomized experiments and observational studies" by Rosenbaum PR. Statistical Science 2002, 17:286–327.View ArticleGoogle Scholar
 Miettinen OS, Cook EF: Confounding: essence and detection. American Journal of Epidemiology 1981, 114:593–603.PubMedGoogle Scholar
 Keyfitz N: Introduction to mathematical demography. With revisions. Reading, MA, AddisonWesley 1977.
 Langholz B, Goldstein L: Risk set sampling in epidemiologic cohort studies. Statistical Science 1996, 11:35–53.View ArticleGoogle Scholar
 Sato T, Matsuyama Y: Marginal structural models as a tool for standardization. Epidemiology 2003, 14:680–686.View ArticlePubMedGoogle Scholar
 Rothman KJ, Greenland S: Modern epidemiology. Second Edition Philadelphia, LippincottRaven 1998, 94.Google Scholar
 Miettinen OS: Components of the crude risk ratio. American Journal of Epidemiology 1972, 96:168–172.PubMedGoogle Scholar
 Miettinen OS: Estimability and estimation in casereferent studies. American Journal of Epidemiology 1976, 103:226–235.PubMedGoogle Scholar
 Greenland S: Interpretation and estimation of summary ratios under heterogeneity. Statistics in Medicine 1982, 1:217–227.View ArticlePubMedGoogle Scholar
 Greenland S, Robins JM: Identifiability, exchangeability, and epidemiologic confounding. International Journal of Epidemiology 1986, 15:412–418.View ArticleGoogle Scholar
 Robins JM, Morgenstern H: The foundations of confounding in epidemiology. Computers and Mathematics with Applications 1987, 14:869–916.View ArticleGoogle Scholar
 Greenland S, Robins JM, Pearl J: Confounding and collapsibility in causal inference. Statistical Science 1999, 14:29–46.View ArticleGoogle Scholar
 Greenland S, Morgenstern H: Confounding in health research. Annual Revue of Public Health 2001, 22:189–212.View ArticleGoogle Scholar
 Maldonado G, Greenland S: Estimating causal effects (with commentary). International Journal of Epidemiology 2002, 31:422–438.View ArticlePubMedGoogle Scholar
 Newman S: Commonalities in the classical, collapsibility and counterfactual concepts of confounding. Journal of Clinical Epidemiology 2004, 57:325–329.View ArticlePubMedGoogle Scholar
 Wickramaratne P, Holford TR: Confounding in epidemiologic studies: the adequacy of the control group as a measure of confounding. Biometrics 1987, 43:751–765.View ArticlePubMedGoogle Scholar
 Mantel N, Haenszel W: Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute 1959, 22:719–748.PubMedGoogle Scholar
 Hanley JA, Negassa A, Edwardes MDdeB, Forrester JE: Statistical analysis of correlated data using generalized estimating equations: an orientation. American Journal of Epidemiology 2003, 157:364–375.View ArticlePubMedGoogle Scholar
 Shapiro S, Slone D, Rosenberg L, Kaufman DW, Stolley PD, Miettinen OS: Oral contraceptive use in relation to myocardial infarction. Lancet 1979, 7:743–746.View ArticleGoogle Scholar
 Greenland S, Maldonado G: The interpretation of multiplicativemodel parameters as standardized parameters. Statistics in Medicine 1994, 13:989–999.View ArticlePubMedGoogle Scholar
 Cytel Software Corporation: EGRET® for Windows. User Manual Cambridge, MA: Cytel Software Corporation 1999.Google Scholar
 Robins J, Breslow N, Greenland S: Estimators of the MantelHaenszel variance consistent in both sparse data and largestrata limiting models. Biometrics 1986, 42:311–323.View ArticlePubMedGoogle Scholar
 Silcocks P: An easy approach to the RobinsBreslowGreenland variance estimator. Epidemiologic Perspectives & Innovations 2005, 2:9.View ArticleGoogle Scholar
 SAS Institute Inc: SAS/STAT ® User's Guide: Version 8 Carey, NC: SAS Institute Inc 1999., 2: Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.