# An easy approach to the Robins-Breslow-Greenland variance estimator

- Paul Silcocks
^{1_15}Email author

**2**:9

**DOI: **10.1186/1742-5573-2-9

© Silcocks. 2005

**Received: **26 April 2005

**Accepted: **26 September 2005

**Published: **26 September 2005

## Abstract

The Mantel-Haenszel estimate for the odds ratio (and its logarithm) in stratified case control studies lacked a generally acceptable variance estimate for many years. The Robins-Breslow-Greenland estimate has met this need, but standard textbooks still do not provide an explanation of how it is derived. This article provides an accessible derivation which demonstrates the link between the Robins-Breslow-Greenland estimate and the familiar Woolf estimate for the variance of the log odds ratio, and which could easily be included in Masters level courses in epidemiology. The relationships to the unconditional and conditional maximum likelihood estimates are also reviewed.

## Introduction

The Mantel-Haenszel (MH) estimate for the summary odds ratio across several 2 × 2 tables, *ψ*
_{
MH
}, was proposed in 1959 [1]. Over twenty years later the lack of a robust estimate for its variance was still being noted [2], yet only a few years afterwards, Robins, Breslow and Greenland introduced their now generally-accepted variance estimator [3] for the Mantel-Haenszel log-odds ratio (denoted by the RBG estimate). This replaced estimation of confidence limits based on the unsatisfactory test-based procedure of Miettinen or the computationally intensive Cornfield type limits which had hitherto been used.

While a useful review of Mantel-Haenszel methods has been published, including some aspects of the historical development towards the RBG estimator [4] the formal derivations by Robins, Breslow & Greenland [3] and Phillips & Holland [5] are not, in the view of this author, easily comprehended. The former omits steps in the argument, while the latter appeals to descending factorial powers. Possibly it is no surprise that even modern textbooks [6, 7] merely state the RBG formula without deriving it.

While other variance estimators exist, some are ad hoc, such as the application of the cohort study formula to case-control data suggested by Clayton and Hills [8], only apply to the large few strata case [9] or are closely related to the RBG estimator [10]. One rather different exception is Sato's formula [11] but this procedure gives confidence limits directly in the odds ratio scale.

It is the intention of this article to present an informal derivation of the RBG estimator as an extension of the familiar variance formula of Woolf [12], and which could readily be included in standard textbooks of epidemiology or biostatistics. I will describe this from the perspective of a case-control study.

## Analysis

### How does the Mantel-Haenszel estimate arise?

Consider a stratified case-control study for which the i^{th} of k independent tables is:

Neglecting constants, the unconditional likelihood for the i^{th} table is:

where in the i^{th} table *θ*
_{
i
} = probability of exposure if a case and *ø*
_{
i
} = probability of exposure if a control.

The maximum likelihood estimate (MLE) for *ø*
_{
i
} is given by *b*
_{
i
}/(*b*
_{
i
} + *d*
_{
i
}) and if we re-parameterise *θ*
_{
i
} as *ψ*
*ø*
_{
i
}
*/* [*ψ*
*ø*
_{
i
} + (1 - *ø*
_{
i
})], where *ψ* is the odds ratio (assumed common to all tables), the contribution to the overall log likelihood made by terms involving *ψ* is:

∑ *a*
_{
i
} ln {*ψ*
*ø*
_{
i
}/[*ψ*
*ø*
_{
i
} + (1 - *ø*
_{
i
})]} + *c*
_{
i
} ln {1/[*ψ*
*ø*
_{
i
} + (1 - *ø*
_{
i
})]}.

Differentiating with respect to *ψ* and equating to zero, and rearranging (noting that *a*
_{
i
} + *c*
_{
i
} = *n*
_{1i
})

we obtain:

∑ *a*
_{
i
} ln {*ψ*
*ø*
_{
i
}/[*ψ*
*ø*
_{
i
} + (1 - *ø*
_{
i
})]} + *c*
_{
i
} ln {1/[*ψ*
*ø*
_{
i
} + (1 - *ø*
_{
i
})]}.

i.e., ∑ {*a*
_{
i
} - *n*
_{1i
}
*ψ*
*ø*
_{
i
}/[*ψ*
*ø*
_{
i
} + (1 - *ø*
_{
i
})]} = 0

i.e., ∑ {[*ψ*
*a*
_{
i
}
*ø*
_{
i
} + *a*
_{
i
} - *a*
_{
i
}
*ø*
_{
i
} - *n*
_{1i
}
*ψ*
*ø*
_{
i
}]/[*ψ*
*ø*
_{
i
} + (1 - *ø*
_{
i
})]} = 0.

This must be solved numerically to obtain the MLE for *ψ*, but if the denominators do not vary too much across the tables we merely have to solve:

∑ [*ψ*
*a*
_{
i
}
*ø*
_{
i
} + *a*
_{
i
} - *a*
_{
i
}
*ø*
_{
i
} - *n*
_{1i
}
*ψ*
*ø*
_{
i
}] = 0

i.e., ∑ [*ψ*(*a*
_{
i
} - *n*
_{1i
})*ø*
_{
i
} + *a*
_{
i
} (1 - *ø*
_{
i
})] = 0

or, ∑ *a*
_{
i
} (1 - *ø*
_{
i
}) = ∑ *ψ*(*n*
_{1i
} - *a*
_{
i
})*ø*
_{
i
}

giving, ∑ *a*
_{
i
} (1 - *ø*
_{
i
}) = *ψ* ∑ (*n*
_{1i
} - *a*
_{
i
})*ø*
_{
i
}

and since, *ψ*
_{
i
} = *b*
_{
i
}/(*b*
_{
i
} + *d*
_{
i
}) = *b*
_{
i
}/*n*
_{0i
}

This can be used as a first approximation to find the MLE (if there is only one table then *ψ*
*is* the unconditional MLE = ad/bc). Now in stratified case-control studies with a constant ratio, *r*, of controls to cases, the total number of subjects in each stratum is given by *n*
_{
i
} = *n*
_{0i
} (1 + *r*), so *n*
_{0i
} = *n*
_{
i
}/(1 + *r*). A constant *r* will be achieved by design if there is caliper matching; otherwise – as with a post-stratified analysis – this will be only approximately true. The term (1 + *r*) can then be cancelled and we are left with:

The MH estimator is therefore a first approximation to the unconditional MLE in the large strata case with a constant control:case ratio across strata. However the MH estimator actually *coincides* with the *conditional* MLE for the matched pairs design, as outlined, for example on page 164 of Breslow & Day [2].

*ø*

_{ i }and constancy of the control:case ratio is not high, as shown by the data in Table 1. In a sense this would be expected because for the most sparse (e.g., pair-matched) data the control:case ratio will be constant, and while the

*ø*

_{ i }then have maximum variance – being only 0 and 1, the MH estimate coincides with the conditional MLE. Conversely, for large strata the control:case ratio will vary, but the variance of the

*ø*

_{ i }will be less and the MH estimate will then approximate the unconditional MLE.

Simulated case-control data with true odds ratio = 5

Case | Control |
| Controls:cases |
---|---|---|---|

36 | 97 | 0.58 | 4 |

6 | 71 | ||

42 | 168 | ||

Ca | Co | ||

41 | 79 | 0.94 | 2 |

1 | 5 | ||

42 | 84 | ||

Ca | Co | ||

2 | 1 | 0.02 | 2 |

26 | 55 | ||

28 | 56 | ||

Ca | Co | ||

19 | 25 | 0.30 | 3 |

9 | 59 | ||

28 | 84 | ||

Ca | Co | ||

20 | 41 | 0.37 | 4 |

8 | 71 | ||

28 | 112 | ||

Ca | Co | ||

30 | 21 | 0.62 | 1 |

4 | 13 | ||

34 | 34 | ||

Ca | Co | ||

22 | 26 | 0.46 | 2 |

6 | 30 | ||

28 | 56 |

### Deriving the variance of the Mantel-Haenszel estimate

Consider again the i^{th} 2 × 2 table, giving the frequencies in each cell:

For odds ratio
, estimated for a single table by the cross-product ratio *a*
_{
i
}
*d*
_{
i
}/*b*
_{
i
}
*c*
_{
i
}, application of the delta method gives Woolf's logit-based formula [8]:

with *n*
_{
i
} = *a*
_{
i
} + *b*
_{
i
} + *c*
_{
i
} + *d*
_{
i
} and,

The delta method is a widely used procedure in statistics when an approximation is needed for the variance of a function of a variable whose variance is known. In this instance the variable with known variance is a proportion *p*, and the function is the logit. The basic delta method formula is: var(*y*) ≈ (*dy*/*dx*)^{2} var(*x*) from which, if *y* = *logit*(*p* = ln[*p*/(1 - *p*)],

var(*y*) ≈ (1/*p* + 1/(1 - *p*))^{2}
*p*(1 - *p*)/*n*

= (1/*p* + 1/(1 - *p*))1/*n*

= (1/*a* + 1/*b*)

if *p* = *a*/*n* and *n* = *a* + *b*.

Here we have two independent proportions (the proportion of cases and controls exposed) and Woolf's formula is obtained by estimating the variances of the separate logits and adding them.

For k such 2 × 2 tables, each representing a separate stratum, the Mantel-Haenszel pooled estimate of the *common* odds ratio *ψ* is given by:

Hence *ψ*
_{
MH
} is a weighted average of the stratum-specific odds ratios. The weights approximate the inverse of the variance of each
_{
i
} if the true value of *ψ* = 1. Note that the assumption here of a common odds ratio is not required for the Mantel-Haenszel *test*.

To derive the variance, in addition to the approximation involved in application of the delta rule, an assumption is also made that each stratum-specific odds ratio is close enough to the Mantel-Haenszel pooled estimate to permit terms like *a*
_{
i
}
*d*
_{
i
}/*b*
_{
i
}
*c*
_{
i
} to be replaced by *ψ*
_{
MH
}.

We then proceed by obtaining an approximation which avoids zeros in the formula for var[ln(
)]. The motivation for this can be seen by comparing the weights for *ψ*
_{
MH
} – which are unaffected by zeros except for deleting such strata – whereas if Woolf's variances were used, the result would be indeterminate if cells with zeros were present.

Taking the weights as constant,

Assuming a common odds ratio *ψ*, estimated by *ψ*
_{
MH
}, this can be written as:

Leading to a formula suggested by Hauck [9]:

As mentioned above, a problem with this formula is that it fails if cell entries are zero. However we can proceed further by re-writing the formula as:

On substituting 1/*ψ*
_{
MH
} for (*b*
_{
i
}
*c*
_{
i
}/*a*
_{
i
}
*d*
_{
i
}):

Now if the rows of the 2 × 2 table are interchanged, the variance stays the same. But a similar argument to that above leads to:

(Note that the new odds ratio
formed by exchanging rows is just 1/*ψ*
_{
MH
}.) "The" variance, V, of ln(*ψ*
_{
MH
}) is therefore taken to be the mean of the two estimates [13] as follows:

Let R = ∑ (*a*
_{
i
}
*d*
_{
i
}/*n*
_{
i
}) and S = ∑ (*b*
_{
i
}
*c*
_{
i
}/*n*
_{
i
}). On substituting into the two variance formulae:

Next, divide the top and bottom by S^{2} and move the
term outside the brackets to obtain:

If we now put

P_{i} = (*a*
_{
i
} + *d*
_{
i
})/*n*
_{
i
} and Q_{i} = (*b*
_{
i
} + *c*
_{
i
})/*n*
_{
i
} with R_{i} = *a*
_{
i
}
*d*
_{
i
}/*n*
_{
i
} and S_{i} = *b*
_{
i
}
*c*
_{
i
}/*n*
_{
i
}

then

which on multiplying out the brackets, rearranging and noting that R/S = *ψ*
_{
MH
}, gives:

This is the RBG formula!

When there is only one stratum, this reduces to (1/*a* + 1/*b* + 1/*c* + 1/*d*) which is the familiar logit based formula of Woolf and which approaches 0 as the sample size increases, assuming a finite true odds ratio. Clearly as the RBG variance estimate is a finite sum of such estimators the RBG estimate will also approach 0, for large strata.

The RBG estimator was derived above on the assumption that the stratum-specific odds ratio estimates could be liberally replaced by the common value, in turn estimated by *ψ*
_{
MH
}; both assumptions are reasonable with large samples per stratum. However, the success of the RBG formula derives from its being applicable also to the sparse data case.

To see this, consider a matched-pairs case control study. The capital letters denote the frequency of case-control pairs.

In such a study each stratum has only two observations. The table can be decomposed into four types of "unmatched" table according to the exposure category of the case and the control, the frequency of each type being given by the frequency of the corresponding case-control pairs:

Only the B such tables with *a*
_{
i
} = *d*
_{
i
} = 1 and the C such tables with *b*
_{
i
} = *c*
_{
i
} = 1 contribute to the estimate of the odds ratio. Note that these are disjoint sets of tables.

Under these circumstances: *ψ*
_{
MH
} = B/C which coincides with the conditional MLE and:

a) the middle term of the RBG formula vanishes because if *b*
_{
i
}
*c*
_{
i
} = 1 then (*a*
_{
i
} + *d*
_{
i
}) = 0, and if *a*
_{
i
}
*d*
_{
i
}
*=* 1 then (*b*
_{
i
}
*+ c*
_{
i
}) = 0

b) R = ∑ *a*
_{
i
}
*d*
_{
i
}/*n*
_{
i
}
*=* B/2 & S = ∑ *b*
_{
i
}
*c*
_{
i
}/*n*
_{
i
} = C/2

c) There are B terms in which *a*
_{
i
}
*d*
_{
i
} (*a*
_{
i
} + *d*
_{
i
}) = 2 C terms in which *b*
_{
i
}
*c*
_{
i
} (*b*
_{
i
} + *c*
_{
i
}) = 2

giving:

V = B/B^{2} + C/C^{2} = 1/B + 1/C

This is not only the familiar logit based formula for the variance of the log odds ratio for matched pairs, but is also the variance of the conditional maximum likelihood estimate. This is asymptotically consistent from the general properties of a MLE (and it's easy to see that as the number of tables increases, V → 0).

In other words, the RBG formula, though derived here without assuming validity in the sparse case, does in fact possess this property.

Table 1 shows how closely the conditional maximum likelihood estimate, unconditional maximum likelihood estimate, and MH estimate agree, despite varying *ø*
_{
i
} and control:case ratio.

## Conclusion

The Mantel-Haenszel estimate of the odds ratio approximates the maximum likelihood estimate for large, few strata and coincides with the conditional maximum likelihood estimate for the sparse data (matched pairs) case.

The RBG formula is the estimator of choice for the variance of the Mantel-Haenszel log-odds-ratio because it applies both in the large few strata case and in the many sparse strata case (as in matched pairs analysis), when the RBG variance estimate actually coincides with the conditional maximum likelihood variance estimate.

Moreover the RBG formula reduces to familiar standard forms for a single stratum and for matched pairs.

Formal derivation of the RBG formula is tricky but an informal, accessible derivation is possible as outlined above, which uses nothing more advanced than the delta method for approximating a variance.

## Declarations

## Authors’ Affiliations

## References

- Mantel N, Haenszel W:
**Statistical aspects of the analysis of data from retrospective studies of disease.***JNCI*1959,**22:**719–748.PubMedGoogle Scholar - Breslow NE, Day NE:
*Statistical methods in cancer research, Volume 1 – the analysis of case-control studies*Lyons: International Agency for Research on Cancer 1980.Google Scholar - Robins J, Breslow N, Greenland S:
**Estimators of the Mantel-Haenszel variance consistent in both sparse data and large-strata limiting models.***Biometrics*1986,**42:**311–323.View ArticlePubMedGoogle Scholar - Kuritz SJ, Landis JR, Koch GG:
**A general overview of Mantel-Haenszel methods: applications and recent developments.***Ann Rev Public Health*1988,**9:**123–160.View ArticleGoogle Scholar - Phillips A, Holland PW:
**Estimators of the variance of the Mantel-Haenszel log-odds-ratio estimate.***Biometrics*1987,**43:**425–431.View ArticleGoogle Scholar - Armitage P, Berry G, Matthews JNS:
*Statistical methods in medical research**4 Edition*Oxford: Blackwell Science 2002.Google Scholar - Fleiss JL, Levin B, Paik MC:
*Statistical Methods for Rates & Proportions*Chichester: John Wiley 2003.View ArticleGoogle Scholar - Clayton D, Hills M:
*Statistical methods in epidemiology*Oxford: Oxford University Press 1995.Google Scholar - Hauck WW:
**The large-sample variance of the Mantel-Haenszel estimator of a common odds ratio.***Biometrics*1979,**35:**817–819.View ArticleGoogle Scholar - Flanders WD:
**A new variance estimator for the Mantel-Haenszel odds ratio.***Biometrics*1985,**41:**637–642.View ArticleGoogle Scholar - Sato T:
**Confidence limits for the Common Odds Ratio Based on the Asymptotic Distribution of the Mantel-Haenszel Estimator.***Biometrics*1990,**46:**71–80.View ArticleGoogle Scholar - Woolf B:
**On estimating the relationship between blood group and disease.***Human Genet*1955,**19:**251–253.View ArticleGoogle Scholar - Ury HK:
**Hauck's approximate large-sample variance of the Mantel-Haenszel estimator [letter].***Biometrics*1982,**38:**1094–1095.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.