Quality Assessment of Data on Smoking Behaviour in
the WHO MONICA Project

## Appendix 4. Estimation of the proportion of ex-cigarette smokers in the population

(See Section 6.1)

This appendix describes the mathematical basis for estimating the proportion of
ex-smokers in the population. In the MONICA data the ex-smokers are identified through two
data-items, CIGS and EVERCIG, and the item EVERCIG is defined only in the sub-category of
non-cigarette smokers of CIGS. If we had sufficient data on EVERCIG for all non-smokers,
the proportion of ex-smokers could be estimated in the standard way as a proportion of the
whole sample. The estimation is, however, complicated by the fact that usually there are
subjects whom we know to be non-smokers (i.e. CIGS=2), but for whom the data for EVERCIG
are insufficient.

Although ex-smokers are used here as an example, the description is valid for any
proportions which are defined hierarchically from several data items.

We will use the following notation:

- Population parameters:
p = |
proportion of non-smokers in the population |

q = |
proportion of ex-smokers among the non-smokers in the population |

r = pq = |
proportion of ex-smokers in the population |

- Sample parameters and variables:
n = |
number of subjects with CIGS = 1, 2 or 3 (i.e. having sufficient data for CIGS) |

X = |
number of subjects with CIGS = 2 (i.e. non-smokers) |

k = |
proportion of non-smokers with sufficient data for EVERCIG (i.e. kX is the number of
subjects with CIGS = 2 and EVERCIG = 1, 2 or 3) |

Y = |
number of subjects with CIGS = 2 and EVERCIG = 1 (i.e. ex-smokers) |

Assuming that

- the sample is small compared to the size of the population, and therefore we incur a
negligible error if we assume the sampling was done without replacement, and
- survey non-response and insufficient data for the smoking questions are independent of
the subject's smoking status,

we can use the model: X ~ Bin(n,p) and Y|X ~ Bin(kX,q).

Then

are unbiased estimates of p, q and r respectively.

To calculate , we see that:

Hence

If k=1 (i.e. there are no insufficient data for EVERCIG),

as we would expect.

A simple estimate for the standard error of ^r we get from the formula of variance by
replacing r and q by their estimates:

If the proportion of missing data for EVERCIG is small (k>0.9 say), the estimate of
the standard error will be reasonably good even if we omit k: