Quality Assessment of Data on Smoking Behaviour in the WHO MONICA Project
(See Section 6.1)
This appendix describes the mathematical basis for estimating the proportion of ex-smokers in the population. In the MONICA data the ex-smokers are identified through two data-items, CIGS and EVERCIG, and the item EVERCIG is defined only in the sub-category of non-cigarette smokers of CIGS. If we had sufficient data on EVERCIG for all non-smokers, the proportion of ex-smokers could be estimated in the standard way as a proportion of the whole sample. The estimation is, however, complicated by the fact that usually there are subjects whom we know to be non-smokers (i.e. CIGS=2), but for whom the data for EVERCIG are insufficient.
Although ex-smokers are used here as an example, the description is valid for any proportions which are defined hierarchically from several data items.
We will use the following notation:
| p = | proportion of non-smokers in the population |
| q = | proportion of ex-smokers among the non-smokers in the population |
| r = pq = | proportion of ex-smokers in the population |
| n = | number of subjects with CIGS = 1, 2 or 3 (i.e. having sufficient data for CIGS) |
| X = | number of subjects with CIGS = 2 (i.e. non-smokers) |
| k = | proportion of non-smokers with sufficient data for EVERCIG (i.e. kX is the number of subjects with CIGS = 2 and EVERCIG = 1, 2 or 3) |
| Y = | number of subjects with CIGS = 2 and EVERCIG = 1 (i.e. ex-smokers) |
Assuming that
we can use the model: X ~ Bin(n,p) and Y|X ~ Bin(kX,q).
Then

are unbiased estimates of p, q and r respectively.
To calculate
, we see that:

Hence
![]()
If k=1 (i.e. there are no insufficient data for EVERCIG),
![]()
as we would expect.
A simple estimate for the standard error of ^r we get from the formula of variance by replacing r and q by their estimates:
![]()
If the proportion of missing data for EVERCIG is small (k>0.9 say), the estimate of the standard error will be reasonably good even if we omit k:
![]()