Quality Assessment of Data on Smoking Behaviour in the WHO MONICA Project
(See Section 6.1)
This appendix describes the mathematical basis for estimating the proportion of ex-smokers in the population. In the MONICA data the ex-smokers are identified through two data-items, CIGS and EVERCIG, and the item EVERCIG is defined only in the sub-category of non-cigarette smokers of CIGS. If we had sufficient data on EVERCIG for all non-smokers, the proportion of ex-smokers could be estimated in the standard way as a proportion of the whole sample. The estimation is, however, complicated by the fact that usually there are subjects whom we know to be non-smokers (i.e. CIGS=2), but for whom the data for EVERCIG are insufficient.
Although ex-smokers are used here as an example, the description is valid for any proportions which are defined hierarchically from several data items.
We will use the following notation:
|p =||proportion of non-smokers in the population|
|q =||proportion of ex-smokers among the non-smokers in the population|
|r = pq =||proportion of ex-smokers in the population|
|n =||number of subjects with CIGS = 1, 2 or 3 (i.e. having sufficient data for CIGS)|
|X =||number of subjects with CIGS = 2 (i.e. non-smokers)|
|k =||proportion of non-smokers with sufficient data for EVERCIG (i.e. kX is the number of subjects with CIGS = 2 and EVERCIG = 1, 2 or 3)|
|Y =||number of subjects with CIGS = 2 and EVERCIG = 1 (i.e. ex-smokers)|
we can use the model: X ~ Bin(n,p) and Y|X ~ Bin(kX,q).
are unbiased estimates of p, q and r respectively.
To calculate , we see that:
If k=1 (i.e. there are no insufficient data for EVERCIG),
as we would expect.
A simple estimate for the standard error of ^r we get from the formula of variance by replacing r and q by their estimates:
If the proportion of missing data for EVERCIG is small (k>0.9 say), the estimate of the standard error will be reasonably good even if we omit k: