Quality Assessment of Data on Smoking Behaviour in the WHO MONICA Project

Appendix 4. Estimation of the proportion of ex-cigarette smokers in the population

(See Section 6.1)

This appendix describes the mathematical basis for estimating the proportion of ex-smokers in the population. In the MONICA data the ex-smokers are identified through two data-items, CIGS and EVERCIG, and the item EVERCIG is defined only in the sub-category of non-cigarette smokers of CIGS. If we had sufficient data on EVERCIG for all non-smokers, the proportion of ex-smokers could be estimated in the standard way as a proportion of the whole sample. The estimation is, however, complicated by the fact that usually there are subjects whom we know to be non-smokers (i.e. CIGS=2), but for whom the data for EVERCIG are insufficient.

Although ex-smokers are used here as an example, the description is valid for any proportions which are defined hierarchically from several data items.

We will use the following notation:

Population parameters:
p = proportion of non-smokers in the population
q = proportion of ex-smokers among the non-smokers in the population
r = pq = proportion of ex-smokers in the population
Sample parameters and variables:
n = number of subjects with CIGS = 1, 2 or 3 (i.e. having sufficient data for CIGS)
X = number of subjects with CIGS = 2 (i.e. non-smokers)
k = proportion of non-smokers with sufficient data for EVERCIG (i.e. kX is the number of subjects with CIGS = 2 and EVERCIG = 1, 2 or 3)
Y = number of subjects with CIGS = 2 and EVERCIG = 1 (i.e. ex-smokers)

Assuming that

we can use the model: X ~ Bin(n,p) and Y|X ~ Bin(kX,q).

Then

^p = X/n, ^q = Y/kX, ^r = ^p^q

are unbiased estimates of p, q and r respectively.

To calculate Var(^r), we see that:

Derivation of Var(Y)

Hence

Var(^r) = Var[Y/kn] = r(1-q+kq-kr)/kn.

If k=1 (i.e. there are no insufficient data for EVERCIG),

Var(^r) = r(1-r)/n,

as we would expect.

A simple estimate for the standard error of ^r we get from the formula of variance by replacing r and q by their estimates:

^se(^r) = Square root[(^r(1-^q+k^q-k^r)/kn].

If the proportion of missing data for EVERCIG is small (k>0.9 say), the estimate of the standard error will be reasonably good even if we omit k:

^se(^r) = Square root[(^r(1-^q+k^q-k^r)/kn].