Computations for multiphase sampling

The ideas here are based on Chapter 9 of Särndal, Swensson and Wretman, which deals with two-phase sampling and the regression estimation of population totals. Here we are interested in multiphase sampling and in settings where many statistics may be computed with the same set of calibrated weights, so the priorities and notation are slightly different.

As in the book, we consider the setting where the sampling probabilities are known (or well-estimated) in advance and raking adjustments are small. When sampling probabilities π are grossly incorrect (eg: missing data/non-response) so that the raking adjustments are not small, we would want different computations such as those of Chang & Kott (Biometrika, 2008).

Estimation

In multiphase sampling we have a sequence of K nested subsamples, with sampling probabilities that can depend on any data at previous phases. We write π_i, 1, π_i, 2|1, π_i, 3|2, … , π_{i, k|k − 1},…,π_{i, K|K − 1} as the sampling probabilities for observation i at each phase, and similarly for pairwise probabilities. In general, we use a , k|k − 1 subscript for quantities related to the kth phase of sampling and a , k subscript for quantities related to the cumulative effect of sampling from phase 1 through phase k. All sums are over the population unless otherwise specified; restriction to samples is achieved with the sampling indicators R_i, k (and R_i ≡ R_i, K). We also write $$\pi^\dagger_{ij,k}=\pi^*_{ij,K}/\pi^*_{ij,k}=\prod_{\ell=k+1}^K \pi^*_{ij,\ell}$$ for the forward cumulative pairwise probabilities.

The product π_i^* ≡ π_i, K^* = π_i, 1π_i, 2|1π_i, 3|2...π_{i, K|K − 1} is not in general the marginal sampling probability for observation i, because each π_{k|k − 1} can depend on data up to phase k − 1, so π_i^* = π_i, K^* is a random variable. To get the true marginal probabilities we would need to integrate out all the dependence on data measured at intermediate phases. However, it is still true that E[R_i, k/π_i, k^*] = 1, which is the key fact we need to estimate totals $$\hat T_X = \sum_i \frac{R_i}{\pi_i^*}X_i$$

It’s also still true (per Särndal et al) that the variance of a total can be estimated by something very like the Horvitz-Thompson formula $$\widehat{\mathrm{var}}[\hat T_X]= \sum_{i,j} \check{X}_i\check{X}_j\check{\Delta}_{ij}$$ where X̌_i = X_i/π_i^*, Δ_ij = π_ij^* = π_i^*π_j^*, and Δ̌_ij = Δ_ij/π_ij^* = (1 − π_i^*π_j^*/π_ij^*)

In the absence of raking we could speed up computation using a recursive relationship for constructing Δ̌ from the weighted covariances Δ̌₁, Δ̌_2|1, and so on at each phase, subscripted down to the subsample remaining at phase K. This also has the advantage of simplicity. However, we can’t do this precomputation with raking, and it has the disadvantage of not giving components of variance at each phase, which we like having. We do use the recursive combination for constructing Δ̌_{k|k − 1} in multistage designs within a single phase.

Per-phase variances

We need to use the a summation over phases. We consider an estimated total as a telescoping sum over phases T̂_X − T_X = (T̂_X, 1 − T_X) + (T̂_X, 2 − T̂_X, 1) + ⋯ + (T̂_X − T̂_{X, K − 1}) where each term is the error incurred by one phase of sampling. That is

$$\hat T_{X,1}-T_X=\sum_i \frac{R_{i,1}}{\pi^*_{i,1}}x_i- \sum_i x_i$$ $$\hat T_{X,2}-\hat T_{X,1}=\sum_i \frac{R_{i,2}}{\pi^*_{i,2}}x_i- \sum_i \frac{R_{i,1}}{\pi^*_{i,1}}x_i$$ and in general $$\hat T_{X,k}-\hat T_{X,k-1}=\sum_i \frac{R_{i,k}}{\pi^*_{i,k}}x_i- \sum_i \frac{R_{i,k-1}}{\pi^*_{i,k-1}}x_i$$ These are all uncorrelated, because each depends on sampling only at one phase. They aren’t independent, because the available data for sampling at phase k depends on all previous phases, but they do form a martingale difference sequence. I will write x̌_i, k for x_i/π_i, k^*, the weighted observation at phase k, so that we have T̂_X, k − T̂_{X, k − 1} = ∑_iR_i, kx̌_i, k − ∑_iR_{i, k − 1}x̌_{i, k − 1}

The variance of this sum is thus the sum of variances var[T̂_X] = var[T̂_X, 1 − T_X] + var[T̂_X, 2 − T̂_X, 1] + ⋯ + var[T̂_X − T̂_{X, K − 1}] and each variance is (conditional on the sampling so far) of the usual Horvitz-Thompson form var[T̂_X, k − T̂_{X, k − 1}] = ∑_i, jR_{i, k − 1}R_{j, k − 1}cov[R_{i, k|k − 1}, R_{j, k|k − 1}]x̌_i, kx̌_j, k which could be estimated at phase k by $$\widehat{\mathrm{var}}_k\left[\hat T_{X,k}-\hat T_{X,k-1}\right]=\sum_{i,j}\frac{R_{i,k}R_{j,k}}{\pi_{ij,k|k-1}}\mathrm{cov}\left[R_{i,k|k-1},R_{j,k|k-1}\right]\check{x}_{i,k}\check{x}_{j,k}=\sum_{i,j}R_{i,k}R_{j,k}\check{\Delta}_{ij,k|k-1}\check{x}_{i,k}\check{x}_{j,k} $$ That’s still not enough, because we don’t necessarily have x until phase K, so we need to weight down to phase K $$\widehat{\mathrm{var}}_K\left[\hat T_{X,k}-\hat T_{X,k-1}\right]=\sum_{i,j}R_{i,k}R_{j,k}\frac{R_{i,K}R_{j,K}}{\pi^*_{ij,K}/\pi^*_{ij,k}}\check{\Delta}_{ij,k|k-1}\check{x}_{i,k}\check{x}_{j,k}=\sum_{i,j}\frac{R_{i,K}R_{j,K}}{\pi^\dagger_{ij,k}}\check{\Delta}_{ij,k|k-1}\check{x}_{i,k}\check{x}_{j,k}$$

Raking

Raking of phase k to phase k − 1 involves estimating raking adjustments g_{i, k|k − 1} that satisfy calibration constraints on variables A_i available at phase k − 1. $$\sum_i R_{i,k}g_{i,k|k-1}\frac{1}{\pi_{i,k}^*}A_i=\sum_i R_{i,k-1}\frac{1}{\pi^*_{i,k-1}}A_i$$ Let Π_A and Π_Ā be the projections on to and orthogonal to the space spanned by A given weights 1/π_k − 1^* and let x̂_i = Π_Ax_i and e_i = Π_Āx_i. The calibrated total estimate is $$\hat T_X=\sum_i\frac{R_ig_{i,k|k-1}}{\pi^*_i}x_i=\sum_iR_ig_{i,k|k-1}\check{x}_i=\sum_iR_ig_{i,k|k-1}\check{e}_{i,k}+\sum_i R_{i,k}g_{i,k|k-1}\check{\hat x}_{i,k}$$ Applying the calibration constraints to the second term we have $$\sum_iR_ig_{i,k|k-1}\check{e}_{i,k}+\sum_i R_{i,k}g_{i,k|k-1}\check{\hat x}_{i,k}=\sum_iR_ig_{i,k|k-1}\check{e}_{i,k}+\sum_i R_{i,k-1}\check{\hat x}_{i,k-1}$$ and since x̂_{i, k − 1} is not random (to first order) conditional on phase k − 1 the estimated variance contribution for phase k comes from just the first term and is $$\widehat{\textrm{var}}_{k|k-1}\left[\hat T_X\right]=\sum_{i,j}\frac{R_{i,K}R_{j,K}}{\pi^\dagger_{jk,K}}g_{i,k|k-1}g_{j,k|k-1}\check{e}_{i,k}\check{e}_{j,k}\check{\Delta}_{ij,k|k-1}$$ with the variances of the other phases being unaffected.