Title: | Analysis of Complex Survey Samples |
---|---|
Description: | Summary statistics, two-sample tests, rank tests, generalised linear models, cumulative link models, Cox models, loglinear models, and general maximum pseudolikelihood estimation for multistage stratified, cluster-sampled, unequally weighted survey samples. Variances by Taylor series linearisation or replicate weights. Post-stratification, calibration, and raking. Two-phase and multiphase subsampling designs. Graphics. PPS sampling without replacement. Small-area estimation. Dual-frame designs. |
Authors: | Thomas Lumley, Peter Gao, Ben Schneider |
Maintainer: | "Thomas Lumley" <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 4.5 |
Built: | 2024-11-17 02:55:30 UTC |
Source: | https://github.com/r-forge/r-survey |
A method for the anova
function, for use on
svyglm
and svycoxph
objects. With a single model argument it produces a sequential anova table, with two arguments it compares the two models.
## S3 method for class 'svyglm' anova(object, object2 = NULL, test = c("F", "Chisq"), method = c("LRT", "Wald"), tolerance = 1e-05, ..., force = FALSE) ## S3 method for class 'svycoxph' anova(object, object2=NULL,test=c("F","Chisq"), method=c("LRT","Wald"),tolerance=1e-5,...,force=FALSE) ## S3 method for class 'svyglm' AIC(object,...,k=2, null_has_intercept=TRUE) ## S3 method for class 'svyglm' BIC(object,...,maximal) ## S3 method for class 'svyglm' extractAIC(fit,scale,k=2,..., null_has_intercept=TRUE) ## S3 method for class 'svrepglm' extractAIC(fit,scale,k=2,..., null_has_intercept=TRUE)
## S3 method for class 'svyglm' anova(object, object2 = NULL, test = c("F", "Chisq"), method = c("LRT", "Wald"), tolerance = 1e-05, ..., force = FALSE) ## S3 method for class 'svycoxph' anova(object, object2=NULL,test=c("F","Chisq"), method=c("LRT","Wald"),tolerance=1e-5,...,force=FALSE) ## S3 method for class 'svyglm' AIC(object,...,k=2, null_has_intercept=TRUE) ## S3 method for class 'svyglm' BIC(object,...,maximal) ## S3 method for class 'svyglm' extractAIC(fit,scale,k=2,..., null_has_intercept=TRUE) ## S3 method for class 'svrepglm' extractAIC(fit,scale,k=2,..., null_has_intercept=TRUE)
object , fit
|
|
object2 |
|
test |
Use (linear combination of) F or chi-squared distributions for p-values. F is usually preferable. |
method |
Use weighted deviance difference (LRT) or Wald tests to compare models |
tolerance |
For models that are not symbolically nested, the tolerance for deciding that a term is common to the models. |
... |
For |
scale |
not used |
null_has_intercept |
Does the null model for AIC have an intercept or
not? Must be |
force |
Force the tests to be done by explicit projection even if the models are symbolically nested (eg, for debugging) |
maximal |
A |
k |
Multiplier for effective df in AIC. Usually 2. There is no choice of |
The reference distribution for the LRT depends on the misspecification effects for the parameters being tested (Rao and Scott, 1984). If the models are symbolically nested, so that the relevant parameters can be identified just by manipulating the model formulas, anova
is equivalent to regTermTest
.If the models are nested but not symbolically nested, more computation using the design matrices is needed to determine the projection matrix on to the parameters being tested. In the examples below, model1
and model2
are symbolically nested in model0
because model0
can be obtained just by deleting terms from the formulas. On the other hand, model2
is nested in model1
but not symbolically nested: knowing that the model is nested requires knowing what design matrix columns are produced by stype
and as.numeric(stype)
. Other typical examples of models that are nested but not symbolically nested are linear and spline models for a continuous covariate, or models with categorical versions of a variable at different resolutions (eg, smoking yes/no or smoking never/former/current).
A saddlepoint approximation is used for the LRT with numerator df greater than 1.
AIC
is defined using the Rao-Scott approximation to the weighted
loglikelihood (Lumley and Scott, 2015). It replaces the usual penalty term p, which is the null expectation of the log likelihood ratio, by the trace of the generalised design effect matrix, which is the expectation under complex sampling. For computational reasons everything is scaled so the weights sum to the sample size.
BIC
is a BIC for the (approximate) multivariate Gaussian models
on regression coefficients from the maximal model implied by each
submodel (ie, the models that say some coefficients in the maximal model
are zero) (Lumley and Scott, 2015). It corresponds to comparing the models with a Wald test and replacing the sample size in the penalty by an effective sample size.
For computational reasons, the models must not only be nested, the names of the coefficients must match.
extractAIC
for a model with a Gaussian link uses the actual AIC based on maximum likelihood estimation of the variance parameter as well as the regression parameters.
Object of class seqanova.svyglm
if one model is given, otherwise of class regTermTest
or regTermTestLRT
At the moment, AIC
works only for models including an intercept.
Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contingency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60.
Lumley, T., & Scott, A. (2014). "Tests for Regression Models Fitted to Survey Data". Australian and New Zealand Journal of Statistics, 56 (1), 1-14.
Lumley T, Scott AJ (2015) "AIC and BIC for modelling with complex survey data" J Surv Stat Methodol 3 (1): 1-18.
data(api) dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) model0<-svyglm(I(sch.wide=="Yes")~ell+meals+mobility, design=dclus2, family=quasibinomial()) model1<-svyglm(I(sch.wide=="Yes")~ell+meals+mobility+as.numeric(stype), design=dclus2, family=quasibinomial()) model2<-svyglm(I(sch.wide=="Yes")~ell+meals+mobility+stype, design=dclus2, family=quasibinomial()) anova(model2) anova(model0,model2) anova(model1, model2) anova(model1, model2, method="Wald") AIC(model0,model1, model2) BIC(model0, model2,maximal=model2) ## AIC for linear model is different because it considers the variance ## parameter model0<-svyglm(api00~ell+meals+mobility, design=dclus2) model1<-svyglm(api00~ell+meals+mobility+as.numeric(stype), design=dclus2) model2<-svyglm(api00~ell+meals+mobility+stype, design=dclus2) modelnull<-svyglm(api00~1, design=dclus2) AIC(model0, model1, model2) AIC(model0, model1, model2,modelnull, null_has_intercept=FALSE) ## from ?twophase data(nwtco) dcchs<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~rel), subset=~I(in.subcohort | rel), data=nwtco) a<-svycoxph(Surv(edrel,rel)~factor(stage)+factor(histol)+I(age/12), design=dcchs) b<-update(a, .~.-I(age/12)+poly(age,3)) ## not symbolically nested models anova(a,b)
data(api) dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) model0<-svyglm(I(sch.wide=="Yes")~ell+meals+mobility, design=dclus2, family=quasibinomial()) model1<-svyglm(I(sch.wide=="Yes")~ell+meals+mobility+as.numeric(stype), design=dclus2, family=quasibinomial()) model2<-svyglm(I(sch.wide=="Yes")~ell+meals+mobility+stype, design=dclus2, family=quasibinomial()) anova(model2) anova(model0,model2) anova(model1, model2) anova(model1, model2, method="Wald") AIC(model0,model1, model2) BIC(model0, model2,maximal=model2) ## AIC for linear model is different because it considers the variance ## parameter model0<-svyglm(api00~ell+meals+mobility, design=dclus2) model1<-svyglm(api00~ell+meals+mobility+as.numeric(stype), design=dclus2) model2<-svyglm(api00~ell+meals+mobility+stype, design=dclus2) modelnull<-svyglm(api00~1, design=dclus2) AIC(model0, model1, model2) AIC(model0, model1, model2,modelnull, null_has_intercept=FALSE) ## from ?twophase data(nwtco) dcchs<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~rel), subset=~I(in.subcohort | rel), data=nwtco) a<-svycoxph(Surv(edrel,rel)~factor(stage)+factor(histol)+I(age/12), design=dcchs) b<-update(a, .~.-I(age/12)+poly(age,3)) ## not symbolically nested models anova(a,b)
The Academic Performance Index is computed for all California schools based on standardised testing of students. The data sets contain information for all schools with at least 100 students and for various probability samples of the data.
data(api)
data(api)
The full population data in apipop
are a data frame with 6194 observations on the following 37 variables.
Unique identifier
Elementary/Middle/High School
School name (15 characters)
School name (40 characters)
School number
District name
District number
County name
County number
reason for missing data
percentage of students tested
API in 2000
API in 1999
target for change in API
Change in API
Met school-wide growth target?
Met Comparable Improvement target
Met both targets
Eligible for awards program
Percentage of students eligible for subsidized meals
‘English Language Learners’ (percent)
Year-round school
percentage of students for whom this is the first year at the school
average class size years K-3
average class size years 4-6
Number of core academic courses
percent where parental education level is known
percent parents not high-school graduates
percent parents who are high-school graduates
percent parents with some college
percent parents with college degree
percent parents with postgraduate education
average parental education level
percent fully qualified teachers
percent teachers with emergency qualifications
number of students enrolled
number of students tested.
The other data sets contain additional variables pw
for
sampling weights and fpc
to compute finite population
corrections to variance.
apipop
is the entire population, apisrs
is a simple random sample,
apiclus1
is a cluster sample of school districts, apistrat
is
a sample stratified by stype
, and apiclus2
is a two-stage
cluster sample of schools within districts. The sampling weights in
apiclus1
are incorrect (the weight should be 757/15) but are as
obtained from UCLA.
Data were obtained from the survey sampling help pages of UCLA Academic Technology Services; these pages are no longer on line.
The API program has been discontinued at the end of 2018, and the archive page at the California Department of Education is now gone. The Wikipedia article has links to past material at the Internet Archive. https://en.wikipedia.org/wiki/Academic_Performance_Index_(California_public_schools)
library(survey) data(api) mean(apipop$api00) sum(apipop$enroll, na.rm=TRUE) #stratified sample dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) summary(dstrat) svymean(~api00, dstrat) svytotal(~enroll, dstrat, na.rm=TRUE) # one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) summary(dclus1) svymean(~api00, dclus1) svytotal(~enroll, dclus1, na.rm=TRUE) # two-stage cluster sample dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) summary(dclus2) svymean(~api00, dclus2) svytotal(~enroll, dclus2, na.rm=TRUE) # two-stage `with replacement' dclus2wr<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) summary(dclus2wr) svymean(~api00, dclus2wr) svytotal(~enroll, dclus2wr, na.rm=TRUE) # convert to replicate weights rclus1<-as.svrepdesign(dclus1) summary(rclus1) svymean(~api00, rclus1) svytotal(~enroll, rclus1, na.rm=TRUE) # post-stratify on school type pop.types<-xtabs(~stype, data=apipop) rclus1p<-postStratify(rclus1, ~stype, pop.types) dclus1p<-postStratify(dclus1, ~stype, pop.types) summary(dclus1p) summary(rclus1p) svymean(~api00, dclus1p) svytotal(~enroll, dclus1p, na.rm=TRUE) svymean(~api00, rclus1p) svytotal(~enroll, rclus1p, na.rm=TRUE)
library(survey) data(api) mean(apipop$api00) sum(apipop$enroll, na.rm=TRUE) #stratified sample dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) summary(dstrat) svymean(~api00, dstrat) svytotal(~enroll, dstrat, na.rm=TRUE) # one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) summary(dclus1) svymean(~api00, dclus1) svytotal(~enroll, dclus1, na.rm=TRUE) # two-stage cluster sample dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) summary(dclus2) svymean(~api00, dclus2) svytotal(~enroll, dclus2, na.rm=TRUE) # two-stage `with replacement' dclus2wr<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) summary(dclus2wr) svymean(~api00, dclus2wr) svytotal(~enroll, dclus2wr, na.rm=TRUE) # convert to replicate weights rclus1<-as.svrepdesign(dclus1) summary(rclus1) svymean(~api00, rclus1) svytotal(~enroll, rclus1, na.rm=TRUE) # post-stratify on school type pop.types<-xtabs(~stype, data=apipop) rclus1p<-postStratify(rclus1, ~stype, pop.types) dclus1p<-postStratify(dclus1, ~stype, pop.types) summary(dclus1p) summary(rclus1p) svymean(~api00, dclus1p) svytotal(~enroll, dclus1p, na.rm=TRUE) svymean(~api00, rclus1p) svytotal(~enroll, rclus1p, na.rm=TRUE)
This function creates an object to store the number of clusters sampled
within each stratum (at each stage of multistage sampling) and the
number of clusters available in the population. It is called by
svydesign
, not directly by the user.
as.fpc(df, strata, ids,pps=FALSE)
as.fpc(df, strata, ids,pps=FALSE)
df |
A data frame or matrix with population size information |
strata |
A data frame giving strata at each stage |
ids |
A data frame giving cluster ids at each stage |
pps |
if |
The population size information may be specified as the number of clusters in the population or as the proportion of clusters sampled.
An object of class survey_fpc
Creates a replicate-weights survey design object from a traditional
strata/cluster survey design object. JK1
and JKn
are
jackknife methods, BRR
is Balanced Repeated Replicates and
Fay
is Fay's modification of this, bootstrap
is Canty
and Davison's bootstrap, subbootstrap
is Rao and Wu's
bootstrap, and
mrbbootstrap
is Preston's multistage
rescaled bootstrap. With a svyimputationList
object, the same
replicate weights will be used for each imputation if the sampling
weights are all the same and separate.replicates=FALSE
.
as.svrepdesign(design,...) ## Default S3 method: as.svrepdesign(design, type=c("auto", "JK1", "JKn", "BRR", "bootstrap", "subbootstrap","mrbbootstrap","Fay"), fay.rho = 0, fpc=NULL,fpctype=NULL,..., compress=TRUE, mse=getOption("survey.replicates.mse")) ## S3 method for class 'svyimputationList' as.svrepdesign(design, type=c("auto", "JK1", "JKn", "BRR", "bootstrap", "subbootstrap","mrbbootstrap","Fay"), fay.rho = 0, fpc=NULL,fpctype=NULL, separate.replicates=FALSE, ..., compress=TRUE, mse=getOption("survey.replicates.mse"))
as.svrepdesign(design,...) ## Default S3 method: as.svrepdesign(design, type=c("auto", "JK1", "JKn", "BRR", "bootstrap", "subbootstrap","mrbbootstrap","Fay"), fay.rho = 0, fpc=NULL,fpctype=NULL,..., compress=TRUE, mse=getOption("survey.replicates.mse")) ## S3 method for class 'svyimputationList' as.svrepdesign(design, type=c("auto", "JK1", "JKn", "BRR", "bootstrap", "subbootstrap","mrbbootstrap","Fay"), fay.rho = 0, fpc=NULL,fpctype=NULL, separate.replicates=FALSE, ..., compress=TRUE, mse=getOption("survey.replicates.mse"))
design |
Object of class |
type |
Type of replicate weights. |
fay.rho |
Tuning parameter for Fay's variance method |
fpc , fpctype , ...
|
Passed to |
separate.replicates |
Compute replicate weights separately for each design (useful for the bootstrap types, which are not deterministic |
compress |
Use a compressed representation of the replicate weights matrix. |
mse |
if |
Object of class svyrep.design
.
Canty AJ, Davison AC. (1999) Resampling-based variance estimation for labour force surveys. The Statistician 48:379-391
Judkins, D. (1990), "Fay's Method for Variance Estimation," Journal of Official Statistics, 6, 223-239.
Preston J. (2009) Rescaled bootstrap for stratified multistage sampling. Survey Methodology 35(2) 227-234
Rao JNK, Wu CFJ. Bootstrap inference for sample surveys. Proc Section on Survey Research Methodology. 1993 (866–871)
brrweights
, svydesign
,
svrepdesign
, bootweights
, subbootweights
, mrbweights
data(scd) scddes<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE, fpc=rep(5,6)) scdnofpc<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE) # convert to BRR replicate weights scd2brr <- as.svrepdesign(scdnofpc, type="BRR") scd2fay <- as.svrepdesign(scdnofpc, type="Fay",fay.rho=0.3) # convert to JKn weights scd2jkn <- as.svrepdesign(scdnofpc, type="JKn") # convert to JKn weights with finite population correction scd2jknf <- as.svrepdesign(scddes, type="JKn") ## with user-supplied hadamard matrix scd2brr1 <- as.svrepdesign(scdnofpc, type="BRR", hadamard.matrix=paley(11)) svyratio(~alive, ~arrests, design=scd2brr) svyratio(~alive, ~arrests, design=scd2brr1) svyratio(~alive, ~arrests, design=scd2fay) svyratio(~alive, ~arrests, design=scd2jkn) svyratio(~alive, ~arrests, design=scd2jknf) data(api) ## one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) ## convert to JK1 jackknife rclus1<-as.svrepdesign(dclus1) ## convert to bootstrap bclus1<-as.svrepdesign(dclus1,type="bootstrap", replicates=100) svymean(~api00, dclus1) svytotal(~enroll, dclus1) svymean(~api00, rclus1) svytotal(~enroll, rclus1) svymean(~api00, bclus1) svytotal(~enroll, bclus1) dclus2<-svydesign(id = ~dnum + snum, fpc = ~fpc1 + fpc2, data = apiclus2) mrbclus2<-as.svrepdesign(dclus2, type="mrb",replicates=100) svytotal(~api00+stype, dclus2) svytotal(~api00+stype, mrbclus2)
data(scd) scddes<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE, fpc=rep(5,6)) scdnofpc<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE) # convert to BRR replicate weights scd2brr <- as.svrepdesign(scdnofpc, type="BRR") scd2fay <- as.svrepdesign(scdnofpc, type="Fay",fay.rho=0.3) # convert to JKn weights scd2jkn <- as.svrepdesign(scdnofpc, type="JKn") # convert to JKn weights with finite population correction scd2jknf <- as.svrepdesign(scddes, type="JKn") ## with user-supplied hadamard matrix scd2brr1 <- as.svrepdesign(scdnofpc, type="BRR", hadamard.matrix=paley(11)) svyratio(~alive, ~arrests, design=scd2brr) svyratio(~alive, ~arrests, design=scd2brr1) svyratio(~alive, ~arrests, design=scd2fay) svyratio(~alive, ~arrests, design=scd2jkn) svyratio(~alive, ~arrests, design=scd2jknf) data(api) ## one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) ## convert to JK1 jackknife rclus1<-as.svrepdesign(dclus1) ## convert to bootstrap bclus1<-as.svrepdesign(dclus1,type="bootstrap", replicates=100) svymean(~api00, dclus1) svytotal(~enroll, dclus1) svymean(~api00, rclus1) svytotal(~enroll, rclus1) svymean(~api00, bclus1) svytotal(~enroll, bclus1) dclus2<-svydesign(id = ~dnum + snum, fpc = ~fpc1 + fpc2, data = apiclus2) mrbclus2<-as.svrepdesign(dclus2, type="mrb",replicates=100) svytotal(~api00+stype, dclus2) svytotal(~api00+stype, mrbclus2)
The structure of survey design objects changed in version 2.9, to allow
standard errors based on multistage sampling. as.svydesign
converts an
object to the new structure and .svycheck
warns if an object
does not have the new structure.
You can set options(survey.want.obsolete=TRUE)
to suppress the
warnings produced by .svycheck
and
options(survey.ultimate.cluster=TRUE)
to always compute
variances based on just the first stage of sampling.
as.svydesign2(object) .svycheck(object)
as.svydesign2(object) .svycheck(object)
object |
produced by |
Object of class survey.design2
Draws a barplot or dotplot based on results from a survey analysis. The default
barplot method already works for results from svytable
.
## S3 method for class 'svystat' barplot(height, ...) ## S3 method for class 'svrepstat' barplot(height, ...) ## S3 method for class 'svyby' barplot(height,beside=TRUE, ...) ## S3 method for class 'svystat' dotchart(x,...,pch=19) ## S3 method for class 'svrepstat' dotchart(x,...,pch=19) ## S3 method for class 'svyby' dotchart(x,...,pch=19)
## S3 method for class 'svystat' barplot(height, ...) ## S3 method for class 'svrepstat' barplot(height, ...) ## S3 method for class 'svyby' barplot(height,beside=TRUE, ...) ## S3 method for class 'svystat' dotchart(x,...,pch=19) ## S3 method for class 'svrepstat' dotchart(x,...,pch=19) ## S3 method for class 'svyby' dotchart(x,...,pch=19)
height , x
|
Analysis result |
beside |
Grouped, rather than stacked, bars |
... |
Arguments to |
pch |
Overrides the default in |
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) a<-svymean(~stype, dclus1) barplot(a) barplot(a, names.arg=c("Elementary","High","Middle"), col="purple", main="Proportions of school level") b<-svyby(~enroll+api.stu, ~stype, dclus1, svymean) barplot(b,beside=TRUE,legend=TRUE) dotchart(b)
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) a<-svymean(~stype, dclus1) barplot(a) barplot(a, names.arg=c("Elementary","High","Middle"), col="purple", main="Proportions of school level") b<-svyby(~enroll+api.stu, ~stype, dclus1, svymean) barplot(b,beside=TRUE,legend=TRUE) dotchart(b)
Bootstrap weights for infinite populations ('with replacement' sampling) are created by sampling with
replacement from the PSUs in each stratum. subbootweights()
samples n-1
PSUs from the n
available (Rao and Wu),
bootweights
samples n
(Canty and Davison).
For multistage designs or those with large sampling fractions,
mrbweights
implements Preston's multistage rescaled
bootstrap. The multistage rescaled bootstrap is still useful for
single-stage designs with small sampling fractions, where it reduces
to a half-sample replicate method.
bootweights(strata, psu, replicates = 50, fpc = NULL, fpctype = c("population", "fraction", "correction"), compress = TRUE) subbootweights(strata, psu, replicates = 50, compress = TRUE) mrbweights(clusters, stratas, fpcs, replicates=50, multicore=getOption("survey.multicore"))
bootweights(strata, psu, replicates = 50, fpc = NULL, fpctype = c("population", "fraction", "correction"), compress = TRUE) subbootweights(strata, psu, replicates = 50, compress = TRUE) mrbweights(clusters, stratas, fpcs, replicates=50, multicore=getOption("survey.multicore"))
strata |
Identifier for sampling strata (top level only) |
stratas |
data frame of strata for all stages of sampling |
psu |
Identifier for primary sampling units |
clusters |
data frame of identifiers for sampling units at each stage |
replicates |
Number of bootstrap replicates |
fpc |
Finite population correction (top level only) |
fpctype |
Is |
fpcs |
|
compress |
Should the replicate weights be compressed? |
multicore |
Use the |
A set of replicate weights
With multicore=TRUE
the resampling procedure does not
use the current random seed, so the results cannot be exactly
reproduced even by using set.seed()
These bootstraps are strictly appropriate only when the first stage of sampling is a simple or stratified random sample of PSUs with or without replacement, and not (eg) for PPS sampling. The functions will not enforce simple random sampling, so they can be used (approximately) for data that have had non-response corrections and other weight adjustments. It is preferable to apply these adjustments after creating the bootstrap replicate weights, but that may not be possible with public-use data.
Canty AJ, Davison AC. (1999) Resampling-based variance estimation for labour force surveys. The Statistician 48:379-391
Judkins, D. (1990), "Fay's Method for Variance Estimation" Journal of Official Statistics, 6, 223-239.
Preston J. (2009) Rescaled bootstrap for stratified multistage sampling. Survey Methodology 35(2) 227-234
Rao JNK, Wu CFJ. Bootstrap inference for sample surveys. Proc Section on Survey Research Methodology. 1993 (866–871)
https://bschneidr.github.io/svrep/ for other sorts of replicate weights
Compute replicate weights from a survey design. These functions are
usually called from as.svrepdesign
rather than directly
by the user.
brrweights(strata, psu, match = NULL, small = c("fail","split","merge"), large = c("split", "merge", "fail"), fay.rho=0, only.weights=FALSE, compress=TRUE, hadamard.matrix=NULL) jk1weights(psu,fpc=NULL, fpctype=c("population","fraction","correction"), compress=TRUE) jknweights(strata,psu, fpc=NULL, fpctype=c("population","fraction","correction"), compress=TRUE, lonely.psu=getOption("survey.lonely.psu"))
brrweights(strata, psu, match = NULL, small = c("fail","split","merge"), large = c("split", "merge", "fail"), fay.rho=0, only.weights=FALSE, compress=TRUE, hadamard.matrix=NULL) jk1weights(psu,fpc=NULL, fpctype=c("population","fraction","correction"), compress=TRUE) jknweights(strata,psu, fpc=NULL, fpctype=c("population","fraction","correction"), compress=TRUE, lonely.psu=getOption("survey.lonely.psu"))
strata |
Stratum identifiers |
psu |
PSU (cluster) identifier |
match |
Optional variable to use in matching. |
small |
How to handle strata with only one PSU |
large |
How to handle strata with more than two PSUs |
fpc |
Optional population (stratum) size or finite population correction |
fpctype |
How |
fay.rho |
Parameter for Fay's extended BRR method |
only.weights |
If |
compress |
If |
hadamard.matrix |
Optional user-supplied Hadamard matrix for
|
lonely.psu |
Handling of non-certainty single-PSU strata |
JK1 and JKn are jackknife schemes for unstratified and stratified
designs respectively. The finite population correction may be
specified as a single number, a vector with one entry per stratum, or
a vector with one entry per observation (constant within strata).
When fpc
is a vector with one entry per stratum it may not have
names that differ from the stratum identifiers (it may have no names,
in which case it must be in the same order as
unique(strata)
). To specify population stratum sizes use
fpctype="population"
, to specify sampling fractions use
fpctype="fraction"
and to specify the correction directly use
fpctype="correction"
The only reason not to use compress=TRUE
is that it is new and
there is a greater possibility of bugs. It reduces the number of
rows of the replicate weights matrix from the number of observations
to the number of PSUs.
In BRR variance estimation each stratum is split in two to give half-samples. Balanced replicated weights are needed, where observations in two different strata end up in the same half stratum as often as in different half-strata.BRR, strictly speaking, is defined only when each stratum has exactly two PSUs. A stratum with one PSU can be merged with another such stratum, or can be split to appear in both half samples with half weight. The latter approach is appropriate for a PSU that was deterministically sampled.
A stratum with more than two PSUs can be split into multiple smaller strata each with two PSUs or the PSUs can be merged to give two superclusters within the stratum.
When merging small strata or grouping PSUs in large strata the
match
variable is used to sort PSUs before merging, to give
approximate matching on this variable.
If you want more control than this you should probably construct your
own weights using the Hadamard matrices produced by hadamard
For brrweights
with only.weights=FALSE
a list with elements
weights |
two-column matrix indicating the weight for each half-stratum in one particular set of split samples |
wstrata |
New stratum variable incorporating merged or split strata |
strata |
Original strata for distinct PSUs |
psu |
Distinct PSUs |
npairs |
Dimension of Hadamard matrix used in BRR construction |
sampler |
function returning replicate weights |
compress |
Indicates whether the |
For jk1weights
and jknweights
a data frame of replicate
weights and the scale
and rscale
arguments to svrVar
.
Levy and Lemeshow "Sampling of Populations". Wiley.
Shao and Tu "The Jackknife and Bootstrap". Springer.
hadamard
, as.svrepdesign
,
svrVar
, surveyoptions
data(scd) scdnofpc<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE) ## convert to BRR replicate weights scd2brr <- as.svrepdesign(scdnofpc, type="BRR") svymean(~alive, scd2brr) svyratio(~alive, ~arrests, scd2brr) ## with user-supplied hadamard matrix scd2brr1 <- as.svrepdesign(scdnofpc, type="BRR", hadamard.matrix=paley(11)) svymean(~alive, scd2brr1) svyratio(~alive, ~arrests, scd2brr1)
data(scd) scdnofpc<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE) ## convert to BRR replicate weights scd2brr <- as.svrepdesign(scdnofpc, type="BRR") svymean(~alive, scd2brr) svyratio(~alive, ~arrests, scd2brr) ## with user-supplied hadamard matrix scd2brr1 <- as.svrepdesign(scdnofpc, type="BRR", hadamard.matrix=paley(11)) svymean(~alive, scd2brr1) svyratio(~alive, ~arrests, scd2brr1)
Calibration, generalized raking, or GREG estimators generalise post-stratification and
raking by calibrating a sample to the marginal totals of
variables in a linear regression model. This function reweights the
survey design and adds additional information that is used by
svyrecvar
to reduce the estimated standard errors.
calibrate(design,...) ## S3 method for class 'survey.design2' calibrate(design, formula, population, aggregate.stage=NULL, stage=0, variance=NULL, bounds=c(-Inf,Inf), calfun=c("linear","raking","logit"), maxit=50,epsilon=1e-7,verbose=FALSE,force=FALSE,trim=NULL, bounds.const=FALSE, sparse=FALSE,...) ## S3 method for class 'svyrep.design' calibrate(design, formula, population,compress=NA, aggregate.index=NULL, variance=NULL, bounds=c(-Inf,Inf), calfun=c("linear","raking","logit"), maxit=50, epsilon=1e-7, verbose=FALSE,force=FALSE,trim=NULL, bounds.const=FALSE, sparse=FALSE,...) ## S3 method for class 'twophase' calibrate(design, phase=2,formula, population, calfun=c("linear","raking","logit","rrz"),...) grake(mm,ww,calfun,eta=rep(0,NCOL(mm)),bounds,population,epsilon, verbose,maxit,variance=NULL) cal_names(formula,design,...)
calibrate(design,...) ## S3 method for class 'survey.design2' calibrate(design, formula, population, aggregate.stage=NULL, stage=0, variance=NULL, bounds=c(-Inf,Inf), calfun=c("linear","raking","logit"), maxit=50,epsilon=1e-7,verbose=FALSE,force=FALSE,trim=NULL, bounds.const=FALSE, sparse=FALSE,...) ## S3 method for class 'svyrep.design' calibrate(design, formula, population,compress=NA, aggregate.index=NULL, variance=NULL, bounds=c(-Inf,Inf), calfun=c("linear","raking","logit"), maxit=50, epsilon=1e-7, verbose=FALSE,force=FALSE,trim=NULL, bounds.const=FALSE, sparse=FALSE,...) ## S3 method for class 'twophase' calibrate(design, phase=2,formula, population, calfun=c("linear","raking","logit","rrz"),...) grake(mm,ww,calfun,eta=rep(0,NCOL(mm)),bounds,population,epsilon, verbose,maxit,variance=NULL) cal_names(formula,design,...)
design |
Survey design object |
formula |
Model formula for calibration model, or list of formulas for each margin |
population |
Vectors of population column totals for the model matrix in the calibration model, or list of such vectors for each cluster, or list of tables for each margin. Required except for two-phase designs |
compress |
compress the resulting replicate weights if
|
stage |
See Details below |
variance |
Coefficients for variance in calibration model (heteroskedasticity parameters) (see Details below) |
aggregate.stage |
An integer. If not |
aggregate.index |
A vector or one-sided formula. If not |
bounds |
Bounds for the calibration weights, optional
except for |
bounds.const |
Should be |
trim |
Weights outside this range will be trimmed to these bounds. |
... |
Options for other methods |
calfun |
Calibration function: see below |
maxit |
Number of iterations |
epsilon |
Tolerance in matching population total. Either a single
number or a vector of the same length as |
verbose |
Print lots of uninteresting information |
force |
Return an answer even if the specified accuracy was not achieved |
phase |
Phase of a two-phase design to calibrate (only
|
mm |
Model matrix |
ww |
Vector of weights |
eta |
Starting values for iteration |
sparse |
Use sparse matrices for faster computation |
The formula
argument specifies a model matrix, and the
population
argument is the population column sums of this
matrix. The function cal_names
shows what the column names of
this model matrix will be.
For the important special case where the calibration totals are (possibly
overlapping) marginal tables of factor variables, as in classical
raking, the formula
and population
arguments may be
lists in the same format as the input to rake
.
If the population
argument has a names attribute it will be
checked against the names produced by model.matrix(formula)
and
reordered if necessary. This protects against situations where the
(locale-dependent) ordering of factor levels is not what you expected.
Numerical instabilities may result if the sampling weights in the
design
object are wrong by multiple orders of magnitude. The
code now attempts to rescale the weights first, but it is better for
the user to ensure that the scale is reasonable.
The calibrate
function implements linear, bounded linear,
raking, bounded raking, and logit calibration functions. All except
unbounded linear calibration use the Newton-Raphson algorithm
described by Deville et al (1993). This algorithm is exposed for other
uses in the grake
function. Unbounded linear calibration uses
an algorithm that is less sensitive to collinearity. The calibration
function may be specified as a string naming one of the three built-in
functions or as an object of class calfun
, allowing
user-defined functions. See make.calfun
for details.
The bounds
argument can be specified as global upper and lower bounds e.g
bounds=c(0.5, 2)
or as a list with lower and upper vectors e.g.
bounds=list(lower=lower, upper=upper)
. This allows for individual
boundary constraints for each unit. The lower and upper vectors must be
the same length as the input data. The bounds can be specified as multiplicative
values or constant values. If constant, bounds.const
must be set to TRUE
.
Calibration with bounds, or on highly collinear data, may fail. If
force=TRUE
the approximately calibrated design object will
still be returned (useful for examining why it failed). A failure in
calibrating a set of replicate weights when the sampling weights were
successfully calibrated will give only a warning, not an error.
When calibration to the desired set of bounds is not possible, another option is
to trim weights. To do this set bounds
to a looser set of bounds
for which calibration is achievable and set trim
to the tighter
bounds. Weights outside the bounds will be trimmed to the bounds, and
the excess weight distributed over other observations in proportion to
their sampling weight (and so this may put some other observations
slightly over the trimming bounds). The projection matrix used in computing
standard errors is based on the feasible bounds specified by the
bounds
argument. See also trimWeights
,
which trims the final weights in a design object rather than the
calibration adjustments.
For two-phase designs calfun="rrz"
estimates the sampling
probabilities using logistic regression as described by Robins et al
(1994). estWeights
will do the same thing.
Calibration may result in observations within the last-stage sampling
units having unequal weight even though they necessarily are sampled
together. Specifying aggegrate.stage
ensures that the
calibration weight adjustments are constant within sampling units at
the specified stage; if the original sampling weights were equal the
final weights will also be equal. The algorithm is as described by
Vanderhoeft (2001, section III.D). Specifying aggregate.index
does the same thing for replicate weight designs; a warning will be
given if the original weights are not constant within levels of
aggregate.index
.
In a model with two-stage sampling, population totals may be available
for the PSUs actually sampled, but not for the whole population. In
this situation, calibrating within each PSU reduces with second-stage
contribution to variance. This generalizes to multistage sampling.
The stage
argument specifies which stage of sampling the totals
refer to. Stage 0 is full population totals, stage 1 is totals for
PSUs, and so on. The default, stage=NULL
is interpreted as
stage 0 when a single population vector is supplied and stage 1 when a
list is supplied. Calibrating to PSU totals will fail (with a message
about an exactly singular matrix) for PSUs that have fewer
observations than the number of calibration variables.
The variance in the calibration model may depend on covariates. If variance=NULL
the
calibration model has constant variance. If variance
is not NULL
it specifies a linear combination of the columns of the model matrix
and the calibration variance is proportional to that linear combination.
Alternatively variance
can be specified as a vector of values the
same length as the input data specifying a heteroskedasticity parameter
for each unit.
The design matrix specified by formula (after any aggregation) must be of full rank, with one exception. If the population total for a column is zero and all the observations are zero the column will be ignored. This allows the use of factors where the population happens to have no observations at some level.
In a two-phase design, population
may be omitted when
phase=2
, to specify calibration to the phase-one sample. If the
two-phase design object was constructed using the more memory-efficient
method="approx"
argument to twophase
, calibration of the first
phase of sampling to the population is not supported.
In a two-phase design, formula
may be a glm or lm or coxph
model fitted to the phase-one data. Calibration will be done using the
influence functions of this model as the calibration variables.
A survey design object.
Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M. Using the whole cohort in the analysis of case-cohort data. Am J Epidemiol. 2009;169(11):1398-1405. doi:10.1093/aje/kwp055
Deville J-C, Sarndal C-E, Sautory O (1993) Generalized Raking Procedures in Survey Sampling. JASA 88:1013-1020
Kalton G, Flores-Cervantes I (2003) "Weighting methods" J Official Stat 19(2) 81-97
Lumley T, Shaw PA, Dai JY (2011) "Connections between survey calibration estimators and semiparametric models for incomplete data" International Statistical Review. 79:200-220. (with discussion 79:221-232)
Sarndal C-E, Swensson B, Wretman J. "Model Assisted Survey Sampling". Springer. 1991.
Rao JNK, Yung W, Hidiroglou MA (2002) Estimating equations for the analysis of survey data using poststratification information. Sankhya 64 Series A Part 2, 364-378.
Robins JM, Rotnitzky A, Zhao LP. (1994) Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846-866.
Vanderhoeft C (2001) Generalized Calibration at Statistics Belgium. Statistics Belgium Working Paper No 3.
postStratify
, rake
for other ways
to use auxiliary information
twophase
and vignette("epi")
for an example of calibration in two-phase designs
survey/tests/kalton.R
for examples replicating those in Kalton & Flores-Cervantes (2003)
make.calfun
for user-defined calibration distances.
trimWeights
to trim final weights rather than calibration adjustments.
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) cal_names(~stype, dclus1) pop.totals<-c(`(Intercept)`=6194, stypeH=755, stypeM=1018) ## For a single factor variable this is equivalent to ## postStratify (dclus1g<-calibrate(dclus1, ~stype, pop.totals)) svymean(~api00, dclus1g) svytotal(~enroll, dclus1g) svytotal(~stype, dclus1g) ## Make weights constant within school district (dclus1agg<-calibrate(dclus1, ~stype, pop.totals, aggregate=1)) svymean(~api00, dclus1agg) svytotal(~enroll, dclus1agg) svytotal(~stype, dclus1agg) ## Now add sch.wide cal_names(~stype+sch.wide, dclus1) (dclus1g2 <- calibrate(dclus1, ~stype+sch.wide, c(pop.totals, sch.wideYes=5122))) svymean(~api00, dclus1g2) svytotal(~enroll, dclus1g2) svytotal(~stype, dclus1g2) ## Finally, calibrate on 1999 API and school type cal_names(~stype+api99, dclus1) (dclus1g3 <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069))) svymean(~api00, dclus1g3) svytotal(~enroll, dclus1g3) svytotal(~stype, dclus1g3) ## Same syntax with replicate weights rclus1<-as.svrepdesign(dclus1) (rclus1g3 <- calibrate(rclus1, ~stype+api99, c(pop.totals, api99=3914069))) svymean(~api00, rclus1g3) svytotal(~enroll, rclus1g3) svytotal(~stype, rclus1g3) (rclus1agg3 <- calibrate(rclus1, ~stype+api99, c(pop.totals,api99=3914069), aggregate.index=~dnum)) svymean(~api00, rclus1agg3) svytotal(~enroll, rclus1agg3) svytotal(~stype, rclus1agg3) ### ## Bounded weights range(weights(dclus1g3)/weights(dclus1)) dclus1g3b <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069),bounds=c(0.6,1.6)) range(weights(dclus1g3b)/weights(dclus1)) svymean(~api00, dclus1g3b) svytotal(~enroll, dclus1g3b) svytotal(~stype, dclus1g3b) ## Individual boundary constraints as constant values # the first weight will be bounded at 40, the rest free to move bnds <- list( lower = rep(-Inf, nrow(apiclus1)), upper = c(40, rep(Inf, nrow(apiclus1)-1))) head(weights(dclus1g3)) dclus1g3b1 <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069), bounds=bnds, bounds.const=TRUE) head(weights(dclus1g3b1)) svytotal(~api.stu, dclus1g3b1) ## trimming dclus1tr <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069), bounds=c(0.5,2), trim=c(2/3,3/2)) svymean(~api00+api99+enroll, dclus1tr) svytotal(~stype,dclus1tr) range(weights(dclus1tr)/weights(dclus1)) rclus1tr <- calibrate(rclus1, ~stype+api99, c(pop.totals, api99=3914069), bounds=c(0.5,2), trim=c(2/3,3/2)) svymean(~api00+api99+enroll, rclus1tr) svytotal(~stype,rclus1tr) ## Input in the same format as rake() for classical raking pop.table <- xtabs(~stype+sch.wide,apipop) pop.table2 <- xtabs(~stype+comp.imp,apipop) dclus1r<-rake(dclus1, list(~stype+sch.wide, ~stype+comp.imp), list(pop.table, pop.table2)) gclus1r<-calibrate(dclus1, formula=list(~stype+sch.wide, ~stype+comp.imp), population=list(pop.table, pop.table2),calfun="raking") svymean(~api00+stype, dclus1r) svymean(~api00+stype, gclus1r) ## generalised raking dclus1g3c <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069), calfun="raking") range(weights(dclus1g3c)/weights(dclus1)) (dclus1g3d <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069), calfun=cal.logit, bounds=c(0.5,2.5))) range(weights(dclus1g3d)/weights(dclus1)) ## Ratio estimators are calibration estimators dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) svytotal(~api.stu,dstrat) common<-svyratio(~api.stu, ~enroll, dstrat, separate=FALSE) predict(common, total=3811472) pop<-3811472 ## equivalent to (common) ratio estimator dstratg1<-calibrate(dstrat,~enroll-1, pop, variance=1) svytotal(~api.stu, dstratg1) # Alternatively specifying the heteroskedasticity parameters directly dstratgh <- calibrate(dstrat,~enroll-1, pop, variance=apistrat$enroll) svytotal(~api.stu, dstratgh)
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) cal_names(~stype, dclus1) pop.totals<-c(`(Intercept)`=6194, stypeH=755, stypeM=1018) ## For a single factor variable this is equivalent to ## postStratify (dclus1g<-calibrate(dclus1, ~stype, pop.totals)) svymean(~api00, dclus1g) svytotal(~enroll, dclus1g) svytotal(~stype, dclus1g) ## Make weights constant within school district (dclus1agg<-calibrate(dclus1, ~stype, pop.totals, aggregate=1)) svymean(~api00, dclus1agg) svytotal(~enroll, dclus1agg) svytotal(~stype, dclus1agg) ## Now add sch.wide cal_names(~stype+sch.wide, dclus1) (dclus1g2 <- calibrate(dclus1, ~stype+sch.wide, c(pop.totals, sch.wideYes=5122))) svymean(~api00, dclus1g2) svytotal(~enroll, dclus1g2) svytotal(~stype, dclus1g2) ## Finally, calibrate on 1999 API and school type cal_names(~stype+api99, dclus1) (dclus1g3 <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069))) svymean(~api00, dclus1g3) svytotal(~enroll, dclus1g3) svytotal(~stype, dclus1g3) ## Same syntax with replicate weights rclus1<-as.svrepdesign(dclus1) (rclus1g3 <- calibrate(rclus1, ~stype+api99, c(pop.totals, api99=3914069))) svymean(~api00, rclus1g3) svytotal(~enroll, rclus1g3) svytotal(~stype, rclus1g3) (rclus1agg3 <- calibrate(rclus1, ~stype+api99, c(pop.totals,api99=3914069), aggregate.index=~dnum)) svymean(~api00, rclus1agg3) svytotal(~enroll, rclus1agg3) svytotal(~stype, rclus1agg3) ### ## Bounded weights range(weights(dclus1g3)/weights(dclus1)) dclus1g3b <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069),bounds=c(0.6,1.6)) range(weights(dclus1g3b)/weights(dclus1)) svymean(~api00, dclus1g3b) svytotal(~enroll, dclus1g3b) svytotal(~stype, dclus1g3b) ## Individual boundary constraints as constant values # the first weight will be bounded at 40, the rest free to move bnds <- list( lower = rep(-Inf, nrow(apiclus1)), upper = c(40, rep(Inf, nrow(apiclus1)-1))) head(weights(dclus1g3)) dclus1g3b1 <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069), bounds=bnds, bounds.const=TRUE) head(weights(dclus1g3b1)) svytotal(~api.stu, dclus1g3b1) ## trimming dclus1tr <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069), bounds=c(0.5,2), trim=c(2/3,3/2)) svymean(~api00+api99+enroll, dclus1tr) svytotal(~stype,dclus1tr) range(weights(dclus1tr)/weights(dclus1)) rclus1tr <- calibrate(rclus1, ~stype+api99, c(pop.totals, api99=3914069), bounds=c(0.5,2), trim=c(2/3,3/2)) svymean(~api00+api99+enroll, rclus1tr) svytotal(~stype,rclus1tr) ## Input in the same format as rake() for classical raking pop.table <- xtabs(~stype+sch.wide,apipop) pop.table2 <- xtabs(~stype+comp.imp,apipop) dclus1r<-rake(dclus1, list(~stype+sch.wide, ~stype+comp.imp), list(pop.table, pop.table2)) gclus1r<-calibrate(dclus1, formula=list(~stype+sch.wide, ~stype+comp.imp), population=list(pop.table, pop.table2),calfun="raking") svymean(~api00+stype, dclus1r) svymean(~api00+stype, gclus1r) ## generalised raking dclus1g3c <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069), calfun="raking") range(weights(dclus1g3c)/weights(dclus1)) (dclus1g3d <- calibrate(dclus1, ~stype+api99, c(pop.totals, api99=3914069), calfun=cal.logit, bounds=c(0.5,2.5))) range(weights(dclus1g3d)/weights(dclus1)) ## Ratio estimators are calibration estimators dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) svytotal(~api.stu,dstrat) common<-svyratio(~api.stu, ~enroll, dstrat, separate=FALSE) predict(common, total=3811472) pop<-3811472 ## equivalent to (common) ratio estimator dstratg1<-calibrate(dstrat,~enroll-1, pop, variance=1) svytotal(~api.stu, dstratg1) # Alternatively specifying the heteroskedasticity parameters directly dstratgh <- calibrate(dstrat,~enroll-1, pop, variance=apistrat$enroll) svytotal(~api.stu, dstratgh)
Many replicate weight matrices have redundant rows, such as when
weights are the same for all observations in a PSU. This function
produces a compressed form. Methods for as.matrix
and
as.vector
extract and expand the weights.
compressWeights(rw, ...) ## S3 method for class 'svyrep.design' compressWeights(rw,...) ## S3 method for class 'repweights_compressed' as.matrix(x,...) ## S3 method for class 'repweights_compressed' as.vector(x,...)
compressWeights(rw, ...) ## S3 method for class 'svyrep.design' compressWeights(rw,...) ## S3 method for class 'repweights_compressed' as.matrix(x,...) ## S3 method for class 'repweights_compressed' as.vector(x,...)
rw |
A set of replicate weights or a |
x |
A compressed set of replicate weights |
... |
For future expansion |
An object of class repweights_compressed
or a
svyrep.design
object with repweights
element of class repweights_compressed
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) rclus1c<-as.svrepdesign(dclus1,compress=TRUE) rclus1<-as.svrepdesign(dclus1,compress=FALSE)
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) rclus1c<-as.svrepdesign(dclus1,compress=TRUE) rclus1<-as.svrepdesign(dclus1,compress=FALSE)
Computes confidence intervals for regression parameters in
svyglm
objects. The default is a Wald-type confidence
interval, adding and subtracting a multiple of the standard error. The
method="likelihood"
is an interval based on inverting the Rao-Scott
likelihood ratio test. That is, it is an interval where the working
model deviance is lower than the threshold for the Rao-Scott test at the
specified level.
## S3 method for class 'svyglm' confint(object, parm, level = 0.95, method = c("Wald", "likelihood"), ddf = NULL, ...)
## S3 method for class 'svyglm' confint(object, parm, level = 0.95, method = c("Wald", "likelihood"), ddf = NULL, ...)
object |
|
parm |
numeric or character vector indicating which parameters to construct intervals for. |
level |
desired coverage |
method |
See description above |
ddf |
Denominator degrees of freedom for |
... |
for future expansion |
A matrix of confidence intervals
J. N. K. Rao and Alistair J. Scott (1984) On Chi-squared Tests For Multiway Contigency Tables with Proportions Estimated From Survey Data. Annals of Statistics 12:46-60
data(api) dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) m<-svyglm(I(comp.imp=="Yes")~stype*emer+ell, design=dclus2, family=quasibinomial) confint(m) confint(m, method="like",ddf=NULL, parm=c("ell","emer"))
data(api) dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) m<-svyglm(I(comp.imp=="Yes")~stype*emer+ell, design=dclus2, family=quasibinomial) confint(m) confint(m, method="like",ddf=NULL, parm=c("ell","emer"))
A tiny dataset from the VPLX manual.
data(crowd)
data(crowd)
A data frame with 6 observations on the following 5 variables.
Number of rooms in the house
Number of people in the household
Sampling weight
Cluster number
Stratum number
Manual for VPLX, Census Bureau.
data(crowd) ## Example 1-1 i1.1<-as.svrepdesign(svydesign(id=~cluster, weight=~weight,data=crowd)) i1.1<-update(i1.1, room.ratio=rooms/person, overcrowded=factor(person>rooms)) svymean(~rooms+person+room.ratio,i1.1) svytotal(~rooms+person+room.ratio,i1.1) svymean(~rooms+person+room.ratio,subset(i1.1,overcrowded==TRUE)) svytotal(~rooms+person+room.ratio,subset(i1.1,overcrowded==TRUE)) ## Example 1-2 i1.2<-as.svrepdesign(svydesign(id=~cluster,weight=~weight,strata=~stratum, data=crowd)) svymean(~rooms+person,i1.2) svytotal(~rooms+person,i1.2)
data(crowd) ## Example 1-1 i1.1<-as.svrepdesign(svydesign(id=~cluster, weight=~weight,data=crowd)) i1.1<-update(i1.1, room.ratio=rooms/person, overcrowded=factor(person>rooms)) svymean(~rooms+person+room.ratio,i1.1) svytotal(~rooms+person+room.ratio,i1.1) svymean(~rooms+person+room.ratio,subset(i1.1,overcrowded==TRUE)) svytotal(~rooms+person+room.ratio,subset(i1.1,overcrowded==TRUE)) ## Example 1-2 i1.2<-as.svrepdesign(svydesign(id=~cluster,weight=~weight,strata=~stratum, data=crowd)) svymean(~rooms+person,i1.2) svytotal(~rooms+person,i1.2)
dimnames
returns variable names and row names for the data
variables in a design object and dim
returns dimensions.
For multiple imputation designs there is a third dimension giving the
number of imputations. For database-backed designs the second dimension
includes variables defined by update
. The first dimension
excludes observations with zero weight.
## S3 method for class 'survey.design' dim(x) ## S3 method for class 'svyimputationList' dim(x) ## S3 method for class 'survey.design' dimnames(x) ## S3 method for class 'DBIsvydesign' dimnames(x) ## S3 method for class 'svyimputationList' dimnames(x)
## S3 method for class 'survey.design' dim(x) ## S3 method for class 'svyimputationList' dim(x) ## S3 method for class 'survey.design' dimnames(x) ## S3 method for class 'DBIsvydesign' dimnames(x) ## S3 method for class 'svyimputationList' dimnames(x)
x |
Design object |
A vector of numbers for dim
, a list of vectors of strings for dimnames
.
update.DBIsvydesign
, with.svyimputationList
data(api) dclus1 <- svydesign(ids=~dnum,weights=~pw,data=apiclus1,fpc=~fpc) dim(dclus1) dimnames(dclus1) colnames(dclus1)
data(api) dclus1 <- svydesign(ids=~dnum,weights=~pw,data=apiclus1,fpc=~fpc) dim(dclus1) dimnames(dclus1) colnames(dclus1)
A sample of voting data from US states or counties (depending on data availability), sampled with probability proportional to number of votes. The sample was drawn using Tille's splitting method, implemented in the "sampling" package.
data(election)
data(election)
election
is a data frame with 4600 observations on the following 8 variables.
County
A factor specifying the state or country
TotPrecincts
Number of precincts in the state or county
PrecinctsReporting
Number of precincts supplying data
Bush
Votes for George W. Bush
Kerry
Votes for John Kerry
Nader
Votes for Ralph Nader
votes
Total votes for those three candidates
p
Sampling probability, proportional to votes
election_pps
is a sample of 40 counties or states taken with
probability proportional to the number of votes. It includes the
additional column wt
with the sampling weights.
election_insample
indicates which rows of election
were sampled.
election_jointprob
are the pairwise sampling probabilities and
election_jointHR
are approximate pairwise sampling probabilities using
the Hartley-Rao approximation.
.
data(election) ## high positive correlation between totals plot(Bush~Kerry,data=election,log="xy") ## high negative correlation between proportions plot(I(Bush/votes)~I(Kerry/votes), data=election) ## Variances without replacement ## Horvitz-Thompson type dpps_br<- svydesign(id=~1, fpc=~p, data=election_pps, pps="brewer") dpps_ov<- svydesign(id=~1, fpc=~p, data=election_pps, pps="overton") dpps_hr<- svydesign(id=~1, fpc=~p, data=election_pps, pps=HR(sum(election$p^2)/40)) dpps_hr1<- svydesign(id=~1, fpc=~p, data=election_pps, pps=HR()) dpps_ht<- svydesign(id=~1, fpc=~p, data=election_pps, pps=ppsmat(election_jointprob)) ## Yates-Grundy type dpps_yg<- svydesign(id=~1, fpc=~p, data=election_pps, pps=ppsmat(election_jointprob),variance="YG") dpps_hryg<- svydesign(id=~1, fpc=~p, data=election_pps, pps=HR(sum(election$p^2)/40),variance="YG") ## The with-replacement approximation dppswr <-svydesign(id=~1, probs=~p, data=election_pps) svytotal(~Bush+Kerry+Nader, dpps_ht) svytotal(~Bush+Kerry+Nader, dpps_yg) svytotal(~Bush+Kerry+Nader, dpps_hr) svytotal(~Bush+Kerry+Nader, dpps_hryg) svytotal(~Bush+Kerry+Nader, dpps_hr1) svytotal(~Bush+Kerry+Nader, dpps_br) svytotal(~Bush+Kerry+Nader, dpps_ov) svytotal(~Bush+Kerry+Nader, dppswr)
data(election) ## high positive correlation between totals plot(Bush~Kerry,data=election,log="xy") ## high negative correlation between proportions plot(I(Bush/votes)~I(Kerry/votes), data=election) ## Variances without replacement ## Horvitz-Thompson type dpps_br<- svydesign(id=~1, fpc=~p, data=election_pps, pps="brewer") dpps_ov<- svydesign(id=~1, fpc=~p, data=election_pps, pps="overton") dpps_hr<- svydesign(id=~1, fpc=~p, data=election_pps, pps=HR(sum(election$p^2)/40)) dpps_hr1<- svydesign(id=~1, fpc=~p, data=election_pps, pps=HR()) dpps_ht<- svydesign(id=~1, fpc=~p, data=election_pps, pps=ppsmat(election_jointprob)) ## Yates-Grundy type dpps_yg<- svydesign(id=~1, fpc=~p, data=election_pps, pps=ppsmat(election_jointprob),variance="YG") dpps_hryg<- svydesign(id=~1, fpc=~p, data=election_pps, pps=HR(sum(election$p^2)/40),variance="YG") ## The with-replacement approximation dppswr <-svydesign(id=~1, probs=~p, data=election_pps) svytotal(~Bush+Kerry+Nader, dpps_ht) svytotal(~Bush+Kerry+Nader, dpps_yg) svytotal(~Bush+Kerry+Nader, dpps_hr) svytotal(~Bush+Kerry+Nader, dpps_hryg) svytotal(~Bush+Kerry+Nader, dpps_hr1) svytotal(~Bush+Kerry+Nader, dpps_br) svytotal(~Bush+Kerry+Nader, dpps_ov) svytotal(~Bush+Kerry+Nader, dppswr)
Creates or adjusts a two-phase survey design object using a logistic regression model for second-phase sampling probability. This function should be particularly useful in reweighting to account for missing data.
estWeights(data,formula,...) ## S3 method for class 'twophase' estWeights(data,formula=NULL, working.model=NULL,...) ## S3 method for class 'data.frame' estWeights(data,formula=NULL, working.model=NULL, subset=NULL, strata=NULL,...)
estWeights(data,formula,...) ## S3 method for class 'twophase' estWeights(data,formula=NULL, working.model=NULL,...) ## S3 method for class 'data.frame' estWeights(data,formula=NULL, working.model=NULL, subset=NULL, strata=NULL,...)
data |
twophase design object or data frame |
formula |
Predictors for estimating weights |
working.model |
Model fitted to complete (ie phase 1) data |
subset |
Subset of data frame with complete data (ie phase 1).
If |
strata |
Stratification (if any) of phase 2 sampling |
... |
for future expansion |
If data
is a data frame, estWeights
first creates a
two-phase design object. The strata
argument is used only to
compute finite population corrections, the same variables must be
included in formula
to compute stratified sampling probabilities.
With a two-phase design object, estWeights
estimates the sampling
probabilities using logistic regression as described by Robins et al
(1994) and adds information to the object to enable correct sandwich
standard errors to be computed.
An alternative to specifying formula
is to specify
working.model
. The estimating functions from this model will be
used as predictors of the sampling probabilities, which will increase
efficiency to the extent that the working model and the model of
interest estimate the same parameters (Kulich & Lin 2004).
The effect on a two-phase design object is very similar to
calibrate
, and is identical when formula
specifies a saturated model.
A two-phase survey design object.
Breslow NE, Lumley T, Ballantyne CM, Chambless LE, Kulich M. (2009) Using the Whole Cohort in the Analysis of Case-Cohort Data. Am J Epidemiol. 2009 Jun 1;169(11):1398-405.
Robins JM, Rotnitzky A, Zhao LP. (1994) Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89, 846-866.
Kulich M, Lin DY (2004). Improving the Efficiency of Relative-Risk Estimation in Case-Cohort Studies. Journal of the American Statistical Association, Vol. 99, pp.832-844
Lumley T, Shaw PA, Dai JY (2011) "Connections between survey calibration estimators and semiparametric models for incomplete data" International Statistical Review. 79:200-220. (with discussion 79:221-232)
postStratify
,
calibrate
, twophase
data(airquality) ## ignoring missingness, using model-based standard error summary(lm(log(Ozone)~Temp+Wind, data=airquality)) ## Without covariates to predict missingness we get ## same point estimates, but different (sandwich) standard errors daq<-estWeights(airquality, formula=~1,subset=~I(!is.na(Ozone))) summary(svyglm(log(Ozone)~Temp+Wind,design=daq)) ## Reweighting based on weather, month d2aq<-estWeights(airquality, formula=~Temp+Wind+Month, subset=~I(!is.na(Ozone))) summary(svyglm(log(Ozone)~Temp+Wind,design=d2aq))
data(airquality) ## ignoring missingness, using model-based standard error summary(lm(log(Ozone)~Temp+Wind, data=airquality)) ## Without covariates to predict missingness we get ## same point estimates, but different (sandwich) standard errors daq<-estWeights(airquality, formula=~1,subset=~I(!is.na(Ozone))) summary(svyglm(log(Ozone)~Temp+Wind,design=daq)) ## Reweighting based on weather, month d2aq<-estWeights(airquality, formula=~Temp+Wind+Month, subset=~I(!is.na(Ozone))) summary(svyglm(log(Ozone)~Temp+Wind,design=d2aq))
The fpc
data frame has 8 rows and 6 columns. It is artificial
data to illustrate survey sampling estimators.
data(fpc)
data(fpc)
This data frame contains the following columns:
Stratum ids
Sampling unit ids
Sampling weights
number sampled per stratum
population size per stratum
data
https://www.stata-press.com/data/r7/fpc.dta
data(fpc) fpc withoutfpc<-svydesign(weights=~weight, ids=~psuid, strata=~stratid, variables=~x, data=fpc, nest=TRUE) withoutfpc svymean(~x, withoutfpc) withfpc<-svydesign(weights=~weight, ids=~psuid, strata=~stratid, fpc=~Nh, variables=~x, data=fpc, nest=TRUE) withfpc svymean(~x, withfpc) ## Other equivalent forms withfpc<-svydesign(prob=~I(1/weight), ids=~psuid, strata=~stratid, fpc=~Nh, variables=~x, data=fpc, nest=TRUE) svymean(~x, withfpc) withfpc<-svydesign(weights=~weight, ids=~psuid, strata=~stratid, fpc=~I(nh/Nh), variables=~x, data=fpc, nest=TRUE) svymean(~x, withfpc) withfpc<-svydesign(weights=~weight, ids=~interaction(stratid,psuid), strata=~stratid, fpc=~I(nh/Nh), variables=~x, data=fpc) svymean(~x, withfpc) withfpc<-svydesign(ids=~psuid, strata=~stratid, fpc=~Nh, variables=~x,data=fpc,nest=TRUE) svymean(~x, withfpc) withfpc<-svydesign(ids=~psuid, strata=~stratid, fpc=~I(nh/Nh), variables=~x, data=fpc, nest=TRUE) svymean(~x, withfpc)
data(fpc) fpc withoutfpc<-svydesign(weights=~weight, ids=~psuid, strata=~stratid, variables=~x, data=fpc, nest=TRUE) withoutfpc svymean(~x, withoutfpc) withfpc<-svydesign(weights=~weight, ids=~psuid, strata=~stratid, fpc=~Nh, variables=~x, data=fpc, nest=TRUE) withfpc svymean(~x, withfpc) ## Other equivalent forms withfpc<-svydesign(prob=~I(1/weight), ids=~psuid, strata=~stratid, fpc=~Nh, variables=~x, data=fpc, nest=TRUE) svymean(~x, withfpc) withfpc<-svydesign(weights=~weight, ids=~psuid, strata=~stratid, fpc=~I(nh/Nh), variables=~x, data=fpc, nest=TRUE) svymean(~x, withfpc) withfpc<-svydesign(weights=~weight, ids=~interaction(stratid,psuid), strata=~stratid, fpc=~I(nh/Nh), variables=~x, data=fpc) svymean(~x, withfpc) withfpc<-svydesign(ids=~psuid, strata=~stratid, fpc=~Nh, variables=~x,data=fpc,nest=TRUE) svymean(~x, withfpc) withfpc<-svydesign(ids=~psuid, strata=~stratid, fpc=~I(nh/Nh), variables=~x, data=fpc, nest=TRUE) svymean(~x, withfpc)
Reformat the output of survey computations to a table.
## S3 method for class 'svystat' ftable(x, rownames,...) ## S3 method for class 'svrepstat' ftable(x, rownames,...) ## S3 method for class 'svyby' ftable(x,...)
## S3 method for class 'svystat' ftable(x, rownames,...) ## S3 method for class 'svrepstat' ftable(x, rownames,...) ## S3 method for class 'svyby' ftable(x,...)
x |
Output of functions such as |
rownames |
List of vectors of strings giving dimension names for the resulting table (see examples) |
... |
Arguments for future expansion |
An object of class "ftable"
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) a<-svymean(~interaction(stype,comp.imp), design=dclus1) b<-ftable(a, rownames=list(stype=c("E","H","M"),comp.imp=c("No","Yes"))) b a<-svymean(~interaction(stype,comp.imp), design=dclus1, deff=TRUE) b<-ftable(a, rownames=list(stype=c("E","H","M"),comp.imp=c("No","Yes"))) round(100*b,1) rclus1<-as.svrepdesign(dclus1) a<-svytotal(~interaction(stype,comp.imp), design=rclus1) b<-ftable(a, rownames=list(stype=c("E","H","M"),comp.imp=c("No","Yes"))) b round(b) a<-svyby(~api99 + api00, ~stype + sch.wide, rclus1, svymean, keep.var=TRUE) ftable(a) print(ftable(a),digits=2) b<-svyby(~api99 + api00, ~stype + sch.wide, rclus1, svymean, keep.var=TRUE, deff=TRUE) print(ftable(b),digits=2) d<-svyby(~api99 + api00, ~stype + sch.wide, rclus1, svymean, keep.var=TRUE, vartype=c("se","cvpct")) round(ftable(d),1)
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) a<-svymean(~interaction(stype,comp.imp), design=dclus1) b<-ftable(a, rownames=list(stype=c("E","H","M"),comp.imp=c("No","Yes"))) b a<-svymean(~interaction(stype,comp.imp), design=dclus1, deff=TRUE) b<-ftable(a, rownames=list(stype=c("E","H","M"),comp.imp=c("No","Yes"))) round(100*b,1) rclus1<-as.svrepdesign(dclus1) a<-svytotal(~interaction(stype,comp.imp), design=rclus1) b<-ftable(a, rownames=list(stype=c("E","H","M"),comp.imp=c("No","Yes"))) b round(b) a<-svyby(~api99 + api00, ~stype + sch.wide, rclus1, svymean, keep.var=TRUE) ftable(a) print(ftable(a),digits=2) b<-svyby(~api99 + api00, ~stype + sch.wide, rclus1, svymean, keep.var=TRUE, deff=TRUE) print(ftable(b),digits=2) d<-svyby(~api99 + api00, ~stype + sch.wide, rclus1, svymean, keep.var=TRUE, vartype=c("se","cvpct")) round(ftable(d),1)
Returns a Hadamard matrix of dimension larger than the argument.
hadamard(n)
hadamard(n)
n |
lower bound for size |
For most n
the matrix comes from paley
. The
matrix is from Plackett and Burman (1946)
and the
is from Sloane's library of Hadamard
matrices.
Matrices of dimension every multiple of 4 are thought to exist, but
this function doesn't know about all of them, so it will sometimes
return matrices that are larger than necessary. The excess is at most
4 for and at most 5% for
.
A Hadamard matrix
Strictly speaking, a Hadamard matrix has entries +1 and -1 rather
than 1 and 0, so 2*hadamard(n)-1
is a Hadamard matrix
Sloane NJA. A Library of Hadamard Matrices http://neilsloane.com/hadamard/
Plackett RL, Burman JP. (1946) The Design of Optimum Multifactorial Experiments Biometrika, Vol. 33, No. 4 pp. 305-325
Cameron PJ (2005) Hadamard Matrices http://designtheory.org/library/encyc/topics/had.pdf. In: The Encyclopedia of Design Theory http://designtheory.org/library/encyc/
par(mfrow=c(2,2)) ## Sylvester-type image(hadamard(63),main=quote("Sylvester: "*64==2^6)) ## Paley-type image(hadamard(59),main=quote("Paley: "*60==59+1)) ## from NJ Sloane's library image(hadamard(27),main=quote("Stored: "*28)) ## For n=90 we get 96 rather than the minimum possible size, 92. image(hadamard(90),main=quote("Constructed: "*96==2^3%*%(11+1))) par(mfrow=c(1,1)) plot(2:150,sapply(2:150,function(i) ncol(hadamard(i))),type="S", ylab="Matrix size",xlab="n",xlim=c(1,150),ylim=c(1,150)) abline(0,1,lty=3) lines(2:150, 2:150-(2:150 %% 4)+4,col="purple",type="S",lty=2) legend(c(x=10,y=140),legend=c("Actual size","Minimum possible size"), col=c("black","purple"),bty="n",lty=c(1,2))
par(mfrow=c(2,2)) ## Sylvester-type image(hadamard(63),main=quote("Sylvester: "*64==2^6)) ## Paley-type image(hadamard(59),main=quote("Paley: "*60==59+1)) ## from NJ Sloane's library image(hadamard(27),main=quote("Stored: "*28)) ## For n=90 we get 96 rather than the minimum possible size, 92. image(hadamard(90),main=quote("Constructed: "*96==2^3%*%(11+1))) par(mfrow=c(1,1)) plot(2:150,sapply(2:150,function(i) ncol(hadamard(i))),type="S", ylab="Matrix size",xlab="n",xlim=c(1,150),ylim=c(1,150)) abline(0,1,lty=3) lines(2:150, 2:150-(2:150 %% 4)+4,col="purple",type="S",lty=2) legend(c(x=10,y=140),legend=c("Actual size","Minimum possible size"), col=c("black","purple"),bty="n",lty=c(1,2))
The hospital
data frame has 15 rows and 5 columns.
data(hospital)
data(hospital)
This data frame contains the following columns:
Hospital id
level of obstetric care
Weights, as given by the original reference
total hospitalisations
births
Weights, as given in the source
Previously at http://www.ats.ucla.edu/stat/books/sop/hospsamp.dta
Levy and Lemeshow. "Sampling of Populations" (3rd edition). Wiley.
data(hospital) hospdes<-svydesign(strata=~oblevel, id=~hospno, weights=~weighta, fpc=~tothosp, data=hospital) hosprep<-as.svrepdesign(hospdes) svytotal(~births, design=hospdes) svytotal(~births, design=hosprep)
data(hospital) hospdes<-svydesign(strata=~oblevel, id=~hospno, weights=~weighta, fpc=~tothosp, data=hospital) hosprep<-as.svrepdesign(hospdes) svytotal(~births, design=hospdes) svytotal(~births, design=hosprep)
The Horvitz-Thompson estimator and the Hartley-Rao approximation require information in addition to the sampling probabilities for sampled individuals. These functions allow this information to be supplied.
HR(psum=NULL, strata = NULL) ppsmat(jointprob, tolerance = 1e-04) ppscov(probcov, weighted=FALSE)
HR(psum=NULL, strata = NULL) ppsmat(jointprob, tolerance = 1e-04) ppscov(probcov, weighted=FALSE)
psum |
The sum of squared sampling probabilities for the population, divided by the sample size, as a single number or as a vector for stratified sampling |
strata |
Stratum labels, of the same length as |
jointprob |
Matrix of pairwise sampling probabilities for the sampled individuals |
tolerance |
Tolerance for deciding that the covariance of sampling indicators is zero |
probcov |
Covariance of the sampling indicators (often written 'Delta'), or weighted covariance if |
weighted |
If |
An object of class HR
,ppsmat
, ppsdelta
, or ppsdcheck
suitable for supplying as the pps
argument to svydesign
.
election for examples of PPS designs
HR(0.1)
HR(0.1)
Create calibration metric for use in calibrate
. The
function F
is the link function described in section 2 of
Deville et al. To create a new calibration metric, specify and its
derivative. The package provides
cal.linear
, cal.raking
,
cal.logit
, which are standard, and cal.sinh
from the
CALMAR2
macro, for which F
is the derivative of the inverse hyperbolic
sine.
make.calfun(Fm1, dF, name)
make.calfun(Fm1, dF, name)
Fm1 |
Function |
dF |
Derivative of |
name |
Character string to use as name |
An object of class "calfun"
Deville J-C, Sarndal C-E, Sautory O (1993) Generalized Raking Procedures in Survey Sampling. JASA 88:1013-1020
Deville J-C, Sarndal C-E (1992) Calibration Estimators in Survey Sampling. JASA 87: 376-382
str(cal.linear) cal.linear$Fm1 cal.linear$dF hellinger <- make.calfun(Fm1=function(u, bounds) ((1-u/2)^-2)-1, dF= function(u, bounds) (1-u/2)^-3 , name="hellinger distance") hellinger data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) svymean(~api00,calibrate(dclus1, ~api99, pop=c(6194, 3914069), calfun=hellinger)) svymean(~api00,calibrate(dclus1, ~api99, pop=c(6194, 3914069), calfun=cal.linear)) svymean(~api00,calibrate(dclus1, ~api99, pop=c(6194,3914069), calfun=cal.raking))
str(cal.linear) cal.linear$Fm1 cal.linear$dF hellinger <- make.calfun(Fm1=function(u, bounds) ((1-u/2)^-2)-1, dF= function(u, bounds) (1-u/2)^-3 , name="hellinger distance") hellinger data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) svymean(~api00,calibrate(dclus1, ~api99, pop=c(6194, 3914069), calfun=hellinger)) svymean(~api00,calibrate(dclus1, ~api99, pop=c(6194, 3914069), calfun=cal.linear)) svymean(~api00,calibrate(dclus1, ~api99, pop=c(6194,3914069), calfun=cal.raking))
Reweights the design (using calibrate
) so that the adjustment variables are uncorrelated
with the variables in the model, and then performs predictions by
calling predict
. When the adjustment model is saturated this is
equivalent to direct standardization on the adjustment variables.
The svycoxph
and svykmlist
methods return survival curves.
marginpred(model, adjustfor, predictat, ...) ## S3 method for class 'svycoxph' marginpred(model, adjustfor, predictat, se=FALSE, ...) ## S3 method for class 'svykmlist' marginpred(model, adjustfor, predictat, se=FALSE, ...) ## S3 method for class 'svyglm' marginpred(model, adjustfor, predictat, ...)
marginpred(model, adjustfor, predictat, ...) ## S3 method for class 'svycoxph' marginpred(model, adjustfor, predictat, se=FALSE, ...) ## S3 method for class 'svykmlist' marginpred(model, adjustfor, predictat, se=FALSE, ...) ## S3 method for class 'svyglm' marginpred(model, adjustfor, predictat, ...)
model |
A regression model object of a class that has a |
adjustfor |
Model formula specifying adjustment variables, which must be in the design object of the model |
predictat |
A data frame giving values of the variables in |
se |
Estimate standard errors for the survival curve (uses a lot of memory if the sample size is large) |
... |
Extra arguments, passed to the |
svypredmeans
for the method of Graubard and Korn implemented in SUDAAN.
## generate data with apparent group effect from confounding set.seed(42) df<-data.frame(x=rnorm(100)) df$time<-rexp(100)*exp(df$x-1) df$status<-1 df$group<-(df$x+rnorm(100))>0 des<-svydesign(id=~1,data=df) newdf<-data.frame(group=c(FALSE,TRUE), x=c(0,0)) ## Cox model m0<-svycoxph(Surv(time,status)~group,design=des) m1<-svycoxph(Surv(time,status)~group+x,design=des) ## conditional predictions, unadjusted and adjusted cpred0<-predict(m0, type="curve", newdata=newdf, se=TRUE) cpred1<-predict(m1, type="curve", newdata=newdf, se=TRUE) ## adjusted marginal prediction mpred<-marginpred(m0, adjustfor=~x, predictat=newdf, se=TRUE) plot(cpred0) lines(cpred1[[1]],col="red") lines(cpred1[[2]],col="red") lines(mpred[[1]],col="blue") lines(mpred[[2]],col="blue") ## Kaplan--Meier s2<-svykm(Surv(time,status>0)~group, design=des) p2<-marginpred(s2, adjustfor=~x, predictat=newdf,se=TRUE) plot(s2) lines(p2[[1]],col="green") lines(p2[[2]],col="green") ## logistic regression logisticm <- svyglm(group~time, family=quasibinomial, design=des) newdf$time<-c(0.1,0.8) logisticpred <- marginpred(logisticm, adjustfor=~x, predictat=newdf)
## generate data with apparent group effect from confounding set.seed(42) df<-data.frame(x=rnorm(100)) df$time<-rexp(100)*exp(df$x-1) df$status<-1 df$group<-(df$x+rnorm(100))>0 des<-svydesign(id=~1,data=df) newdf<-data.frame(group=c(FALSE,TRUE), x=c(0,0)) ## Cox model m0<-svycoxph(Surv(time,status)~group,design=des) m1<-svycoxph(Surv(time,status)~group+x,design=des) ## conditional predictions, unadjusted and adjusted cpred0<-predict(m0, type="curve", newdata=newdf, se=TRUE) cpred1<-predict(m1, type="curve", newdata=newdf, se=TRUE) ## adjusted marginal prediction mpred<-marginpred(m0, adjustfor=~x, predictat=newdf, se=TRUE) plot(cpred0) lines(cpred1[[1]],col="red") lines(cpred1[[2]],col="red") lines(mpred[[1]],col="blue") lines(mpred[[2]],col="blue") ## Kaplan--Meier s2<-svykm(Surv(time,status>0)~group, design=des) p2<-marginpred(s2, adjustfor=~x, predictat=newdf,se=TRUE) plot(s2) lines(p2[[1]],col="green") lines(p2[[2]],col="green") ## logistic regression logisticm <- svyglm(group~time, family=quasibinomial, design=des) newdf$time<-c(0.1,0.8) logisticpred <- marginpred(logisticm, adjustfor=~x, predictat=newdf)
The MU284 population comes from Sarndal et al, and the complete data are available from Statlib. These data are a two-stage sample from the population, analyzed on page 143 of the book.
data(mu284)
data(mu284)
A data frame with 15 observations on the following 5 variables.
id1
identifier for PSU
n1
number of PSUs in population
id2
identifier for second-stage unit
y1
variable to be analysed
n2
number of second-stage units in this PSU
Carl Erik Sarndal, Bengt Swensson, Jan Wretman. (1991) "Model Assisted Survey Sampling" Springer.
(downloaded from StatLib, which is no longer active)
data(mu284) (dmu284<-svydesign(id=~id1+id2,fpc=~n1+n2, data=mu284)) (ytotal<-svytotal(~y1, dmu284)) vcov(ytotal)
data(mu284) (dmu284<-svydesign(id=~id1+id2,fpc=~n1+n2, data=mu284)) (ytotal<-svytotal(~y1, dmu284)) vcov(ytotal)
Given a list of samples from K different sampling frames and information
about which observations are in which frame, constructs an object
representing the whole multi-frame sample. If an unit is in the
overlap of multiple frames in the population it is effectively split
into multiple separate units and so the weight is split if it is
sampled. To optimise the split of frame weights, see reweight
multiframe(designs, overlaps, estimator = c("constant", "expected"), theta = NULL)
multiframe(designs, overlaps, estimator = c("constant", "expected"), theta = NULL)
designs |
List of survey design objects |
overlaps |
list of matrices. Each matrix has K columns indicating whether the
observation is in frames 1-K. For the 'constant'-type estimator, this is
binary. For the |
estimator |
|
theta |
Scale factors adding to 1 for splitting the overlap between frames |
It is not necessary that the frame samples contain exactly the same variables or that they are in the same order, although only variables present in all the samples can be used. It is important that factor variables existing across more than one frame sample have the same factor levels in all the samples.
All these estimators assume sampling is independent between frames, and that any observation sampled more than once is present in the dataset each time it is sampled.
Object of class multiframe
Bankier, M. D. (1986) Estimators Based on Several Stratified Samples With Applications to Multiple Frame Surveys. Journal of the American Statistical Association, Vol. 81, 1074 - 1079.
Hartley, H. O. (1962) Multiple Frames Surveys. Proceedings of the American Statistical Association, Social Statistics Sections, 203 - 206. Hartley, H. O. (1974) Multiple frame methodology and selected applications. Sankhya C, Vol. 36, 99 - 118.
Kalton, G. and Anderson, D. W. (1986) Sampling Rare Populations. Journal of the Royal Statistical Society, Ser. A, Vol. 149, 65 - 82.
For a simple introduction: Metcalf P and Scott AJ (2009) Using multiple frames in health surveys. Stat Med 28:1512-1523
For general reference: Lohr SL, Rao JNK. Inference from dual frame surveys. Journal of the American Statistical Association 2000; 94:271-280.
Lohr SL, Rao JNK. Estimation in multiple frame surveys. Journal of the American Statistical Association 2006; 101:1019-1030.
data(phoneframes) A_in_frames<-cbind(1, DatA$Domain=="ab") B_in_frames<-cbind(DatB$Domain=="ba",1) Bdes_pps<-svydesign(id=~1, fpc=~ProbB, data=DatB,pps=ppsmat(PiklB)) Ades_pps <-svydesign(id=~1, fpc=~ProbA,data=DatA,pps=ppsmat(PiklA)) ## optimal constant (Hartley) weighting mf_pps<-multiframe(list(Ades_pps,Bdes_pps),list(A_in_frames,B_in_frames),theta=0.74) svytotal(~Lei,mf_pps) svymean(~Lei, mf_pps) svyby(~Lei, ~Size, svymean, design=mf_pps) svytable(~Size+I(Lei>20), mf_pps,round=TRUE)
data(phoneframes) A_in_frames<-cbind(1, DatA$Domain=="ab") B_in_frames<-cbind(DatB$Domain=="ba",1) Bdes_pps<-svydesign(id=~1, fpc=~ProbB, data=DatB,pps=ppsmat(PiklB)) Ades_pps <-svydesign(id=~1, fpc=~ProbA,data=DatA,pps=ppsmat(PiklA)) ## optimal constant (Hartley) weighting mf_pps<-multiframe(list(Ades_pps,Bdes_pps),list(A_in_frames,B_in_frames),theta=0.74) svytotal(~Lei,mf_pps) svymean(~Lei, mf_pps) svyby(~Lei, ~Size, svymean, design=mf_pps) svytable(~Size+I(Lei>20), mf_pps,round=TRUE)
These objects represent designs with arbitrarily many nested phases of sampling, allowing estimation and calibration/raking at each phase
multiphase(ids, subset, strata, probs, data, fpc = NULL, check.variable.phase=TRUE)
multiphase(ids, subset, strata, probs, data, fpc = NULL, check.variable.phase=TRUE)
ids |
List of as many model formulas as phases describing ids for each phase. Each formula may indicate multistage sampling |
subset |
list of model formulas for each phase except the first, specifying a logical vector of which observations from the previous phase are included |
strata |
List of as many model formulas as phases describing strata for each phase. Each formula may
indicate multistage sampling, or |
probs |
List of as many model formulas or |
data |
data frame of data |
fpc |
Finite population correction for the first phase, if needed |
check.variable.phase |
Work out which phase each variable is
observed in by looking at missing value patterns. You may want
|
Variance calculation uses a decomposition with sampling contributions
at each stage, which are returned as the phases
attribution of
a variance-covariance matrix. The computations broadly follow the
description for two-phase sampling in chapter 9 of Sarndal et al
(1991); there is more detail in the vignette
Object of class multiphase
There are currently methods for svytotal
,
svymean
, svyglm
, svyvar
.
Sarndal, Swensson, and Wretman (1991) "Model Assisted Survey Sampling" (Chapter 9)
twophase
for older implementations of two-phase sampling
vignette("multiphase")
for computational details
data(nwtco) dcchs<-twophase(id = list(~seqno, ~seqno), strata = list(NULL, ~rel), subset = ~I(in.subcohort | rel), data = nwtco) mcchs<-multiphase(id = list(~seqno, ~seqno), strata = list(NULL, ~rel), subset = list(~I(in.subcohort | rel)), probs = list(~1, NULL), data = nwtco) dcchs mcchs svymean(~edrel, dcchs) svymean(~edrel, mcchs) summary(svyglm(edrel~rel+histol+stage, design=dcchs)) summary(svyglm(edrel~rel+histol+stage, design=mcchs)) m<-calibrate(mcchs,~factor(stage)+rel, phase=2, calfun="raking") vcov(svytotal(~factor(stage), m))
data(nwtco) dcchs<-twophase(id = list(~seqno, ~seqno), strata = list(NULL, ~rel), subset = ~I(in.subcohort | rel), data = nwtco) mcchs<-multiphase(id = list(~seqno, ~seqno), strata = list(NULL, ~rel), subset = list(~I(in.subcohort | rel)), probs = list(~1, NULL), data = nwtco) dcchs mcchs svymean(~edrel, dcchs) svymean(~edrel, mcchs) summary(svyglm(edrel~rel+histol+stage, design=dcchs)) summary(svyglm(edrel~rel+histol+stage, design=mcchs)) m<-calibrate(mcchs,~factor(stage)+rel, phase=2, calfun="raking") vcov(svytotal(~factor(stage), m))
These data are in a paper by JNK Rao and colleagues, on score tests
for complex survey data. External information (not further specified) suggests
the functional form for the Age
variable.
data("myco")
data("myco")
A data frame with 516 observations on the following 6 variables.
Age
Age in years at the midpoint of six age strata
Scar
Presence of a BCG vaccination scar
n
Sampled number of cases (and thus controls) in the age stratum
Ncontrol
Number of non-cases in the population
wt
Sampling weight
leprosy
case status 0/1
The data are a simulated stratified case-control study drawn from a population study conducted in a region of Malawi (Clayton and Hills, 1993, Table 18.1). The goal was to examine whether BCG vaccination against tuberculosis protects against leprosy (the causative agents are both species of _Mycobacterium_). Rao et al have a typographical error: the number of non-cases in the population in the 25-30 age stratum is given as 4981 but 5981 matches both the computational output and the data as given by Clayton and Hills.
JNK Rao, AJ Scott, and Skinner, C. (1998). QUASI-SCORE TESTS WITH SURVEY DATA. Statistica Sinica, 8(4), 1059-1070.
Clayton, D., & Hills, M. (1993). Statistical Models in Epidemiology. OUP
data(myco) dmyco<-svydesign(id=~1, strata=~interaction(Age,leprosy),weights=~wt,data=myco) m_full<-svyglm(leprosy~I((Age+7.5)^-2)+Scar, family=quasibinomial, design=dmyco) m_age<-svyglm(leprosy~I((Age+7.5)^-2), family=quasibinomial, design=dmyco) anova(m_full,m_age) ## unweighted model does not match m_full glm(leprosy~I((Age+7.5)^-2)+Scar, family=binomial, data=myco)
data(myco) dmyco<-svydesign(id=~1, strata=~interaction(Age,leprosy),weights=~wt,data=myco) m_full<-svyglm(leprosy~I((Age+7.5)^-2)+Scar, family=quasibinomial, design=dmyco) m_age<-svyglm(leprosy~I((Age+7.5)^-2), family=quasibinomial, design=dmyco) anova(m_full,m_age) ## unweighted model does not match m_full glm(leprosy~I((Age+7.5)^-2)+Scar, family=binomial, data=myco)
Estimates quantiles and confidence intervals for them. This function was completely re-written for version 4.1 of the survey package, and has a wider range of ways to define the quantile. See the vignette for a list of them.
svyquantile(x, design, quantiles, ...) ## S3 method for class 'survey.design' svyquantile(x, design, quantiles, alpha = 0.05, interval.type = c("mean", "beta","xlogit", "asin","score"), na.rm = FALSE, ci=TRUE, se = ci, qrule=c("math","school","shahvaish","hf1","hf2","hf3", "hf4","hf5","hf6","hf7","hf8","hf9"), df = NULL, ...) ## S3 method for class 'svyrep.design' svyquantile(x, design, quantiles, alpha = 0.05, interval.type = c("mean", "beta","xlogit", "asin","quantile"), na.rm = FALSE, ci = TRUE, se=ci, qrule=c("math","school","shahvaish","hf1","hf2","hf3", "hf4","hf5","hf6","hf7","hf8","hf9"), df = NULL, return.replicates=FALSE,...)
svyquantile(x, design, quantiles, ...) ## S3 method for class 'survey.design' svyquantile(x, design, quantiles, alpha = 0.05, interval.type = c("mean", "beta","xlogit", "asin","score"), na.rm = FALSE, ci=TRUE, se = ci, qrule=c("math","school","shahvaish","hf1","hf2","hf3", "hf4","hf5","hf6","hf7","hf8","hf9"), df = NULL, ...) ## S3 method for class 'svyrep.design' svyquantile(x, design, quantiles, alpha = 0.05, interval.type = c("mean", "beta","xlogit", "asin","quantile"), na.rm = FALSE, ci = TRUE, se=ci, qrule=c("math","school","shahvaish","hf1","hf2","hf3", "hf4","hf5","hf6","hf7","hf8","hf9"), df = NULL, return.replicates=FALSE,...)
x |
A one-sided formula describing variables to be used |
design |
Design object |
quantiles |
Numeric vector specifying which quantiles are requested |
alpha |
Specified confidence interval coverage |
interval.type |
See Details below |
na.rm |
Remove missing values? |
ci , se
|
Return an estimated confidence interval and standard error? |
qrule |
Rule for defining the quantiles: either a character string specifying one of the built-in rules, or a function |
df |
Degrees of freedom for confidence interval estimation: |
return.replicates |
Return replicate estimates of the quantile (only for |
... |
For future expansion |
The p
th quantile is defined as the value where the estimated cumulative
distribution function is equal to p
. As with quantiles in
unweighted data, this definition only pins down the quantile to an
interval between two observations, and a rule is needed to interpolate.
The default is the mathematical definition, the lower end of the
quantile interval; qrule="school"
uses the midpoint of the
quantile interval; "hf1"
to "hf9"
are weighted analogues of
type=1
to 9
in quantile
. See the vignette
"Quantile rules" for details and for how to write your own.
By default, confidence intervals are estimated using Woodruff's (1952) method,
which involves computing the quantile, estimating a confidence interval
for the proportion of observations below the quantile, and then
transforming that interval using the estimated CDF. In that context,
the interval.type
argument specifies how the confidence interval
for the proportion is computed, matching svyciprop
. In
contrast to oldsvyquantile
, NaN
is returned if a confidence
interval endpoint on the probability scale falls outside [0,1]
.
There are two exceptions. For svydesign
objects,
interval.type="score"
asks for the Francisco & Fuller confidence
interval based on inverting a score test. According to Dorfmann &
Valliant, this interval has inferior performance to the "beta"
and "logit"
intervals; it is provided for compatibility.
For replicate-weight designs, interval.type="quantile"
ask for an
interval based directly on the replicates of the quantile. This interval
is not valid for jackknife-type replicates, though it should perform well for
bootstrap-type replicates, BRR, and SDR.
The df
argument specifies degrees of freedom for a t-distribution
approximation to distributions of means. The default is the design degrees of
freedom. Specify df=Inf
to use a Normal distribution (eg, for compatibility).
When the standard error is requested, it is estimated by dividing the
confidence interval length by the number of standard errors in a t
confidence interval with the specified alpha
. For example, with
alpha=0.05
and df=Inf
the standard error is estimated as the confidence
interval length divided by 2*1.96
.
An object of class "newsvyquantile"
, except that with a
replicate-weights design and interval.type="quantile"
and
return.replicates=TRUE
it's an object of class "svrepstat"
Dorfman A, Valliant R (1993) Quantile variance estimators in complex surveys. Proceedings of the ASA Survey Research Methods Section. 1993: 866-871
Francisco CA, Fuller WA (1986) Estimation of the distribution function with a complex survey. Technical Report, Iowa State University.
Hyndman, R. J. and Fan, Y. (1996) Sample quantiles in statistical packages, The American Statistician 50, 361-365.
Shah BV, Vaish AK (2006) Confidence Intervals for Quantile Estimation from Complex Survey Data. Proceedings of the Section on Survey Research Methods.
Woodruff RS (1952) Confidence intervals for medians and other position measures. JASA 57, 622-627.
vignette("qrule", package = "survey")
oldsvyquantile
quantile
data(api) ## population quantile(apipop$api00,c(.25,.5,.75)) ## one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) rclus1<-as.svrepdesign(dclus1) bclus1<-as.svrepdesign(dclus1,type="boot") svyquantile(~api00, dclus1, c(.25,.5,.75)) svyquantile(~api00, dclus1, c(.25,.5,.75),interval.type="beta") svyquantile(~api00, rclus1, c(.25,.5,.75)) svyquantile(~api00, rclus1, c(.25,.5,.75),interval.type="quantile") svyquantile(~api00, bclus1, c(.25,.5,.75),interval.type="quantile") svyquantile(~api00+ell, dclus1, c(.25,.5,.75), qrule="math") svyquantile(~api00+ell, dclus1, c(.25,.5,.75), qrule="school") svyquantile(~api00+ell, dclus1, c(.25,.5,.75), qrule="hf8")
data(api) ## population quantile(apipop$api00,c(.25,.5,.75)) ## one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) rclus1<-as.svrepdesign(dclus1) bclus1<-as.svrepdesign(dclus1,type="boot") svyquantile(~api00, dclus1, c(.25,.5,.75)) svyquantile(~api00, dclus1, c(.25,.5,.75),interval.type="beta") svyquantile(~api00, rclus1, c(.25,.5,.75)) svyquantile(~api00, rclus1, c(.25,.5,.75),interval.type="quantile") svyquantile(~api00, bclus1, c(.25,.5,.75),interval.type="quantile") svyquantile(~api00+ell, dclus1, c(.25,.5,.75), qrule="math") svyquantile(~api00+ell, dclus1, c(.25,.5,.75), qrule="school") svyquantile(~api00+ell, dclus1, c(.25,.5,.75), qrule="hf8")
Data extracted from NHANES 2009-2010 on high cholesterol.
data(nhanes)
data(nhanes)
A data frame with 8591 observations on the following 7 variables.
SDMVPSU
Primary sampling units
SDMVSTRA
Sampling strata
WTMEC2YR
Sampling weights
HI_CHOL
Numeric vector: 1 for total cholesterol over 240mg/dl, 0 under 240mg/dl
race
1=Hispanic, 2=non-Hispanic white, 3=non-Hispanic black, 4=other
agecat
Age group (0,19]
(19,39]
(39,59]
(59,Inf]
RIAGENDR
Gender: 1=male, 2=female
Previously at https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=laboratory&CycleBeginYear=2009
data(nhanes) design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=TRUE,data=nhanes) design
data(nhanes) design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=TRUE,data=nhanes) design
Functions to simplify the construction of non-reponse weights by combining strata with small numbers or large weights.
nonresponse(sample.weights, sample.counts, population) sparseCells(object, count=0,totalweight=Inf, nrweight=1.5) neighbours(index,object) joinCells(object,a,...) ## S3 method for class 'nonresponse' weights(object,...)
nonresponse(sample.weights, sample.counts, population) sparseCells(object, count=0,totalweight=Inf, nrweight=1.5) neighbours(index,object) joinCells(object,a,...) ## S3 method for class 'nonresponse' weights(object,...)
sample.weights |
table of sampling weight by stratifying variables |
sample.counts |
table of sample counts by stratifying variables |
population |
table of population size by stratifying variables |
object |
object of class |
count |
Cells with fewer sampled units than this are "sparse" |
nrweight |
Cells with higher non-response weight than this are "sparse" |
totalweight |
Cells with average sampling weight times non-response weight higher than this are "sparse" |
index |
Number of a cell whose neighbours are to be found |
a , ...
|
Cells to join |
When a stratified survey is conducted with imperfect response it is desirable to rescale the sampling weights to reflect the nonresponse. If some strata have small sample size, high non-response, or already had high sampling weights it may be desirable to get less variable non-response weights by averaging non-response across strata. Suitable strata to collapse may be similar on the stratifying variables and/or on the level of non-response.
nonresponse()
combines stratified tables of population size,
sample size, and sample weight into an object. sparseCells
identifies cells that may need combining. neighbours
describes the
cells adjacent to a specified cell, and joinCells
collapses
the specified cells. When the collapsing is complete, use
weights()
to extract the nonresponse weights.
nonresponse
and joinCells
return objects of class "nonresponse"
,
neighbours
and sparseCells
return objects of class "nonresponseSubset"
data(api) ## pretend the sampling was stratified on three variables poptable<-xtabs(~sch.wide+comp.imp+stype,data=apipop) sample.count<-xtabs(~sch.wide+comp.imp+stype,data=apiclus1) sample.weight<-xtabs(pw~sch.wide+comp.imp+stype, data=apiclus1) ## create a nonresponse object nr<-nonresponse(sample.weight,sample.count, poptable) ## sparse cells sparseCells(nr) ## Look at neighbours neighbours(3,nr) neighbours(11,nr) ## Collapse some contiguous cells nr1<-joinCells(nr,3,5,7) ## sparse cells now sparseCells(nr1) nr2<-joinCells(nr1,3,11,8) nr2 ## one relatively sparse cell sparseCells(nr2) ## but nothing suitable to join it to neighbours(3,nr2) ## extract the weights weights(nr2)
data(api) ## pretend the sampling was stratified on three variables poptable<-xtabs(~sch.wide+comp.imp+stype,data=apipop) sample.count<-xtabs(~sch.wide+comp.imp+stype,data=apiclus1) sample.weight<-xtabs(pw~sch.wide+comp.imp+stype, data=apiclus1) ## create a nonresponse object nr<-nonresponse(sample.weight,sample.count, poptable) ## sparse cells sparseCells(nr) ## Look at neighbours neighbours(3,nr) neighbours(11,nr) ## Collapse some contiguous cells nr1<-joinCells(nr,3,5,7) ## sparse cells now sparseCells(nr1) nr2<-joinCells(nr1,3,11,8) nr2 ## one relatively sparse cell sparseCells(nr2) ## but nothing suitable to join it to neighbours(3,nr2) ## extract the weights weights(nr2)
Compute quantiles for data from complex surveys. oldsvyquantile
is the version of the function from before version 4.1 of the package,
available for backwards compatibility. See svyquantile
for the current version
## S3 method for class 'survey.design' oldsvyquantile(x, design, quantiles, alpha=0.05, ci=FALSE, method = "linear", f = 1, interval.type=c("Wald","score","betaWald"), na.rm=FALSE,se=ci, ties=c("discrete","rounded"), df=NULL,...) ## S3 method for class 'svyrep.design' oldsvyquantile(x, design, quantiles, method ="linear", interval.type=c("probability","quantile"), f = 1, return.replicates=FALSE, ties=c("discrete","rounded"),na.rm=FALSE, alpha=0.05,df=NULL,...)
## S3 method for class 'survey.design' oldsvyquantile(x, design, quantiles, alpha=0.05, ci=FALSE, method = "linear", f = 1, interval.type=c("Wald","score","betaWald"), na.rm=FALSE,se=ci, ties=c("discrete","rounded"), df=NULL,...) ## S3 method for class 'svyrep.design' oldsvyquantile(x, design, quantiles, method ="linear", interval.type=c("probability","quantile"), f = 1, return.replicates=FALSE, ties=c("discrete","rounded"),na.rm=FALSE, alpha=0.05,df=NULL,...)
x |
A formula, vector or matrix |
design |
|
quantiles |
Quantiles to estimate |
method |
see |
f |
see |
ci |
Compute a confidence interval? (relatively slow; needed for |
se |
Compute standard errors from the confidence interval length? |
alpha |
Level for confidence interval |
interval.type |
See Details below |
ties |
See Details below |
df |
Degrees of freedom for a t-distribution. |
return.replicates |
Return the replicate means? |
na.rm |
Remove |
... |
arguments for future expansion |
The definition of the CDF and thus of the quantiles is ambiguous in
the presence of ties. With ties="discrete"
the data are
treated as genuinely discrete, so the CDF has vertical steps at tied
observations. With ties="rounded"
all the weights for tied
observations are summed and the CDF interpolates linearly between
distinct observed values, and so is a continuous function. Combining
interval.type="betaWald"
and ties="discrete"
is (close
to) the proposal of Shah and Vaish(2006) used in some versions of SUDAAN.
Interval estimation for quantiles is complicated, because the
influence function is not continuous. Linearisation cannot be used
directly, and computing the variance of replicates is valid only for
some designs (eg BRR, but not jackknife). The interval.type
option controls how the intervals are computed.
For survey.design
objects the default is
interval.type="Wald"
. A 95% Wald confidence interval is
constructed for the proportion below the estimated quantile. The
inverse of the estimated CDF is used to map this to a confidence
interval for the quantile. This is the method of Woodruff
(1952). For "betaWald"
the same procedure is used, but the
confidence interval for the proportion is computed using the exact
binomial cdf with an effective sample size proposed by Korn &
Graubard (1998).
If interval.type="score"
we use a method described by Binder
(1991) and due originally to Francisco and Fuller (1986), which
corresponds to inverting a robust score test. At the upper and lower
limits of the confidence interval, a test of the null hypothesis that
the cumulative distribution function is equal to the target quantile
just rejects. This was the default before version 2.9. It is much
slower than "Wald"
, and Dorfman & Valliant (1993) suggest it is
not any more accurate.
Standard errors are computed from these confidence intervals by
dividing the confidence interval length by 2*qnorm(alpha/2)
.
For replicate-weight designs, ordinary replication-based standard errors
are valid for BRR and Fay's method, and for some bootstrap-based
designs, but not for jackknife-based designs.
interval.type="quantile"
gives these replication-based
standard errors. The default, interval.type="probability"
computes confidence on the probability scale and then transforms
back to quantiles, the equivalent of interval.type="Wald"
for
survey.design
objects (with alpha=0.05
).
There is a confint
method for svyquantile
objects; it
simply extracts the pre-computed confidence interval.
returns a list whose first component is the quantiles and second
component is the confidence intervals. For replicate weight designs,
returns an object of class svyrepstat
.
Thomas Lumley
Binder DA (1991) Use of estimating functions for interval estimation from complex surveys. Proceedings of the ASA Survey Research Methods Section 1991: 34-42
Dorfman A, Valliant R (1993) Quantile variance estimators in complex surveys. Proceedings of the ASA Survey Research Methods Section. 1993: 866-871
Korn EL, Graubard BI. (1998) Confidence Intervals For Proportions With Small Expected Number of Positive Counts Estimated From Survey Data. Survey Methodology 23:193-201.
Francisco CA, Fuller WA (1986) Estimation of the distribution function with a complex survey. Technical Report, Iowa State University.
Shao J, Tu D (1995) The Jackknife and Bootstrap. Springer.
Shah BV, Vaish AK (2006) Confidence Intervals for Quantile Estimation from Complex Survey Data. Proceedings of the Section on Survey Research Methods.
Woodruff RS (1952) Confidence intervals for medians and other position measures. JASA 57, 622-627.
svykm
for quantiles of survival curves
svyciprop
for confidence intervals on proportions.
data(api) ## population quantile(apipop$api00,c(.25,.5,.75)) ## one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) oldsvyquantile(~api00, dclus1, c(.25,.5,.75),ci=TRUE) oldsvyquantile(~api00, dclus1, c(.25,.5,.75),ci=TRUE,interval.type="betaWald") oldsvyquantile(~api00, dclus1, c(.25,.5,.75),ci=TRUE,df=NULL) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) (qapi<-oldsvyquantile(~api00, dclus1, c(.25,.5,.75),ci=TRUE, interval.type="score")) SE(qapi) #stratified sample dstrat<-svydesign(id=~1, strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) oldsvyquantile(~api00, dstrat, c(.25,.5,.75),ci=TRUE) #stratified sample, replicate weights # interval="probability" is necessary for jackknife weights rstrat<-as.svrepdesign(dstrat) oldsvyquantile(~api00, rstrat, c(.25,.5,.75), interval.type="probability") # BRR method data(scd) repweights<-2*cbind(c(1,0,1,0,1,0), c(1,0,0,1,0,1), c(0,1,1,0,0,1), c(0,1,0,1,1,0)) scdrep<-svrepdesign(data=scd, type="BRR", repweights=repweights) oldsvyquantile(~arrests+alive, design=scdrep, quantile=0.5, interval.type="quantile") oldsvyquantile(~arrests+alive, design=scdrep, quantile=0.5, interval.type="quantile",df=NULL)
data(api) ## population quantile(apipop$api00,c(.25,.5,.75)) ## one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) oldsvyquantile(~api00, dclus1, c(.25,.5,.75),ci=TRUE) oldsvyquantile(~api00, dclus1, c(.25,.5,.75),ci=TRUE,interval.type="betaWald") oldsvyquantile(~api00, dclus1, c(.25,.5,.75),ci=TRUE,df=NULL) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) (qapi<-oldsvyquantile(~api00, dclus1, c(.25,.5,.75),ci=TRUE, interval.type="score")) SE(qapi) #stratified sample dstrat<-svydesign(id=~1, strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) oldsvyquantile(~api00, dstrat, c(.25,.5,.75),ci=TRUE) #stratified sample, replicate weights # interval="probability" is necessary for jackknife weights rstrat<-as.svrepdesign(dstrat) oldsvyquantile(~api00, rstrat, c(.25,.5,.75), interval.type="probability") # BRR method data(scd) repweights<-2*cbind(c(1,0,1,0,1,0), c(1,0,0,1,0,1), c(0,1,1,0,0,1), c(0,1,0,1,1,0)) scdrep<-svrepdesign(data=scd, type="BRR", repweights=repweights) oldsvyquantile(~arrests+alive, design=scdrep, quantile=0.5, interval.type="quantile") oldsvyquantile(~arrests+alive, design=scdrep, quantile=0.5, interval.type="quantile",df=NULL)
A database-backed survey design object contains a connection to a
database. This connection will be broken if the object is saved and
reloaded, and the connection should ideally be closed with close
before quitting R (although it doesn't matter for SQLite
connections). The connection can be reopened with open
.
## S3 method for class 'DBIsvydesign' open(con, ...) ## S3 method for class 'DBIsvydesign' close(con, ...)
## S3 method for class 'DBIsvydesign' open(con, ...) ## S3 method for class 'DBIsvydesign' close(con, ...)
con |
Object of class |
... |
Other options, to be passed to |
The same survey design object with the connection opened or closed.
DBI package
## Not run: library(RSQLite) dbclus1<-svydesign(id=~dnum, weights=~pw, fpc=~fpc, data="apiclus1",dbtype="SQLite", dbname=system.file("api.db",package="survey")) dbclus1 close(dbclus1) dbclus1 try(svymean(~api00, dbclus1)) dbclus1<-open(dbclus1) open(dbclus1) svymean(~api00, dbclus1) ## End(Not run)
## Not run: library(RSQLite) dbclus1<-svydesign(id=~dnum, weights=~pw, fpc=~fpc, data="apiclus1",dbtype="SQLite", dbname=system.file("api.db",package="survey")) dbclus1 close(dbclus1) dbclus1 try(svymean(~api00, dbclus1)) dbclus1<-open(dbclus1) open(dbclus1) svymean(~api00, dbclus1) ## End(Not run)
Computes a Hadamard matrix of dimension , where p is a prime,
and p+1 is a multiple of 4, using the Paley construction. Used by
hadamard
.
paley(n, nmax = 2 * n, prime=NULL, check=!is.null(prime)) is.hadamard(H, style=c("0/1","+-"), full.orthogonal.balance=TRUE)
paley(n, nmax = 2 * n, prime=NULL, check=!is.null(prime)) is.hadamard(H, style=c("0/1","+-"), full.orthogonal.balance=TRUE)
n |
Minimum size for matrix |
nmax |
Maximum size for matrix. Ignored if |
prime |
Optional. A prime at least as large as
|
check |
Check that the resulting matrix is of Hadamard type |
H |
Matrix |
style |
|
full.orthogonal.balance |
Require full orthogonal balance? |
The Paley construction gives a Hadamard matrix of order p+1 if p is
prime and p+1 is a multiple of 4. This is then expanded to order
using the Sylvester construction.
paley
knows primes up to 7919. The user can specify a prime
with the prime
argument, in which case a matrix of order
is constructed.
If check=TRUE
the code uses is.hadamard
to check that
the resulting matrix really is of Hadamard type, in the same way as in
the example below. As this test takes time it is
preferable to just be sure that
prime
really is prime.
A Hadamard matrix including a row of 1s gives BRR designs where the average of the replicates for a linear statistic is exactly the full sample estimate. This property is called full orthogonal balance.
For paley
, a matrix of zeros and ones, or NULL
if no matrix smaller than
nmax
can be found.
For is.hadamard
, TRUE
if H
is a Hadamard matrix.
Cameron PJ (2005) Hadamard Matrices. In: The Encyclopedia of Design Theory
M<-paley(11) is.hadamard(M) ## internals of is.hadamard(M) H<-2*M-1 ## HH^T is diagonal for any Hadamard matrix H%*%t(H)
M<-paley(11) is.hadamard(M) ## internals of is.hadamard(M) H<-2*M-1 ## HH^T is diagonal for any Hadamard matrix H%*%t(H)
The distribution of a quadratic form in p standard Normal variables is a linear combination of p chi-squared distributions with 1df. When there is uncertainty about the variance, a reasonable model for the distribution is a linear combination of F distributions with the same denominator.
pchisqsum(x, df, a, lower.tail = TRUE, method = c("satterthwaite", "integration","saddlepoint")) pFsum(x, df, a, ddf=Inf,lower.tail = TRUE, method = c("saddlepoint","integration","satterthwaite"), ...)
pchisqsum(x, df, a, lower.tail = TRUE, method = c("satterthwaite", "integration","saddlepoint")) pFsum(x, df, a, ddf=Inf,lower.tail = TRUE, method = c("saddlepoint","integration","satterthwaite"), ...)
x |
Observed values |
df |
Vector of degrees of freedom |
a |
Vector of coefficients |
ddf |
Denominator degrees of freedom |
lower.tail |
lower or upper tail? |
method |
See Details below |
... |
arguments to |
The "satterthwaite"
method uses Satterthwaite's
approximation, and this is also used as a fallback for the other
methods. The accuracy is usually good, but is more variable depending
on a
than the other methods and is anticonservative in the
right tail (eg for upper tail probabilities less than 10^-5
).
The Satterthwaite approximation requires all a>0
.
"integration"
requires the CompQuadForm
package. For
pchisqsum
it uses Farebrother's algorithm if all
a>0
. For pFsum
or when some a<0
it inverts the
characteristic function using the algorithm of Davies (1980).
These algorithms are highly accurate for the lower tail probability, but they obtain the upper tail probability by subtraction from 1 and so fail completely when the upper tail probability is comparable to machine epsilon or smaller.
If the CompQuadForm
package is not present, a warning is given
and the saddlepoint approximation is used.
"saddlepoint"
uses Kuonen's saddlepoint approximation. This
is moderately accurate even very far out in the upper tail or with some
a=0
and does not require any additional packages. The relative error
in the right tail is uniformly bounded for all x
and decreases as p
increases. This method is implemented in pure R and so is slower than
the "integration"
method.
The distribution in pFsum
is standardised so that a likelihood
ratio test can use the same x
value as in pchisqsum
.
That is, the linear combination of chi-squareds is multiplied by
ddf
and then divided by an independent chi-squared with
ddf
degrees of freedom.
Vector of cumulative probabilities
Chen, T., & Lumley T. (2019). Numerical evaluation of methods approximating the distribution of a large quadratic form in normal variables. Computational Statistics and Data Analysis, 139, 75-81.
Davies RB (1973). "Numerical inversion of a characteristic function" Biometrika 60:415-7
Davies RB (1980) "Algorithm AS 155: The Distribution of a Linear Combination of chi-squared Random Variables" Applied Statistics,Vol. 29, No. 3 (1980), pp. 323-333
P. Duchesne, P. Lafaye de Micheaux (2010) "Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods", Computational Statistics and Data Analysis, Volume 54, (2010), 858-862
Farebrother R.W. (1984) "Algorithm AS 204: The distribution of a Positive Linear Combination of chi-squared random variables". Applied Statistics Vol. 33, No. 3 (1984), p. 332-339
Kuonen D (1999) Saddlepoint Approximations for Distributions of Quadratic Forms in Normal Variables. Biometrika, Vol. 86, No. 4 (Dec., 1999), pp. 929-935
x <- 2.7*rnorm(1001)^2+rnorm(1001)^2+0.3*rnorm(1001)^2 x.thin<-sort(x)[1+(0:50)*20] p.invert<-pchisqsum(x.thin,df=c(1,1,1),a=c(2.7,1,.3),method="int" ,lower=FALSE) p.satt<-pchisqsum(x.thin,df=c(1,1,1),a=c(2.7,1,.3),method="satt",lower=FALSE) p.sadd<-pchisqsum(x.thin,df=c(1,1,1),a=c(2.7,1,.3),method="sad",lower=FALSE) plot(p.invert, p.satt,type="l",log="xy") abline(0,1,lty=2,col="purple") plot(p.invert, p.sadd,type="l",log="xy") abline(0,1,lty=2,col="purple") pchisqsum(20, df=c(1,1,1),a=c(2.7,1,.3), lower.tail=FALSE,method="sad") pFsum(20, df=c(1,1,1),a=c(2.7,1,.3), ddf=49,lower.tail=FALSE,method="sad") pFsum(20, df=c(1,1,1),a=c(2.7,1,.3), ddf=1000,lower.tail=FALSE,method="sad")
x <- 2.7*rnorm(1001)^2+rnorm(1001)^2+0.3*rnorm(1001)^2 x.thin<-sort(x)[1+(0:50)*20] p.invert<-pchisqsum(x.thin,df=c(1,1,1),a=c(2.7,1,.3),method="int" ,lower=FALSE) p.satt<-pchisqsum(x.thin,df=c(1,1,1),a=c(2.7,1,.3),method="satt",lower=FALSE) p.sadd<-pchisqsum(x.thin,df=c(1,1,1),a=c(2.7,1,.3),method="sad",lower=FALSE) plot(p.invert, p.satt,type="l",log="xy") abline(0,1,lty=2,col="purple") plot(p.invert, p.sadd,type="l",log="xy") abline(0,1,lty=2,col="purple") pchisqsum(20, df=c(1,1,1),a=c(2.7,1,.3), lower.tail=FALSE,method="sad") pFsum(20, df=c(1,1,1),a=c(2.7,1,.3), ddf=49,lower.tail=FALSE,method="sad") pFsum(20, df=c(1,1,1),a=c(2.7,1,.3), ddf=1000,lower.tail=FALSE,method="sad")
This dataset contains some variables regarding household expenses for a
sample of 105 households selected from a list of landline phones (frame
A) and a sample of 135 from a list of mobile phones (frame B) in a
particular city in a specific month. These data are taken from the
Frames2
package under the GPL-2 or GPL-3 licence.
data(phoneframes)
data(phoneframes)
A factor indicating the domain each household belongs to. In sample A, possible values are "a" if household belongs to domain a or "ab" if household belongs to overlap domain; in sample B, the values are "b" or "ba"
Feeding expenses (in euros) at the househould
Clothing expenses (in euros) at the household
Leisure expenses (in euros) at the household
Household income (in euros). Values for this variable are only available for households included in frame A. For households included in domain b, value of this variable is missing
Household municipal taxes (in euros) paid. Values for this variable are only available for households included in frame A. For households included in domain b, value of this variable is missing
Square meters of the house. Values for this variable are only available for households included in frame B. For households included in domain a, value of this variable is missing
Household size. Values for this variable are only available for households included in frame B. For households included in domain a, value of this variable is missing
First order inclusion probability in frame A. This probability is 0 for households included in domain b.
First order inclusion probability in frame B. This probability is 0 for households included in domain a.
A numeric value indicating the stratum each household belongs to.
The frame A sample, of size , has been drawn from a population of
households with landline phone according to a stratified random sampling. Population units were divided in 6 different strata.
Population sizes of these strata are
.
of the households composing the
population have, also, mobile phone. On the other hand, frame totals
for auxiliary variables in this frame are
and
.
The frame B sample, of size , has been drawn from a population of
households with mobile phone according to a simple random sampling without replacement design.
of these households have, also, landline phone. On the other hand, frame totals for auxiliary variables in this frame are
and
PiklA
and PiklB
are matrices of pairwise sampling
probabilities for the two frames.
Original package: https://CRAN.R-project.org/package=Frames2
data(phoneframes) A_in_frames<-cbind(1, DatA$Domain=="ab") B_in_frames<-cbind(DatB$Domain=="ba",1) Bdes_pps<-svydesign(id=~1, fpc=~ProbB, data=DatB,pps=ppsmat(PiklB)) Ades_pps <-svydesign(id=~1, fpc=~ProbA,data=DatA,pps=ppsmat(PiklA)) ## optimal constant (Hartley) weighting mf_pps<-multiframe(list(Ades_pps,Bdes_pps),list(A_in_frames,B_in_frames),theta=0.74) svytotal(~Lei,mf_pps) Awts<-cbind(1/DatA$ProbA, ifelse(DatA$ProbB==0,0,1/DatA$ProbB)) Bwts<-cbind(ifelse(DatB$ProbA==0,0,1/DatB$ProbA),1/DatB$ProbB ) ## dividing by the expected number of selections (BKA or HH estimator) mf_pps2<-multiframe(list(Ades_pps,Bdes_pps),list(Awts,Bwts),estimator="expected") svymean(~Lei,mf_pps2) ## Metcalf and Scott approximation DatB$Stratum<-10 DatB$Frame<-2 DatA$Frame<-1 Dat_both<-rbind(DatA,DatB) frame_weights<-c(0.742,1-0.742) Dat_both$fweights<-with(Dat_both, ifelse(Frame==1, ifelse(Domain=="ab", frame_weights[1]*1/ProbA,1/ProbA), ifelse(Domain=="ba", frame_weights[2]*1/ProbB, 1/ProbB))) MSdesign<-svydesign(id=~1, strata=~Stratum, weights=~fweights,data=Dat_both) svymean(~Lei,MSdesign)
data(phoneframes) A_in_frames<-cbind(1, DatA$Domain=="ab") B_in_frames<-cbind(DatB$Domain=="ba",1) Bdes_pps<-svydesign(id=~1, fpc=~ProbB, data=DatB,pps=ppsmat(PiklB)) Ades_pps <-svydesign(id=~1, fpc=~ProbA,data=DatA,pps=ppsmat(PiklA)) ## optimal constant (Hartley) weighting mf_pps<-multiframe(list(Ades_pps,Bdes_pps),list(A_in_frames,B_in_frames),theta=0.74) svytotal(~Lei,mf_pps) Awts<-cbind(1/DatA$ProbA, ifelse(DatA$ProbB==0,0,1/DatA$ProbB)) Bwts<-cbind(ifelse(DatB$ProbA==0,0,1/DatB$ProbA),1/DatB$ProbB ) ## dividing by the expected number of selections (BKA or HH estimator) mf_pps2<-multiframe(list(Ades_pps,Bdes_pps),list(Awts,Bwts),estimator="expected") svymean(~Lei,mf_pps2) ## Metcalf and Scott approximation DatB$Stratum<-10 DatB$Frame<-2 DatA$Frame<-1 Dat_both<-rbind(DatA,DatB) frame_weights<-c(0.742,1-0.742) Dat_both$fweights<-with(Dat_both, ifelse(Frame==1, ifelse(Domain=="ab", frame_weights[1]*1/ProbA,1/ProbA), ifelse(Domain=="ba", frame_weights[2]*1/ProbB, 1/ProbB))) MSdesign<-svydesign(id=~1, strata=~Stratum, weights=~fweights,data=Dat_both) svymean(~Lei,MSdesign)
Specify a design where units are sampled independently from the population, with known probabilities. This design is often used theoretically, but is rarely used in practice because the sample size is variable. This function calls ppscov
to specify a sparse sampling covariance matrix.
poisson_sampling(p)
poisson_sampling(p)
p |
Vector of sampling probabilities |
Object of class ppsdcheck
data(api) apipop$prob<-with(apipop, 200*api00/sum(api00)) insample<-as.logical(rbinom(nrow(apipop),1,apipop$prob)) apipois<-apipop[insample,] despois<-svydesign(id=~1, prob=~prob, pps=poisson_sampling(apipois$prob), data=apipois) svytotal(~api00, despois) ## SE formula sqrt(sum( (apipois$api00*weights(despois))^2*(1-apipois$prob)))
data(api) apipop$prob<-with(apipop, 200*api00/sum(api00)) insample<-as.logical(rbinom(nrow(apipop),1,apipop$prob)) apipois<-apipop[insample,] despois<-svydesign(id=~1, prob=~prob, pps=poisson_sampling(apipois$prob), data=apipois) svytotal(~api00, despois) ## SE formula sqrt(sum( (apipois$api00*weights(despois))^2*(1-apipois$prob)))
Post-stratification adjusts the sampling and replicate weights so that
the joint distribution of a set of post-stratifying variables matches
the known population joint distribution. Use rake
when
the full joint distribution is not available.
postStratify(design, strata, population, partial = FALSE, ...) ## S3 method for class 'svyrep.design' postStratify(design, strata, population, partial = FALSE, compress=NULL,...) ## S3 method for class 'survey.design' postStratify(design, strata, population, partial = FALSE, ...)
postStratify(design, strata, population, partial = FALSE, ...) ## S3 method for class 'svyrep.design' postStratify(design, strata, population, partial = FALSE, compress=NULL,...) ## S3 method for class 'survey.design' postStratify(design, strata, population, partial = FALSE, ...)
design |
A survey design with replicate weights |
strata |
A formula or data frame of post-stratifying variables, which must not contain missing values. |
population |
|
partial |
if |
compress |
Attempt to compress the replicate weight matrix? When
|
... |
arguments for future expansion |
The population
totals can be specified as a table with the
strata variables in the margins, or as a data frame where one column
lists frequencies and the other columns list the unique combinations
of strata variables (the format produced by as.data.frame
acting on a table
object). A table must have named dimnames
to indicate the variable names.
Compressing the replicate weights will take time and may even increase memory use if there is actually little redundancy in the weight matrix (in particular if the post-stratification variables have many values and cut across PSUs).
If a svydesign
object is to be converted to a replication
design the post-stratification should be performed after conversion.
The variance estimate for replication designs follows the same
procedure as Valliant (1993) described for estimating totals. Rao et
al (2002) describe this procedure for estimating functions (and also
the GREG or g-calibration procedure, see calibrate
)
A new survey design object.
If the sampling weights are already post-stratified there will be no
change in point estimates after postStratify
but the standard
error estimates will decrease to correctly reflect the post-stratification.
Valliant R (1993) Post-stratification and conditional variance estimation. JASA 88: 89-96
Rao JNK, Yung W, Hidiroglou MA (2002) Estimating equations for the analysis of survey data using poststratification information. Sankhya 64 Series A Part 2, 364-378.
rake
, calibrate
for other things to do
with auxiliary information
compressWeights
for information on compressing weights
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) rclus1<-as.svrepdesign(dclus1) svymean(~api00, rclus1) svytotal(~enroll, rclus1) # post-stratify on school type pop.types <- data.frame(stype=c("E","H","M"), Freq=c(4421,755,1018)) #or: pop.types <- xtabs(~stype, data=apipop) #or: pop.types <- table(stype=apipop$stype) rclus1p<-postStratify(rclus1, ~stype, pop.types) summary(rclus1p) svymean(~api00, rclus1p) svytotal(~enroll, rclus1p) ## and for svydesign objects dclus1p<-postStratify(dclus1, ~stype, pop.types) summary(dclus1p) svymean(~api00, dclus1p) svytotal(~enroll, dclus1p)
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) rclus1<-as.svrepdesign(dclus1) svymean(~api00, rclus1) svytotal(~enroll, rclus1) # post-stratify on school type pop.types <- data.frame(stype=c("E","H","M"), Freq=c(4421,755,1018)) #or: pop.types <- xtabs(~stype, data=apipop) #or: pop.types <- table(stype=apipop$stype) rclus1p<-postStratify(rclus1, ~stype, pop.types) summary(rclus1p) svymean(~api00, rclus1p) svytotal(~enroll, rclus1p) ## and for svydesign objects dclus1p<-postStratify(dclus1, ~stype, pop.types) summary(dclus1p) svymean(~api00, dclus1p) svytotal(~enroll, dclus1p)
Compute the Nagelkerke and Cox–Snell pseudo-rsquared statistics, primarily for logistic regression. A generic function with methods for glm
and svyglm
. The method for svyglm
objects uses the design-based estimators described by Lumley (2017)
psrsq(object, method = c("Cox-Snell", "Nagelkerke"), ...)
psrsq(object, method = c("Cox-Snell", "Nagelkerke"), ...)
object |
A regression model ( |
method |
Which statistic to compute |
... |
For future expansion |
Numeric value
Lumley T (2017) "Pseudo-R2 statistics under complex sampling" Australian and New Zealand Journal of Statistics DOI: 10.1111/anzs.12187 (preprint: https://arxiv.org/abs/1701.07745)
data(api) dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) model1<-svyglm(I(sch.wide=="Yes")~ell+meals+mobility+as.numeric(stype), design=dclus2, family=quasibinomial()) psrsq(model1, type="Nagelkerke")
data(api) dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) model1<-svyglm(I(sch.wide=="Yes")~ell+meals+mobility+as.numeric(stype), design=dclus2, family=quasibinomial()) psrsq(model1, type="Nagelkerke")
Raking uses iterative post-stratification to match marginal distributions of a survey sample to known population margins.
rake(design, sample.margins, population.margins, control = list(maxit = 10, epsilon = 1, verbose=FALSE), compress=NULL)
rake(design, sample.margins, population.margins, control = list(maxit = 10, epsilon = 1, verbose=FALSE), compress=NULL)
design |
A survey object |
sample.margins |
list of formulas or data frames describing sample margins, which must not contain missing values |
population.margins |
list of tables or data frames describing corresponding population margins |
control |
|
compress |
If |
The sample.margins
should be in a format suitable for postStratify
.
Raking (aka iterative proportional fitting) is known to converge for any table without zeros, and for any table with zeros for which there is a joint distribution with the given margins and the same pattern of zeros. The ‘margins’ need not be one-dimensional.
The algorithm works by repeated calls to postStratify
(iterative proportional fitting), which is efficient for large
multiway tables. For small tables calibrate
will be
faster, and also allows raking to population totals for continuous
variables, and raking with bounded weights.
A raked survey design.
calibrate
for other ways to use auxiliary information.
data(api) dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) rclus1 <- as.svrepdesign(dclus1) svymean(~api00, rclus1) svytotal(~enroll, rclus1) ## population marginal totals for each stratum pop.types <- data.frame(stype=c("E","H","M"), Freq=c(4421,755,1018)) pop.schwide <- data.frame(sch.wide=c("No","Yes"), Freq=c(1072,5122)) rclus1r <- rake(rclus1, list(~stype,~sch.wide), list(pop.types, pop.schwide)) svymean(~api00, rclus1r) svytotal(~enroll, rclus1r) ## marginal totals correspond to population xtabs(~stype, apipop) svytable(~stype, rclus1r, round=TRUE) xtabs(~sch.wide, apipop) svytable(~sch.wide, rclus1r, round=TRUE) ## joint totals don't correspond xtabs(~stype+sch.wide, apipop) svytable(~stype+sch.wide, rclus1r, round=TRUE) ## Do it for a design without replicate weights dclus1r<-rake(dclus1, list(~stype,~sch.wide), list(pop.types, pop.schwide)) svymean(~api00, dclus1r) svytotal(~enroll, dclus1r) ## compare to raking with calibrate() dclus1gr<-calibrate(dclus1, ~stype+sch.wide, pop=c(6194, 755,1018,5122), calfun="raking") svymean(~stype+api00, dclus1r) svymean(~stype+api00, dclus1gr) ## compare to joint post-stratification ## (only possible if joint population table is known) ## pop.table <- xtabs(~stype+sch.wide,apipop) rclus1ps <- postStratify(rclus1, ~stype+sch.wide, pop.table) svytable(~stype+sch.wide, rclus1ps, round=TRUE) svymean(~api00, rclus1ps) svytotal(~enroll, rclus1ps) ## Example of raking with partial joint distributions pop.imp<-data.frame(comp.imp=c("No","Yes"),Freq=c(1712,4482)) dclus1r2<-rake(dclus1, list(~stype+sch.wide, ~comp.imp), list(pop.table, pop.imp)) svymean(~api00, dclus1r2) ## compare to calibrate() syntax with tables dclus1r2<-calibrate(dclus1, formula=list(~stype+sch.wide, ~comp.imp), population=list(pop.table, pop.imp),calfun="raking") svymean(~api00, dclus1r2)
data(api) dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) rclus1 <- as.svrepdesign(dclus1) svymean(~api00, rclus1) svytotal(~enroll, rclus1) ## population marginal totals for each stratum pop.types <- data.frame(stype=c("E","H","M"), Freq=c(4421,755,1018)) pop.schwide <- data.frame(sch.wide=c("No","Yes"), Freq=c(1072,5122)) rclus1r <- rake(rclus1, list(~stype,~sch.wide), list(pop.types, pop.schwide)) svymean(~api00, rclus1r) svytotal(~enroll, rclus1r) ## marginal totals correspond to population xtabs(~stype, apipop) svytable(~stype, rclus1r, round=TRUE) xtabs(~sch.wide, apipop) svytable(~sch.wide, rclus1r, round=TRUE) ## joint totals don't correspond xtabs(~stype+sch.wide, apipop) svytable(~stype+sch.wide, rclus1r, round=TRUE) ## Do it for a design without replicate weights dclus1r<-rake(dclus1, list(~stype,~sch.wide), list(pop.types, pop.schwide)) svymean(~api00, dclus1r) svytotal(~enroll, dclus1r) ## compare to raking with calibrate() dclus1gr<-calibrate(dclus1, ~stype+sch.wide, pop=c(6194, 755,1018,5122), calfun="raking") svymean(~stype+api00, dclus1r) svymean(~stype+api00, dclus1gr) ## compare to joint post-stratification ## (only possible if joint population table is known) ## pop.table <- xtabs(~stype+sch.wide,apipop) rclus1ps <- postStratify(rclus1, ~stype+sch.wide, pop.table) svytable(~stype+sch.wide, rclus1ps, round=TRUE) svymean(~api00, rclus1ps) svytotal(~enroll, rclus1ps) ## Example of raking with partial joint distributions pop.imp<-data.frame(comp.imp=c("No","Yes"),Freq=c(1712,4482)) dclus1r2<-rake(dclus1, list(~stype+sch.wide, ~comp.imp), list(pop.table, pop.imp)) svymean(~api00, dclus1r2) ## compare to calibrate() syntax with tables dclus1r2<-calibrate(dclus1, formula=list(~stype+sch.wide, ~comp.imp), population=list(pop.table, pop.imp),calfun="raking") svymean(~api00, dclus1r2)
Provides Wald test and working Wald and working likelihood ratio (Rao-Scott) test of the
hypothesis that all coefficients associated with a particular
regression term are zero (or have some other specified
values). Particularly useful as a substitute for anova
when not fitting by maximum likelihood.
regTermTest(model, test.terms, null=NULL,df=NULL, method=c("Wald","WorkingWald","LRT"), lrt.approximation="saddlepoint")
regTermTest(model, test.terms, null=NULL,df=NULL, method=c("Wald","WorkingWald","LRT"), lrt.approximation="saddlepoint")
model |
|
test.terms |
Character string or one-sided formula giving name of term or terms to test |
null |
Null hypothesis values for parameters. Default is zeros |
df |
Denominator degrees of freedom for an F test. If
|
method |
If |
lrt.approximation |
method for approximating the distribution of
the LRT and Working Wald statistic; see |
The Wald test uses a chisquared or F distribution. The two
working-model tests come from the (misspecified) working model where the
observations are independent and the weights are frequency weights. For
categorical data, this is just the model fitted to the estimated
population crosstabulation. The Rao-Scott LRT statistic is the likelihood
ratio statistic in this model. The working Wald test statistic is the Wald statistic
in this model. The working-model tests do not have a chi-squared
sampling distribution: we use a linear combination of chi-squared or F
distributions as in pchisqsum
. I believe the working Wald
test is what SUDAAN refers to as a
"Satterthwaite adjusted Wald test".
To match other software you will typically need to use lrt.approximation="satterthwaite"
An object of class regTermTest
or regTermTestLRT
.
The "LRT"
method will not work if the model had starting values supplied for the regression coefficients. Instead, fit the two models separately and use anova(model1, model2, force=TRUE)
Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contingency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60.
Lumley T, Scott A (2012) "Partial likelihood ratio tests for the Cox model under complex sampling" Statistics in Medicine 17 JUL 2012. DOI: 10.1002/sim.5492
Lumley T, Scott A (2014) "Tests for Regression Models Fitted to Survey Data" Australian and New Zealand Journal of Statistics 56:1-14 DOI: 10.1111/anzs.12065
anova
, vcov
, contrasts
,pchisqsum
data(esoph) model1 <- glm(cbind(ncases, ncontrols) ~ agegp + tobgp * alcgp, data = esoph, family = binomial()) anova(model1) regTermTest(model1,"tobgp") regTermTest(model1,"tobgp:alcgp") regTermTest(model1, ~alcgp+tobgp:alcgp) data(api) dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) model2<-svyglm(I(sch.wide=="Yes")~ell+meals+mobility, design=dclus2, family=quasibinomial()) regTermTest(model2, ~ell) regTermTest(model2, ~ell,df=NULL) regTermTest(model2, ~ell, method="LRT", df=Inf) regTermTest(model2, ~ell+meals, method="LRT", df=NULL) regTermTest(model2, ~ell+meals, method="WorkingWald", df=NULL)
data(esoph) model1 <- glm(cbind(ncases, ncontrols) ~ agegp + tobgp * alcgp, data = esoph, family = binomial()) anova(model1) regTermTest(model1,"tobgp") regTermTest(model1,"tobgp:alcgp") regTermTest(model1, ~alcgp+tobgp:alcgp) data(api) dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) model2<-svyglm(I(sch.wide=="Yes")~ell+meals+mobility, design=dclus2, family=quasibinomial()) regTermTest(model2, ~ell) regTermTest(model2, ~ell,df=NULL) regTermTest(model2, ~ell, method="LRT", df=Inf) regTermTest(model2, ~ell+meals, method="LRT", df=NULL) regTermTest(model2, ~ell+meals, method="WorkingWald", df=NULL)
Evaluates a set of expressions for different frame weights in a dual-frame/multi-frame design, so that an optimal or compromise-optimal set of frame weights can be chosen
reweight(design, ...) ## S3 method for class 'dualframe' reweight(design, targets=NULL, totals=NULL, estimator=c("constant","expected"), theta=NULL, theta_grid=seq(0,1,by=0.05),...) ## S3 method for class 'dualframe_with_rewt' plot(x,y,type="b",...)
reweight(design, ...) ## S3 method for class 'dualframe' reweight(design, targets=NULL, totals=NULL, estimator=c("constant","expected"), theta=NULL, theta_grid=seq(0,1,by=0.05),...) ## S3 method for class 'dualframe_with_rewt' plot(x,y,type="b",...)
design |
dual-frame or multiframe design object |
targets , totals
|
A list of quoted expressions estimating the
variance of a survey estimator ( |
estimator |
As in |
theta |
As in |
theta_grid |
Grid for optimising theta over, with |
x |
object produced by |
y |
ignored |
type , ...
|
in the |
Traditionally, this optimisation has been done with totals, which is a good default and more mathematically tractable. However, when the point of multiple-frame sampling is to improve precision for a rare sub-population, or when you're doing regression modelling, you might want to optimise for something else.
An object of class "dualframe_with_rewt".
The coef
method returns the optimal theta for each target.
The rewt
element includes the variances of each target on a grid of
theta
in variances
data(phoneframes) A_in_frames<-cbind(1, DatA$Domain=="ab") B_in_frames<-cbind(DatB$Domain=="ba",1) Bdes_pps<-svydesign(id=~1, fpc=~ProbB, data=DatB,pps=ppsmat(PiklB)) Ades_pps <-svydesign(id=~1, fpc=~ProbA,data=DatA,pps=ppsmat(PiklA)) ## Not very good weighting mf_pps<-multiframe(list(Ades_pps,Bdes_pps),list(A_in_frames,B_in_frames),theta=0.5) svytotal(~Lei+Feed+Tax+Clo,mf_pps, na.rm=TRUE) ## try to optimise mf_opt<-reweight(mf_pps, totals=list(~Lei, ~Feed,~Tax,~Clo)) coef(mf_opt) plot(mf_opt) ## a good compromise is 0.80 for everything except Tax ## and it's still pretty good there ## (Tax will be biased because it's missing for landline-only) mf_pps_opt<-reweight(mf_opt,theta=0.80) svytotal(~Lei+Feed+Tax+Clo,mf_pps_opt, na.rm=TRUE) ## Targets other than totals mf_reg<-reweight(mf_pps, targets=list(quote(vcov(svyglm(Lei~Feed+Clo, design=.DESIGN))[1,1]), quote(vcov(svytotal(~Lei,.DESIGN)))) ) plot(mf_reg,type="l") legend("topright",bty="n",lty=1:2,col=1:2, legend=c("regression","total")) ## Zooming in on optimality for a particular variable (for compatibility) mf_opt1<-reweight(mf_pps, totals=list(~Feed),theta_grid=seq(0.7,0.9,length=100)) coef(mf_opt1) # Frames2::Hartley gives 0.802776
data(phoneframes) A_in_frames<-cbind(1, DatA$Domain=="ab") B_in_frames<-cbind(DatB$Domain=="ba",1) Bdes_pps<-svydesign(id=~1, fpc=~ProbB, data=DatB,pps=ppsmat(PiklB)) Ades_pps <-svydesign(id=~1, fpc=~ProbA,data=DatA,pps=ppsmat(PiklA)) ## Not very good weighting mf_pps<-multiframe(list(Ades_pps,Bdes_pps),list(A_in_frames,B_in_frames),theta=0.5) svytotal(~Lei+Feed+Tax+Clo,mf_pps, na.rm=TRUE) ## try to optimise mf_opt<-reweight(mf_pps, totals=list(~Lei, ~Feed,~Tax,~Clo)) coef(mf_opt) plot(mf_opt) ## a good compromise is 0.80 for everything except Tax ## and it's still pretty good there ## (Tax will be biased because it's missing for landline-only) mf_pps_opt<-reweight(mf_opt,theta=0.80) svytotal(~Lei+Feed+Tax+Clo,mf_pps_opt, na.rm=TRUE) ## Targets other than totals mf_reg<-reweight(mf_pps, targets=list(quote(vcov(svyglm(Lei~Feed+Clo, design=.DESIGN))[1,1]), quote(vcov(svytotal(~Lei,.DESIGN)))) ) plot(mf_reg,type="l") legend("topright",bty="n",lty=1:2,col=1:2, legend=c("regression","total")) ## Zooming in on optimality for a particular variable (for compatibility) mf_opt1<-reweight(mf_pps, totals=list(~Feed),theta_grid=seq(0.7,0.9,length=100)) coef(mf_opt1) # Frames2::Hartley gives 0.802776
This data set presents the outcome of three experiments conducted at the University of Chicago in 1986 to study interbreeding between populations of mountain dusky salamanders (McCullagh and Nelder, 1989, Section 14.5). The analysis here is from Lumley (1998, section 5.3)
data(salamander)
data(salamander)
A data frame with the following columns:
Whether the salamanders mated (1) or did not mate (0).
Cross between female and male type. A factor with four levels: R/R
,R/W
,W/R
, and W/W
. The type of the female salamander is listed first and the male is listed second. Rough Butt is represented by R and White Side is represented by W. For example, Cross=W/R
indicates a White Side female was crossed with a Rough Butt male.
Identification number of the male salamander. A factor.
Identification number of the female salamander. A factor.
McCullagh P. and Nelder, J. A. (1989) Generalized Linear Models. Chapman and Hall/CRC. Lumley T (1998) PhD thesis, University of Washington
data(salamander) salamander$mixed<-with(salamander, Cross=="W/R" | Cross=="R/W") salamander$RWvsWR<-with(salamander, ifelse(mixed, ((Cross=="R/W")-(Cross=="W/R"))/2, 0)) xsalamander<-xdesign(id=list(~Male, ~Female), data=salamander, overlap="unbiased") ## Adjacency matrix ## Blocks 1 and 2 are actually the same salamanders, but ## it's traditional to pretend they are independent. image(xsalamander$adjacency) ## R doesn't allow family=binomial(identity) success <- svyglm(Mate~mixed+RWvsWR, design=xsalamander, family=quasi(link="identity", variance="mu(1-mu)")) summary(success)
data(salamander) salamander$mixed<-with(salamander, Cross=="W/R" | Cross=="R/W") salamander$RWvsWR<-with(salamander, ifelse(mixed, ((Cross=="R/W")-(Cross=="W/R"))/2, 0)) xsalamander<-xdesign(id=list(~Male, ~Female), data=salamander, overlap="unbiased") ## Adjacency matrix ## Blocks 1 and 2 are actually the same salamanders, but ## it's traditional to pretend they are independent. image(xsalamander$adjacency) ## R doesn't allow family=binomial(identity) success <- svyglm(Mate~mixed+RWvsWR, design=xsalamander, family=quasi(link="identity", variance="mu(1-mu)")) summary(success)
These data are from Section 12.2 of Levy and Lemeshow. They describe (a possibly apocryphal) study of survival in out-of-hospital cardiac arrest. Two out of five ambulance stations were sampled from each of three emergency service areas.
data(scd)
data(scd)
This data frame contains the following columns:
Emergency Service Area (strata)
Ambulance station (PSU)
estimated number of cardiac arrests
number reaching hospital alive
Levy and Lemeshow. "Sampling of Populations" (3rd edition). Wiley.
data(scd) ## survey design objects scddes<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE, fpc=rep(5,6)) scdnofpc<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE) # convert to BRR replicate weights scd2brr <- as.svrepdesign(scdnofpc, type="BRR") # or to Rao-Wu bootstrap scd2boot <- as.svrepdesign(scdnofpc, type="subboot") # use BRR replicate weights from Levy and Lemeshow repweights<-2*cbind(c(1,0,1,0,1,0), c(1,0,0,1,0,1), c(0,1,1,0,0,1), c(0,1,0,1,1,0)) scdrep<-svrepdesign(data=scd, type="BRR", repweights=repweights) # ratio estimates svyratio(~alive, ~arrests, design=scddes) svyratio(~alive, ~arrests, design=scdnofpc) svyratio(~alive, ~arrests, design=scd2brr) svyratio(~alive, ~arrests, design=scd2boot) svyratio(~alive, ~arrests, design=scdrep) # or a logistic regression summary(svyglm(cbind(alive,arrests-alive)~1, family=quasibinomial, design=scdnofpc)) summary(svyglm(cbind(alive,arrests-alive)~1, family=quasibinomial, design=scdrep)) # Because no sampling weights are given, can't compute design effects # without replacement: use deff="replace" svymean(~alive+arrests, scddes, deff=TRUE) svymean(~alive+arrests, scddes, deff="replace")
data(scd) ## survey design objects scddes<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE, fpc=rep(5,6)) scdnofpc<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE) # convert to BRR replicate weights scd2brr <- as.svrepdesign(scdnofpc, type="BRR") # or to Rao-Wu bootstrap scd2boot <- as.svrepdesign(scdnofpc, type="subboot") # use BRR replicate weights from Levy and Lemeshow repweights<-2*cbind(c(1,0,1,0,1,0), c(1,0,0,1,0,1), c(0,1,1,0,0,1), c(0,1,0,1,1,0)) scdrep<-svrepdesign(data=scd, type="BRR", repweights=repweights) # ratio estimates svyratio(~alive, ~arrests, design=scddes) svyratio(~alive, ~arrests, design=scdnofpc) svyratio(~alive, ~arrests, design=scd2brr) svyratio(~alive, ~arrests, design=scd2boot) svyratio(~alive, ~arrests, design=scdrep) # or a logistic regression summary(svyglm(cbind(alive,arrests-alive)~1, family=quasibinomial, design=scdnofpc)) summary(svyglm(cbind(alive,arrests-alive)~1, family=quasibinomial, design=scdrep)) # Because no sampling weights are given, can't compute design effects # without replacement: use deff="replace" svymean(~alive+arrests, scddes, deff=TRUE) svymean(~alive+arrests, scddes, deff="replace")
Extracts standard errors from an object. The default method is for
objects with a vcov
method.
SE(object, ...) ## Default S3 method: SE(object,...) ## S3 method for class 'svrepstat' SE(object,...)
SE(object, ...) ## Default S3 method: SE(object,...) ## S3 method for class 'svrepstat' SE(object,...)
object |
An object |
... |
Arguments for future expansion |
Vector of standard errors.
Generates small area estimates by smoothing direct estimates using an area level model
svysmoothArea( formula, domain, design = NULL, adj.mat = NULL, X.domain = NULL, direct.est = NULL, domain.size = NULL, transform = c("identity", "logit", "log"), pc.u = 1, pc.alpha = 0.01, pc.u.phi = 0.5, pc.alpha.phi = 2/3, level = 0.95, n.sample = 250, var.tol = 1e-10, return.samples = FALSE,... )
svysmoothArea( formula, domain, design = NULL, adj.mat = NULL, X.domain = NULL, direct.est = NULL, domain.size = NULL, transform = c("identity", "logit", "log"), pc.u = 1, pc.alpha = 0.01, pc.u.phi = 0.5, pc.alpha.phi = 2/3, level = 0.95, n.sample = 250, var.tol = 1e-10, return.samples = FALSE,... )
formula |
An object of class 'formula' describing the model to be fitted. If direct.est is specified, the right hand side of the formula is not necessary. |
domain |
One-sided formula specifying factors containing domain labels |
design |
An object of class "svydesign" containing the data for the model |
adj.mat |
Adjacency matrix with rownames matching the domain labels. If set to |
X.domain |
Data frame of areal covariates. One of the column names needs to match the name of the domain variable, in order to be linked to the data input. Currently only supporting time-invariant covariates. |
direct.est |
Data frame of direct estimates, with first column containing the domain variable, second column containing direct estimate, and third column containing the variance of direct estimate. |
domain.size |
Data frame of domain sizes. One of the column names needs to match the name of the |
transform |
Optional transformation applied to the direct estimates before fitting area level model. The default option is no transformation, but logit and log are implemented. |
pc.u |
Hyperparameter U for the PC prior on precisions. See the INLA documentation for more details on the parameterization. |
pc.alpha |
Hyperparameter alpha for the PC prior on precisions. |
pc.u.phi |
Hyperparameter U for the PC prior on the mixture probability phi in BYM2 model. |
pc.alpha.phi |
Hyperparameter alpha for the PC prior on the mixture probability phi in BYM2 model. |
level |
The specified level for the posterior credible intervals |
n.sample |
Number of draws from posterior used to compute summaries |
var.tol |
Tolerance parameter; if variance of an area's direct estimator is below this value, that direct estimator is dropped from model |
return.samples |
If TRUE, return matrix of posterior samples of area level quantities |
... |
for future methods |
The basic area level model is a Bayesian version of the Fay-Herriot model (Fay & Herriot,1979). It treats direct estimates of small area quantities as response data and explicitly models differences between areas using covariate information and random effects. The Fay-Herriot model can be viewed as a two-stage model: in the first stage, a sampling model represents the sampling variability of a direct estimator and in the second stage, a linking model describes the between area differences in small area quantities. More detail is given in section 4 of Mercer et al (2015).
A svysae
object
Fay, Robert E., and Roger A. Herriot. (1979). Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data. Journal of the American Statistical Association 74 (366a): 269-77.
Mercer LD, Wakefield J, Pantazis A, Lutambi AM, Masanja H, Clark S. Space-Time Smoothing of Complex Survey Data: Small Area Estimation for Child Mortality. Ann Appl Stat. 2015 Dec;9(4):1889-1905. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4959836/
The survey-sae
vignette
## artificial data from SUMMER package ## Uses too many cores for a CRAN example ## Not run: hasSUMMER<-tryCatch({ data("DemoData2",package="SUMMER") data("DemoMap2", package="SUMMER") }, error=function(e) FALSE) if (!isFALSE(hasSUMMER)){ library(survey) des0 <- svydesign(ids = ~clustid+id, strata = ~strata, weights = ~weights, data = DemoData2, nest = TRUE) Xmat <- aggregate(age~region, data = DemoData2, FUN = mean) cts.cov.res <- svysmoothArea(tobacco.use ~ age, domain = ~region, design = des0, adj.mat = DemoMap2$Amat, X.domain = Xmat, pc.u = 1, pc.alpha = 0.01, pc.u.phi = 0.5, pc.alpha.phi = 2/3) print(cts.cov.res) plot(cts.cov.res) } ## End(Not run)
## artificial data from SUMMER package ## Uses too many cores for a CRAN example ## Not run: hasSUMMER<-tryCatch({ data("DemoData2",package="SUMMER") data("DemoMap2", package="SUMMER") }, error=function(e) FALSE) if (!isFALSE(hasSUMMER)){ library(survey) des0 <- svydesign(ids = ~clustid+id, strata = ~strata, weights = ~weights, data = DemoData2, nest = TRUE) Xmat <- aggregate(age~region, data = DemoData2, FUN = mean) cts.cov.res <- svysmoothArea(tobacco.use ~ age, domain = ~region, design = des0, adj.mat = DemoMap2$Amat, X.domain = Xmat, pc.u = 1, pc.alpha = 0.01, pc.u.phi = 0.5, pc.alpha.phi = 2/3) print(cts.cov.res) plot(cts.cov.res) } ## End(Not run)
Generates small area estimates by smoothing direct estimates using a basic
unit level model. This model assumes sampling is ignorable (no selection
bias). It's a Bayesian linear (family="gaussian"
) or generalised linear
(family="binomial"
) mixed model for the unit-level data with
individual-level covariates and area-level random effects.
svysmoothUnit( formula, domain, design, family = c("gaussian", "binomial"), X.pop = NULL, adj.mat = NULL, domain.size = NULL, pc.u = 1, pc.alpha = 0.01, pc.u.phi = 0.5, pc.alpha.phi = 2/3, level = 0.95, n.sample = 250, return.samples = FALSE, X.pop.weights = NULL,... )
svysmoothUnit( formula, domain, design, family = c("gaussian", "binomial"), X.pop = NULL, adj.mat = NULL, domain.size = NULL, pc.u = 1, pc.alpha = 0.01, pc.u.phi = 0.5, pc.alpha.phi = 2/3, level = 0.95, n.sample = 250, return.samples = FALSE, X.pop.weights = NULL,... )
formula |
An object of class 'formula' describing the model to be fitted. |
domain |
One-sided formula specifying factors containing domain labels |
design |
An object of class "survey.design" containing the data for the model |
family |
of the response variable, currently supports 'binomial' (default with logit link function) or 'gaussian'. |
X.pop |
Data frame of population unit-level covariates. One of the column name needs to match the domain specified, in order to be linked to the data input. Currently only supporting time-invariant covariates. |
adj.mat |
Adjacency matrix with rownames matching the domain labels. If set to NULL, the IID spatial effect will be used. |
domain.size |
Data frame of domain sizes. One of the column names needs to match the name of the domain variable, in order to be linked to the data input and there must be a column names 'size' containing domain sizes. The default option is no transformation, but logit and log are implemented. |
pc.u |
Hyperparameter U for the PC prior on precisions. See the INLA documentation for more details on the parameterization. |
pc.alpha |
Hyperparameter alpha for the PC prior on precisions. |
pc.u.phi |
Hyperparameter U for the PC prior on the mixture probability phi in BYM2 model. |
pc.alpha.phi |
Hyperparameter alpha for the PC prior on the mixture probability phi in BYM2 model. |
level |
The specified level for the posterior credible intervals |
n.sample |
Number of draws from posterior used to compute summaries |
return.samples |
If TRUE, return matrix of posterior samples of area level quantities |
X.pop.weights |
Optional vector of weights to use when aggregating unit level predictions |
... |
for future expansion |
A svysae
object
Battese, G. E., Harter, R. M., & Fuller, W. A. (1988). An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data. Journal of the American Statistical Association, 83(401), 28-36.
The survey-sae
vignette
This function takes a stratified sample without replacement from a data set.
stratsample(strata, counts)
stratsample(strata, counts)
strata |
Vector of stratum identifiers; will be coerced to character |
counts |
named vector of stratum sample sizes, with names corresponding to the values of |
vector of indices into strata
giving the sample
The "sampling" package has many more sampling algorithms.
data(api) s<-stratsample(apipop$stype, c("E"=5,"H"=4,"M"=2)) table(apipop$stype[s])
data(api) s<-stratsample(apipop$stype, c("E"=5,"H"=4,"M"=2)) table(apipop$stype[s])
Restrict a survey design to a subpopulation, keeping the original design information about number of clusters, strata. If the design has no post-stratification or calibration data the subset will use proportionately less memory.
## S3 method for class 'survey.design' subset(x, subset, ...) ## S3 method for class 'svyrep.design' subset(x, subset, ...)
## S3 method for class 'survey.design' subset(x, subset, ...) ## S3 method for class 'svyrep.design' subset(x, subset, ...)
x |
A survey design object |
subset |
An expression specifying the subpopulation |
... |
Arguments not used by this method |
A new survey design object
data(fpc) dfpc<-svydesign(id=~psuid,strat=~stratid,weight=~weight,data=fpc,nest=TRUE) dsub<-subset(dfpc,x>4) summary(dsub) svymean(~x,design=dsub) ## These should give the same domain estimates and standard errors svyby(~x,~I(x>4),design=dfpc, svymean) summary(svyglm(x~I(x>4)+0,design=dfpc)) data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) rclus1<-as.svrepdesign(dclus1) svymean(~enroll, subset(dclus1, sch.wide=="Yes" & comp.imp=="Yes")) svymean(~enroll, subset(rclus1, sch.wide=="Yes" & comp.imp=="Yes"))
data(fpc) dfpc<-svydesign(id=~psuid,strat=~stratid,weight=~weight,data=fpc,nest=TRUE) dsub<-subset(dfpc,x>4) summary(dsub) svymean(~x,design=dsub) ## These should give the same domain estimates and standard errors svyby(~x,~I(x>4),design=dfpc, svymean) summary(svyglm(x~I(x>4)+0,design=dfpc)) data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) rclus1<-as.svrepdesign(dclus1) svymean(~enroll, subset(dclus1, sch.wide=="Yes" & comp.imp=="Yes")) svymean(~enroll, subset(rclus1, sch.wide=="Yes" & comp.imp=="Yes"))
This help page documents the options that control the behaviour of the survey package.
All the options for the survey package have names beginning with "survey". Four of them control standard error estimation.
options("survey.replicates.mse")
controls the default in
svrepdesign
and as.svrepdesign
for computing
variances. When options("survey.replicates.mse")
is
TRUE
, the default is to create replicate weight designs that
compute variances centered at the point estimate, rather than at the
mean of the replicates. The option can be overridden by specifying
the mse
argument explicitly in svrepdesign
and
as.svrepdesign
. The default is FALSE
.
When options("survey.ultimate.cluster")
is TRUE
,
standard error estimation is based on independence of PSUs at the
first stage of sampling, without using any information about
subsequent stages. When FALSE
, finite population corrections
and variances are estimated recursively. See svyrecvar
for more information. This option makes no difference unless
first-stage finite population corrections are specified, in which
case setting the option to TRUE
gives the wrong answer for a
multistage study. The only reason to use TRUE
is for
compatibility with other software that gives the wrong answer.
Handling of strata with a single PSU that are not certainty PSUs is
controlled by options("survey.lonely.psu")
. The default
setting is "fail"
, which gives an error. Use "remove"
to ignore that PSU for variance computation, "adjust"
to
center the stratum at the population mean rather than the stratum
mean, and "average"
to replace the variance contribution of
the stratum by the average variance contribution across strata. As
of version 3.4-2 as.svrepdesign
also uses this option.
The variance formulas for domain estimation give well-defined,
positive results when a stratum contains only one PSU with
observations in the domain, but are not unbiased. If
options("survey.adjust.domain.lonely")
is TRUE
and
options("survey.lonely.psu")
is "average"
or
"adjust"
the same adjustment for lonely PSUs will be used
within a domain. Note that this adjustment is not available for
replicate-weight designs, nor (currently) for raked,
post-stratified, or calibrated designs.
The fourth option is options("survey.want.obsolete")
. This
controls the warnings about using the deprecated pre-2.9.0 survey
design objects.
The behaviour of replicate-weight designs for self-representing
strata is controlled by options("survey.drop.replicates")
.
When TRUE
, various optimizations are used that take advantage
of the fact that these strata do not contribute to the variance.
The only reason ever to use FALSE
is if there is a bug in
the code for these optimizations.
The fifth option controls the use of multiple processors with the
multicore
package. This option should not affect the values
computed by any of the survey functions. If TRUE
, all
functions that are able to use multiple processors will do so by
default. Using multiple processors may speed up calculations, but
need not, especially if the computer is short on memory. The best
strategy is probably to experiment with explicitly requesting
multicore=TRUE
in functions that support it, to see if there
is an increase in speed before setting the global option.
survey.use_rcpp
controls whether the new C++ code for
standard errors is used (vs the old R code). The factory setting is
TRUE
and the only reason to use FALSE
is for comparisons.
Compute means, variances, ratios and totals for data from complex surveys.
## S3 method for class 'survey.design' svymean(x, design, na.rm=FALSE,deff=FALSE,influence=FALSE,...) ## S3 method for class 'survey.design2' svymean(x, design, na.rm=FALSE,deff=FALSE,influence=FALSE,...) ## S3 method for class 'twophase' svymean(x, design, na.rm=FALSE,deff=FALSE,...) ## S3 method for class 'svyrep.design' svymean(x, design, na.rm=FALSE, rho=NULL, return.replicates=FALSE, deff=FALSE,...) ## S3 method for class 'survey.design' svyvar(x, design, na.rm=FALSE,...) ## S3 method for class 'svyrep.design' svyvar(x, design, na.rm=FALSE, rho=NULL, return.replicates=FALSE,...,estimate.only=FALSE) ## S3 method for class 'survey.design' svytotal(x, design, na.rm=FALSE,deff=FALSE,influence=FALSE,...) ## S3 method for class 'survey.design2' svytotal(x, design, na.rm=FALSE,deff=FALSE,influence=FALSE,...) ## S3 method for class 'twophase' svytotal(x, design, na.rm=FALSE,deff=FALSE,...) ## S3 method for class 'svyrep.design' svytotal(x, design, na.rm=FALSE, rho=NULL, return.replicates=FALSE, deff=FALSE,...) ## S3 method for class 'svystat' coef(object,...) ## S3 method for class 'svrepstat' coef(object,...) ## S3 method for class 'svystat' vcov(object,...) ## S3 method for class 'svrepstat' vcov(object,...) ## S3 method for class 'svystat' confint(object, parm, level = 0.95,df =Inf,...) ## S3 method for class 'svrepstat' confint(object, parm, level = 0.95,df =Inf,...) cv(object,...) deff(object, quietly=FALSE,...) make.formula(names)
## S3 method for class 'survey.design' svymean(x, design, na.rm=FALSE,deff=FALSE,influence=FALSE,...) ## S3 method for class 'survey.design2' svymean(x, design, na.rm=FALSE,deff=FALSE,influence=FALSE,...) ## S3 method for class 'twophase' svymean(x, design, na.rm=FALSE,deff=FALSE,...) ## S3 method for class 'svyrep.design' svymean(x, design, na.rm=FALSE, rho=NULL, return.replicates=FALSE, deff=FALSE,...) ## S3 method for class 'survey.design' svyvar(x, design, na.rm=FALSE,...) ## S3 method for class 'svyrep.design' svyvar(x, design, na.rm=FALSE, rho=NULL, return.replicates=FALSE,...,estimate.only=FALSE) ## S3 method for class 'survey.design' svytotal(x, design, na.rm=FALSE,deff=FALSE,influence=FALSE,...) ## S3 method for class 'survey.design2' svytotal(x, design, na.rm=FALSE,deff=FALSE,influence=FALSE,...) ## S3 method for class 'twophase' svytotal(x, design, na.rm=FALSE,deff=FALSE,...) ## S3 method for class 'svyrep.design' svytotal(x, design, na.rm=FALSE, rho=NULL, return.replicates=FALSE, deff=FALSE,...) ## S3 method for class 'svystat' coef(object,...) ## S3 method for class 'svrepstat' coef(object,...) ## S3 method for class 'svystat' vcov(object,...) ## S3 method for class 'svrepstat' vcov(object,...) ## S3 method for class 'svystat' confint(object, parm, level = 0.95,df =Inf,...) ## S3 method for class 'svrepstat' confint(object, parm, level = 0.95,df =Inf,...) cv(object,...) deff(object, quietly=FALSE,...) make.formula(names)
x |
A formula, vector or matrix |
design |
|
na.rm |
Should cases with missing values be dropped? |
influence |
Should a matrix of influence functions be returned
(primarily to support |
rho |
parameter for Fay's variance estimator in a BRR design |
return.replicates |
Return the replicate means/totals? |
deff |
Return the design effect (see below) |
object |
The result of one of the other survey summary functions |
quietly |
Don't warn when there is no design effect computed |
estimate.only |
Don't compute standard errors (useful when
|
parm |
a specification of which parameters are to be given confidence intervals, either a vector of numbers or a vector of names. If missing, all parameters are considered. |
level |
the confidence level required. |
df |
degrees of freedom for t-distribution in confidence
interval, use |
... |
additional arguments to methods,not currently used |
names |
vector of character strings |
These functions perform weighted estimation, with each observation being weighted by the inverse of its sampling probability. Except for the table functions, these also give precision estimates that incorporate the effects of stratification and clustering.
Factor variables are converted to sets of indicator variables for each
category in computing means and totals. Combining this with the
interaction
function, allows crosstabulations. See
ftable.svystat
for formatting the output.
With na.rm=TRUE
, all cases with missing data are removed. With
na.rm=FALSE
cases with missing data are not removed and so will
produce missing results. When using replicate weights and
na.rm=FALSE
it may be useful to set
options(na.action="na.pass")
, otherwise all replicates with any
missing results will be discarded.
The svytotal
and svreptotal
functions estimate a
population total. Use predict
on svyratio
and
svyglm
, to get ratio or regression estimates of totals.
svyvar
estimates the population variance. The object returned
includes the full matrix of estimated population variances and
covariances, but by default only the diagonal elements are printed. To
display the whole matrix use as.matrix(v)
or print(v,
covariance=TRUE)
.
The design effect compares the variance of a mean or total to the
variance from a study of the same size using simple random sampling
without replacement. Note that the design effect will be incorrect if
the weights have been rescaled so that they are not reciprocals of
sampling probabilities. To obtain an estimate of the design effect
comparing to simple random sampling with replacement, which does not
have this requirement, use deff="replace"
. This with-replacement
design effect is the square of Kish's "deft".
The design effect for a subset of a design conditions on the size of the subset. That is, it compares the variance of the estimate to the variance of an estimate based on a simple random sample of the same size as the subset, taken from the subpopulation. So, for example, under stratified random sampling the design effect in a subset consisting of a single stratum will be 1.0.
The cv
function computes the coefficient of variation of a
statistic such as ratio, mean or total. The default method is for any
object with methods for SE
and coef
.
make.formula
makes a formula from a vector of names. This is
useful because formulas as the best way to specify variables to the
survey functions.
Objects of class "svystat"
or "svrepstat"
,
which are vectors with a "var"
attribute giving the variance
and a "statistic"
attribute giving the name of the
statistic, and optionally a "deff"
attribute with design effects
These objects have methods for vcov
, SE
, coef
,
confint
, svycontrast
.
When influence=TRUE
is used, a svystat
object has an
attribute "influence"
with influence functions for each
observations
When return.replicates=TRUE
, the svrepstat
object is a
list whose second component is a matrix of replicate values.
svystat
objects have Math
and Ops
methods that
remove the variance attribute
Thomas Lumley
svydesign
, as.svrepdesign
,
svrepdesign
for constructing design objects.
degf
to extract degrees of freedom from a design.
svyquantile
for quantiles
ftable.svystat
for more attractive tables
svyciprop
for more accurate confidence intervals for
proportions near 0 or 1.
svyttest
for comparing two means.
svycontrast
for linear and nonlinear functions of estimates.
data(api) ## one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) svymean(~api00, dclus1, deff=TRUE) svymean(~factor(stype),dclus1) svymean(~interaction(stype, comp.imp), dclus1) svyquantile(~api00, dclus1, c(.25,.5,.75)) svytotal(~enroll, dclus1, deff=TRUE) svyratio(~api.stu, ~enroll, dclus1) v<-svyvar(~api00+api99, dclus1) v print(v, cov=TRUE) as.matrix(v) # replicate weights - jackknife (this is slower) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) jkstrat<-as.svrepdesign(dstrat) svymean(~api00, jkstrat) svymean(~factor(stype),jkstrat) svyvar(~api00+api99,jkstrat) svyquantile(~api00, jkstrat, c(.25,.5,.75)) svytotal(~enroll, jkstrat) svyratio(~api.stu, ~enroll, jkstrat) # coefficients of variation cv(svytotal(~enroll,dstrat)) cv(svyratio(~api.stu, ~enroll, jkstrat)) # extracting information from the results coef(svytotal(~enroll,dstrat)) vcov(svymean(~api00+api99,jkstrat)) SE(svymean(~enroll, dstrat)) confint(svymean(~api00+api00, dclus1)) confint(svymean(~api00+api00, dclus1), df=degf(dclus1)) # Design effect svymean(~api00, dstrat, deff=TRUE) svymean(~api00, dstrat, deff="replace") svymean(~api00, jkstrat, deff=TRUE) svymean(~api00, jkstrat, deff="replace") (a<-svytotal(~enroll, dclus1, deff=TRUE)) deff(a) ## weights that are *already* calibrated to population size sum(weights(dclus1)) nrow(apipop) cdclus1<- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc,calibrate.formula=~1) SE(svymean(~enroll, dclus1)) ## not equal to SE(mean) SE(svytotal(~enroll, dclus1))/nrow(apipop) ## equal to SE(mean) SE(svytotal(~enroll, cdclus1))/nrow(apipop)
data(api) ## one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) svymean(~api00, dclus1, deff=TRUE) svymean(~factor(stype),dclus1) svymean(~interaction(stype, comp.imp), dclus1) svyquantile(~api00, dclus1, c(.25,.5,.75)) svytotal(~enroll, dclus1, deff=TRUE) svyratio(~api.stu, ~enroll, dclus1) v<-svyvar(~api00+api99, dclus1) v print(v, cov=TRUE) as.matrix(v) # replicate weights - jackknife (this is slower) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) jkstrat<-as.svrepdesign(dstrat) svymean(~api00, jkstrat) svymean(~factor(stype),jkstrat) svyvar(~api00+api99,jkstrat) svyquantile(~api00, jkstrat, c(.25,.5,.75)) svytotal(~enroll, jkstrat) svyratio(~api.stu, ~enroll, jkstrat) # coefficients of variation cv(svytotal(~enroll,dstrat)) cv(svyratio(~api.stu, ~enroll, jkstrat)) # extracting information from the results coef(svytotal(~enroll,dstrat)) vcov(svymean(~api00+api99,jkstrat)) SE(svymean(~enroll, dstrat)) confint(svymean(~api00+api00, dclus1)) confint(svymean(~api00+api00, dclus1), df=degf(dclus1)) # Design effect svymean(~api00, dstrat, deff=TRUE) svymean(~api00, dstrat, deff="replace") svymean(~api00, jkstrat, deff=TRUE) svymean(~api00, jkstrat, deff="replace") (a<-svytotal(~enroll, dclus1, deff=TRUE)) deff(a) ## weights that are *already* calibrated to population size sum(weights(dclus1)) nrow(apipop) cdclus1<- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc,calibrate.formula=~1) SE(svymean(~enroll, dclus1)) ## not equal to SE(mean) SE(svytotal(~enroll, dclus1))/nrow(apipop) ## equal to SE(mean) SE(svytotal(~enroll, cdclus1))/nrow(apipop)
Some recent large-scale surveys specify replication weights rather than the sampling design (partly for privacy reasons). This function specifies the data structure for such a survey.
svrepdesign(variables , repweights , weights, data, degf=NULL,...) ## Default S3 method: svrepdesign(variables = NULL, repweights = NULL, weights = NULL, data = NULL, degf=NULL, type = c("BRR", "Fay", "JK1","JKn","bootstrap", "ACS","successive-difference","JK2","other"), combined.weights=TRUE, rho = NULL, bootstrap.average=NULL, scale=NULL, rscales=NULL,fpc=NULL, fpctype=c("fraction","correction"), mse=getOption("survey.replicates.mse"),...) ## S3 method for class 'imputationList' svrepdesign(variables=NULL, repweights,weights,data, degf=NULL, mse=getOption("survey.replicates.mse"),...) ## S3 method for class 'character' svrepdesign(variables=NULL,repweights=NULL, weights=NULL,data=NULL, degf=NULL, type=c("BRR","Fay","JK1", "JKn","bootstrap","ACS","successive-difference","JK2","other"), combined.weights=TRUE, rho=NULL, bootstrap.average=NULL, scale=NULL,rscales=NULL, fpc=NULL,fpctype=c("fraction","correction"),mse=getOption("survey.replicates.mse"), dbtype="SQLite", dbname,...) ## S3 method for class 'svyrep.design' image(x, ..., col=grey(seq(.5,1,length=30)), type.=c("rep","total"))
svrepdesign(variables , repweights , weights, data, degf=NULL,...) ## Default S3 method: svrepdesign(variables = NULL, repweights = NULL, weights = NULL, data = NULL, degf=NULL, type = c("BRR", "Fay", "JK1","JKn","bootstrap", "ACS","successive-difference","JK2","other"), combined.weights=TRUE, rho = NULL, bootstrap.average=NULL, scale=NULL, rscales=NULL,fpc=NULL, fpctype=c("fraction","correction"), mse=getOption("survey.replicates.mse"),...) ## S3 method for class 'imputationList' svrepdesign(variables=NULL, repweights,weights,data, degf=NULL, mse=getOption("survey.replicates.mse"),...) ## S3 method for class 'character' svrepdesign(variables=NULL,repweights=NULL, weights=NULL,data=NULL, degf=NULL, type=c("BRR","Fay","JK1", "JKn","bootstrap","ACS","successive-difference","JK2","other"), combined.weights=TRUE, rho=NULL, bootstrap.average=NULL, scale=NULL,rscales=NULL, fpc=NULL,fpctype=c("fraction","correction"),mse=getOption("survey.replicates.mse"), dbtype="SQLite", dbname,...) ## S3 method for class 'svyrep.design' image(x, ..., col=grey(seq(.5,1,length=30)), type.=c("rep","total"))
variables |
formula or data frame specifying variables to include in the design (default is all) |
repweights |
formula or data frame specifying replication weights, or character string specifying a regular expression that matches the names of the replication weight variables |
weights |
sampling weights |
data |
data frame to look up variables in formulas, or character string giving name of database table |
degf |
Design degrees of freedom; use |
type |
Type of replication weights |
combined.weights |
|
rho |
Shrinkage factor for weights in Fay's method |
bootstrap.average |
For |
scale , rscales
|
Scaling constant for variance, see Details below |
fpc , fpctype
|
Finite population correction information |
mse |
If |
dbname |
name of database, passed to |
dbtype |
Database driver: see Details |
x |
survey design with replicate weights |
... |
Other arguments to |
col |
Colors |
type. |
|
In the BRR method, the dataset is split into halves, and the
difference between halves is used to estimate the variance. In Fay's
method, rather than removing observations from half the sample they
are given weight rho
in one half-sample and 2-rho
in the
other. The ideal BRR analysis is restricted to a design where each
stratum has two PSUs, however, it has been used in a much wider class
of surveys. The scale
and rscales
arguments will be ignored (with a warning) if they are specified.
The JK1 and JKn types are both jackknife estimators deleting one cluster at a time. JKn is designed for stratified and JK1 for unstratified designs.
The successive-difference weights in the American Community Survey
automatically use scale = 4/ncol(repweights)
and rscales=rep(1,
ncol(repweights))
. This can be specified as type="ACS"
or
type="successive-difference"
. The scale
and rscales
arguments will be ignored (with a warning) if they are specified. The
American Community Survey recommends mse-style standard error
estimates; if you do not specify mse
explicitly mse=TRUE
will be set with a message, overriding
getOption("survey.replicates.mse")
. If you explicitly specify
mse=FALSE
there will be a warning but your choice will be respected.
JK2 weights (type="JK2"
), as in the California Health Interview
Survey, automatically use scale=1
, rscales=rep(1, ncol(repweights))
.
The scale
and rscales
arguments will be ignored (with a warning) if they are specified.
Averaged bootstrap weights ("mean bootstrap") are used for some surveys from Statistics Canada. Yee et al (1999) describe their construction and use for one such survey.
The variance is computed as the sum of squared deviations of the
replicates from their mean. This may be rescaled: scale
is an
overall multiplier and rscales
is a vector of
replicate-specific multipliers for the squared deviations. That is,
rscales
should have one entry for each column of repweights
If thereplication weights incorporate the sampling weights
(combined.weights=TRUE
) or for type="other"
these must
be specified, otherwise they can be guessed from the weights.
A finite population correction may be specified for type="other"
,
type="JK1"
and type="JKn"
. fpc
must be a vector
with one entry for each replicate. To specify sampling fractions use
fpctype="fraction"
and to specify the correction directly use
fpctype="correction"
The design degrees of freedom are returned by degf
. By
default they are computed from the numerical rank of the
repweights. This is slow for very large data sets and you can specify
a value instead.
repweights
may be a character string giving a regular expression
for the replicate weight variables. For example, in the
California Health Interview Survey public-use data, the sampling weights are
"rakedw0"
and the replicate weights are "rakedw1"
to
"rakedw80"
. The regular expression "rakedw[1-9]"
matches the replicate weight variables (and not the sampling weight
variable).
data
may be a character string giving the name of a table or view
in a relational database that can be accessed through the DBI
interface. For DBI interfaces dbtype
should be the name of the database
driver and dbname
should be the name by which the driver identifies
the specific database (eg file name for SQLite).
The appropriate database interface package must already be loaded (eg
RSQLite
for SQLite). The survey design
object will contain the replicate weights, but actual variables will
be loaded from the database only as needed. Use
close
to close the database connection and
open
to reopen the connection, eg, after
loading a saved object.
The database interface does not attempt to modify the underlying database and so can be used with read-only permissions on the database.
To generate your own replicate weights either use
as.svrepdesign
on a survey.design
object, or see
brrweights
, bootweights
,
jk1weights
and jknweights
The model.frame
method extracts the observed data.
Object of class svyrep.design
, with methods for print
,
summary
, weights
, image
.
To use replication-weight analyses on a survey specified by
sampling design, use as.svrepdesign
to convert it.
Levy and Lemeshow. "Sampling of Populations". Wiley.
Shao and Tu. "The Jackknife and Bootstrap." Springer.
Yee et al (1999). Bootstrat Variance Estimation for the National Population Health Survey. Proceedings of the ASA Survey Research Methodology Section. https://web.archive.org/web/20151110170959/http://www.amstat.org/sections/SRMS/Proceedings/papers/1999_136.pdf
as.svrepdesign
, svydesign
,
brrweights
, bootweights
data(scd) # use BRR replicate weights from Levy and Lemeshow repweights<-2*cbind(c(1,0,1,0,1,0), c(1,0,0,1,0,1), c(0,1,1,0,0,1), c(0,1,0,1,1,0)) scdrep<-svrepdesign(data=scd, type="BRR", repweights=repweights, combined.weights=FALSE) svyratio(~alive, ~arrests, scdrep) ## Not run: ## Needs RSQLite library(RSQLite) db_rclus1<-svrepdesign(weights=~pw, repweights="wt[1-9]+", type="JK1", scale=(1-15/757)*14/15, data="apiclus1rep",dbtype="SQLite", dbname=system.file("api.db",package="survey"), combined=FALSE) svymean(~api00+api99,db_rclus1) summary(db_rclus1) ## closing and re-opening a connection close(db_rclus1) db_rclus1 try(svymean(~api00+api99,db_rclus1)) db_rclus1<-open(db_rclus1) svymean(~api00+api99,db_rclus1) ## End(Not run)
data(scd) # use BRR replicate weights from Levy and Lemeshow repweights<-2*cbind(c(1,0,1,0,1,0), c(1,0,0,1,0,1), c(0,1,1,0,0,1), c(0,1,0,1,1,0)) scdrep<-svrepdesign(data=scd, type="BRR", repweights=repweights, combined.weights=FALSE) svyratio(~alive, ~arrests, scdrep) ## Not run: ## Needs RSQLite library(RSQLite) db_rclus1<-svrepdesign(weights=~pw, repweights="wt[1-9]+", type="JK1", scale=(1-15/757)*14/15, data="apiclus1rep",dbtype="SQLite", dbname=system.file("api.db",package="survey"), combined=FALSE) svymean(~api00+api99,db_rclus1) summary(db_rclus1) ## closing and re-opening a connection close(db_rclus1) db_rclus1 try(svymean(~api00+api99,db_rclus1)) db_rclus1<-open(db_rclus1) svymean(~api00+api99,db_rclus1) ## End(Not run)
Compute an appropriately scaled empirical variance estimate from
replicates. The mse
argument specifies whether the sums of
squares should be centered at the point estimate (mse=TRUE
) or
the mean of the replicates. It is usually taken from the mse
component of the design object.
svrVar(thetas, scale, rscales, na.action=getOption("na.action"), mse=getOption("survey.replicates.mse"),coef)
svrVar(thetas, scale, rscales, na.action=getOption("na.action"), mse=getOption("survey.replicates.mse"),coef)
thetas |
matrix whose rows are replicates (or a vector of replicates) |
scale |
Overall scaling factor |
rscales |
Scaling factor for each squared deviation |
na.action |
How to handle replicates where the statistic could not be estimated |
mse |
if |
coef |
The point estimate, required only if |
covariance matrix.
svrepdesign
, as.svrepdesign
,
brrweights
,
jk1weights
, jknweights
Computes the sandwich variance estimator for a generalised linear model fitted to data from a complex sample survey. Designed to be used internally by svyglm
.
svy.varcoef(glm.object, design)
svy.varcoef(glm.object, design)
glm.object |
A |
design |
A |
A variance matrix
Thomas Lumley
Compute survey statistics on subsets of a survey defined by factors.
svyby(formula, by ,design,...) ## Default S3 method: svyby(formula, by, design, FUN, ..., deff=FALSE,keep.var = TRUE, keep.names = TRUE,verbose=FALSE, vartype=c("se","ci","ci","cv","cvpct","var"), drop.empty.groups=TRUE, covmat=FALSE, return.replicates=FALSE, na.rm.by=FALSE, na.rm.all=FALSE, stringsAsFactors=TRUE, multicore=getOption("survey.multicore")) ## S3 method for class 'survey.design2' svyby(formula, by, design, FUN, ..., deff=FALSE,keep.var = TRUE, keep.names = TRUE,verbose=FALSE, vartype=c("se","ci","ci","cv","cvpct","var"), drop.empty.groups=TRUE, covmat=FALSE, influence=covmat, na.rm.by=FALSE, na.rm.all=FALSE, stringsAsFactors=TRUE, multicore=getOption("survey.multicore")) ## S3 method for class 'svyby' SE(object,...) ## S3 method for class 'svyby' deff(object,...) ## S3 method for class 'svyby' coef(object,...) ## S3 method for class 'svyby' confint(object, parm, level = 0.95,df =Inf,...) unwtd.count(x, design, ...) svybys(formula, bys, design, FUN, ...)
svyby(formula, by ,design,...) ## Default S3 method: svyby(formula, by, design, FUN, ..., deff=FALSE,keep.var = TRUE, keep.names = TRUE,verbose=FALSE, vartype=c("se","ci","ci","cv","cvpct","var"), drop.empty.groups=TRUE, covmat=FALSE, return.replicates=FALSE, na.rm.by=FALSE, na.rm.all=FALSE, stringsAsFactors=TRUE, multicore=getOption("survey.multicore")) ## S3 method for class 'survey.design2' svyby(formula, by, design, FUN, ..., deff=FALSE,keep.var = TRUE, keep.names = TRUE,verbose=FALSE, vartype=c("se","ci","ci","cv","cvpct","var"), drop.empty.groups=TRUE, covmat=FALSE, influence=covmat, na.rm.by=FALSE, na.rm.all=FALSE, stringsAsFactors=TRUE, multicore=getOption("survey.multicore")) ## S3 method for class 'svyby' SE(object,...) ## S3 method for class 'svyby' deff(object,...) ## S3 method for class 'svyby' coef(object,...) ## S3 method for class 'svyby' confint(object, parm, level = 0.95,df =Inf,...) unwtd.count(x, design, ...) svybys(formula, bys, design, FUN, ...)
formula , x
|
A formula specifying the variables to pass to
|
by |
A formula specifying factors that define subsets, or a list of factors. |
design |
A |
FUN |
A function taking a formula and survey design object as its first two arguments. |
... |
Other arguments to |
deff |
Request a design effect from |
keep.var |
If |
keep.names |
Define row names based on the subsets |
verbose |
If |
vartype |
Report variability as one or more of standard error, confidence interval, coefficient of variation, percent coefficient of variation, or variance |
drop.empty.groups |
If |
na.rm.by |
If true, omit groups defined by |
.
na.rm.all |
If true, check for groups with no non-missing
observations for variables defined by |
covmat |
If |
return.replicates |
Only for replicate-weight designs. If
|
influence |
Return the influence functions of the result |
multicore |
Use |
stringsAsFactors |
Convert any string variables in |
parm |
a specification of which parameters are to be given confidence intervals, either a vector of numbers or a vector of names. If missing, all parameters are considered. |
level |
the confidence level required. |
df |
degrees of freedom for t-distribution in confidence
interval, use |
object |
An object of class |
bys |
one-sided formula with each term specifying a grouping (rather than being combined to give a grouping |
The variance type "ci" asks for confidence intervals, which are produced
by confint
. In some cases additional options to FUN
will
be needed to produce confidence intervals, for example,
svyquantile
needs ci=TRUE
or keep.var=FALSE
.
unwtd.count
is designed to be passed to svyby
to report
the number of non-missing observations in each subset. Observations
with exactly zero weight will also be counted as missing, since that's
how subsets are implemented for some designs.
Parallel processing with multicore=TRUE
is useful only for
fairly large problems and on computers with sufficient memory. The
multicore
package is incompatible with some GUIs, although the
Mac Aqua GUI appears to be safe.
The variant svybys
creates a separate table for each term in
bys
rather than creating a joint table.
An object of class "svyby"
: a data frame showing the factors and the results of FUN
.
For unwtd.count
, the unweighted number of non-missing observations in the data matrix specified by x
for the design.
The function works by making a lot of calls of the form
FUN(formula, subset(design, by==i))
, where formula
is
re-evaluated in each subset, so it is unwise to use data-dependent
terms in formula
. In particular, svyby(~factor(a), ~b,
design=d, svymean)
, will create factor variables whose levels are
only those values of a
present in each subset. If a
is a character variable then svyby(~a, ~b,
design=d, svymean)
creates factor variables implicitly and so has
the same problem. Either use
update.survey.design
to add variables to the design
object instead or specify the levels explicitly in the call to
factor
. The stringsAsFactors=TRUE
option converts
all character variables to factors, which can be slow, set it to
FALSE
if you have predefined factors where necessary.
Asking for a design effect (deff=TRUE
) from a function
that does not produce one will cause an error or incorrect formatting
of the output. The same will occur with keep.var=TRUE
if the
function does not compute a standard error.
svytable
and ftable.svystat
for
contingency tables, ftable.svyby
for pretty-printing of svyby
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) svyby(~api99, ~stype, dclus1, svymean) svyby(~api99, ~stype, dclus1, svyquantile, quantiles=0.5,ci=TRUE,vartype="ci") ## without ci=TRUE svyquantile does not compute standard errors svyby(~api99, ~stype, dclus1, svyquantile, quantiles=0.5, keep.var=FALSE) svyby(~api99, list(school.type=apiclus1$stype), dclus1, svymean) svyby(~api99+api00, ~stype, dclus1, svymean, deff=TRUE,vartype="ci") svyby(~api99+api00, ~stype+sch.wide, dclus1, svymean, keep.var=FALSE) ## report raw number of observations svyby(~api99+api00, ~stype+sch.wide, dclus1, unwtd.count, keep.var=FALSE) rclus1<-as.svrepdesign(dclus1) svyby(~api99, ~stype, rclus1, svymean) svyby(~api99, ~stype, rclus1, svyquantile, quantiles=0.5) svyby(~api99, list(school.type=apiclus1$stype), rclus1, svymean, vartype="cv") svyby(~enroll,~stype, rclus1,svytotal, deff=TRUE) svyby(~api99+api00, ~stype+sch.wide, rclus1, svymean, keep.var=FALSE) ##report raw number of observations svyby(~api99+api00, ~stype+sch.wide, rclus1, unwtd.count, keep.var=FALSE) ## comparing subgroups using covmat=TRUE mns<-svyby(~api99, ~stype, rclus1, svymean,covmat=TRUE) vcov(mns) svycontrast(mns, c(E = 1, M = -1)) str(svyby(~api99, ~stype, rclus1, svymean,return.replicates=TRUE)) tots<-svyby(~enroll, ~stype, dclus1, svytotal,covmat=TRUE) vcov(tots) svycontrast(tots, quote(E/H)) ## comparing subgroups uses the delta method unless replicates are present meanlogs<-svyby(~log(enroll),~stype,svymean, design=rclus1,covmat=TRUE) svycontrast(meanlogs, quote(exp(E-H))) meanlogs<-svyby(~log(enroll),~stype,svymean, design=rclus1,covmat=TRUE,return.replicates=TRUE) svycontrast(meanlogs, quote(exp(E-H))) ## extractor functions (a<-svyby(~enroll, ~stype, rclus1, svytotal, deff=TRUE, verbose=TRUE, vartype=c("se","cv","cvpct","var"))) deff(a) SE(a) cv(a) coef(a) confint(a, df=degf(rclus1)) ## ratio estimates svyby(~api.stu, by=~stype, denominator=~enroll, design=dclus1, svyratio) ratios<-svyby(~api.stu, by=~stype, denominator=~enroll, design=dclus1, svyratio,covmat=TRUE) vcov(ratios) ## empty groups svyby(~api00,~comp.imp+sch.wide,design=dclus1,svymean) svyby(~api00,~comp.imp+sch.wide,design=dclus1,svymean,drop.empty.groups=FALSE) ## Multiple tables svybys(~api00,~comp.imp+sch.wide,design=dclus1,svymean)
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) svyby(~api99, ~stype, dclus1, svymean) svyby(~api99, ~stype, dclus1, svyquantile, quantiles=0.5,ci=TRUE,vartype="ci") ## without ci=TRUE svyquantile does not compute standard errors svyby(~api99, ~stype, dclus1, svyquantile, quantiles=0.5, keep.var=FALSE) svyby(~api99, list(school.type=apiclus1$stype), dclus1, svymean) svyby(~api99+api00, ~stype, dclus1, svymean, deff=TRUE,vartype="ci") svyby(~api99+api00, ~stype+sch.wide, dclus1, svymean, keep.var=FALSE) ## report raw number of observations svyby(~api99+api00, ~stype+sch.wide, dclus1, unwtd.count, keep.var=FALSE) rclus1<-as.svrepdesign(dclus1) svyby(~api99, ~stype, rclus1, svymean) svyby(~api99, ~stype, rclus1, svyquantile, quantiles=0.5) svyby(~api99, list(school.type=apiclus1$stype), rclus1, svymean, vartype="cv") svyby(~enroll,~stype, rclus1,svytotal, deff=TRUE) svyby(~api99+api00, ~stype+sch.wide, rclus1, svymean, keep.var=FALSE) ##report raw number of observations svyby(~api99+api00, ~stype+sch.wide, rclus1, unwtd.count, keep.var=FALSE) ## comparing subgroups using covmat=TRUE mns<-svyby(~api99, ~stype, rclus1, svymean,covmat=TRUE) vcov(mns) svycontrast(mns, c(E = 1, M = -1)) str(svyby(~api99, ~stype, rclus1, svymean,return.replicates=TRUE)) tots<-svyby(~enroll, ~stype, dclus1, svytotal,covmat=TRUE) vcov(tots) svycontrast(tots, quote(E/H)) ## comparing subgroups uses the delta method unless replicates are present meanlogs<-svyby(~log(enroll),~stype,svymean, design=rclus1,covmat=TRUE) svycontrast(meanlogs, quote(exp(E-H))) meanlogs<-svyby(~log(enroll),~stype,svymean, design=rclus1,covmat=TRUE,return.replicates=TRUE) svycontrast(meanlogs, quote(exp(E-H))) ## extractor functions (a<-svyby(~enroll, ~stype, rclus1, svytotal, deff=TRUE, verbose=TRUE, vartype=c("se","cv","cvpct","var"))) deff(a) SE(a) cv(a) coef(a) confint(a, df=degf(rclus1)) ## ratio estimates svyby(~api.stu, by=~stype, denominator=~enroll, design=dclus1, svyratio) ratios<-svyby(~api.stu, by=~stype, denominator=~enroll, design=dclus1, svyratio,covmat=TRUE) vcov(ratios) ## empty groups svyby(~api00,~comp.imp+sch.wide,design=dclus1,svymean) svyby(~api00,~comp.imp+sch.wide,design=dclus1,svymean,drop.empty.groups=FALSE) ## Multiple tables svybys(~api00,~comp.imp+sch.wide,design=dclus1,svymean)
Estimates the population cumulative distribution function for specified
variables. In contrast to svyquantile
, this does not do
any interpolation: the result is a right-continuous step function.
svycdf(formula, design, na.rm = TRUE,...) ## S3 method for class 'svycdf' print(x,...) ## S3 method for class 'svycdf' plot(x,xlab=NULL,...)
svycdf(formula, design, na.rm = TRUE,...) ## S3 method for class 'svycdf' print(x,...) ## S3 method for class 'svycdf' plot(x,xlab=NULL,...)
formula |
one-sided formula giving variables from the design object |
design |
survey design object |
na.rm |
remove missing data (case-wise deletion)? |
... |
other arguments to |
x |
object of class |
xlab |
a vector of x-axis labels or |
An object of class svycdf
, which is a list of step functions (of
class stepfun
)
svyquantile
, svyhist
, plot.stepfun
data(api) dstrat <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, fpc = ~fpc) cdf.est<-svycdf(~enroll+api00+api99, dstrat) cdf.est ## function cdf.est[[1]] ## evaluate the function cdf.est[[1]](800) cdf.est[[2]](800) ## compare to population and sample CDFs. opar<-par(mfrow=c(2,1)) cdf.pop<-ecdf(apipop$enroll) cdf.samp<-ecdf(apistrat$enroll) plot(cdf.pop,main="Population vs sample", xlab="Enrollment") lines(cdf.samp,col.points="red") plot(cdf.pop, main="Population vs estimate", xlab="Enrollment") lines(cdf.est[[1]],col.points="red") par(opar)
data(api) dstrat <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, fpc = ~fpc) cdf.est<-svycdf(~enroll+api00+api99, dstrat) cdf.est ## function cdf.est[[1]] ## evaluate the function cdf.est[[1]](800) cdf.est[[2]](800) ## compare to population and sample CDFs. opar<-par(mfrow=c(2,1)) cdf.pop<-ecdf(apipop$enroll) cdf.samp<-ecdf(apistrat$enroll) plot(cdf.pop,main="Population vs sample", xlab="Enrollment") lines(cdf.samp,col.points="red") plot(cdf.pop, main="Population vs estimate", xlab="Enrollment") lines(cdf.est[[1]],col.points="red") par(opar)
Computes confidence intervals for proportions using methods that may be
more accurate near 0 and 1 than simply using confint(svymean())
.
svyciprop(formula, design, method = c("logit", "likelihood", "asin", "beta","mean","xlogit","wilson"), level = 0.95, df=degf(design),...)
svyciprop(formula, design, method = c("logit", "likelihood", "asin", "beta","mean","xlogit","wilson"), level = 0.95, df=degf(design),...)
formula |
Model formula specifying a single binary variable |
design |
survey design object |
method |
See Details below. Partial matching is done on the argument. |
level |
Confidence level for interval |
df |
denominator degrees of freedom, for all methods except
|
... |
For |
The "logit"
method fits a logistic regression model and computes a
Wald-type interval on the log-odds scale, which is then transformed to
the probability scale.
The "likelihood"
method uses the (Rao-Scott) scaled chi-squared distribution
for the loglikelihood from a binomial distribution.
The "asin"
method uses the variance-stabilising transformation
for the binomial distribution, the arcsine square root, and then
back-transforms the interval to the probability scale
The "beta"
method uses the incomplete beta function as in
binom.test
, with an effective sample size based on the
estimated variance of the proportion. (Korn and Graubard, 1998)
The "xlogit"
method uses a logit transformation of the mean and
then back-transforms to the probablity scale. This appears to be the
method used by SUDAAN and SPSS COMPLEX SAMPLES.
The "wilson"
method is the Wilson score interval, which inverts
the coverage probability statement using the true probability rather
than the estimated probability, which results in a quadratic equation
for the estimated probability. This interval is contained in [0,1].
The "mean"
method is a Wald-type interval on the probability
scale, the same as confint(svymean())
All methods undercover for probabilities close enough to zero or one,
but "beta"
, "likelihood"
, "logit"
, and "logit"
are noticeably
better than the other two. None of the methods will work when the
observed proportion is exactly 0 or 1.
The confint
method extracts the confidence interval; the
vcov
and SE
methods just report the variance or standard
error of the mean.
The point estimate of the proportion, with the confidence interval as an attribute
Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contingency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60. https://doi.org/10.1214/AOS/1176346391
Korn EL, Graubard BI. (1998) Confidence Intervals For Proportions With Small Expected Number of Positive Counts Estimated From Survey Data. Survey Methodology 23:193-201. https://www150.statcan.gc.ca/n1/pub/12-001-x/1998002/article/4356-eng.pdf
Dean, N., and Pagano, M. (2015) Evaluating Confidence Interval Methods for Binomial Proportions in Clustered Surveys. Journal of Survey Statistics and Methodology, 3 (4), 484-503. https://doi.org/10.1093/jssam/smv024
data(api) dclus1<-svydesign(id=~dnum, fpc=~fpc, data=apiclus1) svyciprop(~I(ell==0), dclus1, method="li") svyciprop(~I(ell==0), dclus1, method="lo") svyciprop(~I(ell==0), dclus1, method="as") svyciprop(~I(ell==0), dclus1, method="be") svyciprop(~I(ell==0), dclus1, method="me") svyciprop(~I(ell==0), dclus1, method="xl") svyciprop(~I(ell==0), dclus1, method="wi") ## reproduces Stata svy: mean svyciprop(~I(ell==0), dclus1, method="me", df=degf(dclus1)) ## reproduces Stata svy: prop svyciprop(~I(ell==0), dclus1, method="lo", df=degf(dclus1)) rclus1<-as.svrepdesign(dclus1) svyciprop(~I(emer==0), rclus1, method="li") svyciprop(~I(emer==0), rclus1, method="lo") svyciprop(~I(emer==0), rclus1, method="as") svyciprop(~I(emer==0), rclus1, method="be") svyciprop(~I(emer==0), rclus1, method="me") svyciprop(~I(emer==0), rclus1, method="wi")
data(api) dclus1<-svydesign(id=~dnum, fpc=~fpc, data=apiclus1) svyciprop(~I(ell==0), dclus1, method="li") svyciprop(~I(ell==0), dclus1, method="lo") svyciprop(~I(ell==0), dclus1, method="as") svyciprop(~I(ell==0), dclus1, method="be") svyciprop(~I(ell==0), dclus1, method="me") svyciprop(~I(ell==0), dclus1, method="xl") svyciprop(~I(ell==0), dclus1, method="wi") ## reproduces Stata svy: mean svyciprop(~I(ell==0), dclus1, method="me", df=degf(dclus1)) ## reproduces Stata svy: prop svyciprop(~I(ell==0), dclus1, method="lo", df=degf(dclus1)) rclus1<-as.svrepdesign(dclus1) svyciprop(~I(emer==0), rclus1, method="li") svyciprop(~I(emer==0), rclus1, method="lo") svyciprop(~I(emer==0), rclus1, method="as") svyciprop(~I(emer==0), rclus1, method="be") svyciprop(~I(emer==0), rclus1, method="me") svyciprop(~I(emer==0), rclus1, method="wi")
Computes linear or nonlinear contrasts of estimates produced by survey
functions (or any object with coef
and vcov
methods).
svycontrast(stat, contrasts, add=FALSE, ...)
svycontrast(stat, contrasts, add=FALSE, ...)
stat |
object of class |
contrasts |
A vector or list of vectors of coefficients, or a call or list of calls |
add |
keep all the coefficients of the input in the output? |
... |
For future expansion |
If contrasts
is a list, the element names are used as
names for the returned statistics.
If an element of contrasts
is shorter than coef(stat)
and has names, the
names are used to match up the vectors and the remaining elements of
contrasts
are assumed to be zero. If the names are not legal
variable names (eg 0.1
) they must be quoted (eg "0.1"
)
If contrasts
is a "call"
or list of "call"s
, and
stat
is a svrepstat
object including replicates, the
replicates are transformed and used to compute the variance. If
stat
is a svystat
object or a svrepstat
object
without replicates, the delta-method is used to compute variances, and
the calls must use only functions that deriv
knows how
to differentiate. If the names are not legal variable names they must
be quoted with backticks (eg `0.1`
).
If stats
is a svyvar
object, the estimates are elements of a matrix and the names are the row and column names pasted together with a colon.
Object of class svrepstat
or svystat
or svyvar
regTermTest
, svyglm
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) a <- svytotal(~api00+enroll+api99, dclus1) svycontrast(a, list(avg=c(0.5,0,0.5), diff=c(1,0,-1))) ## if contrast vectors have names, zeroes may be omitted svycontrast(a, list(avg=c(api00=0.5,api99=0.5), diff=c(api00=1,api99=-1))) ## nonlinear contrasts svycontrast(a, quote(api00/api99)) svyratio(~api00, ~api99, dclus1) ## Example: standardised skewness coefficient moments<-svymean(~I(api00^3)+I(api00^2)+I(api00), dclus1) svycontrast(moments, quote((`I(api00^3)`-3*`I(api00^2)`*`I(api00)`+ 3*`I(api00)`*`I(api00)`^2-`I(api00)`^3)/ (`I(api00^2)`-`I(api00)`^2)^1.5)) ## Example: geometric means ## using delta method meanlogs <- svymean(~log(api00)+log(api99), dclus1) svycontrast(meanlogs, list(api00=quote(exp(`log(api00)`)), api99=quote(exp(`log(api99)`)))) ## using delta method rclus1<-as.svrepdesign(dclus1) meanlogs <- svymean(~log(api00)+log(api99), rclus1) svycontrast(meanlogs, list(api00=quote(exp(`log(api00)`)), api99=quote(exp(`log(api99)`)))) ## why is add=TRUE useful? (totals<-svyby(~enroll,~stype,design=dclus1,svytotal,covmat=TRUE)) totals1<-svycontrast(totals, list(total=c(1,1,1)), add=TRUE) svycontrast(totals1, list(quote(E/total), quote(H/total), quote(M/total))) totals2<-svycontrast(totals, list(total=quote(E+H+M)), add=TRUE) all.equal(as.matrix(totals1),as.matrix(totals2)) ## more complicated svyby means <- svyby(~api00+api99, ~stype+sch.wide, design=dclus1, svymean,covmat=TRUE) svycontrast(means, quote(`E.No:api00`-`E.No:api99`)) svycontrast(means, quote(`E.No:api00`/`E.No:api99`)) ## transforming replicates meanlogs_r <- svymean(~log(api00)+log(api99), rclus1, return.replicates=TRUE) svycontrast(meanlogs_r, list(api00=quote(exp(`log(api00)`)), api99=quote(exp(`log(api99)`)))) ## converting covariances to correlations vmat <-svyvar(~api00+ell,dclus1) print(vmat,cov=TRUE) cov2cor(as.matrix(vmat))[1,2] svycontrast(vmat, quote(`api00:ell`/sqrt(`api00:api00`*`ell:ell`)))
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) a <- svytotal(~api00+enroll+api99, dclus1) svycontrast(a, list(avg=c(0.5,0,0.5), diff=c(1,0,-1))) ## if contrast vectors have names, zeroes may be omitted svycontrast(a, list(avg=c(api00=0.5,api99=0.5), diff=c(api00=1,api99=-1))) ## nonlinear contrasts svycontrast(a, quote(api00/api99)) svyratio(~api00, ~api99, dclus1) ## Example: standardised skewness coefficient moments<-svymean(~I(api00^3)+I(api00^2)+I(api00), dclus1) svycontrast(moments, quote((`I(api00^3)`-3*`I(api00^2)`*`I(api00)`+ 3*`I(api00)`*`I(api00)`^2-`I(api00)`^3)/ (`I(api00^2)`-`I(api00)`^2)^1.5)) ## Example: geometric means ## using delta method meanlogs <- svymean(~log(api00)+log(api99), dclus1) svycontrast(meanlogs, list(api00=quote(exp(`log(api00)`)), api99=quote(exp(`log(api99)`)))) ## using delta method rclus1<-as.svrepdesign(dclus1) meanlogs <- svymean(~log(api00)+log(api99), rclus1) svycontrast(meanlogs, list(api00=quote(exp(`log(api00)`)), api99=quote(exp(`log(api99)`)))) ## why is add=TRUE useful? (totals<-svyby(~enroll,~stype,design=dclus1,svytotal,covmat=TRUE)) totals1<-svycontrast(totals, list(total=c(1,1,1)), add=TRUE) svycontrast(totals1, list(quote(E/total), quote(H/total), quote(M/total))) totals2<-svycontrast(totals, list(total=quote(E+H+M)), add=TRUE) all.equal(as.matrix(totals1),as.matrix(totals2)) ## more complicated svyby means <- svyby(~api00+api99, ~stype+sch.wide, design=dclus1, svymean,covmat=TRUE) svycontrast(means, quote(`E.No:api00`-`E.No:api99`)) svycontrast(means, quote(`E.No:api00`/`E.No:api99`)) ## transforming replicates meanlogs_r <- svymean(~log(api00)+log(api99), rclus1, return.replicates=TRUE) svycontrast(meanlogs_r, list(api00=quote(exp(`log(api00)`)), api99=quote(exp(`log(api99)`)))) ## converting covariances to correlations vmat <-svyvar(~api00+ell,dclus1) print(vmat,cov=TRUE) cov2cor(as.matrix(vmat))[1,2] svycontrast(vmat, quote(`api00:ell`/sqrt(`api00:api00`*`ell:ell`)))
Draws conditioned scatterplots ('Trellis' plots) of survey data using hexagonal binning or transparency.
svycoplot(formula, design, style = c("hexbin", "transparent"), basecol = "black", alpha = c(0, 0.8),hexscale=c("relative","absolute"), ...)
svycoplot(formula, design, style = c("hexbin", "transparent"), basecol = "black", alpha = c(0, 0.8),hexscale=c("relative","absolute"), ...)
formula |
A graph formula suitable for |
design |
A survey design object |
style |
Hexagonal binning or transparent color? |
basecol |
The fully opaque 'base' color for creating transparent
colors. This may also be a function; see |
alpha |
Minimum and maximum opacity |
hexscale |
Scale hexagons separate for each panel (relative) or across all panels (absolute) |
... |
Other arguments passed to |
An object of class trellis
As with all 'Trellis' graphs, this function creates an object but does
not draw the graph. When used inside a function or non-interactively
you need to print()
the result to create the graph.
data(api) dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2, fpc=~fpc1+fpc2) svycoplot(api00~api99|sch.wide*comp.imp, design=dclus2, style="hexbin") svycoplot(api00~api99|sch.wide*comp.imp, design=dclus2, style="hexbin", hexscale="absolute") svycoplot(api00~api99|sch.wide, design=dclus2, style="trans") svycoplot(api00~meals|stype,design=dclus2, style="transparent", basecol=function(d) c("darkred","purple","forestgreen")[as.numeric(d$stype)], alpha=c(0,1))
data(api) dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2, fpc=~fpc1+fpc2) svycoplot(api00~api99|sch.wide*comp.imp, design=dclus2, style="hexbin") svycoplot(api00~api99|sch.wide*comp.imp, design=dclus2, style="hexbin", hexscale="absolute") svycoplot(api00~api99|sch.wide, design=dclus2, style="trans") svycoplot(api00~meals|stype,design=dclus2, style="transparent", basecol=function(d) c("darkred","purple","forestgreen")[as.numeric(d$stype)], alpha=c(0,1))
Fit a proportional hazards model to data from a complex survey design.
svycoxph(formula, design,subset=NULL, rescale=TRUE, ...) ## S3 method for class 'svycoxph' predict(object, newdata, se=FALSE, type=c("lp", "risk", "terms","curve"),...) ## S3 method for class 'svycoxph' AIC(object, ..., k = 2)
svycoxph(formula, design,subset=NULL, rescale=TRUE, ...) ## S3 method for class 'svycoxph' predict(object, newdata, se=FALSE, type=c("lp", "risk", "terms","curve"),...) ## S3 method for class 'svycoxph' AIC(object, ..., k = 2)
formula |
Model formula. Any |
design |
|
subset |
Expression to select a subpopulation |
rescale |
Rescale weights to improve numerical stability |
object |
A |
newdata |
New data for prediction |
se |
Compute standard errors? This takes a lot of memory for
|
type |
"curve" does predicted survival curves. The other values
are passed to |
... |
For |
k |
The penalty per parameter that would be used under independent sampling: AIC has |
The main difference between svycoxph
function and the robust=TRUE
option to coxph
in the
survival package is that this function accounts for the reduction in
variance from stratified sampling and the increase in variance from
having only a small number of clusters.
Note that strata
terms in the model formula describe subsets that
have a separate baseline hazard function and need not have anything to
do with the stratification of the sampling.
The AIC
method uses the same approach as AIC.svyglm
,
though the relevance of the criterion this optimises is a bit less clear
than for generalised linear models.
The standard errors for predicted survival curves are available only by linearization, not
by replicate weights (at the moment). Use
withReplicates
to get standard errors with replicate
weights. Predicted survival curves are not available for stratified
Cox models.
The standard errors use the delta-method approach of Williams (1995)
for the Nelson-Aalen estimator, modified to handle the Cox model
following Tsiatis (1981). The standard errors agree closely with
survfit.coxph
for independent sampling when the model fits
well, but are larger when the model fits poorly. I believe the
standard errors are equivalent to those of Lin (2000), but I don't
know of any implementation that would allow a check.
An object of class svycoxph
for svycoxph
, an object of
class svykm
or svykmlist
for predict(,type="curve")
.
The standard error calculation for survival curves uses memory proportional to the sample size times the square of the number of events.
Thomas Lumley
Binder DA. (1992) Fitting Cox's proportional hazards models from survey data. Biometrika 79: 139-147
Lin D-Y (2000) On fitting Cox's proportional hazards model to survey data. Biometrika 87: 37-47
Tsiatis AA (1981) A Large Sample Study of Cox's Regression Model. Annals of Statistics 9(1) 93-108
Williams RL (1995) "Product-Limit Survival Functions with Correlated Survival Times" Lifetime Data Analysis 1: 171–186
svykm
for estimation of Kaplan-Meier survival curves and
for methods that operate on survival curves.
regTermTest
for Wald and (Rao-Scott) likelihood ratio tests for one or more parameters.
## Somewhat unrealistic example of nonresponse bias. data(pbc, package="survival") pbc$randomized<-with(pbc, !is.na(trt) & trt>0) biasmodel<-glm(randomized~age*edema,data=pbc,family=binomial) pbc$randprob<-fitted(biasmodel) if (is.null(pbc$albumin)) pbc$albumin<-pbc$alb ##pre2.9.0 dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized)) rpbc<-as.svrepdesign(dpbc) (model<-svycoxph(Surv(time,status>0)~log(bili)+protime+albumin,design=dpbc)) svycoxph(Surv(time,status>0)~log(bili)+protime+albumin,design=rpbc) s<-predict(model,se=TRUE, type="curve", newdata=data.frame(bili=c(3,9), protime=c(10,10), albumin=c(3.5,3.5))) plot(s[[1]],ci=TRUE,col="sienna") lines(s[[2]], ci=TRUE,col="royalblue") quantile(s[[1]], ci=TRUE) confint(s[[2]], parm=365*(1:5))
## Somewhat unrealistic example of nonresponse bias. data(pbc, package="survival") pbc$randomized<-with(pbc, !is.na(trt) & trt>0) biasmodel<-glm(randomized~age*edema,data=pbc,family=binomial) pbc$randprob<-fitted(biasmodel) if (is.null(pbc$albumin)) pbc$albumin<-pbc$alb ##pre2.9.0 dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized)) rpbc<-as.svrepdesign(dpbc) (model<-svycoxph(Surv(time,status>0)~log(bili)+protime+albumin,design=dpbc)) svycoxph(Surv(time,status>0)~log(bili)+protime+albumin,design=rpbc) s<-predict(model,se=TRUE, type="curve", newdata=data.frame(bili=c(3,9), protime=c(10,10), albumin=c(3.5,3.5))) plot(s[[1]],ci=TRUE,col="sienna") lines(s[[2]], ci=TRUE,col="royalblue") quantile(s[[1]], ci=TRUE) confint(s[[2]], parm=365*(1:5))
Computes the sum of products needed for the variance of survey sample
estimators. svyCprod
is used for survey design objects from
before version 2.9, onestage
is called by svyrecvar
for post-2.9 design objects.
svyCprod(x, strata, psu, fpc, nPSU,certainty=NULL, postStrata=NULL, lonely.psu=getOption("survey.lonely.psu")) onestage(x, strata, clusters, nPSU, fpc, lonely.psu=getOption("survey.lonely.psu"),stage=0,cal)
svyCprod(x, strata, psu, fpc, nPSU,certainty=NULL, postStrata=NULL, lonely.psu=getOption("survey.lonely.psu")) onestage(x, strata, clusters, nPSU, fpc, lonely.psu=getOption("survey.lonely.psu"),stage=0,cal)
x |
A vector or matrix |
strata |
A vector of stratum indicators (may be |
psu |
A vector of cluster indicators (may be |
clusters |
A vector of cluster indicators |
fpc |
A data frame ( |
nPSU |
Table ( |
certainty |
logical vector with stratum names as names. If
|
postStrata |
Post-stratification variables |
lonely.psu |
One of |
stage |
Used internally to track the depth of recursion |
cal |
Used to pass calibration information at stages below the population |
The observations for each cluster are added, then centered within each stratum and the outer product is taken of the row vector resulting for each cluster. This is added within strata, multiplied by a degrees-of-freedom correction and by a finite population correction (if supplied) and added across strata.
If there are fewer clusters (PSUs) in a stratum than in the original
design extra rows of zeroes are added to x
to allow the correct
subpopulation variance to be computed.
See postStratify
for information about
post-stratification adjustments.
The variance formula gives 0/0 if a stratum contains only one sampling
unit. If the certainty
argument specifies that this is a PSU
sampled with probability 1 (a "certainty" PSU) then it does not
contribute to the variance (this is correct only when there is no
subsampling within the PSU – otherwise it should be defined as a
pseudo-stratum). If certainty
is FALSE
for
this stratum or is not supplied the result depends on lonely.psu
.
The options are "fail"
to give an error, "remove"
or
"certainty"
to give a variance contribution of 0 for the stratum,
"adjust"
to center the stratum at the grand mean rather than the
stratum mean, and "average"
to assign strata with one PSU the
average variance contribution from strata with more than one PSU. The
choice is controlled by setting options(survey.lonely.psu)
. If
this is not done the factory default is "fail"
. Using
"adjust"
is conservative, and it would often be better to combine
strata in some intelligent way. The properties of "average"
have
not been investigated thoroughly, but it may be useful when the lonely
PSUs are due to a few strata having PSUs missing completely at random.
The "remove"
and "certainty"
options give the same result,
but "certainty"
is intended for situations where there is only
one PSU in the population stratum, which is sampled with certainty (also
called ‘self-representing’ PSUs or strata). With "certainty"
no
warning is generated for strata with only one PSU. Ordinarily,
svydesign
will detect certainty PSUs, making this option
unnecessary.
For strata with a single PSU in a subset (domain) the variance formula
gives a value that is well-defined and positive, but not typically
correct. If options("survey.adjust.domain.lonely")
is TRUE
and options("survey.lonely.psu")
is "adjust"
or
"average"
, and no post-stratification or G-calibration has been
done, strata with a single PSU in a subset will be treated like those
with a single PSU in the sample. I am not aware of any theoretical
study of this procedure, but it should at least be conservative.
A covariance matrix
Thomas Lumley
Binder, David A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51, 279- 292.
svydesign
, svyrecvar
, surveyoptions
, postStratify
Compute Cronbach's alpha coefficient of reliability from survey data. The formula is equation (2) of Cronbach (1951) only with design-based estimates of the variances.
svycralpha(formula, design, na.rm = FALSE)
svycralpha(formula, design, na.rm = FALSE)
formula |
One-sided formula giving the variables that make up the total score |
design |
survey design object |
na.rm |
|
A number
Cronbach LJ (1951). "Coefficient alpha and the internal structure of tests". Psychometrika. 16 (3): 297-334. doi:10.1007/bf02310555.
data(api) dstrat<-svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, fpc = ~fpc) svycralpha(~ell+mobility+avg.ed+emer+meals, dstrat)
data(api) dstrat<-svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, fpc = ~fpc) svycralpha(~ell+mobility+avg.ed+emer+meals, dstrat)
Specify a complex survey design.
svydesign(ids, probs=NULL, strata = NULL, variables = NULL, fpc=NULL, data = NULL, nest = FALSE, check.strata = !nest, weights=NULL,pps=FALSE,...) ## Default S3 method: svydesign(ids, probs=NULL, strata = NULL, variables = NULL, fpc=NULL,data = NULL, nest = FALSE, check.strata = !nest, weights=NULL, pps=FALSE,calibrate.formula=NULL,variance=c("HT","YG"),...) ## S3 method for class 'imputationList' svydesign(ids, probs = NULL, strata = NULL, variables = NULL, fpc = NULL, data, nest = FALSE, check.strata = !nest, weights = NULL, pps=FALSE, calibrate.formula=NULL,...) ## S3 method for class 'character' svydesign(ids, probs = NULL, strata = NULL, variables = NULL, fpc = NULL, data, nest = FALSE, check.strata = !nest, weights = NULL, pps=FALSE, calibrate.formula=NULL,dbtype = "SQLite", dbname, ...)
svydesign(ids, probs=NULL, strata = NULL, variables = NULL, fpc=NULL, data = NULL, nest = FALSE, check.strata = !nest, weights=NULL,pps=FALSE,...) ## Default S3 method: svydesign(ids, probs=NULL, strata = NULL, variables = NULL, fpc=NULL,data = NULL, nest = FALSE, check.strata = !nest, weights=NULL, pps=FALSE,calibrate.formula=NULL,variance=c("HT","YG"),...) ## S3 method for class 'imputationList' svydesign(ids, probs = NULL, strata = NULL, variables = NULL, fpc = NULL, data, nest = FALSE, check.strata = !nest, weights = NULL, pps=FALSE, calibrate.formula=NULL,...) ## S3 method for class 'character' svydesign(ids, probs = NULL, strata = NULL, variables = NULL, fpc = NULL, data, nest = FALSE, check.strata = !nest, weights = NULL, pps=FALSE, calibrate.formula=NULL,dbtype = "SQLite", dbname, ...)
ids |
Formula or data frame specifying cluster ids from largest
level to smallest level, |
probs |
Formula or data frame specifying cluster sampling probabilities |
strata |
Formula or vector specifying strata, use |
variables |
Formula or data frame specifying the variables
measured in the survey. If |
fpc |
Finite population correction: see Details below |
weights |
Formula or vector specifying sampling weights as an
alternative to |
data |
Data frame to look up variables in the formula
arguments, or database table name, or |
nest |
If |
check.strata |
If |
.
pps |
|
calibrate.formula |
model formula specifying how the weights are *already* calibrated (raked, post-stratified). |
dbtype |
name of database driver to pass to |
dbname |
name of database (eg file name for SQLite) |
variance |
For |
... |
for future expansion |
The svydesign
object combines a data frame and all the survey
design information needed to analyse it. These objects are used by
the survey modelling and summary functions. The
id
argument is always required, the strata
,
fpc
, weights
and probs
arguments are
optional. If these variables are specified they must not have any
missing values.
By default, svydesign
assumes that all PSUs, even those in
different strata, have a unique value of the id
variable. This allows some data errors to be detected. If your PSUs
reuse the same identifiers across strata then set nest=TRUE
.
The finite population correction (fpc) is used to reduce the variance when a substantial fraction of the total population of interest has been sampled. It may not be appropriate if the target of inference is the process generating the data rather than the statistics of a particular finite population.
The finite population correction can be specified either as the total population size in each stratum or as the fraction of the total population that has been sampled. In either case the relevant population size is the sampling units. That is, sampling 100 units from a population stratum of size 500 can be specified as 500 or as 100/500=0.2. The exception is for PPS sampling without replacement, where the sampling probability (which will be different for each PSU) must be used.
If population sizes are specified but not sampling probabilities or weights, the sampling probabilities will be computed from the population sizes assuming simple random sampling within strata.
For multistage sampling the id
argument should specify a
formula with the cluster identifiers at each stage. If subsequent
stages are stratified strata
should also be specified as a
formula with stratum identifiers at each stage. The population size
for each level of sampling should also be specified in fpc
.
If fpc
is not specified then sampling is assumed to be with
replacement at the top level and only the first stage of cluster is
used in computing variances. If fpc
is specified but for fewer
stages than id
, sampling is assumed to be complete for
subsequent stages. The variance calculations for
multistage sampling assume simple or stratified random sampling
within clusters at each stage except possibly the last.
For PPS sampling without replacement it is necessary to specify the
probabilities for each stage of sampling using the fpc
arguments, and an overall weight
argument should not be
given. At the moment, multistage or stratified PPS sampling without
replacement is supported only with pps="brewer"
, or by
giving the full joint probability matrix using
ppsmat
. [Cluster sampling is supported by all
methods, but not subsampling within clusters].
The dim
, "["
, "[<-"
and na.action methods for
survey.design
objects operate on the dataframe specified by
variables
and ensure that the design information is properly
updated to correspond to the new data frame. With the "[<-"
method the new value can be a survey.design
object instead of a
data frame, but only the data frame is used. See also
subset.survey.design
for a simple way to select
subpopulations.
The model.frame
method extracts the observed data.
If the strata with only one PSU are not self-representing (or they are,
but svydesign
cannot tell based on fpc
) then the handling
of these strata for variance computation is determined by
options("survey.lonely.psu")
. See svyCprod
for
details.
data
may be a character string giving the name of a table or view
in a relational database that can be accessed through the DBI
interfaces. For DBI interfaces dbtype
should be the name of the database
driver and dbname
should be the name by which the driver identifies
the specific database (eg file name for SQLite).
The appropriate database interface package must already be loaded (eg
RSQLite
for SQLite). The survey design
object will contain only the design meta-data, and actual variables will
be loaded from the database as needed. Use
close
to close the database connection and
open
to reopen the connection, eg, after
loading a saved object.
The database interface does not attempt to modify the underlying database and so can be used with read-only permissions on the database.
If data
is an imputationList
object (from the "mitools"
package), svydesign
will return a svyimputationList
object
containing a set of designs. Use with.svyimputationList
to
do analyses on these designs and MIcombine
to combine the results.
An object of class survey.design
.
Thomas Lumley
as.svrepdesign
for converting to replicate weight designs,
subset.survey.design
for domain estimates,
update.survey.design
to add variables.
mitools
package for using multiple imputations
svyrecvar
for details of variance estimation
election
for examples of PPS sampling without replacement.
data(api) # stratified sample dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) # one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) # two-stage cluster sample: weights computed from population sizes. dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) ## multistage sampling has no effect when fpc is not given, so ## these are equivalent. dclus2wr<-svydesign(id=~dnum+snum, weights=weights(dclus2), data=apiclus2) dclus2wr2<-svydesign(id=~dnum, weights=weights(dclus2), data=apiclus2) ## syntax for stratified cluster sample ##(though the data weren't really sampled this way) svydesign(id=~dnum, strata=~stype, weights=~pw, data=apistrat, nest=TRUE) ## PPS sampling without replacement data(election) dpps<- svydesign(id=~1, fpc=~p, data=election_pps, pps="brewer") ##database example: requires RSQLite ## Not run: library(RSQLite) dbclus1<-svydesign(id=~dnum, weights=~pw, fpc=~fpc, data="apiclus1",dbtype="SQLite", dbname=system.file("api.db",package="survey")) ## End(Not run) ## pre-calibrated weights cdclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc, calibration.formula=~1)
data(api) # stratified sample dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) # one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) # two-stage cluster sample: weights computed from population sizes. dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) ## multistage sampling has no effect when fpc is not given, so ## these are equivalent. dclus2wr<-svydesign(id=~dnum+snum, weights=weights(dclus2), data=apiclus2) dclus2wr2<-svydesign(id=~dnum, weights=weights(dclus2), data=apiclus2) ## syntax for stratified cluster sample ##(though the data weren't really sampled this way) svydesign(id=~dnum, strata=~stype, weights=~pw, data=apistrat, nest=TRUE) ## PPS sampling without replacement data(election) dpps<- svydesign(id=~1, fpc=~p, data=election_pps, pps="brewer") ##database example: requires RSQLite ## Not run: library(RSQLite) dbclus1<-svydesign(id=~dnum, weights=~pw, fpc=~fpc, data="apiclus1",dbtype="SQLite", dbname=system.file("api.db",package="survey")) ## End(Not run) ## pre-calibrated weights cdclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc, calibration.formula=~1)
This function fits a factor analysis model or SEM, by maximum weighted likelihood.
svyfactanal(formula, design, factors, n = c("none", "sample", "degf","effective", "min.effective"), ...)
svyfactanal(formula, design, factors, n = c("none", "sample", "degf","effective", "min.effective"), ...)
formula |
Model formula specifying the variables to use |
design |
Survey design object |
factors |
Number of factors to estimate |
n |
Sample size to be used for testing: see below |
... |
Other arguments to pass to |
The population covariance matrix is estimated by svyvar
and passed to factanal
Although fitting these models requires only the estimated covariance
matrix, inference requires a sample size. With n="sample"
, the sample size is taken to be
the number of observations; with n="degf"
, the survey degrees of
freedom as returned by degf
. Using "sample"
corresponds to standardizing weights to have mean 1, and is known to
result in anti-conservative tests.
The other two methods estimate an effective sample size for each
variable as the sample size where the standard error of a variance of a
Normal distribution would match the design-based standard error
estimated by svyvar
. With n="min.effective"
the
minimum sample size across the variables is used; with
n="effective"
the harmonic mean is used. For svyfactanal
the test of model adequacy is optional, and the default choice,
n="none"
, does not do the test.
An object of class factanal
.
The lavaan.survey
package fits structural equation models to complex samples using similar techniques.
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) svyfactanal(~api99+api00+hsg+meals+ell+emer, design=dclus1, factors=2) svyfactanal(~api99+api00+hsg+meals+ell+emer, design=dclus1, factors=2, n="effective") ##Population dat for comparison factanal(~api99+api00+hsg+meals+ell+emer, data=apipop, factors=2)
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) svyfactanal(~api99+api00+hsg+meals+ell+emer, design=dclus1, factors=2) svyfactanal(~api99+api00+hsg+meals+ell+emer, design=dclus1, factors=2, n="effective") ##Population dat for comparison factanal(~api99+api00+hsg+meals+ell+emer, data=apipop, factors=2)
Fit a generalised linear model to data from a complex survey design, with inverse-probability weighting and design-based standard errors.
## S3 method for class 'survey.design' svyglm(formula, design, subset=NULL, family=stats::gaussian(),start=NULL, rescale=TRUE, ..., deff=FALSE, influence=FALSE) ## S3 method for class 'svyrep.design' svyglm(formula, design, subset=NULL, family=stats::gaussian(),start=NULL, rescale=NULL, ..., rho=NULL, return.replicates=FALSE, na.action,multicore=getOption("survey.multicore")) ## S3 method for class 'svyglm' summary(object, correlation = FALSE, df.resid=NULL, ...) ## S3 method for class 'svyglm' predict(object,newdata=NULL,total=NULL, type=c("link","response","terms"), se.fit=(type != "terms"),vcov=FALSE,...) ## S3 method for class 'svrepglm' predict(object,newdata=NULL,total=NULL, type=c("link","response","terms"), se.fit=(type != "terms"),vcov=FALSE, return.replicates=!is.null(object$replicates),...)
## S3 method for class 'survey.design' svyglm(formula, design, subset=NULL, family=stats::gaussian(),start=NULL, rescale=TRUE, ..., deff=FALSE, influence=FALSE) ## S3 method for class 'svyrep.design' svyglm(formula, design, subset=NULL, family=stats::gaussian(),start=NULL, rescale=NULL, ..., rho=NULL, return.replicates=FALSE, na.action,multicore=getOption("survey.multicore")) ## S3 method for class 'svyglm' summary(object, correlation = FALSE, df.resid=NULL, ...) ## S3 method for class 'svyglm' predict(object,newdata=NULL,total=NULL, type=c("link","response","terms"), se.fit=(type != "terms"),vcov=FALSE,...) ## S3 method for class 'svrepglm' predict(object,newdata=NULL,total=NULL, type=c("link","response","terms"), se.fit=(type != "terms"),vcov=FALSE, return.replicates=!is.null(object$replicates),...)
formula |
Model formula |
design |
Survey design from |
subset |
Expression to select a subpopulation |
family |
|
start |
Starting values for the coefficients (needed for some uncommon link/family combinations) |
rescale |
Rescaling of weights, to improve numerical stability. The default
rescales weights to sum to the sample size. Use |
... |
Other arguments passed to |
rho |
For replicate BRR designs, to specify the parameter for
Fay's variance method, giving weights of |
return.replicates |
Return the replicates as the |
deff |
Estimate the design effects |
influence |
Return influence functions |
object |
A |
correlation |
Include the correlation matrix of parameters? |
na.action |
Handling of NAs |
multicore |
Use the |
df.resid |
Optional denominator degrees of freedom for Wald tests |
newdata |
new data frame for prediction |
total |
population size when predicting population total |
type |
linear predictor ( |
se.fit |
if |
vcov |
if |
For binomial and Poisson families use family=quasibinomial()
and family=quasipoisson()
to avoid a warning about non-integer
numbers of successes. The ‘quasi’ versions of the family objects give
the same point estimates and standard errors and do not give the
warning.
If df.resid
is not specified the df for the null model is
computed by degf
and the residual df computed by
subtraction. This is recommended by Korn and Graubard and is correct
for PSU-level covariates but is potentially very conservative for
individual-level covariates. To get tests based on a Normal distribution
use df.resid=Inf
, and to use number of PSUs-number of strata,
specify df.resid=degf(design)
.
Parallel processing with multicore=TRUE
is helpful only for
fairly large data sets and on computers with sufficient memory. It may
be incompatible with GUIs, although the Mac Aqua GUI appears to be safe.
predict
gives fitted values and sampling variability for specific new
values of covariates. When newdata
are the population mean it
gives the regression estimator of the mean, and when newdata
are
the population totals and total
is specified it gives the
regression estimator of the population total. Regression estimators of
mean and total can also be obtained with calibrate
.
When the model is not of full rank, so that some coefficients are
NA
, point predictions will be made by setting those coefficients
to zero. Standard error and variance estimates will be NA
.
svyglm
returns an object of class svyglm
. The
predict
method returns an object of class svystat
if
se.fit
is TRUE
, otherwise just a numeric vector
svyglm
always returns 'model-robust' standard errors; the
Horvitz-Thompson-type standard errors used everywhere in the survey
package are a generalisation of the model-robust 'sandwich' estimators.
In particular, a quasi-Poisson svyglm
will return correct
standard errors for relative risk regression models.
This function does not return the same standard error estimates for the regression estimator of population mean and total as some textbooks, or SAS. However, it does give the same standard error estimator as estimating the mean or total with calibrated weights.
In particular, under simple random sampling with or without replacement
there is a simple rescaling of the mean squared residual to estimate the
mean squared error of the regression estimator. The standard error
estimate produced by predict.svyglm
has very similar
(asymptotically identical) expected
value to the textbook estimate, and has the advantage of being
applicable when the supplied newdata
are not the population mean
of the predictors. The difference is small when the sample size is large, but can be
appreciable for small samples.
You can obtain the other standard error estimator by calling
predict.svyglm
with the covariates set to their estimated (rather
than true) population mean values.
Thomas Lumley
Lumley T, Scott A (2017) "Fitting Regression Models to Survey Data" Statistical Science 32: 265-278
glm
, which is used to do most of the work.
regTermTest
, for multiparameter tests
calibrate
, for an alternative way to specify regression
estimators of population totals or means
svyttest
for one-sample and two-sample t-tests.
data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) rstrat<-as.svrepdesign(dstrat) rclus2<-as.svrepdesign(dclus2) summary(svyglm(api00~ell+meals+mobility, design=dstrat)) summary(svyglm(api00~ell+meals+mobility, design=dclus2)) summary(svyglm(api00~ell+meals+mobility, design=rstrat)) summary(svyglm(api00~ell+meals+mobility, design=rclus2)) ## use quasibinomial, quasipoisson to avoid warning messages summary(svyglm(sch.wide~ell+meals+mobility, design=dstrat, family=quasibinomial())) ## Compare regression and ratio estimation of totals api.ratio <- svyratio(~api.stu,~enroll, design=dstrat) pop<-data.frame(enroll=sum(apipop$enroll, na.rm=TRUE)) npop <- nrow(apipop) predict(api.ratio, pop$enroll) ## regression estimator is less efficient api.reg <- svyglm(api.stu~enroll, design=dstrat) predict(api.reg, newdata=pop, total=npop) ## same as calibration estimator svytotal(~api.stu, calibrate(dstrat, ~enroll, pop=c(npop, pop$enroll))) ## svyglm can also reproduce the ratio estimator api.reg2 <- svyglm(api.stu~enroll-1, design=dstrat, family=quasi(link="identity",var="mu")) predict(api.reg2, newdata=pop, total=npop) ## higher efficiency by modelling variance better api.reg3 <- svyglm(api.stu~enroll-1, design=dstrat, family=quasi(link="identity",var="mu^3")) predict(api.reg3, newdata=pop, total=npop) ## true value sum(apipop$api.stu)
data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) rstrat<-as.svrepdesign(dstrat) rclus2<-as.svrepdesign(dclus2) summary(svyglm(api00~ell+meals+mobility, design=dstrat)) summary(svyglm(api00~ell+meals+mobility, design=dclus2)) summary(svyglm(api00~ell+meals+mobility, design=rstrat)) summary(svyglm(api00~ell+meals+mobility, design=rclus2)) ## use quasibinomial, quasipoisson to avoid warning messages summary(svyglm(sch.wide~ell+meals+mobility, design=dstrat, family=quasibinomial())) ## Compare regression and ratio estimation of totals api.ratio <- svyratio(~api.stu,~enroll, design=dstrat) pop<-data.frame(enroll=sum(apipop$enroll, na.rm=TRUE)) npop <- nrow(apipop) predict(api.ratio, pop$enroll) ## regression estimator is less efficient api.reg <- svyglm(api.stu~enroll, design=dstrat) predict(api.reg, newdata=pop, total=npop) ## same as calibration estimator svytotal(~api.stu, calibrate(dstrat, ~enroll, pop=c(npop, pop$enroll))) ## svyglm can also reproduce the ratio estimator api.reg2 <- svyglm(api.stu~enroll-1, design=dstrat, family=quasi(link="identity",var="mu")) predict(api.reg2, newdata=pop, total=npop) ## higher efficiency by modelling variance better api.reg3 <- svyglm(api.stu~enroll-1, design=dstrat, family=quasi(link="identity",var="mu^3")) predict(api.reg3, newdata=pop, total=npop) ## true value sum(apipop$api.stu)
A Rao-Scott-type version of the chi-squared test for goodness of fit to prespecified proportions. The test statistic is the chi-squared statistic applied to the estimated population table, and the reference distribution is a Satterthwaite approximation: the test statistic divided by the estimated scale is compared to a chi-squared distribution with the estimated df.
svygofchisq(formula, p, design, ...)
svygofchisq(formula, p, design, ...)
formula |
Formula specifying a single factor variable |
p |
Vector of probabilities for the categories of the factor, in the correct order (will be rescaled to sum to 1) |
design |
Survey design object |
... |
Other arguments to pass to |
An object of class htest
chisq.test
, svychisq
, pchisqsum
data(api) dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) true_p <- table(apipop$stype) svygofchisq(~stype,dclus2,p=true_p) svygofchisq(~stype,dclus2,p=c(1/3,1/3,1/3))
data(api) dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) true_p <- table(apipop$stype) svygofchisq(~stype,dclus2,p=true_p) svygofchisq(~stype,dclus2,p=c(1/3,1/3,1/3))
Histograms and boxplots weighted by the sampling weights.
svyhist(formula, design, breaks = "Sturges", include.lowest = TRUE, right = TRUE, xlab = NULL, main = NULL, probability = TRUE, freq = !probability, ...) svyboxplot(formula, design, all.outliers=FALSE,...)
svyhist(formula, design, breaks = "Sturges", include.lowest = TRUE, right = TRUE, xlab = NULL, main = NULL, probability = TRUE, freq = !probability, ...) svyboxplot(formula, design, all.outliers=FALSE,...)
formula |
One-sided formula for |
design |
A survey design object |
xlab |
x-axis label |
main |
Main title |
probability , freq
|
Y-axis is probability density or frequency |
all.outliers |
Show all outliers in the boxplot, not just extremes |
breaks , include.lowest , right
|
As for |
... |
The histogram breakpoints are computed as if the sample were a simple random sample of the same size.
The grouping variable in svyboxplot
, if present, must be a factor.
The boxplot whiskers go to the maximum and minimum observations or to
1.5 interquartile ranges beyond the end of the box, whichever is
closer. The maximum and minimum are plotted as outliers if they are
beyond the ends of the whiskers, but other outlying points are not
plotted unless all.outliers=TRUE
. svyboxplot
requires a two-sided formula; use variable~1
for a single boxplot.
As for hist
, except that when probability=FALSE
, the return value includes a component
count_scale
giving a scale factor between density and
counts, assuming equal bin widths.
data(api) dstrat <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, fpc = ~fpc) opar<-par(mfrow=c(1,3)) svyhist(~enroll, dstrat, main="Survey weighted",col="purple",ylim=c(0,1.3e-3)) hist(apistrat$enroll, main="Sample unweighted",col="purple",prob=TRUE,ylim=c(0,1.3e-3)) hist(apipop$enroll, main="Population",col="purple",prob=TRUE,ylim=c(0,1.3e-3)) par(mfrow=c(1,1)) svyboxplot(enroll~stype,dstrat,all.outliers=TRUE) svyboxplot(enroll~1,dstrat) par(opar)
data(api) dstrat <- svydesign(id = ~1, strata = ~stype, weights = ~pw, data = apistrat, fpc = ~fpc) opar<-par(mfrow=c(1,3)) svyhist(~enroll, dstrat, main="Survey weighted",col="purple",ylim=c(0,1.3e-3)) hist(apistrat$enroll, main="Sample unweighted",col="purple",prob=TRUE,ylim=c(0,1.3e-3)) hist(apipop$enroll, main="Population",col="purple",prob=TRUE,ylim=c(0,1.3e-3)) par(mfrow=c(1,1)) svyboxplot(enroll~stype,dstrat,all.outliers=TRUE) svyboxplot(enroll~1,dstrat) par(opar)
Estimates regressions with endogenous covariates using two-stage least squares. The function uses ivreg
from the AER
package for the main computations, and follows the syntax of that function.
svyivreg(formula, design, ...)
svyivreg(formula, design, ...)
formula |
formula specification(s) of the regression relationship and the instruments. See Details for details |
design |
A survey design object |
... |
For future expansion |
Regressors and instruments for svyivreg
are specified
in a formula with two parts on the right-hand side, e.g., y ~ x1
+ x2 | z1 + z2 + z3
, where x1
and x2
are the regressors and
z1
, z2
, and z3
are the instruments. Note that exogenous
regressors have to be included as instruments for themselves. For
example, if there is one exogenous regressor ex
and one
endogenous regressor en
with instrument in
, the appropriate
formula would be y ~ ex + en | ex + in
. Equivalently, this can
be specified as y ~ ex + en | . - en + in
, i.e., by providing an
update formula with a .
in the second part of the formula.
An object of class svyivreg
https://notstatschat.rbind.io/2019/07/16/adding-new-functions-to-the-survey-package/
Computes the unweighted kappa measure of agreement between two raters and the standard error. The measurements must both be factor variables in the survey design object.
svykappa(formula, design, ...)
svykappa(formula, design, ...)
formula |
one-sided formula giving two measurements |
design |
survey design object |
... |
passed to |
Object of class svystat
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) svykappa(~comp.imp+sch.wide, dclus1) dclus1<-update(dclus1, stypecopy=stype) svykappa(~stype+stypecopy,dclus1) (kappas<-svyby(~comp.imp+sch.wide,~stype,design=dclus1, svykappa, covmat=TRUE)) svycontrast(kappas, quote(E/H))
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) svykappa(~comp.imp+sch.wide, dclus1) dclus1<-update(dclus1, stypecopy=stype) svykappa(~stype+stypecopy,dclus1) (kappas<-svyby(~comp.imp+sch.wide,~stype,design=dclus1, svykappa, covmat=TRUE)) svycontrast(kappas, quote(E/H))
Estimates the survival function using a weighted Kaplan-Meier estimator.
svykm(formula, design,se=FALSE, ...) ## S3 method for class 'svykm' plot(x,xlab="time",ylab="Proportion surviving", ylim=c(0,1),ci=NULL,lty=1,...) ## S3 method for class 'svykm' lines(x,xlab="time",type="s",ci=FALSE,lty=1,...) ## S3 method for class 'svykmlist' plot(x, pars=NULL, ci=FALSE,...) ## S3 method for class 'svykm' quantile(x, probs=c(0.75,0.5,0.25),ci=FALSE,level=0.95,...) ## S3 method for class 'svykm' confint(object,parm,level=0.95,...)
svykm(formula, design,se=FALSE, ...) ## S3 method for class 'svykm' plot(x,xlab="time",ylab="Proportion surviving", ylim=c(0,1),ci=NULL,lty=1,...) ## S3 method for class 'svykm' lines(x,xlab="time",type="s",ci=FALSE,lty=1,...) ## S3 method for class 'svykmlist' plot(x, pars=NULL, ci=FALSE,...) ## S3 method for class 'svykm' quantile(x, probs=c(0.75,0.5,0.25),ci=FALSE,level=0.95,...) ## S3 method for class 'svykm' confint(object,parm,level=0.95,...)
formula |
Two-sided formula. The response variable should be a right-censored
|
design |
survey design object |
se |
Compute standard errors? This is slow for moderate to large data sets |
... |
in |
x |
a |
xlab , ylab , ylim , type
|
as for |
lty |
Line type, see |
ci |
Plot (or return, for |
pars |
A list of vectors of graphical parameters for the
separate curves in a |
object |
A |
parm |
vector of times to report confidence intervals |
level |
confidence level |
probs |
survival probabilities for computing survival quantiles
(note that these are the complement of the usual
|
When standard errors are computed, the survival curve is actually the Aalen (hazard-based) estimator rather than the Kaplan-Meier estimator.
The standard error computations use memory proportional to the sample size times the square of the number of events. This can be a lot.
In the case of equal-probability cluster sampling without replacement the computations are essentially the same as those of Williams (1995), and the same linearization strategy is used for other designs.
Confidence intervals are computed on the log(survival) scale,
following the default in survival
package, which was based on
simulations by Link(1984).
Confidence intervals for quantiles use Woodruff's method: the interval is the intersection of the horizontal line at the specified quantile with the pointwise confidence band around the survival curve.
For svykm
, an object of class svykm
for a single curve or svykmlist
for multiple curves.
Link, C. L. (1984). Confidence intervals for the survival function using Cox's proportional hazards model with covariates. Biometrics 40, 601-610.
Williams RL (1995) "Product-Limit Survival Functions with Correlated Survival Times" Lifetime Data Analysis 1: 171–186
Woodruff RS (1952) Confidence intervals for medians and other position measures. JASA 57, 622-627.
predict.svycoxph
for survival curves from a Cox model
data(pbc, package="survival") pbc$randomized <- with(pbc, !is.na(trt) & trt>0) biasmodel<-glm(randomized~age*edema,data=pbc) pbc$randprob<-fitted(biasmodel) dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized)) s1<-svykm(Surv(time,status>0)~1, design=dpbc) s2<-svykm(Surv(time,status>0)~I(bili>6), design=dpbc) plot(s1) plot(s2) plot(s2, lwd=2, pars=list(lty=c(1,2),col=c("purple","forestgreen"))) quantile(s1, probs=c(0.9,0.75,0.5,0.25,0.1)) s3<-svykm(Surv(time,status>0)~I(bili>6), design=dpbc,se=TRUE) plot(s3[[2]],col="purple") confint(s3[[2]], parm=365*(1:5)) quantile(s3[[1]], ci=TRUE)
data(pbc, package="survival") pbc$randomized <- with(pbc, !is.na(trt) & trt>0) biasmodel<-glm(randomized~age*edema,data=pbc) pbc$randprob<-fitted(biasmodel) dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized)) s1<-svykm(Surv(time,status>0)~1, design=dpbc) s2<-svykm(Surv(time,status>0)~I(bili>6), design=dpbc) plot(s1) plot(s2) plot(s2, lwd=2, pars=list(lty=c(1,2),col=c("purple","forestgreen"))) quantile(s1, probs=c(0.9,0.75,0.5,0.25,0.1)) s3<-svykm(Surv(time,status>0)~I(bili>6), design=dpbc,se=TRUE) plot(s3[[2]],col="purple") confint(s3[[2]], parm=365*(1:5)) quantile(s3[[1]], ci=TRUE)
Fit and compare hierarchical loglinear models for complex survey data.
svyloglin(formula, design, ...) ## S3 method for class 'svyloglin' update(object,formula,...) ## S3 method for class 'svyloglin' anova(object,object1,...,integrate=FALSE) ## S3 method for class 'anova.svyloglin' print(x,pval=c("F","saddlepoint","lincom","chisq"),...) ## S3 method for class 'svyloglin' coef(object,...,intercept=FALSE)
svyloglin(formula, design, ...) ## S3 method for class 'svyloglin' update(object,formula,...) ## S3 method for class 'svyloglin' anova(object,object1,...,integrate=FALSE) ## S3 method for class 'anova.svyloglin' print(x,pval=c("F","saddlepoint","lincom","chisq"),...) ## S3 method for class 'svyloglin' coef(object,...,intercept=FALSE)
formula |
Model formula |
design |
survey design object |
object , object1
|
loglinear model from |
pval |
p-value approximation: see Details |
integrate |
Compute the exact asymptotic p-value (slow)? |
... |
not used |
intercept |
Report the intercept? |
x |
anova object |
The loglinear model is fitted to a multiway table with probabilities
estimated by svymean
and with the sample size equal to the
observed sample size, treating the resulting table as if it came from iid
multinomial sampling, as described by Rao and Scott. The
variance-covariance matrix does not include the intercept term, and so
by default neither does the coef
method. A Newton-Raphson
algorithm is used, rather than iterative proportional fitting, so
starting values are not needed.
The anova
method computes the quantities that would be the score
(Pearson) and likelihood ratio chi-squared statistics if the data were
an iid sample. It computes four p-values for each of these, based on the
exact asymptotic distribution (see pchisqsum
), a
saddlepoint approximateion to this distribution, a scaled
chi-squared distribution, and a scaled F-distribution. When testing the
two-way interaction model against the main-effects model in a two-way
table the score statistic and p-values match the Rao-Scott tests
computed by svychisq
.
The anova
method can only compare two models if they are for
exactly the same multiway table (same variables and same order). The
update
method will help with this. It is also much faster to use
update
than svyloglin
for a large data set: its time
complexity depends only on the size of the model, not on the size of the
data set.
It is not possible to fit a model using a variable created inline, eg
I(x<10)
, since the multiway table is based on all variables used
in the formula.
Object of class "svyloglin"
Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contingency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60.
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) a<-svyloglin(~stype+comp.imp,dclus1) b<-update(a,~.^2) an<-anova(a,b) an print(an, pval="saddlepoint") ## Wald test regTermTest(b, ~stype:comp.imp) ## linear-by-linear association d<-update(a,~.+as.numeric(stype):as.numeric(comp.imp)) an1<-anova(a,d) an1
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) a<-svyloglin(~stype+comp.imp,dclus1) b<-update(a,~.^2) an<-anova(a,b) an print(an, pval="saddlepoint") ## Wald test regTermTest(b, ~stype:comp.imp) ## linear-by-linear association d<-update(a,~.+as.numeric(stype):as.numeric(comp.imp)) an1<-anova(a,d) an1
Computes a weighted version of the logrank test for comparing two or more
survival distributions. The generalization to complex samples is based
on the characterization of the logrank test as the score test in a Cox model.
Under simple random sampling with replacement, this function with
rho=0
and gamma=0
is almost identical to the robust score test
in the survival package. The rho=0
and gamma=0
version was
proposed by Rader (2014).
svylogrank(formula, design, rho=0,gamma=0,method=c("small","large","score"), ...)
svylogrank(formula, design, rho=0,gamma=0,method=c("small","large","score"), ...)
formula |
Model formula with a single predictor. The predictor must be a factor if it has more than two levels. |
design |
A survey design object |
rho , gamma
|
Coefficients for the Harrington/Fleming G-rho-gamma
tests. The default is the logrank test, |
method |
|
... |
for future expansion. |
A vector containing the z-statistic for comparing each level of the variable to the lowest, the chisquared statistic for the logrank test, and the p-value.
Rader, Kevin Andrew. 2014. Methods for Analyzing Survival and Binary Data in Complex Surveys. Doctoral dissertation, Harvard University.http://nrs.harvard.edu/urn-3:HUL.InstRepos:12274283
library("survival") data(nwtco) ## stratified on case status dcchs<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~rel), subset=~I(in.subcohort | rel), data=nwtco, method="simple") svylogrank(Surv(edrel,rel)~factor(stage),design=dcchs) data(pbc, package="survival") pbc$randomized <- with(pbc, !is.na(trt) & trt>0) biasmodel<-glm(randomized~age*edema,data=pbc) pbc$randprob<-fitted(biasmodel) dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized)) svylogrank(Surv(time,status==2)~trt,design=dpbc) svylogrank(Surv(time,status==2)~trt,design=dpbc,rho=1) rpbc<-as.svrepdesign(dpbc) svylogrank(Surv(time,status==2)~trt,design=rpbc)
library("survival") data(nwtco) ## stratified on case status dcchs<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~rel), subset=~I(in.subcohort | rel), data=nwtco, method="simple") svylogrank(Surv(edrel,rel)~factor(stage),design=dcchs) data(pbc, package="survival") pbc$randomized <- with(pbc, !is.na(trt) & trt>0) biasmodel<-glm(randomized~age*edema,data=pbc) pbc$randprob<-fitted(biasmodel) dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized)) svylogrank(Surv(time,status==2)~trt,design=dpbc) svylogrank(Surv(time,status==2)~trt,design=dpbc,rho=1) rpbc<-as.svrepdesign(dpbc) svylogrank(Surv(time,status==2)~trt,design=rpbc)
Maximises a user-specified likelihood parametrised by multiple linear predictors to data from a complex sample survey and computes the sandwich variance estimator of the coefficients. Note that this function maximises an estimated population likelihood, it is not the sample MLE.
svymle(loglike, gradient = NULL, design, formulas, start = NULL, control = list(), na.action="na.fail", method=NULL, lower=NULL,upper=NULL,influence=FALSE,...) ## S3 method for class 'svymle' summary(object, stderr=c("robust", "model"),...)
svymle(loglike, gradient = NULL, design, formulas, start = NULL, control = list(), na.action="na.fail", method=NULL, lower=NULL,upper=NULL,influence=FALSE,...) ## S3 method for class 'svymle' summary(object, stderr=c("robust", "model"),...)
loglike |
vectorised loglikelihood function |
gradient |
Derivative of |
design |
a |
formulas |
A list of formulas specifying the variable and linear predictors: see Details below |
start |
Starting values for parameters |
control |
control options for the optimiser: see the help page for the optimiser you are using. |
lower , upper
|
Parameter bounds for |
influence |
Return the influence functions (primarily for svyby) |
na.action |
Handling of |
method |
|
... |
Arguments to |
object |
|
stderr |
Choice of standard error estimator. The default is a standard sandwich estimator. See Details below. |
Optimization is done by nlm
by default or if
method=="nlm"
. Otherwise optim
is used and method
specifies the method and control
specifies control parameters.
The design
object contains all the data and design information
from the survey, so all the formulas refer to variables in this object.
The formulas
argument needs to specify the response variable and
a linear predictor for each freely varying argument of loglike
.
Consider for example the dnorm
function, with arguments
x
, mean
, sd
and log
, and suppose we want to
estimate the mean of y
as a linear function of a variable
z
, and to estimate a constant standard deviation. The log
argument must be fixed at FALSE
to get the loglikelihood. A
formulas
argument would be list(~y, mean=~z, sd=~1)
. Note
that the data variable y
must be the first argument to
dnorm
and the first formula and that all the other formulas are
labelled. It is also permitted to have the data variable as the
left-hand side of one of the formulas: eg list( mean=y~z, sd=~1)
.
The two optimisers from the minqa
package do not use any
derivatives to be specified for optimisation, but they do assume
that the function is smooth enough for a quadratic approximation, ie,
that two derivatives exist.
The usual variance estimator for MLEs in a survey sample is a ‘sandwich’
variance that requires the score vector and the information matrix. It
requires only sampling assumptions to be valid (though some model
assumptions are required for it to be useful). This is the
stderr="robust"
option, which is available only when the gradient
argument was specified.
If the model is correctly specified and the sampling is at random
conditional on variables in the model then standard errors based on just
the information matrix will be approximately valid. In particular, for
independent sampling where weights and strata depend on variables in the
model the stderr="model"
should work fairly well.
An object of class svymle
Thomas Lumley
data(api) dstrat<-svydesign(id=~1, strata=~stype, weight=~pw, fpc=~fpc, data=apistrat) ## fit with glm m0 <- svyglm(api00~api99+ell,family="gaussian",design=dstrat) ## fit as mle (without gradient) m1 <- svymle(loglike=dnorm,gradient=NULL, design=dstrat, formulas=list(mean=api00~api99+ell, sd=~1), start=list(c(80,1,0),c(20)), log=TRUE) ## with gradient gr<- function(x,mean,sd,log){ dm<-2*(x - mean)/(2*sd^2) ds<-(x-mean)^2*(2*(2 * sd))/(2*sd^2)^2 - sqrt(2*pi)/(sd*sqrt(2*pi)) cbind(dm,ds) } m2 <- svymle(loglike=dnorm,gradient=gr, design=dstrat, formulas=list(mean=api00~api99+ell, sd=~1), start=list(c(80,1,0),c(20)), log=TRUE, method="BFGS") summary(m0) summary(m1,stderr="model") summary(m2) ## Using offsets m3 <- svymle(loglike=dnorm,gradient=gr, design=dstrat, formulas=list(mean=api00~api99+offset(ell)+ell, sd=~1), start=list(c(80,1,0),c(20)), log=TRUE, method="BFGS") ## demonstrating multiple linear predictors m3 <- svymle(loglike=dnorm,gradient=gr, design=dstrat, formulas=list(mean=api00~api99+offset(ell)+ell, sd=~stype), start=list(c(80,1,0),c(20,0,0)), log=TRUE, method="BFGS") ## More complicated censored lognormal data example ## showing that the response variable can be multivariate data(pbc, package="survival") pbc$randomized <- with(pbc, !is.na(trt) & trt>0) biasmodel<-glm(randomized~age*edema,data=pbc) pbc$randprob<-fitted(biasmodel) dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized)) ## censored logNormal likelihood lcens<-function(x,mean,sd){ ifelse(x[,2]==1, dnorm(log(x[,1]),mean,sd,log=TRUE), pnorm(log(x[,1]),mean,sd,log=TRUE,lower.tail=FALSE) ) } gcens<- function(x,mean,sd){ dz<- -dnorm(log(x[,1]),mean,sd)/pnorm(log(x[,1]),mean,sd,lower.tail=FALSE) dm<-ifelse(x[,2]==1, 2*(log(x[,1]) - mean)/(2*sd^2), dz*-1/sd) ds<-ifelse(x[,2]==1, (log(x[,1])-mean)^2*(2*(2 * sd))/(2*sd^2)^2 - sqrt(2*pi)/(sd*sqrt(2*pi)), ds<- dz*-(log(x[,1])-mean)/(sd*sd)) cbind(dm,ds) } m<-svymle(loglike=lcens, gradient=gcens, design=dpbc, method="newuoa", formulas=list(mean=I(cbind(time,status>0))~bili+protime+albumin, sd=~1), start=list(c(10,0,0,0),c(1))) summary(m) ## the same model, but now specifying the lower bound of zero on the ## log standard deviation mbox<-svymle(loglike=lcens, gradient=gcens, design=dpbc, method="bobyqa", formulas=list(mean=I(cbind(time,status>0))~bili+protime+albumin, sd=~1), lower=list(c(-Inf,-Inf,-Inf,-Inf),0), upper=Inf, start=list(c(10,0,0,0),c(1))) ## The censored lognormal model is now available in svysurvreg() summary(svysurvreg(Surv(time,status>0)~bili+protime+albumin, design=dpbc,dist="lognormal")) ## compare svymle scale value after log transformation svycontrast(m, quote(log(`sd.(Intercept)`)))
data(api) dstrat<-svydesign(id=~1, strata=~stype, weight=~pw, fpc=~fpc, data=apistrat) ## fit with glm m0 <- svyglm(api00~api99+ell,family="gaussian",design=dstrat) ## fit as mle (without gradient) m1 <- svymle(loglike=dnorm,gradient=NULL, design=dstrat, formulas=list(mean=api00~api99+ell, sd=~1), start=list(c(80,1,0),c(20)), log=TRUE) ## with gradient gr<- function(x,mean,sd,log){ dm<-2*(x - mean)/(2*sd^2) ds<-(x-mean)^2*(2*(2 * sd))/(2*sd^2)^2 - sqrt(2*pi)/(sd*sqrt(2*pi)) cbind(dm,ds) } m2 <- svymle(loglike=dnorm,gradient=gr, design=dstrat, formulas=list(mean=api00~api99+ell, sd=~1), start=list(c(80,1,0),c(20)), log=TRUE, method="BFGS") summary(m0) summary(m1,stderr="model") summary(m2) ## Using offsets m3 <- svymle(loglike=dnorm,gradient=gr, design=dstrat, formulas=list(mean=api00~api99+offset(ell)+ell, sd=~1), start=list(c(80,1,0),c(20)), log=TRUE, method="BFGS") ## demonstrating multiple linear predictors m3 <- svymle(loglike=dnorm,gradient=gr, design=dstrat, formulas=list(mean=api00~api99+offset(ell)+ell, sd=~stype), start=list(c(80,1,0),c(20,0,0)), log=TRUE, method="BFGS") ## More complicated censored lognormal data example ## showing that the response variable can be multivariate data(pbc, package="survival") pbc$randomized <- with(pbc, !is.na(trt) & trt>0) biasmodel<-glm(randomized~age*edema,data=pbc) pbc$randprob<-fitted(biasmodel) dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized)) ## censored logNormal likelihood lcens<-function(x,mean,sd){ ifelse(x[,2]==1, dnorm(log(x[,1]),mean,sd,log=TRUE), pnorm(log(x[,1]),mean,sd,log=TRUE,lower.tail=FALSE) ) } gcens<- function(x,mean,sd){ dz<- -dnorm(log(x[,1]),mean,sd)/pnorm(log(x[,1]),mean,sd,lower.tail=FALSE) dm<-ifelse(x[,2]==1, 2*(log(x[,1]) - mean)/(2*sd^2), dz*-1/sd) ds<-ifelse(x[,2]==1, (log(x[,1])-mean)^2*(2*(2 * sd))/(2*sd^2)^2 - sqrt(2*pi)/(sd*sqrt(2*pi)), ds<- dz*-(log(x[,1])-mean)/(sd*sd)) cbind(dm,ds) } m<-svymle(loglike=lcens, gradient=gcens, design=dpbc, method="newuoa", formulas=list(mean=I(cbind(time,status>0))~bili+protime+albumin, sd=~1), start=list(c(10,0,0,0),c(1))) summary(m) ## the same model, but now specifying the lower bound of zero on the ## log standard deviation mbox<-svymle(loglike=lcens, gradient=gcens, design=dpbc, method="bobyqa", formulas=list(mean=I(cbind(time,status>0))~bili+protime+albumin, sd=~1), lower=list(c(-Inf,-Inf,-Inf,-Inf),0), upper=Inf, start=list(c(10,0,0,0),c(1))) ## The censored lognormal model is now available in svysurvreg() summary(svysurvreg(Surv(time,status>0)~bili+protime+albumin, design=dpbc,dist="lognormal")) ## compare svymle scale value after log transformation svycontrast(m, quote(log(`sd.(Intercept)`)))
Fits a nonlinear model by probability-weighted least squares. Uses
nls
to do the fitting, but estimates design-based standard errors with either
linearisation or replicate weights. See nls
for
documentation of model specification and fitting.
svynls(formula, design, start, weights=NULL, ...)
svynls(formula, design, start, weights=NULL, ...)
formula |
Nonlinear model specified as a formula; see |
design |
Survey design object |
start |
starting values, passed to |
weights |
Non-sampling weights, eg precision weights to give more efficient estimation in the presence of heteroscedasticity. |
... |
Other arguments to |
Object of class svynls
. The fitted nls
object is
included as the fit
element.
svymle
for maximum likelihood with linear predictors on
one or more parameters
set.seed(2020-4-3) x<-rep(seq(0,50,1),10) y<-((runif(1,10,20)*x)/(runif(1,0,10)+x))+rnorm(510,0,1) pop_model<-nls(y~a*x/(b+x), start=c(a=15,b=5)) df<-data.frame(x=x,y=y) df$p<-ifelse((y-fitted(pop_model))*(x-mean(x))>0, .4,.1) df$strata<-ifelse(df$p==.4,"a","b") in_sample<-stratsample(df$strata, round(table(df$strat)*c(0.4,0.1))) sdf<-df[in_sample,] des<-svydesign(id=~1, strata=~strata, prob=~p, data=sdf) pop_model (biased_sample<-nls(y~a*x/(b+x),data=sdf, start=c(a=15,b=5))) (corrected <- svynls(y~a*x/(b+x), design=des, start=c(a=15,b=5)))
set.seed(2020-4-3) x<-rep(seq(0,50,1),10) y<-((runif(1,10,20)*x)/(runif(1,0,10)+x))+rnorm(510,0,1) pop_model<-nls(y~a*x/(b+x), start=c(a=15,b=5)) df<-data.frame(x=x,y=y) df$p<-ifelse((y-fitted(pop_model))*(x-mean(x))>0, .4,.1) df$strata<-ifelse(df$p==.4,"a","b") in_sample<-stratsample(df$strata, round(table(df$strat)*c(0.4,0.1))) sdf<-df[in_sample,] des<-svydesign(id=~1, strata=~strata, prob=~p, data=sdf) pop_model (biased_sample<-nls(y~a*x/(b+x),data=sdf, start=c(a=15,b=5))) (corrected <- svynls(y~a*x/(b+x), design=des, start=c(a=15,b=5)))
Fits cumulative link models: proportional odds, probit, complementary log-log, and cauchit.
svyolr(formula, design, ...) ## S3 method for class 'survey.design2' svyolr(formula, design, start, subset=NULL,..., na.action = na.omit,method = c("logistic", "probit", "cloglog", "cauchit")) ## S3 method for class 'svyrep.design' svyolr(formula,design,subset=NULL,...,return.replicates=FALSE, multicore=getOption("survey.multicore")) ## S3 method for class 'svyolr' predict(object, newdata, type = c("class", "probs"), ...)
svyolr(formula, design, ...) ## S3 method for class 'survey.design2' svyolr(formula, design, start, subset=NULL,..., na.action = na.omit,method = c("logistic", "probit", "cloglog", "cauchit")) ## S3 method for class 'svyrep.design' svyolr(formula,design,subset=NULL,...,return.replicates=FALSE, multicore=getOption("survey.multicore")) ## S3 method for class 'svyolr' predict(object, newdata, type = c("class", "probs"), ...)
formula |
Formula: the response must be a factor with at least three levels |
design |
survey design object |
subset |
subset of the design to use; |
... |
dots |
start |
Optional starting values for optimization |
na.action |
handling of missing values |
multicore |
Use |
method |
Link function |
return.replicates |
return the individual replicate-weight estimates |
object |
object of class |
newdata |
new data for predictions |
type |
return vector of most likely class or matrix of probabilities |
An object of class svyolr
The code is based closely on polr() from the MASS package of Venables and Ripley.
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) dclus1<-update(dclus1, mealcat=cut(meals,c(0,25,50,75,100))) m<-svyolr(mealcat~avg.ed+mobility+stype, design=dclus1) m ## Use regTermTest for testing multiple parameters regTermTest(m, ~avg.ed+stype, method="LRT") ## predictions summary(predict(m, newdata=apiclus2)) summary(predict(m, newdata=apiclus2, type="probs"))
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) dclus1<-update(dclus1, mealcat=cut(meals,c(0,25,50,75,100))) m<-svyolr(mealcat~avg.ed+mobility+stype, design=dclus1) m ## Use regTermTest for testing multiple parameters regTermTest(m, ~avg.ed+stype, method="LRT") ## predictions summary(predict(m, newdata=apiclus2)) summary(predict(m, newdata=apiclus2, type="probs"))
Because observations in survey samples may represent very different
numbers of units in the population ordinary plots can be misleading.
The svyplot
function produces scatterplots adjusted in various ways
for sampling weights.
svyplot(formula, design,...) ## Default S3 method: svyplot(formula, design, style = c("bubble", "hex", "grayhex","subsample","transparent"), sample.size = 500, subset = NULL, legend = 1, inches = 0.05, amount=NULL, basecol="black", alpha=c(0, 0.8),xbins=30,...)
svyplot(formula, design,...) ## Default S3 method: svyplot(formula, design, style = c("bubble", "hex", "grayhex","subsample","transparent"), sample.size = 500, subset = NULL, legend = 1, inches = 0.05, amount=NULL, basecol="black", alpha=c(0, 0.8),xbins=30,...)
formula |
A model formula |
design |
A survey object (svydesign or svrepdesign) |
style |
See Details below |
sample.size |
For |
subset |
expression using variables in the design object |
legend |
For |
inches |
Scale for bubble plots |
amount |
list with |
basecol |
base color for transparent plots, or a function to compute the color (see below), or color for bubble plots |
alpha |
minimum and maximum opacity for transparent plots |
xbins |
Number of (x-axis) bins for hexagonal binning |
... |
Passed to |
Bubble plots are scatterplots with circles whose area is proportional
to the sampling weight. The two "hex" styles produce hexagonal
binning scatterplots, and require the hexbin
package from
Bioconductor. The "transparent" style plots points with opacity
proportional to sampling weight.
The subsample
method uses the sampling weights to create a
sample from approximately the population distribution and passes this to plot
Bubble plots are suited to small surveys, hexagonal binning and transparency to large surveys where plotting all the points would result in too much overlap.
basecol
can be a function taking one data frame argument, which
will be passed the data frame of variables from the survey object.
This could be memory-intensive for large data sets.
None
Korn EL, Graubard BI (1998) "Scatterplots with Survey Data" The American Statistician 52: 58-69
Lumley T, Scott A (2017) "Fitting Regression Models to Survey Data" Statistical Science 32: 265-278
symbols
for other options (such as colour) for bubble
plots.
svytable
for plots of discrete data.
data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) svyplot(api00~api99, design=dstrat, style="bubble") svyplot(api00~api99, design=dstrat, style="transparent",pch=19) ## these two require the hexbin package svyplot(api00~api99, design=dstrat, style="hex", xlab="1999 API",ylab="2000 API") svyplot(api00~api99, design=dstrat, style="grayhex",legend=0) dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2, fpc=~fpc1+fpc2) svyplot(api00~api99, design=dclus2, style="subsample") svyplot(api00~api99, design=dclus2, style="subsample", amount=list(x=25,y=25)) svyplot(api00~api99, design=dstrat, basecol=function(df){c("goldenrod","tomato","sienna")[as.numeric(df$stype)]}, style="transparent",pch=19,alpha=c(0,1)) legend("topleft",col=c("goldenrod","tomato","sienna"), pch=19, legend=c("E","H","M")) ## For discrete data, estimate a population table and plot the table. plot(svytable(~sch.wide+comp.imp+stype,design=dstrat)) fourfoldplot(svytable(~sch.wide+comp.imp+stype,design=dstrat,round=TRUE)) ## To draw on a hexbin plot you need grid graphics, eg, library(grid) h<-svyplot(api00~api99, design=dstrat, style="hex", xlab="1999 API",ylab="2000 API") s<-svysmooth(api00~api99,design=dstrat) grid.polyline(s$api99$x,s$api99$y,[email protected],default.units="native", gp=gpar(col="red",lwd=2))
data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) svyplot(api00~api99, design=dstrat, style="bubble") svyplot(api00~api99, design=dstrat, style="transparent",pch=19) ## these two require the hexbin package svyplot(api00~api99, design=dstrat, style="hex", xlab="1999 API",ylab="2000 API") svyplot(api00~api99, design=dstrat, style="grayhex",legend=0) dclus2<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2, fpc=~fpc1+fpc2) svyplot(api00~api99, design=dclus2, style="subsample") svyplot(api00~api99, design=dclus2, style="subsample", amount=list(x=25,y=25)) svyplot(api00~api99, design=dstrat, basecol=function(df){c("goldenrod","tomato","sienna")[as.numeric(df$stype)]}, style="transparent",pch=19,alpha=c(0,1)) legend("topleft",col=c("goldenrod","tomato","sienna"), pch=19, legend=c("E","H","M")) ## For discrete data, estimate a population table and plot the table. plot(svytable(~sch.wide+comp.imp+stype,design=dstrat)) fourfoldplot(svytable(~sch.wide+comp.imp+stype,design=dstrat,round=TRUE)) ## To draw on a hexbin plot you need grid graphics, eg, library(grid) h<-svyplot(api00~api99, design=dstrat, style="hex", xlab="1999 API",ylab="2000 API") s<-svysmooth(api00~api99,design=dstrat) grid.polyline(s$api99$x,s$api99$y,vp=h$plot.vp@hexVp.on,default.units="native", gp=gpar(col="red",lwd=2))
Computes principal components using the sampling weights.
svyprcomp(formula, design, center = TRUE, scale. = FALSE, tol = NULL, scores = FALSE, ...) ## S3 method for class 'svyprcomp' biplot(x, cols=c("black","darkred"),xlabs=NULL, weight=c("transparent","scaled","none"), max.alpha=0.5,max.cex=0.5,xlim=NULL,ylim=NULL,pc.biplot=FALSE, expand=1,xlab=NULL,ylab=NULL, arrow.len=0.1, ...)
svyprcomp(formula, design, center = TRUE, scale. = FALSE, tol = NULL, scores = FALSE, ...) ## S3 method for class 'svyprcomp' biplot(x, cols=c("black","darkred"),xlabs=NULL, weight=c("transparent","scaled","none"), max.alpha=0.5,max.cex=0.5,xlim=NULL,ylim=NULL,pc.biplot=FALSE, expand=1,xlab=NULL,ylab=NULL, arrow.len=0.1, ...)
formula |
model formula describing variables to be used |
design |
survey design object. |
center |
Center data before analysis? |
scale. |
Scale to unit variance before analysis? |
tol |
Tolerance for omitting components from the results; a proportion of the standard deviation of the first component. The default is to keep all components. |
scores |
Return scores on each component? These are needed for |
x |
A |
cols |
Base colors for observations and variables respectively |
xlabs |
Formula, or character vector, giving labels for each observation |
weight |
How to display the sampling weights: |
max.alpha |
Opacity for the largest sampling weight, or for all points if |
max.cex |
Character size (as a multiple of |
xlim , ylim , xlab , ylab
|
Graphical parameters |
expand , arrow.len
|
See |
pc.biplot |
See |
... |
Other arguments to |
svyprcomp
returns an object of class svyprcomp
, similar to
class prcomp
but including design information
data(api) dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) pc <- svyprcomp(~api99+api00+ell+hsg+meals+emer, design=dclus2,scale=TRUE,scores=TRUE) pc biplot(pc, xlabs=~dnum, weight="none") biplot(pc, xlabs=~dnum,max.alpha=1) biplot(pc, weight="scaled",max.cex=1.5, xlabs=~dnum)
data(api) dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) pc <- svyprcomp(~api99+api00+ell+hsg+meals+emer, design=dclus2,scale=TRUE,scores=TRUE) pc biplot(pc, xlabs=~dnum, weight="none") biplot(pc, xlabs=~dnum,max.alpha=1) biplot(pc, weight="scaled",max.cex=1.5, xlabs=~dnum)
Predictive marginal means for a generalised linear model, using the method of Korn and Graubard (1999) and matching the results of SUDAAN. The predictive marginal mean for one level of a factor is the probability-weighted average of the fitted values for the model on new data where all the observations are set to that level of the factor but have whatever values of adjustment variables they really have.
svypredmeans(adjustmodel, groupfactor, predictat=NULL)
svypredmeans(adjustmodel, groupfactor, predictat=NULL)
adjustmodel |
A generalised linear model fit by |
groupfactor |
A one-sided formula specifying the factor for which predictive means are wanted. Can use, eg, |
predictat |
A vector of the values of |
An object of class svystat
with the predictive marginal means and their covariance matrix.
It is possible to supply an adjustment model with only an intercept, but the results are then the same as svymean
It makes no sense to have a variable in the adjustment model that is part of the grouping factor, and will give an error message or NA
.
Graubard B, Korn E (1999) "Predictive Margins with Survey Data" Biometrics 55:652-659
Bieler, Brown, Williams, & Brogan (2010) "Estimating Model-Adjusted Risks, Risk Differences, and Risk Ratios From Complex Survey Data" Am J Epi DOI: 10.1093/aje/kwp440
Worked example using National Health Interview Survey data: https://gist.github.com/tslumley/2e74cd0ac12a671d2724
data(nhanes) nhanes_design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=TRUE,data=nhanes) agesexmodel<-svyglm(HI_CHOL~agecat+RIAGENDR, design=nhanes_design,family=quasibinomial) ## high cholesterol by race/ethnicity, adjusted for demographic differences means<-svypredmeans(agesexmodel, ~factor(race)) means ## relative risks compared to non-Hispanic white svycontrast(means,quote(`1`/`2`)) svycontrast(means,quote(`3`/`2`)) data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) demog_model <- svyglm(api00~mobility+ell+hsg+meals, design=dstrat) svypredmeans(demog_model,~enroll, predictat=c(100,300,1000,3000))
data(nhanes) nhanes_design <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=TRUE,data=nhanes) agesexmodel<-svyglm(HI_CHOL~agecat+RIAGENDR, design=nhanes_design,family=quasibinomial) ## high cholesterol by race/ethnicity, adjusted for demographic differences means<-svypredmeans(agesexmodel, ~factor(race)) means ## relative risks compared to non-Hispanic white svycontrast(means,quote(`1`/`2`)) svycontrast(means,quote(`3`/`2`)) data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) demog_model <- svyglm(api00~mobility+ell+hsg+meals, design=dstrat) svypredmeans(demog_model,~enroll, predictat=c(100,300,1000,3000))
Quantile-quantile plots either against a specified distribution function or comparing two variables from the same or different designs.
svyqqplot(formula, design, designx = NULL, na.rm = TRUE, qrule = "hf8", xlab = NULL, ylab = NULL, ...) svyqqmath(x, design, null=qnorm, na.rm=TRUE, xlab="Expected",ylab="Observed",...)
svyqqplot(formula, design, designx = NULL, na.rm = TRUE, qrule = "hf8", xlab = NULL, ylab = NULL, ...) svyqqmath(x, design, null=qnorm, na.rm=TRUE, xlab="Expected",ylab="Observed",...)
x , formula
|
A one-sided formula for |
design |
Survey design object to look up variables |
designx |
Survey design object to look up the RHS variable in |
null |
Quantile function to compare the data quantiles to |
na.rm |
Remove missing values |
qrule |
How to define quantiles for |
xlab , ylab
|
Passed to |
... |
Graphical options to be passed to |
None
data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) svyqqmath(~api99, design=dstrat) svyqqplot(api00~api99, design=dstrat) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) opar<-par(mfrow=c(1,2)) ## sample distributions very different qqplot(apiclus1$enroll, apistrat$enroll); abline(0,1) ## estimated population distributions much more similar svyqqplot(enroll~enroll, design=dstrat,designx=dclus1,qrule=survey:::qrule_hf8); abline(0,1) par(opar)
data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) svyqqmath(~api99, design=dstrat) svyqqplot(api00~api99, design=dstrat) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) opar<-par(mfrow=c(1,2)) ## sample distributions very different qqplot(apiclus1$enroll, apistrat$enroll); abline(0,1) ## estimated population distributions much more similar svyqqplot(enroll~enroll, design=dstrat,designx=dclus1,qrule=survey:::qrule_hf8); abline(0,1) par(opar)
Design-based versions of k-sample rank tests. The built-in tests are all for location hypotheses, but the user could specify others.
svyranktest(formula, design, test = c("wilcoxon", "vanderWaerden", "median","KruskalWallis"), ...)
svyranktest(formula, design, test = c("wilcoxon", "vanderWaerden", "median","KruskalWallis"), ...)
formula |
Model formula |
design |
A survey design object |
test |
Which rank test to use: Wilcoxon, van der Waerden's normal-scores
test, Mood's test for the median, or a function |
... |
for future expansion |
These tests are for the null hypothesis that the population or superpopulation distributions of the response variable are different between groups, targeted at population or superpopulation alternatives. The 'ranks' are defined as quantiles of the pooled distribution of the variable, so they do not just go from 1 to N; the null hypothesis does not depend on the weights, but the ranks do.
The tests reduce to the usual Normal approximations to the usual rank tests under iid sampling. Unlike the traditional rank tests, they are not exact in small samples.
Object of class htest
Note that with more than two groups the statistic
element of the return value holds the numerator degrees of freedom and the parameter
element holds the test statistic.
Lumley, T., & Scott, A. J. (2013). Two-sample rank tests under complex sampling. BIOMETRIKA, 100 (4), 831-842.
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, fpc=~fpc, data=apiclus1) svyranktest(ell~comp.imp, dclus1) svyranktest(ell~comp.imp, dclus1, test="median") svyranktest(ell~stype, dclus1) svyranktest(ell~stype, dclus1, test="median") str(svyranktest(ell~stype, dclus1)) ## upper quartile svyranktest(ell~comp.imp, dclus1, test=function(r,N) as.numeric(r>0.75*N)) quantiletest<-function(p){ rval<-function(r,N) as.numeric(r>(N*p)) attr(rval,"name")<-paste(p,"quantile") rval } svyranktest(ell~comp.imp, dclus1, test=quantiletest(0.5)) svyranktest(ell~comp.imp, dclus1, test=quantiletest(0.75)) ## replicate weights rclus1<-as.svrepdesign(dclus1) svyranktest(ell~stype, rclus1)
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, fpc=~fpc, data=apiclus1) svyranktest(ell~comp.imp, dclus1) svyranktest(ell~comp.imp, dclus1, test="median") svyranktest(ell~stype, dclus1) svyranktest(ell~stype, dclus1, test="median") str(svyranktest(ell~stype, dclus1)) ## upper quartile svyranktest(ell~comp.imp, dclus1, test=function(r,N) as.numeric(r>0.75*N)) quantiletest<-function(p){ rval<-function(r,N) as.numeric(r>(N*p)) attr(rval,"name")<-paste(p,"quantile") rval } svyranktest(ell~comp.imp, dclus1, test=quantiletest(0.5)) svyranktest(ell~comp.imp, dclus1, test=quantiletest(0.75)) ## replicate weights rclus1<-as.svrepdesign(dclus1) svyranktest(ell~stype, rclus1)
Ratio estimation and estimates of totals based on ratios for complex
survey samples. Estimating domain (subpopulation) means can be done
more easily with svymean
.
## S3 method for class 'survey.design2' svyratio(numerator=formula, denominator, design,separate=FALSE, na.rm=FALSE,formula, covmat=FALSE, deff=FALSE,influence=FALSE,...) ## S3 method for class 'svyrep.design' svyratio(numerator=formula, denominator, design, na.rm=FALSE,formula, covmat=FALSE,return.replicates=FALSE,deff=FALSE, ...) ## S3 method for class 'twophase' svyratio(numerator=formula, denominator, design, separate=FALSE, na.rm=FALSE,formula,...) ## S3 method for class 'svyratio' predict(object, total, se=TRUE,...) ## S3 method for class 'svyratio_separate' predict(object, total, se=TRUE,...) ## S3 method for class 'svyratio' SE(object,...,drop=TRUE) ## S3 method for class 'svyratio' coef(object,...,drop=TRUE) ## S3 method for class 'svyratio' confint(object, parm, level = 0.95,df =Inf,...)
## S3 method for class 'survey.design2' svyratio(numerator=formula, denominator, design,separate=FALSE, na.rm=FALSE,formula, covmat=FALSE, deff=FALSE,influence=FALSE,...) ## S3 method for class 'svyrep.design' svyratio(numerator=formula, denominator, design, na.rm=FALSE,formula, covmat=FALSE,return.replicates=FALSE,deff=FALSE, ...) ## S3 method for class 'twophase' svyratio(numerator=formula, denominator, design, separate=FALSE, na.rm=FALSE,formula,...) ## S3 method for class 'svyratio' predict(object, total, se=TRUE,...) ## S3 method for class 'svyratio_separate' predict(object, total, se=TRUE,...) ## S3 method for class 'svyratio' SE(object,...,drop=TRUE) ## S3 method for class 'svyratio' coef(object,...,drop=TRUE) ## S3 method for class 'svyratio' confint(object, parm, level = 0.95,df =Inf,...)
numerator , formula
|
formula, expression, or data frame giving numerator variable(s) |
denominator |
formula, expression, or data frame giving denominator variable(s) |
design |
survey design object |
object |
result of |
total |
vector of population totals for the denominator variables in
|
se |
Return standard errors? |
separate |
Estimate ratio separately for strata |
na.rm |
Remove missing values? |
covmat |
Compute the full variance-covariance matrix of the ratios |
deff |
Compute design effects |
return.replicates |
Return replicate estimates of ratios |
influence |
Return influence functions |
drop |
Return a vector rather than a matrix |
parm |
a specification of which parameters are to be given confidence intervals, either a vector of numbers or a vector of names. If missing, all parameters are considered. |
level |
the confidence level required. |
df |
degrees of freedom for t-distribution in confidence
interval, use |
... |
Other unused arguments for other methods |
The separate ratio estimate of a total is the sum of ratio estimates
in each stratum. If the stratum totals supplied in the total
argument and the strata in the design object both have names these
names will be matched. If they do not have names it is important that
the sample totals are supplied in the correct order, the same order
as shown in the output of summary(design)
.
When design
is a two-phase design, stratification will be on
the second phase.
svyratio
returns an object of class svyratio
. The
predict
method returns a matrix of population totals and
optionally a matrix of standard errors.
Thomas Lumley
Levy and Lemeshow. "Sampling of Populations" (3rd edition). Wiley
svymean
for estimating proportions and domain means
calibrate
for estimators related to the separate ratio estimator.
data(scd) ## survey design objects scddes<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE, fpc=rep(5,6)) scdnofpc<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE) # convert to BRR replicate weights scd2brr <- as.svrepdesign(scdnofpc, type="BRR") # use BRR replicate weights from Levy and Lemeshow repweights<-2*cbind(c(1,0,1,0,1,0), c(1,0,0,1,0,1), c(0,1,1,0,0,1), c(0,1,0,1,1,0)) scdrep<-svrepdesign(data=scd, type="BRR", repweights=repweights) # ratio estimates svyratio(~alive, ~arrests, design=scddes) svyratio(~alive, ~arrests, design=scdnofpc) svyratio(~alive, ~arrests, design=scd2brr) svyratio(~alive, ~arrests, design=scdrep) data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) ## domain means are ratio estimates, but available directly svyratio(~I(api.stu*(comp.imp=="Yes")), ~as.numeric(comp.imp=="Yes"), dstrat) svymean(~api.stu, subset(dstrat, comp.imp=="Yes")) ## separate and combined ratio estimates of total (sep<-svyratio(~api.stu,~enroll, dstrat,separate=TRUE)) (com<-svyratio(~api.stu, ~enroll, dstrat)) stratum.totals<-list(E=1877350, H=1013824, M=920298) predict(sep, total=stratum.totals) predict(com, total=sum(unlist(stratum.totals))) SE(com) coef(com) coef(com, drop=FALSE) confint(com)
data(scd) ## survey design objects scddes<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE, fpc=rep(5,6)) scdnofpc<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE) # convert to BRR replicate weights scd2brr <- as.svrepdesign(scdnofpc, type="BRR") # use BRR replicate weights from Levy and Lemeshow repweights<-2*cbind(c(1,0,1,0,1,0), c(1,0,0,1,0,1), c(0,1,1,0,0,1), c(0,1,0,1,1,0)) scdrep<-svrepdesign(data=scd, type="BRR", repweights=repweights) # ratio estimates svyratio(~alive, ~arrests, design=scddes) svyratio(~alive, ~arrests, design=scdnofpc) svyratio(~alive, ~arrests, design=scd2brr) svyratio(~alive, ~arrests, design=scdrep) data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) ## domain means are ratio estimates, but available directly svyratio(~I(api.stu*(comp.imp=="Yes")), ~as.numeric(comp.imp=="Yes"), dstrat) svymean(~api.stu, subset(dstrat, comp.imp=="Yes")) ## separate and combined ratio estimates of total (sep<-svyratio(~api.stu,~enroll, dstrat,separate=TRUE)) (com<-svyratio(~api.stu, ~enroll, dstrat)) stratum.totals<-list(E=1877350, H=1013824, M=920298) predict(sep, total=stratum.totals) predict(com, total=sum(unlist(stratum.totals))) SE(com) coef(com) coef(com, drop=FALSE) confint(com)
Compute the variance of a total under multistage sampling, using a recursive descent algorithm.
svyrecvar(x, clusters, stratas,fpcs, postStrata = NULL, lonely.psu = getOption("survey.lonely.psu"), one.stage=getOption("survey.ultimate.cluster"))
svyrecvar(x, clusters, stratas,fpcs, postStrata = NULL, lonely.psu = getOption("survey.lonely.psu"), one.stage=getOption("survey.ultimate.cluster"))
x |
Matrix of data or estimating functions |
clusters |
Data frame or matrix with cluster ids for each stage |
stratas |
Strata for each stage |
fpcs |
Information on population and sample size for each stage,
created by |
postStrata |
post-stratification information as created by
|
lonely.psu |
How to handle strata with a single PSU |
one.stage |
If |
The main use of this function is to compute the variance of the sum of a set of estimating functions under multistage sampling. The sampling is assumed to be simple or stratified random sampling within clusters at each stage except perhaps the last stage. The variance of a statistic is computed from the variance of estimating functions as described by Binder (1983).
Use one.stage=FALSE
for compatibility with other software that
does not perform multi-stage calculations, and set
options(survey.ultimate.cluster=TRUE)
to make this the default.
The idea of a recursive algorithm is due to Bellhouse (1985). Texts such as Cochran (1977) and Sarndal et al (1991) describe the decomposition of the variance into a single-stage between-cluster estimator and a within-cluster estimator, and this is applied recursively.
If one.stage
is a positive integer it specifies the number of
stages of sampling to use in the recursive estimator.
If pps="brewer"
, standard errors are estimated using Brewer's
approximation for PPS without replacement, option 2 of those described
by Berger (2004). The fpc
argument must then be specified in
terms of sampling fractions, not population sizes (or omitted, but
then the pps
argument would have no effect and the
with-replacement standard errors would be correct).
A covariance matrix
A simple set of finite population corrections will only be exactly correct when each successive stage uses simple or stratified random sampling without replacement. A correction under general unequal probability sampling (eg PPS) would require joint inclusion probabilities (or, at least, sampling probabilities for units not included in the sample), information not generally available.
The quality of Brewer's approximation is excellent in Berger's simulations, but the accuracy may vary depending on the sampling algorithm used.
Bellhouse DR (1985) Computing Methods for Variance Estimation in Complex Surveys. Journal of Official Statistics. Vol.1, No.3, 1985
Berger, Y.G. (2004), A Simple Variance Estimator for Unequal Probability Sampling Without Replacement. Journal of Applied Statistics, 31, 305-315.
Binder, David A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51, 279-292.
Brewer KRW (2002) Combined Survey Sampling Inference (Weighing Basu's Elephants) [Chapter 9]
Cochran, W. (1977) Sampling Techniques. 3rd edition. Wiley.
Sarndal C-E, Swensson B, Wretman J (1991) Model Assisted Survey Sampling. Springer.
svrVar
for replicate weight designs
svyCprod
for a description of how variances are
estimated at each stage
data(mu284) dmu284<-svydesign(id=~id1+id2,fpc=~n1+n2, data=mu284) svytotal(~y1, dmu284) data(api) # two-stage cluster sample dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) summary(dclus2) svymean(~api00, dclus2) svytotal(~enroll, dclus2,na.rm=TRUE) # bootstrap for multistage sample mrbclus2<-as.svrepdesign(dclus2, type="mrb", replicates=100) svytotal(~enroll, mrbclus2, na.rm=TRUE) # two-stage `with replacement' dclus2wr<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) summary(dclus2wr) svymean(~api00, dclus2wr) svytotal(~enroll, dclus2wr,na.rm=TRUE)
data(mu284) dmu284<-svydesign(id=~id1+id2,fpc=~n1+n2, data=mu284) svytotal(~y1, dmu284) data(api) # two-stage cluster sample dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) summary(dclus2) svymean(~api00, dclus2) svytotal(~enroll, dclus2,na.rm=TRUE) # bootstrap for multistage sample mrbclus2<-as.svrepdesign(dclus2, type="mrb", replicates=100) svytotal(~enroll, mrbclus2, na.rm=TRUE) # two-stage `with replacement' dclus2wr<-svydesign(id=~dnum+snum, weights=~pw, data=apiclus2) summary(dclus2wr) svymean(~api00, dclus2wr) svytotal(~enroll, dclus2wr,na.rm=TRUE)
Performs two versions of the efficient score test. These are the same
for a single parameter. In the working
score test, different
parameters are weighted according to the inverse of the estimated population Fisher
information. In the pseudoscore
test, parameters are weighted according to the
inverse of their estimated covariance matrix.
svyscoretest(model, drop.terms=NULL, add.terms=NULL, method=c("working","pseudoscore","individual"),ddf=NULL, lrt.approximation = "satterthwaite", ...) ## S3 method for class 'svyglm' svyscoretest(model, drop.terms=NULL, add.terms=NULL, method=c("working","pseudoscore","individual"), ddf=NULL, lrt.approximation = "satterthwaite",fullrank=TRUE, ...)
svyscoretest(model, drop.terms=NULL, add.terms=NULL, method=c("working","pseudoscore","individual"),ddf=NULL, lrt.approximation = "satterthwaite", ...) ## S3 method for class 'svyglm' svyscoretest(model, drop.terms=NULL, add.terms=NULL, method=c("working","pseudoscore","individual"), ddf=NULL, lrt.approximation = "satterthwaite",fullrank=TRUE, ...)
model |
A model of a class having a |
drop.terms |
Model formula giving terms to remove from |
add.terms |
Model formula giving terms to add to |
method |
The type of score test to use. For a single parameter they are
equivalent. To report tests for each column separately use |
ddf |
denominator degrees of freedom for an F or linear combination
of F distributions. Use |
lrt.approximation |
For the working score, the method for computing/approximating the null
distribution: see |
fullrank |
If |
... |
for future expansion |
The working
score test will be asymptotically equivalent to the
Rao-Scott likelihood ratio test computed by regTermTest
and anova.svyglm
. The paper by Rao, Scott and Skinner calls this
a "naive" score test. The null distribution is a linear combination
of chi-squared (or F) variables.
The pseudoscore
test will be
asymptotically equivalent to the Wald test computed by
regTermTest
; it has a chi-squared (or F) null
distribution.
If ddf
is negative or zero, which can happen with large numbers
of predictors and small numbers of PSUs, it will be changed to 1 with a warning.
For "pseudoscore" and "working" score methods, a named vector with the test statistic, degrees of freedom, and p-value. For "individual" an object of class "svystat"
JNK Rao, AJ Scott, and C Rao, J., Scott, A., & Skinner, C. (1998). QUASI-SCORE TESTS WITH SURVEY DATA. Statistica Sinica, 8(4), 1059-1070.
data(myco) dmyco<-svydesign(id=~1, strata=~interaction(Age,leprosy),weights=~wt,data=myco) m_full<-svyglm(leprosy~I((Age+7.5)^-2)+Scar, family=quasibinomial, design=dmyco) svyscoretest(m_full, ~Scar) svyscoretest(m_full,add.terms= ~I((Age+7.5)^-2):Scar) svyscoretest(m_full,add.terms= ~factor(Age), method="pseudo") svyscoretest(m_full,add.terms= ~factor(Age),method="individual",fullrank=FALSE) svyscoretest(m_full,add.terms= ~factor(Age),method="individual")
data(myco) dmyco<-svydesign(id=~1, strata=~interaction(Age,leprosy),weights=~wt,data=myco) m_full<-svyglm(leprosy~I((Age+7.5)^-2)+Scar, family=quasibinomial, design=dmyco) svyscoretest(m_full, ~Scar) svyscoretest(m_full,add.terms= ~I((Age+7.5)^-2):Scar) svyscoretest(m_full,add.terms= ~factor(Age), method="pseudo") svyscoretest(m_full,add.terms= ~factor(Age),method="individual",fullrank=FALSE) svyscoretest(m_full,add.terms= ~factor(Age),method="individual")
Scatterplot smoothing and density estimation for probability-weighted data.
svysmooth(formula, design, ...) ## Default S3 method: svysmooth(formula, design, method = c("locpoly", "quantreg"), bandwidth = NULL, quantile, df = 4, ...) ## S3 method for class 'svysmooth' plot(x, which=NULL, type="l", xlabs=NULL, ylab=NULL,...) ## S3 method for class 'svysmooth' lines(x,which=NULL,...) make.panel.svysmooth(design,bandwidth=NULL)
svysmooth(formula, design, ...) ## Default S3 method: svysmooth(formula, design, method = c("locpoly", "quantreg"), bandwidth = NULL, quantile, df = 4, ...) ## S3 method for class 'svysmooth' plot(x, which=NULL, type="l", xlabs=NULL, ylab=NULL,...) ## S3 method for class 'svysmooth' lines(x,which=NULL,...) make.panel.svysmooth(design,bandwidth=NULL)
formula |
One-sided formula for density estimation, two-sided for smoothing |
design |
Survey design object |
method |
local polynomial smoothing for the mean or regression splines for quantiles |
bandwidth |
Smoothing bandwidth for "locpoly" or |
quantile |
quantile to be estimated for "quantreg" |
df |
Degrees of freedom for "quantreg" |
which |
Which plots to show (default is all) |
type |
as for |
xlabs |
Optional vector of x-axis labels |
ylab |
Optional y-axis label |
... |
More arguments |
x |
Object of class |
svysmooth
does one-dimensional smoothing. If formula
has
multiple predictor variables a separate one-dimensional smooth is
performed for each one.
For method="locpoly"
the extra arguments are passed to
locpoly
from the KernSmooth package, for
method="quantreg"
they are passed to rq
from the
quantreg package. The automatic choice of bandwidth for
method="locpoly"
uses the default settings for dpik
and
dpill
in the KernSmooth package.
make.panel.svysmooth()
makes a function that plots points and
draws a weighted smooth curve through them, a weighted replacement for
panel.smooth
that can be passed to functions such as
termplot
or plot.lm
. The resulting function has a span
argument that will set the bandwidth; if this is not specified the automatic choice will be used.
An object of class svysmooth
, a list of lists, each with x
and y
components.
svyhist
for histograms
data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) smth<-svysmooth(api00~api99+ell,dstrat) dens<-svysmooth(~api99, dstrat,bandwidth=30) dens1<-svysmooth(~api99, dstrat) qsmth<-svysmooth(api00~ell,dstrat, quantile=0.75, df=3,method="quantreg") plot(smth) plot(smth, which="ell",lty=2,ylim=c(500,900)) lines(qsmth, col="red") svyhist(~api99,design=dstrat) lines(dens,col="purple",lwd=3) lines(dens1, col="forestgreen",lwd=2) m<-svyglm(api00~sin(api99/100)+stype, design=dstrat) termplot(m, data=model.frame(dstrat), partial.resid=TRUE, se=TRUE, smooth=make.panel.svysmooth(dstrat))
data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) smth<-svysmooth(api00~api99+ell,dstrat) dens<-svysmooth(~api99, dstrat,bandwidth=30) dens1<-svysmooth(~api99, dstrat) qsmth<-svysmooth(api00~ell,dstrat, quantile=0.75, df=3,method="quantreg") plot(smth) plot(smth, which="ell",lty=2,ylim=c(500,900)) lines(qsmth, col="red") svyhist(~api99,design=dstrat) lines(dens,col="purple",lwd=3) lines(dens1, col="forestgreen",lwd=2) m<-svyglm(api00~sin(api99/100)+stype, design=dstrat) termplot(m, data=model.frame(dstrat), partial.resid=TRUE, se=TRUE, smooth=make.panel.svysmooth(dstrat))
In health surveys it is often of interest to standardize domains to have the same distribution of, eg, age as in a target population. The operation is similar to post-stratification, except that the totals for the domains are fixed at the current estimates, not at known population values. This function matches the estimates produced by the (US) National Center for Health Statistics.
svystandardize(design, by, over, population, excluding.missing = NULL)
svystandardize(design, by, over, population, excluding.missing = NULL)
design |
survey design object |
by |
A one-sided formula specifying the variables whose distribution will be standardised |
over |
A one-sided formula specifying the domains within which the
standardisation will occur, or |
population |
Desired population totals or proportions for the levels of combinations of variables in |
excluding.missing |
Optionally, a one-sided formula specifying variables whose missing values should be dropped before calculating the domain totals. |
A new survey design object of the same type as the input.
The standard error estimates do not exactly match the NCHS estimates
National Center for Health Statistics https://www.cdc.gov/nchs/tutorials/NHANES/NHANESAnalyses/agestandardization/age_standardization_intro.htm
## matches http://www.cdc.gov/nchs/data/databriefs/db92_fig1.png data(nhanes) popage <- c( 55901 , 77670 , 72816 , 45364 ) design<-svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, data=nhanes, nest=TRUE) stdes<-svystandardize(design, by=~agecat, over=~race+RIAGENDR, population=popage, excluding.missing=~HI_CHOL) svyby(~HI_CHOL, ~race+RIAGENDR, svymean, design=subset(stdes, agecat!="(0,19]")) data(nhanes) nhanes_design <- svydesign(ids = ~ SDMVPSU, strata = ~ SDMVSTRA, weights = ~ WTMEC2YR, nest = TRUE, data = nhanes) ## These are the same nhanes_adj <- svystandardize(update(nhanes_design, all_adults = "1"), by = ~ agecat, over = ~ all_adults, population = c(55901, 77670, 72816, 45364), excluding.missing = ~ HI_CHOL) svymean(~I(HI_CHOL == 1), nhanes_adj, na.rm = TRUE) nhanes_adj <- svystandardize(nhanes_design, by = ~ agecat, over = ~ 1, population = c(55901, 77670, 72816, 45364), excluding.missing = ~ HI_CHOL) svymean(~I(HI_CHOL == 1), nhanes_adj, na.rm = TRUE)
## matches http://www.cdc.gov/nchs/data/databriefs/db92_fig1.png data(nhanes) popage <- c( 55901 , 77670 , 72816 , 45364 ) design<-svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, data=nhanes, nest=TRUE) stdes<-svystandardize(design, by=~agecat, over=~race+RIAGENDR, population=popage, excluding.missing=~HI_CHOL) svyby(~HI_CHOL, ~race+RIAGENDR, svymean, design=subset(stdes, agecat!="(0,19]")) data(nhanes) nhanes_design <- svydesign(ids = ~ SDMVPSU, strata = ~ SDMVSTRA, weights = ~ WTMEC2YR, nest = TRUE, data = nhanes) ## These are the same nhanes_adj <- svystandardize(update(nhanes_design, all_adults = "1"), by = ~ agecat, over = ~ all_adults, population = c(55901, 77670, 72816, 45364), excluding.missing = ~ HI_CHOL) svymean(~I(HI_CHOL == 1), nhanes_adj, na.rm = TRUE) nhanes_adj <- svystandardize(nhanes_design, by = ~ agecat, over = ~ 1, population = c(55901, 77670, 72816, 45364), excluding.missing = ~ HI_CHOL) svymean(~I(HI_CHOL == 1), nhanes_adj, na.rm = TRUE)
This function calls survreg
from the 'survival' package to fit accelerated failure (accelerated life) models to complex survey data, and then computes correct standard errors by linearisation. It has the same arguments as survreg
, except that the second argument is design
rather than data
.
## S3 method for class 'survey.design' svysurvreg(formula, design, weights=NULL, subset=NULL, ...)
## S3 method for class 'survey.design' svysurvreg(formula, design, weights=NULL, subset=NULL, ...)
formula |
Model formula |
design |
Survey design object, including two-phase designs |
weights |
Additional weights to multiply by the sampling weights. No, I don't know why you'd want to do that. |
subset |
subset to use in fitting (if needed) |
... |
Other arguments of |
Object of class svysurvreg
, with the same structure as a survreg
object but with NA
for the loglikelihood.
The residuals
method is identical to that for survreg
objects except the weighted
option defaults to TRUE
data(pbc, package="survival") pbc$randomized <- with(pbc, !is.na(trt) & trt>0) biasmodel<-glm(randomized~age*edema,data=pbc) pbc$randprob<-fitted(biasmodel) dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized)) model <- svysurvreg(Surv(time, status>0)~bili+protime+albumin, design=dpbc, dist="weibull") summary(model)
data(pbc, package="survival") pbc$randomized <- with(pbc, !is.na(trt) & trt>0) biasmodel<-glm(randomized~age*edema,data=pbc) pbc$randprob<-fitted(biasmodel) dpbc<-svydesign(id=~1, prob=~randprob, strata=~edema, data=subset(pbc,randomized)) model <- svysurvreg(Surv(time, status>0)~bili+protime+albumin, design=dpbc, dist="weibull") summary(model)
Contingency tables and chisquared tests of association for survey data.
## S3 method for class 'survey.design' svytable(formula, design, Ntotal = NULL, round = FALSE,...) ## S3 method for class 'svyrep.design' svytable(formula, design, Ntotal = sum(weights(design, "sampling")), round = FALSE,...) ## S3 method for class 'survey.design' svychisq(formula, design, statistic = c("F", "Chisq","Wald","adjWald","lincom", "saddlepoint","wls-score"),na.rm=TRUE,...) ## S3 method for class 'svyrep.design' svychisq(formula, design, statistic = c("F", "Chisq","Wald","adjWald","lincom", "saddlepoint","wls-score"),na.rm=TRUE,...) ## S3 method for class 'svytable' summary(object, statistic = c("F","Chisq","Wald","adjWald","lincom","saddlepoint"),...) degf(design, ...) ## S3 method for class 'survey.design2' degf(design, ...) ## S3 method for class 'svyrep.design' degf(design, tol=1e-5,...)
## S3 method for class 'survey.design' svytable(formula, design, Ntotal = NULL, round = FALSE,...) ## S3 method for class 'svyrep.design' svytable(formula, design, Ntotal = sum(weights(design, "sampling")), round = FALSE,...) ## S3 method for class 'survey.design' svychisq(formula, design, statistic = c("F", "Chisq","Wald","adjWald","lincom", "saddlepoint","wls-score"),na.rm=TRUE,...) ## S3 method for class 'svyrep.design' svychisq(formula, design, statistic = c("F", "Chisq","Wald","adjWald","lincom", "saddlepoint","wls-score"),na.rm=TRUE,...) ## S3 method for class 'svytable' summary(object, statistic = c("F","Chisq","Wald","adjWald","lincom","saddlepoint"),...) degf(design, ...) ## S3 method for class 'survey.design2' degf(design, ...) ## S3 method for class 'svyrep.design' degf(design, tol=1e-5,...)
formula |
Model formula specifying margins for the table (using |
design |
survey object |
statistic |
See Details below |
Ntotal |
A population total or set of population stratum totals to normalise to. |
round |
Should the table entries be rounded to the nearest integer? |
na.rm |
Remove missing values |
object |
Output from |
... |
For |
tol |
Tolerance for |
The svytable
function computes a weighted crosstabulation. This
is especially useful for producing graphics. It is sometimes easier
to use svytotal
or svymean
, which also
produce standard errors, design effects, etc.
The frequencies in the table can be normalised to some convenient total
such as 100 or 1.0 by specifying the Ntotal
argument. If the
formula has a left-hand side the mean or sum of this variable rather
than the frequency is tabulated.
The Ntotal
argument can be either a single number or a data
frame whose first column gives the (first-stage) sampling strata and
second column the population size in each stratum. In this second case
the svytable
command performs ‘post-stratification’: tabulating
and scaling to the population within strata and then adding up the
strata.
As with other xtabs
objects, the output of svytable
can be
processed by ftable
for more attractive display. The
summary
method for svytable
objects calls svychisq
for a test of independence.
svychisq
computes first and second-order Rao-Scott corrections to
the Pearson chisquared test, and two Wald-type tests.
The default (statistic="F"
) is the Rao-Scott second-order
correction. The p-values are computed with a Satterthwaite
approximation to the distribution and with denominator degrees of
freedom as recommended by Thomas and Rao (1990). The alternative
statistic="Chisq"
adjusts the Pearson chisquared statistic by a
design effect estimate and then compares it to the chisquared
distribution it would have under simple random sampling.
The statistic="Wald"
test is that proposed by Koch et al (1975)
and used by the SUDAAN software package. It is a Wald test based on the
differences between the observed cells counts and those expected under
independence. The adjustment given by statistic="adjWald"
reduces
the statistic when the number of PSUs is small compared to the number of
degrees of freedom of the test. Thomas and Rao (1987) compare these
tests and find the adjustment benefical.
statistic="lincom"
replaces the numerator of the Rao-Scott F with
the exact asymptotic distribution, which is a linear combination of
chi-squared variables (see pchisqsum
, and
statistic="saddlepoint"
uses a saddlepoint approximation to this
distribution. The CompQuadForm
package is needed for
statistic="lincom"
but not for
statistic="saddlepoint"
. The saddlepoint approximation is
especially useful when the p-value is very small (as in large-scale
multiple testing problems).
statistic="wls-score"
is an experimental implementation of the
weighted least squares score test of Lipsitz et al (2015). It is not
identical to that paper, for example, I think the denominator degrees
of freedom need to be reduced by JK for a JxK table, not (J-1)(K-1). And
it's very close to the "adjWald" test.
For designs using replicate weights the code is essentially the same as
for designs with sampling structure, since the necessary variance
computations are done by the appropriate methods of
svytotal
and svymean
. The exception is that
the degrees of freedom is computed as one less than the rank of the
matrix of replicate weights (by degf
).
At the moment, svychisq
works only for 2-dimensional tables.
The table commands return an xtabs
object, svychisq
returns a htest
object.
Rao and Scott (1984) leave open one computational issue. In
computing ‘generalised design effects’ for these tests, should the
variance under simple random sampling be estimated using the observed
proportions or the the predicted proportions under the null
hypothesis? svychisq
uses the observed proportions, following
simulations by Sribney (1998), and the choices made in Stata
Davies RB (1973). "Numerical inversion of a characteristic function" Biometrika 60:415-7
P. Duchesne, P. Lafaye de Micheaux (2010) "Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods", Computational Statistics and Data Analysis, Volume 54, 858-862
Koch, GG, Freeman, DH, Freeman, JL (1975) "Strategies in the multivariate analysis of data from complex surveys" International Statistical Review 43: 59-78
Stuart R. Lipsitz, Garrett M. Fitzmaurice, Debajyoti Sinha, Nathanael Hevelone, Edward Giovannucci, and Jim C. Hu (2015) "Testing for independence in JxK contingency tables with complex sample survey data" Biometrics 71(3): 832-840
Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contigency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60.
Sribney WM (1998) "Two-way contingency tables for survey or clustered data" Stata Technical Bulletin 45:33-49.
Thomas, DR, Rao, JNK (1987) "Small-sample comparison of level and power for simple goodness-of-fit statistics under cluster sampling" JASA 82:630-636
svytotal
and svymean
report totals
and proportions by category for factor variables.
See svyby
and ftable.svystat
to construct
more complex tables of summary statistics.
See svyloglin
for loglinear models.
See regTermTest
for Rao-Scott tests in regression models.
See https://notstatschat.rbind.io/2019/06/08/design-degrees-of-freedom-brief-note/ for an explanation of the design degrees of freedom with replicate weights.
data(api) xtabs(~sch.wide+stype, data=apipop) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) summary(dclus1) (tbl <- svytable(~sch.wide+stype, dclus1)) plot(tbl) fourfoldplot(svytable(~sch.wide+comp.imp+stype,design=dclus1,round=TRUE), conf.level=0) svychisq(~sch.wide+stype, dclus1) summary(tbl, statistic="Chisq") svychisq(~sch.wide+stype, dclus1, statistic="adjWald") rclus1 <- as.svrepdesign(dclus1) summary(svytable(~sch.wide+stype, rclus1)) svychisq(~sch.wide+stype, rclus1, statistic="adjWald")
data(api) xtabs(~sch.wide+stype, data=apipop) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) summary(dclus1) (tbl <- svytable(~sch.wide+stype, dclus1)) plot(tbl) fourfoldplot(svytable(~sch.wide+comp.imp+stype,design=dclus1,round=TRUE), conf.level=0) svychisq(~sch.wide+stype, dclus1) summary(tbl, statistic="Chisq") svychisq(~sch.wide+stype, dclus1, statistic="adjWald") rclus1 <- as.svrepdesign(dclus1) summary(svytable(~sch.wide+stype, rclus1)) svychisq(~sch.wide+stype, rclus1, statistic="adjWald")
One-sample or two-sample t-test. This function is a wrapper for
svymean
in the one-sample case and for
svyglm
in the two-sample case. Degrees of freedom are
degf(design)-1
for the one-sample test and degf(design)-2
for the two-sample case.
svyttest(formula, design, ...)
svyttest(formula, design, ...)
formula |
Formula, |
design |
survey design object |
... |
for methods |
Object of class htest
data(api) dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) tt<-svyttest(enroll~comp.imp, dclus2) tt confint(tt, level=0.9) svyttest(enroll~I(stype=="E"),dclus2) svyttest(I(api00-api99)~0, dclus2)
data(api) dclus2<-svydesign(id=~dnum+snum, fpc=~fpc1+fpc2, data=apiclus2) tt<-svyttest(enroll~comp.imp, dclus2) tt confint(tt, level=0.9) svyttest(enroll~I(stype=="E"),dclus2) svyttest(I(api00-api99)~0, dclus2)
Trims very high or very low sampling weights to reduce the influence of outlying observations. In a replicate-weight design object, the replicate weights are also trimmed. The total amount trimmed is divided among the observations that were not trimmed, so that the total weight remains the same.
trimWeights(design, upper = Inf, lower = -Inf, ...) ## S3 method for class 'survey.design2' trimWeights(design, upper = Inf, lower = -Inf, strict=FALSE,...) ## S3 method for class 'svyrep.design' trimWeights(design, upper = Inf, lower = -Inf, strict=FALSE, compress=FALSE,...)
trimWeights(design, upper = Inf, lower = -Inf, ...) ## S3 method for class 'survey.design2' trimWeights(design, upper = Inf, lower = -Inf, strict=FALSE,...) ## S3 method for class 'svyrep.design' trimWeights(design, upper = Inf, lower = -Inf, strict=FALSE, compress=FALSE,...)
design |
A survey design object |
upper |
Upper bound for weights |
lower |
Lower bound for weights |
strict |
The reapportionment of the ‘trimmings’ from the weights can push
other weights over the limits. If |
compress |
Compress the replicate weights after trimming. |
... |
Other arguments for future expansion |
A new survey design object with trimmed weights.
calibrate
has a trim
option for trimming the
calibration adjustments.
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) pop.totals<-c(`(Intercept)`=6194, stypeH=755, stypeM=1018, api99=3914069) dclus1g<-calibrate(dclus1, ~stype+api99, pop.totals) summary(weights(dclus1g)) dclus1t<-trimWeights(dclus1g,lower=20, upper=45) summary(weights(dclus1t)) dclus1tt<-trimWeights(dclus1g, lower=20, upper=45,strict=TRUE) summary(weights(dclus1tt)) svymean(~api99+api00+stype, dclus1g) svymean(~api99+api00+stype, dclus1t) svymean(~api99+api00+stype, dclus1tt)
data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) pop.totals<-c(`(Intercept)`=6194, stypeH=755, stypeM=1018, api99=3914069) dclus1g<-calibrate(dclus1, ~stype+api99, pop.totals) summary(weights(dclus1g)) dclus1t<-trimWeights(dclus1g,lower=20, upper=45) summary(weights(dclus1t)) dclus1tt<-trimWeights(dclus1g, lower=20, upper=45,strict=TRUE) summary(weights(dclus1tt)) svymean(~api99+api00+stype, dclus1g) svymean(~api99+api00+stype, dclus1t) svymean(~api99+api00+stype, dclus1tt)
In a two-phase design a sample is taken from a population and a subsample taken from the sample, typically stratified by variables not known for the whole population. The second phase can use any design supported for single-phase sampling. The first phase must currently be one-stage element or cluster sampling
twophase(id, strata = NULL, probs = NULL, weights = NULL, fpc = NULL, subset, data, method=c("full","approx","simple"), pps=NULL) twophasevar(x,design) twophase2var(x,design)
twophase(id, strata = NULL, probs = NULL, weights = NULL, fpc = NULL, subset, data, method=c("full","approx","simple"), pps=NULL) twophasevar(x,design) twophase2var(x,design)
id |
list of two formulas for sampling unit identifiers |
strata |
list of two formulas (or |
probs |
list of two formulas (or |
weights |
Only for |
fpc |
list of two formulas (or |
subset |
formula specifying which observations are selected in phase 2 |
data |
Data frame will all data for phase 1 and 2 |
method |
|
pps |
With |
x |
probability-weighted estimating functions |
design |
two-phase design |
The population for the second phase is the first-phase sample. If the
second phase sample uses stratified (multistage cluster) sampling
without replacement and all the stratum and sampling unit identifier
variables are available for the whole first-phase sample it is
possible to estimate the sampling probabilities/weights and the
finite population correction. These would then be specified as
NULL
.
Two-phase case-control and case-cohort studies in biostatistics will typically have simple random sampling with replacement as the first stage. Variances given here may differ slightly from those in the biostatistics literature where a model-based estimator of the first-stage variance would typically be used.
Variance computations are based on the conditioning argument in
Section 9.3 of Sarndal et al. Method "full"
corresponds exactly
to the formulas in that reference. Method "simple"
or
"approx"
(the two are the same) uses less time and memory but
is exact only for some special cases. The most important special case
is the two-phase epidemiologic designs where phase 1 is simple random
sampling from an infinite population and phase 2 is stratified random
sampling. See the tests
directory for a worked example. The
only disadvantage of method="simple" in these cases is that
standardization of margins (marginpred
) is not available.
For method="full"
, genuine sampling probabilities must be available for
each stage of sampling, within each phase. For multistage sampling
this requires specifying either fpc
or probs
as a
formula with a term for each stage of sampling. If no fpc
or
probs
are specified at phase 1 it is treated as simple random
sampling from an infinite population, and population totals will not
be correctly estimated, but means, quantiles, and regression models
will be correct.
The pps
argument allows for PPS sampling at phase two (or
eventually at phase one), and also for Poisson sampling at phase two
as a model for non-response.
twophase
returns an object of class twophase2
(for
method="full"
) or twophase
. The structure of
twophase2
objects may change as unnecessary components are removed.
twophase2var
and twophasevar
return a variance matrix with an attribute
containing the separate phase 1 and phase 2 contributions to the variance.
Sarndal CE, Swensson B, Wretman J (1992) "Model Assisted Survey Sampling" Springer.
Breslow NE and Chatterjee N, Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. "Applied Statistics" 48:457-68, 1999
Breslow N, Lumley T, Ballantyne CM, Chambless LE, Kulick M. (2009) Improved Horvitz-Thompson estimation of model parameters from two-phase stratified samples: applications in epidemiology. Statistics in Biosciences. doi 10.1007/s12561-009-9001-6
Lin, DY and Ying, Z (1993). Cox regression with incomplete covariate measurements. "Journal of the American Statistical Association" 88: 1341-1349.
svydesign
, svyrecvar
for multi*stage*
sampling
calibrate
for calibration (GREG) estimators.
estWeights
for two-phase designs for missing data.
The "epi" and "phase1" vignettes for examples and technical details.
## two-phase simple random sampling. data(pbc, package="survival") pbc$randomized<-with(pbc, !is.na(trt) & trt>0) pbc$id<-1:nrow(pbc) d2pbc<-twophase(id=list(~id,~id), data=pbc, subset=~randomized) svymean(~bili, d2pbc) ## two-stage sampling as two-phase data(mu284) ii<-with(mu284, c(1:15, rep(1:5,n2[1:5]-3))) mu284.1<-mu284[ii,] mu284.1$id<-1:nrow(mu284.1) mu284.1$sub<-rep(c(TRUE,FALSE),c(15,34-15)) dmu284<-svydesign(id=~id1+id2,fpc=~n1+n2, data=mu284) ## first phase cluster sample, second phase stratified within cluster d2mu284<-twophase(id=list(~id1,~id),strata=list(NULL,~id1), fpc=list(~n1,NULL),data=mu284.1,subset=~sub) svytotal(~y1, dmu284) svytotal(~y1, d2mu284) svymean(~y1, dmu284) svymean(~y1, d2mu284) ## case-cohort design: this example requires R 2.2.0 or later library("survival") data(nwtco) ## stratified on case status dcchs<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~rel), subset=~I(in.subcohort | rel), data=nwtco) svycoxph(Surv(edrel,rel)~factor(stage)+factor(histol)+I(age/12), design=dcchs) ## Using survival::cch subcoh <- nwtco$in.subcohort selccoh <- with(nwtco, rel==1|subcoh==1) ccoh.data <- nwtco[selccoh,] ccoh.data$subcohort <- subcoh[selccoh] cch(Surv(edrel, rel) ~ factor(stage) + factor(histol) + I(age/12), data =ccoh.data, subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing") ## two-phase case-control ## Similar to Breslow & Chatterjee, Applied Statistics (1999) but with ## a slightly different version of the data set nwtco$incc2<-as.logical(with(nwtco, ifelse(rel | instit==2,1,rbinom(nrow(nwtco),1,.1)))) dccs2<-twophase(id=list(~seqno,~seqno),strata=list(NULL,~interaction(rel,instit)), data=nwtco, subset=~incc2) dccs8<-twophase(id=list(~seqno,~seqno),strata=list(NULL,~interaction(rel,stage,instit)), data=nwtco, subset=~incc2) summary(glm(rel~factor(stage)*factor(histol),data=nwtco,family=binomial())) summary(svyglm(rel~factor(stage)*factor(histol),design=dccs2,family=quasibinomial())) summary(svyglm(rel~factor(stage)*factor(histol),design=dccs8,family=quasibinomial())) ## Stratification on stage is really post-stratification, so we should use calibrate() gccs8<-calibrate(dccs2, phase=2, formula=~interaction(rel,stage,instit)) summary(svyglm(rel~factor(stage)*factor(histol),design=gccs8,family=quasibinomial())) ## For this saturated model calibration is equivalent to estimating weights. pccs8<-calibrate(dccs2, phase=2,formula=~interaction(rel,stage,instit), calfun="rrz") summary(svyglm(rel~factor(stage)*factor(histol),design=pccs8,family=quasibinomial())) ## Since sampling is SRS at phase 1 and stratified RS at phase 2, we ## can use method="simple" to save memory. dccs8_simple<-twophase(id=list(~seqno,~seqno),strata=list(NULL,~interaction(rel,stage,instit)), data=nwtco, subset=~incc2,method="simple") summary(svyglm(rel~factor(stage)*factor(histol),design=dccs8_simple,family=quasibinomial()))
## two-phase simple random sampling. data(pbc, package="survival") pbc$randomized<-with(pbc, !is.na(trt) & trt>0) pbc$id<-1:nrow(pbc) d2pbc<-twophase(id=list(~id,~id), data=pbc, subset=~randomized) svymean(~bili, d2pbc) ## two-stage sampling as two-phase data(mu284) ii<-with(mu284, c(1:15, rep(1:5,n2[1:5]-3))) mu284.1<-mu284[ii,] mu284.1$id<-1:nrow(mu284.1) mu284.1$sub<-rep(c(TRUE,FALSE),c(15,34-15)) dmu284<-svydesign(id=~id1+id2,fpc=~n1+n2, data=mu284) ## first phase cluster sample, second phase stratified within cluster d2mu284<-twophase(id=list(~id1,~id),strata=list(NULL,~id1), fpc=list(~n1,NULL),data=mu284.1,subset=~sub) svytotal(~y1, dmu284) svytotal(~y1, d2mu284) svymean(~y1, dmu284) svymean(~y1, d2mu284) ## case-cohort design: this example requires R 2.2.0 or later library("survival") data(nwtco) ## stratified on case status dcchs<-twophase(id=list(~seqno,~seqno), strata=list(NULL,~rel), subset=~I(in.subcohort | rel), data=nwtco) svycoxph(Surv(edrel,rel)~factor(stage)+factor(histol)+I(age/12), design=dcchs) ## Using survival::cch subcoh <- nwtco$in.subcohort selccoh <- with(nwtco, rel==1|subcoh==1) ccoh.data <- nwtco[selccoh,] ccoh.data$subcohort <- subcoh[selccoh] cch(Surv(edrel, rel) ~ factor(stage) + factor(histol) + I(age/12), data =ccoh.data, subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing") ## two-phase case-control ## Similar to Breslow & Chatterjee, Applied Statistics (1999) but with ## a slightly different version of the data set nwtco$incc2<-as.logical(with(nwtco, ifelse(rel | instit==2,1,rbinom(nrow(nwtco),1,.1)))) dccs2<-twophase(id=list(~seqno,~seqno),strata=list(NULL,~interaction(rel,instit)), data=nwtco, subset=~incc2) dccs8<-twophase(id=list(~seqno,~seqno),strata=list(NULL,~interaction(rel,stage,instit)), data=nwtco, subset=~incc2) summary(glm(rel~factor(stage)*factor(histol),data=nwtco,family=binomial())) summary(svyglm(rel~factor(stage)*factor(histol),design=dccs2,family=quasibinomial())) summary(svyglm(rel~factor(stage)*factor(histol),design=dccs8,family=quasibinomial())) ## Stratification on stage is really post-stratification, so we should use calibrate() gccs8<-calibrate(dccs2, phase=2, formula=~interaction(rel,stage,instit)) summary(svyglm(rel~factor(stage)*factor(histol),design=gccs8,family=quasibinomial())) ## For this saturated model calibration is equivalent to estimating weights. pccs8<-calibrate(dccs2, phase=2,formula=~interaction(rel,stage,instit), calfun="rrz") summary(svyglm(rel~factor(stage)*factor(histol),design=pccs8,family=quasibinomial())) ## Since sampling is SRS at phase 1 and stratified RS at phase 2, we ## can use method="simple" to save memory. dccs8_simple<-twophase(id=list(~seqno,~seqno),strata=list(NULL,~interaction(rel,stage,instit)), data=nwtco, subset=~incc2,method="simple") summary(svyglm(rel~factor(stage)*factor(histol),design=dccs8_simple,family=quasibinomial()))
Update the data variables in a survey design, either with a formula for a new set of variables or with an expression for variables to be added.
## S3 method for class 'survey.design' update(object, ...) ## S3 method for class 'twophase' update(object, ...) ## S3 method for class 'svyrep.design' update(object, ...) ## S3 method for class 'DBIsvydesign' update(object, ...)
## S3 method for class 'survey.design' update(object, ...) ## S3 method for class 'twophase' update(object, ...) ## S3 method for class 'svyrep.design' update(object, ...) ## S3 method for class 'DBIsvydesign' update(object, ...)
object |
a survey design object |
... |
Arguments |
Database-backed objects may not have write access to the database and so
update
does not attempt to modify the database. The expressions
are stored and are evaluated when the data is loaded.
If a set of new variables will be used extensively it may be more efficient to modify the database, either with SQL queries from the R interface or separately. One useful intermediate approach is to create a table with the new variables and a view that joins this table to the table of existing variables.
There is now a base-R function transform
for adding new
variables to a data frame, so I have added transform
as a synonym for
update
for survey objects.
A survey design object
svydesign
, svrepdesign
, twophase
data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) dstrat<-update(dstrat, apidiff=api00-api99) svymean(~api99+api00+apidiff, dstrat)
data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) dstrat<-update(dstrat, apidiff=api00-api99) svymean(~api99+api00+apidiff, dstrat)
Extract weights from a survey design object.
## S3 method for class 'survey.design' weights(object, ...) ## S3 method for class 'svyrep.design' weights(object, type=c("replication","sampling","analysis"), ...) ## S3 method for class 'survey_fpc' weights(object,final=TRUE,...)
## S3 method for class 'survey.design' weights(object, ...) ## S3 method for class 'svyrep.design' weights(object, type=c("replication","sampling","analysis"), ...) ## S3 method for class 'survey_fpc' weights(object,final=TRUE,...)
object |
Survey design object |
type |
Type of weights: |
final |
If |
... |
Other arguments ignored |
vector or matrix of weights
svydesign
, svrepdesign
,
as.fpc
data(scd) scddes<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE, fpc=rep(5,6)) repweights<-2*cbind(c(1,0,1,0,1,0), c(1,0,0,1,0,1), c(0,1,1,0,0,1), c(0,1,0,1,1,0)) scdrep<-svrepdesign(data=scd, type="BRR", repweights=repweights) weights(scdrep) weights(scdrep, type="sampling") weights(scdrep, type="analysis") weights(scddes)
data(scd) scddes<-svydesign(data=scd, prob=~1, id=~ambulance, strata=~ESA, nest=TRUE, fpc=rep(5,6)) repweights<-2*cbind(c(1,0,1,0,1,0), c(1,0,0,1,0,1), c(0,1,1,0,0,1), c(0,1,0,1,1,0)) scdrep<-svrepdesign(data=scd, type="BRR", repweights=repweights) weights(scdrep) weights(scdrep, type="sampling") weights(scdrep, type="analysis") weights(scddes)
Performs a survey analysis on each of the designs in a
svyimputationList
objects and returns a list of results suitable
for MIcombine
. The analysis may be specified as an expression or
as a function.
## S3 method for class 'svyimputationList' with(data, expr, fun, ...,multicore=getOption("survey.multicore")) ## S3 method for class 'svyimputationList' subset(x, subset,...)
## S3 method for class 'svyimputationList' with(data, expr, fun, ...,multicore=getOption("survey.multicore")) ## S3 method for class 'svyimputationList' subset(x, subset,...)
data , x
|
A |
expr |
An expression giving a survey analysis |
fun |
A function taking a survey design object as its argument |
... |
for future expansion |
multicore |
Use |
subset |
An logical expression specifying the subset |
A list of the results from applying the analysis to each design object.
MIcombine
, in the mitools
package
library(mitools) data.dir<-system.file("dta",package="mitools") files.men<-list.files(data.dir,pattern="m.\\.dta$",full=TRUE) men<-imputationList(lapply(files.men, foreign::read.dta, warn.missing.labels=FALSE)) files.women<-list.files(data.dir,pattern="f.\\.dta$",full=TRUE) women<-imputationList(lapply(files.women, foreign::read.dta, warn.missing.labels=FALSE)) men<-update(men, sex=1) women<-update(women,sex=0) all<-rbind(men,women) designs<-svydesign(id=~id, strata=~sex, data=all) designs results<-with(designs, svymean(~drkfre)) MIcombine(results) summary(MIcombine(results)) repdesigns<-as.svrepdesign(designs, type="boot", replicates=50) MIcombine(with(repdesigns, svymean(~drkfre)))
library(mitools) data.dir<-system.file("dta",package="mitools") files.men<-list.files(data.dir,pattern="m.\\.dta$",full=TRUE) men<-imputationList(lapply(files.men, foreign::read.dta, warn.missing.labels=FALSE)) files.women<-list.files(data.dir,pattern="f.\\.dta$",full=TRUE) women<-imputationList(lapply(files.women, foreign::read.dta, warn.missing.labels=FALSE)) men<-update(men, sex=1) women<-update(women,sex=0) all<-rbind(men,women) designs<-svydesign(id=~id, strata=~sex, data=all) designs results<-with(designs, svymean(~drkfre)) MIcombine(results) summary(MIcombine(results)) repdesigns<-as.svrepdesign(designs, type="boot", replicates=50) MIcombine(with(repdesigns, svymean(~drkfre)))
In each set of replicate weights there will be some clusters that have essentially zero weight. These are used as the test set, with the other clusters used as the training set. Jackknife weights ("JK1","JKn") are very similar to cross-validation at the cluster level; bootstrap weights are similar to bootstrapping for cross-validation.
withCrossval(design, formula, trainfun, testfun, loss = c("MSE", "entropy", "AbsError"), intercept, tuning, nearly_zero=1e-4,...)
withCrossval(design, formula, trainfun, testfun, loss = c("MSE", "entropy", "AbsError"), intercept, tuning, nearly_zero=1e-4,...)
design |
A survey design object (currently only |
formula |
Model formula where the left-hand side specifies the outcome vairable and the right-hand side specifies the variables that will be used for prediction |
trainfun |
Function taking a predictor matrix |
testfun |
Function taking a predictor matrix |
loss |
Loss function for assessing prediction |
intercept |
Should the predictor matrix have an intercept added? |
tuning |
vector of tuning parameters, such as the regularisation
parameter in information criteria or the number of predictors. |
nearly_zero |
test-set threshold on the scale of replicate weight divided by sampling weight. |
... |
future expansion |
A number
Iparragirre, A., Lumley, T., Barrio, I., & Arostegui, I. (2023). Variable selection with LASSO regression for complex survey data. Stat, 12(1), e578.
data(api) rclus1<-as.svrepdesign(svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)) withCrossval(rclus1, api00~api99+ell+stype, trainfun=function(X,y,w,tuning) lm.wfit(X,y,w), testfun=function(X, trainfit,tuning) X%*%coef(trainfit), intercept=TRUE,loss="MSE",tuning=1) ## More realistic example using lasso ## tuning parameter is number of variables in model ## ## library(glmnet) ## ftrain=function(X,y,w,tuning) { ## m<-glmnet(X,y,weights=w) ## lambda<-m$lambda[min(which(m$df>=tuning))] ## list(m,lambda) ## } ## ftest=function(X, trainfit, tuning){ ## predict(trainfit[[1]], newx=X, s=trainfit[[2]]) ## } ## ## withCrossval(rclus1, api00~api99+ell+stype+mobility+enroll, ## trainfun=ftrain, ## testfun=ftest, ## intercept=FALSE,loss="MSE", ## tuning=0:3) ## ## [1] 11445.2379 9649.1150 800.0742 787.4171 ## ## Models with two or three predictors are about equally good
data(api) rclus1<-as.svrepdesign(svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)) withCrossval(rclus1, api00~api99+ell+stype, trainfun=function(X,y,w,tuning) lm.wfit(X,y,w), testfun=function(X, trainfit,tuning) X%*%coef(trainfit), intercept=TRUE,loss="MSE",tuning=1) ## More realistic example using lasso ## tuning parameter is number of variables in model ## ## library(glmnet) ## ftrain=function(X,y,w,tuning) { ## m<-glmnet(X,y,weights=w) ## lambda<-m$lambda[min(which(m$df>=tuning))] ## list(m,lambda) ## } ## ftest=function(X, trainfit, tuning){ ## predict(trainfit[[1]], newx=X, s=trainfit[[2]]) ## } ## ## withCrossval(rclus1, api00~api99+ell+stype+mobility+enroll, ## trainfun=ftrain, ## testfun=ftest, ## intercept=FALSE,loss="MSE", ## tuning=0:3) ## ## [1] 11445.2379 9649.1150 800.0742 787.4171 ## ## Models with two or three predictors are about equally good
Repeats an analysis for each of a set of 'plausible values' in a survey data set, returning a list suitable for mitools::MIcombine
. The default method works for both standard and replicate-weight designs but not for two-phase designs.
## S3 method for class 'survey.design' withPV(mapping, data, action, rewrite=TRUE, ...)
## S3 method for class 'survey.design' withPV(mapping, data, action, rewrite=TRUE, ...)
mapping |
A formula or list of formulas describing each variable in the analysis that has plausible values. The left-hand side of the formula is the name to use in the analysis; the right-hand side gives the names in the dataset. |
data |
A survey design object, as created by |
action |
With |
rewrite |
Rewrite |
... |
For methods |
A list of the results returned by each evaluation of action
, with the call as an attribute.
if(require(mitools)){ data(pisamaths, package="mitools") des<-svydesign(id=~SCHOOLID+STIDSTD, strata=~STRATUM, nest=TRUE, weights=~W_FSCHWT+condwt, data=pisamaths) oo<-options(survey.lonely.psu="remove") results<-withPV(list(maths~PV1MATH+PV2MATH+PV3MATH+PV4MATH+PV5MATH), data=des, action=quote(svyglm(maths~ST04Q01*(PCGIRLS+SMRATIO)+MATHEFF+OPENPS, design=des)), rewrite=TRUE) summary(MIcombine(results)) options(oo) }
if(require(mitools)){ data(pisamaths, package="mitools") des<-svydesign(id=~SCHOOLID+STIDSTD, strata=~STRATUM, nest=TRUE, weights=~W_FSCHWT+condwt, data=pisamaths) oo<-options(survey.lonely.psu="remove") results<-withPV(list(maths~PV1MATH+PV2MATH+PV3MATH+PV4MATH+PV5MATH), data=des, action=quote(svyglm(maths~ST04Q01*(PCGIRLS+SMRATIO)+MATHEFF+OPENPS, design=des)), rewrite=TRUE) summary(MIcombine(results)) options(oo) }
Given a function or expression computing a statistic based on sampling
weights, withReplicates
evaluates the statistic and produces a
replicate-based estimate of variance. vcov.svrep.design
produces
the variance estimate from a set of replicates and the design object.
withReplicates(design, theta,..., return.replicates=FALSE) ## S3 method for class 'svyrep.design' withReplicates(design, theta, rho = NULL, ..., scale.weights=FALSE, return.replicates=FALSE) ## S3 method for class 'svrepvar' withReplicates(design, theta, ..., return.replicates=FALSE) ## S3 method for class 'svrepstat' withReplicates(design, theta, ..., return.replicates=FALSE) ## S3 method for class 'svyimputationList' withReplicates(design, theta, ..., return.replicates=FALSE) ## S3 method for class 'svyrep.design' vcov(object, replicates, centre,...)
withReplicates(design, theta,..., return.replicates=FALSE) ## S3 method for class 'svyrep.design' withReplicates(design, theta, rho = NULL, ..., scale.weights=FALSE, return.replicates=FALSE) ## S3 method for class 'svrepvar' withReplicates(design, theta, ..., return.replicates=FALSE) ## S3 method for class 'svrepstat' withReplicates(design, theta, ..., return.replicates=FALSE) ## S3 method for class 'svyimputationList' withReplicates(design, theta, ..., return.replicates=FALSE) ## S3 method for class 'svyrep.design' vcov(object, replicates, centre,...)
design |
A survey design with replicate weights (eg from |
theta |
A function or expression: see Details below |
rho |
If |
... |
Other arguments to |
scale.weights |
Divide the probability weights by their sum (can help with overflow problems) |
return.replicates |
Return the replicate estimates as well as the variance? |
object |
The replicate-weights design object used to create the replicates |
replicates |
A set of replicates |
centre |
The centering value for variance calculation. If
|
The method for svyrep.design
objects evaluates a function or
expression using the sampling weights and then each set of replicate
weights. The method for svrepvar
objects evaluates the function
or expression on an estimated population covariance matrix and its
replicates, to simplify multivariate statistics such as structural
equation models.
For the svyrep.design
method, if theta
is a function its first argument will be a vector of
weights and the second argument will be a data frame containing the
variables from the design object. If it is an expression, the sampling weights will be available as the
variable .weights
. Variables in the design object will also
be in scope. It is possible to use global variables in the
expression, but unwise, as they may be masked by local variables
inside withReplicates
.
For the svrepvar
method a function will get the covariance
matrix as its first argument, and an expression will be evaluated with
.replicate
set to the variance matrix.
For the svrepstat
method a function will get the point estimate, and an expression will be evaluated with
.replicate
set to each replicate. The method can only be used
when the svrepstat
object includes replicates.
The svyimputationList
method runs withReplicates
on each imputed design (which must be replicate-weight designs).
If return.replicates=FALSE
, the weighted statistic, with the
variance matrix as the "var"
attribute. If
return.replicates=TRUE
, a list with elements theta
for
the usual return value and replicates
for the replicates.
svrepdesign
, as.svrepdesign
, svrVar
data(scd) repweights<-2*cbind(c(1,0,1,0,1,0), c(1,0,0,1,0,1), c(0,1,1,0,0,1), c(0,1,0,1,1,0)) scdrep<-svrepdesign(data=scd, type="BRR", repweights=repweights) a<-svyratio(~alive, ~arrests, design=scdrep) print(a$ratio) print(a$var) withReplicates(scdrep, quote(sum(.weights*alive)/sum(.weights*arrests))) withReplicates(scdrep, function(w,data) sum(w*data$alive)/sum(w*data$arrests)) data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) rclus1<-as.svrepdesign(dclus1) varmat<-svyvar(~api00+api99+ell+meals+hsg+mobility,rclus1,return.replicates=TRUE) withReplicates(varmat, quote( factanal(covmat=.replicate, factors=2)$unique) ) data(nhanes) nhanesdesign <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=TRUE,data=nhanes) logistic <- svyglm(HI_CHOL~race+agecat+RIAGENDR, design=as.svrepdesign(nhanesdesign), family=quasibinomial, return.replicates=TRUE) fitted<-predict(logistic, return.replicates=TRUE, type="response") sensitivity<-function(pred,actual) mean(pred>0.1 & actual)/mean(actual) withReplicates(fitted, sensitivity, actual=logistic$y) ## Not run: library(quantreg) data(api) ## one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) ## convert to bootstrap bclus1<-as.svrepdesign(dclus1,type="bootstrap", replicates=100) ## median regression withReplicates(bclus1, quote(coef(rq(api00~api99, tau=0.5, weights=.weights)))) ## End(Not run) ## pearson correlation dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) bstrat<- as.svrepdesign(dstrat,type="subbootstrap") v <- svyvar(~api00+api99, bstrat, return.replicates=TRUE) vcor<-cov2cor(as.matrix(v))[2,1] vreps<-v$replicates correps<-apply(vreps,1, function(v) v[2]/sqrt(v[1]*v[4])) vcov(bstrat,correps, centre=vcor)
data(scd) repweights<-2*cbind(c(1,0,1,0,1,0), c(1,0,0,1,0,1), c(0,1,1,0,0,1), c(0,1,0,1,1,0)) scdrep<-svrepdesign(data=scd, type="BRR", repweights=repweights) a<-svyratio(~alive, ~arrests, design=scdrep) print(a$ratio) print(a$var) withReplicates(scdrep, quote(sum(.weights*alive)/sum(.weights*arrests))) withReplicates(scdrep, function(w,data) sum(w*data$alive)/sum(w*data$arrests)) data(api) dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) rclus1<-as.svrepdesign(dclus1) varmat<-svyvar(~api00+api99+ell+meals+hsg+mobility,rclus1,return.replicates=TRUE) withReplicates(varmat, quote( factanal(covmat=.replicate, factors=2)$unique) ) data(nhanes) nhanesdesign <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=TRUE,data=nhanes) logistic <- svyglm(HI_CHOL~race+agecat+RIAGENDR, design=as.svrepdesign(nhanesdesign), family=quasibinomial, return.replicates=TRUE) fitted<-predict(logistic, return.replicates=TRUE, type="response") sensitivity<-function(pred,actual) mean(pred>0.1 & actual)/mean(actual) withReplicates(fitted, sensitivity, actual=logistic$y) ## Not run: library(quantreg) data(api) ## one-stage cluster sample dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) ## convert to bootstrap bclus1<-as.svrepdesign(dclus1,type="bootstrap", replicates=100) ## median regression withReplicates(bclus1, quote(coef(rq(api00~api99, tau=0.5, weights=.weights)))) ## End(Not run) ## pearson correlation dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) bstrat<- as.svrepdesign(dstrat,type="subbootstrap") v <- svyvar(~api00+api99, bstrat, return.replicates=TRUE) vcor<-cov2cor(as.matrix(v))[2,1] vreps<-v$replicates correps<-apply(vreps,1, function(v) v[2]/sqrt(v[1]*v[4])) vcov(bstrat,correps, centre=vcor)
Defines a design object with multiple dimensions of correlation:
observations that share any of the id
variables are correlated,
or you can supply an adjacency matrix or Matrix to specify which are
correlated. Supports crossed designs (eg multiple raters of multiple
objects) and non-nested observational correlation (eg observations
sharing primary school or secondary school). Has methods for
svymean
, svytotal
, svyglm
(so far).
xdesign(id = NULL, strata = NULL, weights = NULL, data, fpc = NULL, adjacency = NULL, overlap = c("unbiased", "positive"), allow.non.binary = FALSE)
xdesign(id = NULL, strata = NULL, weights = NULL, data, fpc = NULL, adjacency = NULL, overlap = c("unbiased", "positive"), allow.non.binary = FALSE)
id |
list of formulas specifying cluster identifiers for each clustering dimension (or |
strata |
Not implemented |
weights |
model formula specifying (sampling) weights |
data |
data frame containing all the variables |
fpc |
Not implemented |
adjacency |
Adjacency matrix or Matrix indicating which pairs of observations are correlated |
overlap |
See details below |
allow.non.binary |
If |
Subsetting for these objects actually drops observations; it is not equivalent to just setting weights to zero as for survey designs. So, for example, a subset of a balanced design will not be a balanced design.
The overlap
option controls double-counting of some variance
terms. Suppose there are two clustering dimensions, ~a
and
~b
. If we compute variance matrices clustered on a
and
clustered on b
and add them, observations that share both
a
and b
will be counted twice, giving a positively
biased estimator. We can subtract off a variance matrix clustered
on combinations of a
and b
to give an unbiased
variance estimator. However, the unbiased estimator is not
guaranteed to be positive definite. In the references, Miglioretti
and Heagerty use the overlap="positive"
estimator and Cameron
et al use the overlap="unbiased"
estimator.
An object of class xdesign
Miglioretti D, Heagerty PJ (2007) Marginal modeling of nonnested multilevel data using standard software. Am J Epidemiol 165(4):453-63
Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2011). Robust Inference With Multiway Clustering. Journal of Business & Economic Statistics, 29(2), 238-249.
https://notstatschat.rbind.io/2021/09/18/crossed-clustering-and-parallel-invention/
## With one clustering dimension, is close to the with-replacement ## survey estimator, but not identical unless clusters are equal size data(api) dclus1r<-svydesign(id=~dnum, weights=~pw, data=apiclus1) xclus1<-xdesign(id=list(~dnum), weights=~pw, data=apiclus1) xclus1 svymean(~enroll,dclus1r) svymean(~enroll,xclus1) data(salamander) xsalamander<-xdesign(id=list(~Male, ~Female), data=salamander, overlap="unbiased") xsalamander degf(xsalamander)
## With one clustering dimension, is close to the with-replacement ## survey estimator, but not identical unless clusters are equal size data(api) dclus1r<-svydesign(id=~dnum, weights=~pw, data=apiclus1) xclus1<-xdesign(id=list(~dnum), weights=~pw, data=apiclus1) xclus1 svymean(~enroll,dclus1r) svymean(~enroll,xclus1) data(salamander) xsalamander<-xdesign(id=list(~Male, ~Female), data=salamander, overlap="unbiased") xsalamander degf(xsalamander)
Design information from the Youth Risk Behaviors Survey (YRBS), together with the single variable ‘Never/Rarely wore bike helmet’. Used as an analysis example by CDC.
data("yrbs")
data("yrbs")
A data frame with 15624 observations on the following 4 variables.
weight
sampling weights
stratum
sampling strata
psu
primary sampling units
qn8
1=Yes, 2=No
https://ftp.cdc.gov/pub/Data/YRBS/2015smy/ for files
Centers for Disease Control and Prevention (2016) Software for Analysis of YRBS Data. [CRAN doesn't believe the URL is valid]
data(yrbs) yrbs_design <- svydesign(id=~psu, weight=~weight, strata=~stratum, data=yrbs) yrbs_design <- update(yrbs_design, qn8yes=2-qn8) ci <- svyciprop(~qn8yes, yrbs_design, na.rm=TRUE, method="xlogit") ci ## to print more digits: matches SUDAAN and SPSS exactly, per table 3 of reference coef(ci) SE(ci) attr(ci,"ci")
data(yrbs) yrbs_design <- svydesign(id=~psu, weight=~weight, strata=~stratum, data=yrbs) yrbs_design <- update(yrbs_design, qn8yes=2-qn8) ci <- svyciprop(~qn8yes, yrbs_design, na.rm=TRUE, method="xlogit") ci ## to print more digits: matches SUDAAN and SPSS exactly, per table 3 of reference coef(ci) SE(ci) attr(ci,"ci")