Design-Based Survey Analysis

One of the persisting problems for secondary analysis-based researchers is generating a statistical model from data that is generalizable only to a fixed population (Lumley, 2010). A key difference between creating statistical inferences towards similar populations and estimating the results of a sample towards a fixed population is using several preemptive steps to guarantee that design-based sampling is replicated. Bell, Onwuegbuzie, Ferron, Jiao and Kromey (2012) have reported on the lack of clarity in remaining faithful to survey designs

"Case Damascus Barlow Knife" by Michael E. Cumpston CC-BY-SA 3.0

by many investigators relying on large survey data covering adolescent health. However, reporting on international survey data suffers from the same issues, as sampling weights are not included in investigatory analysis, or they are not discussed thoroughly in methodology sections of investigation reports. While the rationales for incomplete discussions are not definitively concluded upon, a heuristic may be formed that the process of not including design-based anti-bias mechanisms is owing to the mystery behind how these mechanisms work, the nature of survey design itself, and statistical packages that can help.

Two of the most common ways of remaining faithful to survey design is to use and report on survey clusters and sampling weights. The former approach, with survey clusters or primary sampling units (PSUs), can be slightly overwhelming to the novice secondary data researcher, especially when importing data from several .sav files into R. The daunting process of gathering separate data files into one data frame can take several lines of code, if the analyst is not using a specialty package like ‘intsvy’ to choose appropriate variables and concatenate them into a workable data frame (Caro & Biecek, 2017). Sometimes data sets do not provide PSUs, and it can be especially difficult to program computers to apply the appropriate calculus behind the scenes so that a well-formed estimate of the population parameters can be achieved. A function with a survey ‘weights’ argument will allow the statistical programmer to place the correct variable in the appropriate slot. However, finding an R package with a weight slot and the ability to perform bootstrapping or jackknife procedures can be challenging.

Using the R programming language for analysis of secondary data can be deceptively simple in solving the issue of survey replication, especially when running a linear model in R using the base lm() function. The assumption of assigning a set variable to the ‘weights’ argument will not intuitively work the way in which one would expect and can get the programmer in trouble. A few R packages (‘EdSurvey’, and ‘intsvy’) will take on a weights argument.

There are two solutions (depending on how fine grain you want the analysis). One answer is to use the ‘BIFIEsurvey’ package (BIFIE, 2018). It is based on the work of Breit and Schreiner (2016), and has slots for weighting variables and can handle replicate variables in the TIMSS and PIRLS surveys. Outputs include standard errors, degrees of freedom, p-values, and Wald statistics, derived from jackknife replicates or bootstrapping procedures. The output is intuitive, and can be directly exported to LaTeX while calling the str() command in R and using the ‘xtable’ package (Dahl, 2016). The other solution is to use the ‘survey’ package by Lumley (2018). This package requires more information to produce an object that can then be carried forward for further analysis, but for those analysts who can identify the cluster id variables in their dataset, it will give greater control over the analysis. Both packages will also give statistical output based on an estimation towards a fixed population, along with standard errors (a sine qua non of estimating in this circumstance). An example of the practical usage of the ‘survey’ package can be found by Murray (2015). Novices will probably want to use the ‘BIFIEsurvey’ package and advance towards the ‘survey’ package once they have mastered the full set of complex options in estimating the fixed population.

Both packages take on the problems reported by Bell, Onwuegbusie, Ferron, Jiao, and Kromey (2012) in their review of the literature, and to good effect. The packages help with several options for estimating statistical output from a sample towards a fixed population with one to two-step (and possibly more) treatments of a design sample. Both packages are contingent upon the statistical programmer creating a basic data frame that captures the variables of interest for parameter estimation. Both packages will allow for statistical procedures that involve the cross between the process of mathematical modeling in addition to estimation towards the fixed populace.

References

Bailey, P., C'deBaca, R., Emad, A., Huo, H., Lee, M., Liao, Y.,
Nguyen, T., Xie, Q., Yu, J. and Zhang, T. (2018). EdSurvey: Analysis of NCES
Education Survey and Assessment Data. R package version 2.0.3.
https://CRAN.R-project.org/package=EdSurvey

Bell, B., Onwuegbuzie, A., Ferron, J., Jiao, Q. Hibbard, S. and Kromey, J. (2012). Use of design effects and sample weights in complex health survey data: a review of published articles using data from 3 commonly used adolescent health surveys. American Journal of Public Health, 102(7), 1399-1405. Retrieved from http://doi.org/10.2105/AJPH.2011.300398.

BIFIE (2018). BIFIEsurvey: Tools for survey statistics in educational assessment. R
package version 2.191-12. https://CRAN.R-project.org/package=BIFIEsurvey

Breit, S. and Schreiner, C. (2016). Large-scale assessment mit r: Methodische grundlagen der österreichischen bildungsstandardüberprüfung. Facultas: Vienna, Austria.

Caro, D. H., and Biecek, P. (2017). “intsvy: An R Package for Analyzing International
Large-Scale Assessment Data.” Journal of Statistical Software, 81(7), pp. 1-44.
http://doi.org/10.18637/jss.v081.i07.

Dahl, D. (2016). xtable: Export Tables to LaTeX or HTML. R package version
1.8-2. https://CRAN.R-project.org/package=xtable

Lumley, T. (2010). Complex surveys : a guide to analysis using R. Hoboken, N.J: John Wiley.

T. Lumley (2018) "survey: analysis of complex survey samples". R package version
3.34. https://cran.r-project.org/web/packages/survey/

Murray, J. (2015). Intro to the survey R package (36-303). Retrieved from https://www.andrew.cmu.edu/user/jsmurray/teaching/303/files/lab.html

Researching Education

Search This Blog

Design-Based Survey Analysis

References

Popular posts from this blog

Persisting through reading technical CRAN documentation

Striking a Balance: How Much Should I Cast My own Site Variables, Versus Referring to the Literature When Doing Predictive Machine Learning as a Data Scientist?

The Matrix Literature Review and the 'rectangulate' Function from the r7283 Package