Skip to main content

Design-Based Survey Analysis

One of the persisting problems for secondary analysis-based researchers is generating a statistical model from data that is generalizable only to a fixed population (Lumley, 2010). A key difference between creating statistical inferences towards similar populations and estimating the results of a sample towards a fixed population is using several preemptive steps to guarantee that design-based sampling is replicated. Bell, Onwuegbuzie, Ferron, Jiao and Kromey (2012) have reported on the lack of clarity in remaining faithful to survey designs
"Case Damascus Barlow Knife" by Michael E. Cumpston CC-BY-SA 3.0

by many investigators relying on large survey data covering adolescent health. However, reporting on international survey data suffers from the same issues, as sampling weights are not included in investigatory analysis, or they are not discussed thoroughly in methodology sections of investigation reports. While the rationales for incomplete discussions are not definitively concluded upon, a heuristic may be formed that the process of not including design-based anti-bias mechanisms is owing to the mystery behind how these mechanisms work, the nature of survey design itself, and statistical packages that can help.
Two of the most common ways of remaining faithful to survey design is to use and report on survey clusters and sampling weights. The former approach, with survey clusters or primary sampling units (PSUs), can be slightly overwhelming to the novice secondary data researcher, especially when importing data from several .sav files into R. The daunting process of gathering separate data files into one data frame can take several lines of code, if the analyst is not using a specialty package like ‘intsvy’ to choose appropriate variables and concatenate them into a workable data frame (Caro & Biecek, 2017). Sometimes data sets do not provide PSUs, and it can be especially difficult to program computers to apply the appropriate calculus behind the scenes so that a well-formed estimate of the population parameters can be achieved. A function with a survey ‘weights’ argument will allow the statistical programmer to place the correct variable in the appropriate slot. However, finding an R package with a weight slot and the ability to perform bootstrapping or jackknife procedures can be challenging.
Using the R programming language for analysis of secondary data can be deceptively simple in solving the issue of survey replication, especially when running a linear model in R using the base lm() function. The assumption of assigning a set variable to the ‘weights’ argument will not intuitively work the way in which one would expect and can get the programmer in trouble. A few R packages (‘EdSurvey’, and ‘intsvy’) will take on a weights argument.
There are two solutions (depending on how fine grain you want the analysis). One answer is to use the ‘BIFIEsurvey’ package (BIFIE, 2018). It is based on the work of Breit and Schreiner (2016), and has slots for weighting variables and can handle replicate variables in the TIMSS and PIRLS surveys. Outputs include standard errors, degrees of freedom, p-values, and Wald statistics, derived from jackknife replicates or bootstrapping procedures. The output is intuitive, and can be directly exported to LaTeX while calling the str() command in R and using the ‘xtable’ package (Dahl, 2016). The other solution is to use the ‘survey’ package by Lumley (2018). This package requires more information to produce an object that can then be carried forward for further analysis, but for those analysts who can identify the cluster id variables in their dataset, it will give greater control over the analysis. Both packages will also give statistical output based on an estimation towards a fixed population, along with standard errors (a sine qua non of estimating in this circumstance). An example of the practical usage of the ‘survey’ package can be found by Murray (2015). Novices will probably want to use the ‘BIFIEsurvey’ package and advance towards the ‘survey’ package once they have mastered the full set of complex options in estimating the fixed population.
Both packages take on the problems reported by Bell, Onwuegbusie, Ferron, Jiao, and Kromey (2012) in their review of the literature, and to good effect. The packages help with several options for estimating statistical output from a sample towards a fixed population with one to two-step (and possibly more) treatments of a design sample. Both packages are contingent upon the statistical programmer creating a basic data frame that captures the variables of interest for parameter estimation. Both packages will allow for statistical procedures that involve the cross between the process of mathematical modeling in addition to estimation towards the fixed populace.


Bailey, P., C'deBaca, R., Emad, A., Huo, H., Lee, M., Liao, Y.,
Nguyen, T.,  Xie, Q., Yu, J. and Zhang, T. (2018). EdSurvey: Analysis of NCES
Education Survey and Assessment Data. R package version 2.0.3.
Bell, B., Onwuegbuzie, A., Ferron, J., Jiao, Q. Hibbard, S. and Kromey, J. (2012). Use of design effects and sample weights in complex health survey data: a review of published articles using data from 3 commonly used adolescent health surveys. American Journal of Public Health, 102(7), 1399-1405. Retrieved from
BIFIE (2018). BIFIEsurvey: Tools for survey statistics in educational assessment. R
package version 2.191-12.
Breit, S. and Schreiner, C. (2016). Large-scale assessment mit r: Methodische grundlagen der österreichischen bildungsstandardüberprüfung. Facultas: Vienna, Austria.
Caro, D. H., and Biecek, P. (2017). “intsvy: An R Package for Analyzing International
Large-Scale Assessment Data.” Journal of Statistical Software, 81(7), pp. 1-44.
Dahl, D. (2016). xtable: Export Tables to LaTeX or HTML. R package version
Lumley, T. (2010). Complex surveys : a guide to analysis using R. Hoboken, N.J: John Wiley.
T. Lumley (2018) "survey: analysis of complex survey samples". R package version
Murray, J. (2015). Intro to the survey R package (36-303). Retrieved from

Popular posts from this blog

Digital Humanities Methods in Educational Research

Digital Humanities based education Research This is a backpost from 2017. During that year, I presented my latest work at the 2017  SERA conference in Division II (Instruction, Cognition, and Learning). The title of my paper was "A Return to the Pahl (1978) School Leavers Study: A Distanced Reading Analysis." There are several motivations behind this study, including Cheon et al. (2013) from my alma mater .   This paper accomplished two objectives. First, I engaged previous claims made about the United States' equivalent of high school graduates on the Isle of Sheppey, UK, in the late 1970s. Second, I used emerging digital methods to arrive at conclusions about relationships between unemployment, participants' feelings about their  (then) current selves, their possible selves, and their  educational accomplishm ents. I n the image to the left I show a Ward Hierarchical Cluster reflecting the stylometrics of 153 essay

Creating Examination Question Banks for ESL Civics Students based on U.S. Form M-638

R and Latex Code in the Service of Exam Questions   The following webpage is under development and will grow with more information. The author abides by the GPL (>= 2) license provided by the "ProfessR" package by showing basic code, but not altering it. The code that is provided here is governed by the MIT license, copyright 2018, while respecting the GPL (>=2) license. Rationale Apart from the limited choices of open sourced, online curriculum building for adult ESL students (viz., there is a current need to create open-sourced assessments for various levels of student understandings of the English language. While the U.S. Citizenship and Immigration Services ( has valuable lessons for beginning and intermediate ESL civics learners, there exists a need to provide more robust assessments, especially for individuals repeating ESL-based civics courses. This is because the risks and efforts involved in applying for U

Getting past the two column PDF to extract text into RQDA: Literature reviews await

One of the great promises of working with RQDA is conceiving of it as computer assisted literature review software. This requires balancing the right amount of coding with text that can be used as warrants and backing in arguments. In theory it is a great idea--using computer assisted qualitative data analysis software (CAQDAS) for literature reviews, but how do you get the article PDFs into R and RQDA in a human readable format? By this I mean that many empirical articles are written in two column formats, and text extraction with standard tools produces text on the diagonal. Extracting PDF texts under this circumstance can be daunting when using some R packages such as 'pdftools', either with or without the assistance of  the 'tesseract' package. If you are working on a windows based computer, you can install three packages and Java to do the trick. First gather the literature articles that you would like to mark up in RQDA. Put them into a folder, and away you go.