Skip to main content

Persisting through reading technical CRAN documentation

 In my pursuit of self learning the R programming language, I have mostly mastered the art of reading through CRAN documentation of R libraries as they are published. I have gone through everything from mediocre to very well documented sheets and anything in between. I am sharing one example of a very good function that was well documented in the 'survey' library by Dr. Thomas Lumley that for some reason I could not process and make work with my data initially. No finger pointing or anything like that here. It was merely my brain not readily able to wrap around the idea that the function passed another function in its arguments. 


fig1: the svyby function in the 'survey' library by Thomas Lumley
filled in with variables for my study

Readers familiar with base R will be reminded of another function that works similarly called the aggregate function, which is mirrored by the work of the svyby function, in that both call on data and both call on a function towards the end of the line of arguments. However, the difference between the two functions is that svyby is designed to give results on a subset of the survey as defined by a factor. Slot 1 is for the full data, slot 2 is for the factor that we're slicing the data by, slot 3 is for the survey design, slot 4 is for the survey-designed function, and slot 5 allows for removal of NAs (see fig 1). In retrospect, it is an ingenious little function designed: Sleek, smart, and available. 

However, I was unable to break through the library to find the software that I needed. First, I tried to use other functions and add to those snippets to see if I was doing something right, with no such luck. There was weeks of this going on before I finally discovered svyby

Then when I discovered this code, the tildes threw me off. In many cases, calling the object or part of an object in R by quotations is usually how to satisfy arguments in functions. The use of the tilde is admittedly foreign to my eye as an R programmer. But I do not say that as a criticism as the writer of this package is part of the R core team. Maybe there's something in the 'survey' library that is truer than what we've all been doing in other countries? I'm not sure. It's just a different experience. 

However, the use of the term "formula" in the CRAN documentation for slots 1, 2, and 4 threw me off once I realized that svyby is the function that I needed to break down descriptive statistics into subsets of the survey that I was using for my analysis. In some cases, useRs will need a formula to specify subsets in very particular form, as their survey sampling will be complex: this was not the case for me, as I had a very simply sampled survey from which to draw. Therefore, I did not need a formula. I could have substituted a different word in my head altogether for slots 1 and 2 (viz. vector).  In terms of slot 4, this was easy enough to figure out when I realized what was needed for slots 1 and 2 in the arguments for svyby

All of this did not come easy, but instead, it took weeks of problem solving, persistent reading, perspective changing, drilling through technical CRAN documentation and trying variations on code to get to results that worked. I can say that in the time that it took to understand the svyby function I learned more about myself, using functions as arguments, and my data than I thought possible. I am grateful to Dr. Thomas Lumley and the 'survey' package. 



Comments

Popular posts from this blog

Striking a Balance: How Much Should I Cast My own Site Variables, Versus Referring to the Literature When Doing Predictive Machine Learning as a Data Scientist?

 Recently I came back from the TAIR 2025 conference and I was struck by the number of presenters that focused on using either auto machine learning or artificial intelligence in creating models for predictive analytics in higher education. One of the striking things about the works presented is that the independent variables were somewhat similar to each other but yet different from each other enough to raise the question. How much should there be consistency between predictive machine learning models? Or, how generalizable should any given model be? These two questions strike at the limits of what local work should aim towards. One way to look at the issue is the pressing need to look at all available variables locally and use them to forage a way forward at predictions about issues like retention, enrollment, and so forth at the university level. To a certain degree this is a moot point, as some would argue that data science is about creating actionable insights.  That is, u...

The Matrix Literature Review and the 'rectangulate' Function from the r7283 Package

Matrices and Literature Reviews Pulling together a strong literature review continues to be the very foundation of  positioning an education researcher's novel contribution to the field. Yet, reviewing literature can be daunting at the outset. This is because organizing the literature review results requires itemizing, tagging, and keeping track of the relevant articles. Organizing 100 + articles takes time, commitment, and can ultimately distract from the task at hand, which is getting a grip on the state of knowledge. To make the task of organizing the literature more straightforward, I have created a computational function that helps lift some of the burden of organizing literature.  It takes an exported bibliographic research file (.bib) exported from EBSCO and widens it into a matrix. Transposing the .bib file into a matrix allows the researcher to jump right into the matrix literature review style of reading articles. A matrix literature function for education ...