Skip to main content

The "Cram" Method of validation for Machine learning

Our department at the University of Houston has a paper that is accepted and will be presented at the 2026 conference of the American Educational Research Association (AERA) in Los Angeles. Without getting too deep into the purpose of the paper, while simultaneously aiming for the purpose of this writing, which is showcasing our use of novel methodology in action, we used machine learning ML algorithms to make our case. As a result, we needed to validate the machine learning outcomes. Traditionally this is done with an 80/20 holdout reservation of part of the data. Indeed, one of our critics asked us to use a holdout method in a particular way. However, as a data scientist, I am always championing the newest methodologies out there available in the R coding sphere (read as a linguistic sphere altogether). As I was looking out during the break for newest releases, I came across the work of Jia, Imai, and Li (2024/2025). Their methods turn the traditional 80/20 holdout and cross validation techniques on their heads, rejecting them outright and proclaiming something new. In their new approach, they take the entire dataset, subdivide it into batches, create an initial rule, then, following several iterations, create respective rules that follow, to culminate in a final rule. For the purpose of our paper, this is an important use case apart from the ones declared specifically by Jia, Imai, and Li (2024/2025). They defend their uses of this methodology in randomized clinical trials due to small numbers of cases. I also want to extend this to studies that have small cases as well, as with our paper.
While our data might look like more than those of a clinical trial, the numbers are complex and vary by cohort. So there is tethering involved, and the variables get reduced down to 54 with LASSO regression. Still, some might consider the numbers to be on the verge of medial to big data. Whatever the perspective on this, the cram method helps to prevent loss of the complexity of cases needed to train the final outcome. The only drawback from using the cram method was with hyperparameter tuning. It was considered double dipping to do cross validated sampling for tuning purposes, so it was decided to leave all tuning options at default levels.
This left us with a clean slate to use the cram method without feeling as though we were over using any cross validated method. The results of the cram method indicated clean, interpretable findings that could be used towards unseen testing sets going forward.
In this final image one can see the expected loss estimate is 0.10, which is respectable for the XGBoost algorithm, relative to others that we stacked against it. In the end, XGBoost did not win out as the most predictive model with the least error and highest AUC score and so forth, but this was due to a host of factors, including the data itself. In the end, the cram method was very helpful in getting a handle on our data and will be used going forward in our presentation at AERA.

References

Jia, Z., Imai, K. & Li, M. (2025). Cramming Contextual Bandits for On-policy Statistical Evaluation. arXiv, 2403.07031. https://arxiv.org/abs/2403.07031

Jia, Z., Imai, K., & Li, M.L. (2024). "The Cram Method for Efficient Simultaneous Learning and Evaluation." Harvard Business School Working Paper. https://www.hbs.edu/faculty/Pages/item.aspx?num=66892

Popular posts from this blog

Striking a Balance: How Much Should I Cast My own Site Variables, Versus Referring to the Literature When Doing Predictive Machine Learning as a Data Scientist?

 Recently I came back from the TAIR 2025 conference and I was struck by the number of presenters that focused on using either auto machine learning or artificial intelligence in creating models for predictive analytics in higher education. One of the striking things about the works presented is that the independent variables were somewhat similar to each other but yet different from each other enough to raise the question. How much should there be consistency between predictive machine learning models? Or, how generalizable should any given model be? These two questions strike at the limits of what local work should aim towards. One way to look at the issue is the pressing need to look at all available variables locally and use them to forage a way forward at predictions about issues like retention, enrollment, and so forth at the university level. To a certain degree this is a moot point, as some would argue that data science is about creating actionable insights.  That is, u...

Persisting through reading technical CRAN documentation

 In my pursuit of self learning the R programming language, I have mostly mastered the art of reading through CRAN documentation of R libraries as they are published. I have gone through everything from mediocre to very well documented sheets and anything in between. I am sharing one example of a very good function that was well documented in the 'survey' library by Dr. Thomas Lumley that for some reason I could not process and make work with my data initially. No finger pointing or anything like that here. It was merely my brain not readily able to wrap around the idea that the function passed another function in its arguments.  fig1: the  svyby function in the 'survey' library by Thomas Lumley filled in with variables for my study Readers familiar with base R will be reminded of another function that works similarly called the aggregate  function, which is mirrored by the work of the svyby function, in that both call on data and both call on a function toward...

The Matrix Literature Review and the 'rectangulate' Function from the r7283 Package

Matrices and Literature Reviews Pulling together a strong literature review continues to be the very foundation of  positioning an education researcher's novel contribution to the field. Yet, reviewing literature can be daunting at the outset. This is because organizing the literature review results requires itemizing, tagging, and keeping track of the relevant articles. Organizing 100 + articles takes time, commitment, and can ultimately distract from the task at hand, which is getting a grip on the state of knowledge. To make the task of organizing the literature more straightforward, I have created a computational function that helps lift some of the burden of organizing literature.  It takes an exported bibliographic research file (.bib) exported from EBSCO and widens it into a matrix. Transposing the .bib file into a matrix allows the researcher to jump right into the matrix literature review style of reading articles. A matrix literature function for education ...