The "Cram" Method of Validation for Machine learning

Our department at the University of Houston has a paper that is accepted and will be presented at the 2026 conference of the American Educational Research Association (AERA) in Los Angeles, CA. Without getting too deep into the purpose of the paper, while simultaneously aiming for the purpose of this writing, which is showcasing our use of novel methodology in action, we used machine learning ML algorithms to make our case. As a result, we needed to validate the machine learning outcomes. Traditionally this is done with an 80/20 holdout reservation of part of the data. Indeed, one of our critics asked us to use a holdout method in a particular way.

However, as a data scientist, I am always championing the newest methodologies out there available in the R coding sphere (read as a linguistic sphere altogether). As I was looking out during the break for newest releases, I came across the work of Jia, Imai, and Li (2024/2025). Their methods turn the traditional 80/20 holdout and cross validation techniques on their heads, rejecting them outright and proclaiming something new. In their new approach, they take the entire dataset, subdivide it into batches, create an initial rule, then, following several iterations, create respective rules that follow, to culminate in a final rule.

For the purpose of our paper, this is an important use case apart from the ones declared specifically by Jia, Imai, and Li (2024/2025). They defend their uses of this methodology in randomized clinical trials due to small numbers of cases. I also want to extend this to studies that have small cases as well, as with our paper.

While our data might look like more than those of a clinical trial, the numbers are complex and vary by cohort. So there is tethering involved, and the variables get reduced down to 54 with LASSO regression. Still, some might consider the numbers to be on the verge of medial to big data. Whatever the perspective on this, the cram method helps to prevent loss of the complexity of cases needed to train the final outcome.

The only drawback from using the cram method was with hyperparameter tuning. It was considered double dipping to do cross validated sampling for tuning purposes, so it was decided to leave all tuning options at default levels.

This left us with a clean slate to use the cram method without feeling as though we were over using any cross validated method. The results of the cram method indicated clean, interpretable findings that could be used towards unseen testing sets going forward.

In this final image one can see the expected loss estimate is 0.10, which is respectable for the XGBoost algorithm, relative to others that we stacked against it. In the end, XGBoost did not win out as the most predictive model with the least error and highest AUC score and so forth, but this was due to a host of factors, including the data itself. In the end, the cram method was very helpful in getting a handle on our data and will be used going forward in our presentation at AERA.

References

Jia, Z., Imai, K. & Li, M. (2025). Cramming Contextual Bandits for On-policy Statistical Evaluation. arXiv, 2403.07031. https://arxiv.org/abs/2403.07031

Jia, Z., Imai, K., & Li, M.L. (2024). "The Cram Method for Efficient Simultaneous Learning and Evaluation." Harvard Business School Working Paper. https://www.hbs.edu/faculty/Pages/item.aspx?num=66892

Researching Education

Search This Blog

The "Cram" Method of Validation for Machine learning

Labels

Popular posts from this blog

Striking a Balance: How Much Should I Cast My own Site Variables, Versus Referring to the Literature When Doing Predictive Machine Learning as a Data Scientist?

Persisting through reading technical CRAN documentation

The Matrix Literature Review and the 'rectangulate' Function from the r7283 Package