Skip to main content

Bi-Term topic modeling in R

As large language models (LLMs) have become all the rage recently, we can look to small scale modeling again as a useful tool to researchers in the field with strictly defined research questions that limit the use of language parsing and modeling to the bi term topic modeling procedure. In this blog post I discuss the procedure for bi-term topic modeling (BTM) in the R programming language. One indication of when to use the procedure is when there is short text with a large "n" to be parsed. An example of this is using it on twitter applications, and related social media postings. To be sure, such applications of text are becoming harder to harvest from online, but secondary data sources can still yield insightful information, and there are other uses for the BTM outside of twitter that can bring insights into short text, such as from open ended questions in surveys. 

 Yan et al. (2013) have suggested that the procedure of BTM with its Gibbs sampling procedure handles short text better than latent Dirichlet analysis (LDA) models. This conclusion is reached by the determination that term co-occurances can be discovered and calculated to help topic learning, and the sparce word occurence problem is solved through the grouping of patterns in the whole corpus space. So long as these two points can be defended, the BTM can be seen as useful beyond the LDA model. What follows next is a 4 step process for working through the BTM procedure in R. Note: the spacyr package is used in this workflow and should be installed ahead of time. Step 1 involves ingesting the data.
In this case the tidyverse and readxl libraries are called. Then on the next lines the data is read into the R environment, with the specification that NA cells be dropped. This is because the BTM library will throw errors with empty cells later in step 3. 

 Step 2 involves parsing the data with the spacyr library. Spacyr must be initialized, and if properly installed, will call a conda environment through the backend. Once initial commenting has indicated that the conda environment has been contacted, it is safe to proceed with parsing text. Note: when parsing is finished, spacyr must be closed, or finalized.
Step 2 indicates that once the initialization has been complete, the command spacy_parse() must be run with the first argument taking the text column in the data frame, with further arguments specified. In this case, entities was kept as false, but tag and lemma were kept as true. Then, dplyr arguments were used to filter parts of speech, keeping both nouns and adjectives only, with selection of both columns document id and lemmas. These were used to feed the object into the next line, which is the actual BTM modeling process. Kappa was subjectively kept at k = 25, with everything else in place as seen. "detailed = T" was cut off in this photograph of the code, but can be included now that it is known. 

Step 3 is the display of terms and their probabilities. This is done with simple code. 


The terms come in a list, and with simple breakdown, the useR can specify which topic to view. The list can be further broken down with the unlist() command.

Step 4 comes with the R CRAN instructions for the BTM package itself (Wijffels, 2023). 


The final step calls for an integration of object "d" in step 2, and object "m" from the BTM model.  The code produces the final output seen in the output figure below, which showcases topic 1, with topic relevance metric at lambda = 1. 



References

Wijffels, J. (2023). BTM: Biterm topic models for short text. https://cran.r-project.org/web/packages/BTM/index.html

Yan, X., Guo, J., Lan, Y, and Cheng, X. (2013). A biterm topic model for short texts. Proceedings of the 22nd international conference on the world wide web. pp. 1445- 1456https://doi.org/10.1145/2488388.2488514 

Popular posts from this blog

Digital Humanities Methods in Educational Research

Digital Humanities based education Research This is a backpost from 2017. During that year, I presented my latest work at the 2017  SERA conference in Division II (Instruction, Cognition, and Learning). The title of my paper was "A Return to the Pahl (1978) School Leavers Study: A Distanced Reading Analysis." There are several motivations behind this study, including Cheon et al. (2013) from my alma mater .   This paper accomplished two objectives. First, I engaged previous claims made about the United States' equivalent of high school graduates on the Isle of Sheppey, UK, in the late 1970s. Second, I used emerging digital methods to arrive at conclusions about relationships between unemployment, participants' feelings about their  (then) current selves, their possible selves, and their  educational accomplishm ents. I n the image to the left I show a Ward Hierarchical Cluster reflecting the stylometrics of 153 essay

Creating Examination Question Banks for ESL Civics Students based on U.S. Form M-638

R and Latex Code in the Service of Exam Questions   The following webpage is under development and will grow with more information. The author abides by the GPL (>= 2) license provided by the "ProfessR" package by showing basic code, but not altering it. The code that is provided here is governed by the MIT license, copyright 2018, while respecting the GPL (>=2) license. Rationale Apart from the limited choices of open sourced, online curriculum building for adult ESL students (viz. elcivics.com), there is a current need to create open-sourced assessments for various levels of student understandings of the English language. While the U.S. Citizenship and Immigration Services (https://www.uscis.gov/citizenship) has valuable lessons for beginning and intermediate ESL civics learners, there exists a need to provide more robust assessments, especially for individuals repeating ESL-based civics courses. This is because the risks and efforts involved in applying for U

Getting past the two column PDF to extract text into RQDA: Literature reviews await

One of the great promises of working with RQDA is conceiving of it as computer assisted literature review software. This requires balancing the right amount of coding with text that can be used as warrants and backing in arguments. In theory it is a great idea--using computer assisted qualitative data analysis software (CAQDAS) for literature reviews, but how do you get the article PDFs into R and RQDA in a human readable format? By this I mean that many empirical articles are written in two column formats, and text extraction with standard tools produces text on the diagonal. Extracting PDF texts under this circumstance can be daunting when using some R packages such as 'pdftools', either with or without the assistance of  the 'tesseract' package. If you are working on a windows based computer, you can install three packages and Java to do the trick. First gather the literature articles that you would like to mark up in RQDA. Put them into a folder, and away you go.