Skip to main content

Getting past the two column PDF to extract text into RQDA: Literature reviews await

One of the great promises of working with RQDA is conceiving of it as computer assisted literature review software. This requires balancing the right amount of coding with text that can be used as warrants and backing in arguments. In theory it is a great idea--using computer assisted qualitative data analysis software (CAQDAS) for literature reviews, but how do you get the article PDFs into R and RQDA in a human readable format? By this I mean that many empirical articles are written in two column formats, and text extraction with standard tools produces text on the diagonal. Extracting PDF texts under this circumstance can be daunting when using some R packages such as 'pdftools', either with or without the assistance of  the 'tesseract' package.

If you are working on a windows based computer, you can install three packages and Java to do the trick. First gather the literature articles that you would like to mark up in RQDA. Put them into a folder, and away you go.


Fig 1. Preparing files for reading into the R global environment.

Before looking at Figure 1, it is  necessary to install R packages through the 'install.packages()' command, and the 'library()' command. The required packages for this quick project are 'magrittr', 'rJava', and 'tabulizer'. All packages get separately installed by putting the respective names into the parentheses part of the install.packages and library commands described above (unless you want to make or borrow a function that will automate the process).

Once you have installed the packages, download Java SE development kit (I downloaded version 8, which comes with new terms and conditions. Be sure to read the terms and conditions carefully). This version will run with 'rJava' but you should remember where you install Java on your computer because you will need that path later. Once you have saved your path to Java and you have installed the requisite packages, it is time to run code in Figure 1.

Figure 1, Line 1 comments that the next two lines will automate importing pdfs. Line 2 sets the working directory to the path of the folder containing all of the pdfs. Line 5 tells us that we must point the package 'rJava' to the location of Java so the two can work in tandem with each other. Line 6 points 'rJava' to the place where Java lives on the computer. Line 8 explicitly calls the 'magrittr' package (a piping package; it helps to make the code clear to see, and it returns the processes into a single object, 'b'). Then, lines 10-13 are run together, producing the list of PDF contents that are read into R. Line 15 takes the names listed from line 3, pastes them together with  a comma separator, and line 16 associates the file 'b' with the list of names from line 15.

When 'b' is further processed as part of the RQDA function write.FileList(), the user can then import the object b into RQDA as a list of files with associated names. With this simple code the problem of how to use RQDA as a computer assisted literature review software is solved with three R packages: 'rJava', 'tabulizer', and 'magrittr'. These packages help Java to get the job complete, as they take two column PDF contents, and create output with one column. The default single column PDF contents are preserved as one column on the output as well.


References

  Thomas J. Leeper (2018). tabulizer: Bindings for Tabula PDF Table Extractor Library. R package version 0.2.2.

  Simon Urbanek (2020). rJava: Low-Level R to Java Interface. R package version 0.9-12. https://CRAN.R-project.org/package=rJava

Stefan Milton Bache and Hadley Wickham (2014). magrittr: A Forward-Pipe Operator for R. R package version 1.5. https://CRAN.R-project.org/package=magrittr

Popular posts from this blog

Persisting through reading technical CRAN documentation

 In my pursuit of self learning the R programming language, I have mostly mastered the art of reading through CRAN documentation of R libraries as they are published. I have gone through everything from mediocre to very well documented sheets and anything in between. I am sharing one example of a very good function that was well documented in the 'survey' library by Dr. Thomas Lumley that for some reason I could not process and make work with my data initially. No finger pointing or anything like that here. It was merely my brain not readily able to wrap around the idea that the function passed another function in its arguments.  fig1: the  svyby function in the 'survey' library by Thomas Lumley filled in with variables for my study Readers familiar with base R will be reminded of another function that works similarly called the aggregate  function, which is mirrored by the work of the svyby function, in that both call on data and both call on a function toward...

Striking a Balance: How Much Should I Cast My own Site Variables, Versus Referring to the Literature When Doing Predictive Machine Learning as a Data Scientist?

 Recently I came back from the TAIR 2025 conference and I was struck by the number of presenters that focused on using either auto machine learning or artificial intelligence in creating models for predictive analytics in higher education. One of the striking things about the works presented is that the independent variables were somewhat similar to each other but yet different from each other enough to raise the question. How much should there be consistency between predictive machine learning models? Or, how generalizable should any given model be? These two questions strike at the limits of what local work should aim towards. One way to look at the issue is the pressing need to look at all available variables locally and use them to forage a way forward at predictions about issues like retention, enrollment, and so forth at the university level. To a certain degree this is a moot point, as some would argue that data science is about creating actionable insights.  That is, u...

The Matrix Literature Review and the 'rectangulate' Function from the r7283 Package

Matrices and Literature Reviews Pulling together a strong literature review continues to be the very foundation of  positioning an education researcher's novel contribution to the field. Yet, reviewing literature can be daunting at the outset. This is because organizing the literature review results requires itemizing, tagging, and keeping track of the relevant articles. Organizing 100 + articles takes time, commitment, and can ultimately distract from the task at hand, which is getting a grip on the state of knowledge. To make the task of organizing the literature more straightforward, I have created a computational function that helps lift some of the burden of organizing literature.  It takes an exported bibliographic research file (.bib) exported from EBSCO and widens it into a matrix. Transposing the .bib file into a matrix allows the researcher to jump right into the matrix literature review style of reading articles. A matrix literature function for education ...