Skip to main content

The 'jstor_ocr' function in the 'r7283' package for concatenating ocr and metadata from JSTOR's Data for Research

Digital Text Investigations

The digital humanities continues to change the ways in which we draw conclusions about social phenomena. This condition starts from the understanding that for the first time in history, humans can potentially scale the totality of a social phenomenon's appearing. This continuous evolution of study provides new ways to examine data. A key idea in this evolution is the ability to pull together unstructured data and their accompanying metadata as a rejoinder to older forms of content analysis and its related approaches.

The JSTOR Data for Research (DfR) arrangement presents such a unique development to work with unstructured data. Subscribers can request large, carefully delineated, corpora for academic investigations. At time of writing there are two options for data requests. The first option allows the subscriber to create search terms, and without a signed contract, scale down the results, and download n-grams (roughly 1-3 combinations are available). The second option allows the subscriber to save a larger search, request optimal character recognition files along with n-grams. The latter arrangement is a little more formal, but both arrangements allow for new opportunities to work with big data in coming to conclusions about social, literary, or otherwise discursive phenomena.

Existing Functions

The R programming language has a peer-reviewed CRAN documented package dedicated to viewing JSTOR DfF data, and it is aptly called jstor (Klebel, 2018). Together with the tidyverse and knitr (Yihui, 2019; Wickham et al., 2020), it enables the user to view zipped JSTOR DfR data and combine information into data frames. The jstor package’s dependency on other packages allows for powerful views on data; for discovery, this makes sense. However, in some cases, after signing a contract with JSTOR for ocr and advanced n-grams, unzipping the files and working through them is as straightforward as just going to work on the data. The job at hand is very straightforward as well. Paste metadata to the appropriate files, save the files, and coerce individual files into a data frame with metadata as variables. In short, after unzipping manageable n's (viz. n = 726), binocular views might not be needed (and to a certain extent, even medial n sizes would do well enough to be left alone).

Forward Movement

After viewing several deprecations of jstor's original code after its own submission to peer review, I decided that a serviceable, alternative protocol would be to combine ocr + metadata into .xml formatting as a coding goal. Peeking at zipped files, although considered a good practice at quality control, did not merit such lavish data frame visualizations as with the original jstor vignettes. Viewed from the business of statistical hypothesis testing and working with manageable n-sizes, it was more relevant to just unzip and work with the goods, as it were. Then, in the spirit of leaving metadata multi-purposive, further downstream processing could yield data frames and variables as needed for text analytics, content analysis, and quantitative linguistics. This meant that the unit of analysis par content analytic parlance would be the individual .xml file. Data perseveration techniques were used so that as many variables as possible were included in the final .xml files.

Fig. 1 Documentation for jstor_ocr in the r7283 package

Procedural Automations for jstor_ocr and its Dependencies

The xml2 package was considered a vital dependency to retain within the jstor_ocr code (from the r7283 package)(Martinez, 2019), mostly because of xml2's memory management features (Wickham, Hester and Ooms, 2018). The resulting jstor_ocr code represents an intermediary coder's work (Note well: functional programming did not improve processing speeds at medial n  levels, several tests notwithstanding).  Assuming an unzipped file location, the code takes the unadulterated main file folder structure of opening to n-grams, metadata, and ocr subfolders, reads in such subfolder contents as lists, parses text files, deletes all tags in .txt files (as this was found to be precarious within file quality), surrounds all text with beginning and ending text tag files, cleans hyphenated words, substitutes out unreadable UTF-8 text, and finally adopts the resulting nodes as xml children to designated xml metadata files. It performs a quality control check, assuring all names in the .txt and .xml files are exact, a latent indication that respective file contents match, according to vector indexes; it does so by comparing indexes, one within a loop against one found with lexical scoping inside the function. It then exports the resulting .xml files to a predesignated folder supplied in the function's second argument. These files are then released to further text cleansing and staging downstream, in preparation for final statistical modeling according to content analytic or quantitative linguistic statistical methods.

Conclusion

The constant evolution of archival sharing arrangements makes possible the rapid implementation of standardized code, either through peer review, or through archived binary packages, the latter of which might invoke problem-in-practice documentation and usage. JSTOR's recent innovative and forward thinking arrangement of providing scaled, raw, digital humanities data beckons code-writing methodologists to construct pragmatic code when dealing with small n- to medial n- (and arguably large n-) level text analytic procedural processing. A major assumption of the code presented here is that quality control can be attained in older, albeit basic functions, eliminating some newer published functions entirely. The application of the code is mostly restricted to JSTOR's DfR archival work, featuring the function's ability to pull together unstructured data and their accompanying metadata in a seamless set of files.

References

Klebel, T. (2018). Jstor: Import and analyse data from scientific texts.Journal of open source software, 3(28), 883-884 .Retrieved from http://doi.org/https://joss.theoj.org/papers/10.21105/joss.00883

Martinez, M. (2018). r7283: A miscellaneous toolkit. Retrieved from http://github.com/cownr10r/r7283

Wickham, H. et al. (2020). Tidyverse. Retrieved from http://tidyverse.org

Wickham, H., Hester, J. and Ooms, J. (2018). xml2: Parse xml. Retrieved from https://CRAN.R-project.org/package=xml2

Yihui, X. (2019). knitr: A general-purpose package for dynamic report generation in R. Retrieved from https://cran.r-project.org/web/packages/knitr/index.html.


Popular posts from this blog

Digital Humanities Methods in Educational Research

Digital Humanities based education Research This is a backpost from 2017. During that year, I presented my latest work at the 2017  SERA conference in Division II (Instruction, Cognition, and Learning). The title of my paper was "A Return to the Pahl (1978) School Leavers Study: A Distanced Reading Analysis." There are several motivations behind this study, including Cheon et al. (2013) from my alma mater .   This paper accomplished two objectives. First, I engaged previous claims made about the United States' equivalent of high school graduates on the Isle of Sheppey, UK, in the late 1970s. Second, I used emerging digital methods to arrive at conclusions about relationships between unemployment, participants' feelings about their  (then) current selves, their possible selves, and their  educational accomplishm ents. I n the image to the left I show a Ward Hierarchical Cluster reflecting the stylometrics of 153 essay

Creating Examination Question Banks for ESL Civics Students based on U.S. Form M-638

R and Latex Code in the Service of Exam Questions   The following webpage is under development and will grow with more information. The author abides by the GPL (>= 2) license provided by the "ProfessR" package by showing basic code, but not altering it. The code that is provided here is governed by the MIT license, copyright 2018, while respecting the GPL (>=2) license. Rationale Apart from the limited choices of open sourced, online curriculum building for adult ESL students (viz. elcivics.com), there is a current need to create open-sourced assessments for various levels of student understandings of the English language. While the U.S. Citizenship and Immigration Services (https://www.uscis.gov/citizenship) has valuable lessons for beginning and intermediate ESL civics learners, there exists a need to provide more robust assessments, especially for individuals repeating ESL-based civics courses. This is because the risks and efforts involved in applying for U

Getting past the two column PDF to extract text into RQDA: Literature reviews await

One of the great promises of working with RQDA is conceiving of it as computer assisted literature review software. This requires balancing the right amount of coding with text that can be used as warrants and backing in arguments. In theory it is a great idea--using computer assisted qualitative data analysis software (CAQDAS) for literature reviews, but how do you get the article PDFs into R and RQDA in a human readable format? By this I mean that many empirical articles are written in two column formats, and text extraction with standard tools produces text on the diagonal. Extracting PDF texts under this circumstance can be daunting when using some R packages such as 'pdftools', either with or without the assistance of  the 'tesseract' package. If you are working on a windows based computer, you can install three packages and Java to do the trick. First gather the literature articles that you would like to mark up in RQDA. Put them into a folder, and away you go.