Skip to main content

Text mining, unruly text, XML, TEI, and R: Go with conventional architecture, or make your own?

Many educational researchers will inevitably work with text as data. It is unavoidable, as reflective practice (almost universally required by teacher preparation programs) requires conveying meaning through words, and retaining a corpus of reflections throughout a semester, or even a year. Finding patterns in teaching strategies will inevitably require text parsing. Student writing assessment naturally lends itself to text analytics, so educational researchers can gain data on student learning through reading student responses to writing prompts. Further still, professional educational researchers stand to gain much by taking in large amounts of text, searching for patterns, and reporting on their findings. The more skill at working with text, the greater the opportunities abound for educational researchers. Working with text requires effort and copious patience, mostly because text is, relative to numbers, messy.

The Curse of Messiness

Messiness is essentially the observation that educational researchers come by text in many different forms, and once it is imported into the R environment (or most programming environments), it comes in as a messy indexed character vector. Submitting messy character vectors as part of an evidence/ audit trail can erode away at perceptions of educational researchers' trustworthiness. On a different note, in the case of creating a retrievable artifact for peer review purposes, banking the data for archival purposes, and cutting down wrangling time in any programming language, educational researchers must find ways to speed up their workflows to get the job done, and there are means to solving this problem. Figure 1 shows messiness in the raw text that I entered into the R Studio console for a previous research paper. It was part of clearly defined research objectives in which (1) a 2,000 page pdf document needed to be split, (2) processed from 2 columns into 1 column, and (3) cleaned for statistical modeling to answer the research questions.

fig 1. "Unruly, Raw Text"

A solution to reading in large amounts of text while avoiding messiness is to invoke unobtrusive structure around the text, while issuing commands to remove unwanted characters.  Note that the text in figure 1 has raised bullet points, "\n" new line designations, and is populated with dashes. Using regular expression pattern matching (regexes) is a way to clean the textual data while creating structure. Figure 2 gives a taste of regex work. Perl users will easily relate to the humorous notion of "Perl Golf"  on line 1, or the idea of going through several rounds of near misses (sometimes dozens) in laying down computer code to get exactitude in the outcome, while invoking the shortest amount of code possible. In Figure 2, lines 3 and 4 of the code initiates text cleansing by using base functions in R, while line 6 starts searching for places in the character vectors to split the document.

Fig 2. "Regular expression pattern matching"

Recalling that educational researchers are not necessarily computer programmers first and foremost, the problem remains that publication in top journals requires data artifacts that reflect clear ontologies alongside the reproducible audit, which reveals the researcher's epistemological processes. Here is where XML can considerably add integrity to the workflow with text.

XML is an architectonic data format. It is analogous to a spreadsheet, and a tree. Like a spreadsheet, it has data containers that act like cells. Spreadsheets mainly require a latitude/longitude system to retrieve cell contents because they are rectangular. Unlike a spreadsheet, XML uses pathways, or branches, to retrieve contents.

XML adds integrity to educational research with text as the XML structure allows for strict handling of character vectors (words, sentences, paragraphs, etc.), and it imposes well-founded architecture upon its data contents. What does this mean? Think of XML as a kind of spreadsheet in a tree-like set of relations (See figure 3). Each of the XML tags (think HTML) contains data in relation to other data. Unlike numerical data, word-based writing has hundreds of years of cultural conventions governing its appearance and structure, so concepts can refuse to look like a rectangular dataset. XML allows educational researchers to respect cultural conventions while exploring non-rectangular word-based meanings in context.

Fig 3. "The TEI standards as architectonic"

An example of non-rectangular data that might be of crucial interest to educational researchers is exploring how students write up their thoughts in the conventional five paragraph essay. Cultural conventions demand structuring human thought on paper, and this gets played out in how well students are able to use Western writing conventions to express themselves. However, stylistic conventions allow a student writer to use echoing tropes (for example) at the end of the document, pulling from earlier in the writing, which defies data rectangularity. It introduces a kind of recursion into the ontological field of document research. We cannot simply command the computer to pull data from column "A" and "B" to perform correlation analysis in these situations. We first need to acknowledge any non-rectangularity and work with locations in the structure of the document. Then the researcher can coerce the text into constituent shapes for further processing.

The Text Encoding Initiative (TEI) standards are widely used to reflect culturally sanctioned human expression, and they are a set of rules that tell XML how to form a document. Think of the TEI standards as a theatre script blocking exercise that helps the actor know when to move, what to say, how to say it, and so forth. The only catch is that the exercise is taking place within a 12" x 12" floor tile. The actor must make some kind of mental mapping of things, so blocking instructions are paramount, should be richly annotated, and must be abided. Following this example, the TEI standards help the actor to see where things go on the set, and even what they might look like, well in advance of ever stepping foot on stage. The more detailed the blocking notes, the more instruction that the actor has to inform his performance on stage. XML works with tags, guiding the actor to read the script with as few mistakes as possible, while conveying the weighty meaning of the text. The curse of messiness is avoided altogether with extensive tagging.

Figure 2 shows the structures of a simple document portion of an XML file marked up by TEI architecture. We might read the document as follows, with italics being spoken under the breath, and words in quotations spoken in full voice recitation, starting from top to bottom, and reading from left to right:
The general document information is closed; move towards the text, starting from the body of the writing, at the part where the first division is, where a header will be found. "Title".

As the actor recites the contents of figure 2, it becomes apparent that a lot of instruction must be processed to get at the first recited statement, which is "Title." This kind of specificity is great news for scholars of written text, as TEI instructions in an XML file preserves culturally- accepted writing structures that give the text its meaning. There are hundreds of standardized tags that can specify the look, feel, and classification of poetry, newspapers, essays, and other formats of written human expression. These tags help the computer read the instructions like the actor working through the blocking exercise.

Educational researchers can take advantage of the tags to indicate text boundaries, turns at talk, and even student assignments, especially if the research question tugs at humanistic data. We can take what is practical from XML tagging, use TEI conventions, and go through the data branches to find the text to answer our specific research questions. Using the TEI conventions makes intuitive sense, but the conventions pose their own problems.

Educational researchers using qualitative research paradigms might find TEI (for all its standardization) too much. For novice programmers trying to slam down code in pursuit of answering a research question, the same great vocabulary of tags that makes TEI so specific can get in the way of data analysis. Some researchers will spend months (if not years) paying homage at the altar of TEI, regarding it as a sacred grammar that should always be invoked in every circumstance. But when practicality and time combine forces, and ontological integrity is still essential, a major problem resolves itself when educational researchers blaspheme against TEI and write their own tags.
Recalling that XML can surround text with integrity by its specific, well grounded architecture, that it overcomes the problem of perl-golf, encodings issues, and character readability, diving head first into TEI can be daunting. Even though TEI is widely recognized and is supported by people, conferences, email lists, manuals and so on, it is exacting grammar. This grammar can be off-putting to a pragmatic researcher. The end of research, is indeed, to bring the writing to publication. Dancing through the beautiful verdure of TEI and enjoying its rich landscape can slow down the production pipeline considerably. Therefore, a best practice in text-based, educational research is to create your own stylesheet, or just use very basic architecture to structure your text/s or corpora.

An emerging workflow for educational researchers

The current example takes a double-columned PDF file to construct a pipeline. A first step might consist of exporting the PDF to a word processor with web viewing capabilities, viewing the word document as a web page, and cutting and pasting the text into an XML editor. Of course, there are probably quicker ways to do this, considering that commercial word processing software has its own XML stylesheets (which can be very verbose), but as an emerging workflow, it is easier to go from webpage format to an XML editor.

Fig 4. "Dropping TEI and gaining directness"

Once the web view contents are pasted into the XML editing software, you can begin the work of creating architecture around the text. Figure 3 shows a direct path to text parts. The sample text represents invoked structure to tag an academic article in XML. Compared to figure 2, the path, or branches, in figure 3 are shorter (see the orange arrow pointing to the directness of the path). We go from the root tag (the base of the "tree" in our earlier example) directly to text. Figure 2 shows arrows pointing to the elaborate TEI structure (it actually cuts off the previous meta-tags that capture information "about" the document), which translates into more code when going down the document path.

Perhaps the most important part of working with XML is the educational researcher's ability to create classification variables. Think of these as factors in statistical research. Figure 4 adds a little bit to the path of the text, but by adding the "type" attribute, we can make the XML document create a factor out of the article.

Fig 5. "Adding variables"

When it comes time to extract text and perform analysis on it, there are now numerous possibilities to work with variables and statistical modeling.


This discussion has highlighted a workflow problem preceded by the principle of maintaining ontological integrity of data for word-based educational researchers. When working on character-based hypothesis testing or exploratory text mining, it is critical to decide whether to (1) read raw, unruly text into the R environment and rely on regexes to provide architecture, (2) use XML structures to maintain textual integrity (inevitably, regex work will be needed even when working with XML structures), (3) rely on TEI tagging conventions, or (4) create a home-grown set of XML tags. The answer to the question is, to some degree, a matter of personal comfort with programming languages and data structures. However, being an educational researcher presents a unique condition, as peer review requires artifacts that reflect transparent data ontologies. Simply zipping up .txt files arguable does not rise to the occasion. XML creates the data artifact, gives explicit tags, and allows educational researchers to create variables. Because of these principles, creating an XML pipeline is a sound choice for word based educational researchers. Creating own home-made XML tags goes one step further to addressing the issue of directness.

Popular posts from this blog

Digital Humanities Methods in Educational Research

Digital Humanities based education Research This is a backpost from 2017. During that year, I presented my latest work at the 2017  SERA conference in Division II (Instruction, Cognition, and Learning). The title of my paper was "A Return to the Pahl (1978) School Leavers Study: A Distanced Reading Analysis." There are several motivations behind this study, including Cheon et al. (2013) from my alma mater .   This paper accomplished two objectives. First, I engaged previous claims made about the United States' equivalent of high school graduates on the Isle of Sheppey, UK, in the late 1970s. Second, I used emerging digital methods to arrive at conclusions about relationships between unemployment, participants' feelings about their  (then) current selves, their possible selves, and their  educational accomplishm ents. I n the image to the left I show a Ward Hierarchical Cluster reflecting the stylometrics of 153 essay

Creating Examination Question Banks for ESL Civics Students based on U.S. Form M-638

R and Latex Code in the Service of Exam Questions   The following webpage is under development and will grow with more information. The author abides by the GPL (>= 2) license provided by the "ProfessR" package by showing basic code, but not altering it. The code that is provided here is governed by the MIT license, copyright 2018, while respecting the GPL (>=2) license. Rationale Apart from the limited choices of open sourced, online curriculum building for adult ESL students (viz., there is a current need to create open-sourced assessments for various levels of student understandings of the English language. While the U.S. Citizenship and Immigration Services ( has valuable lessons for beginning and intermediate ESL civics learners, there exists a need to provide more robust assessments, especially for individuals repeating ESL-based civics courses. This is because the risks and efforts involved in applying for U

Getting past the two column PDF to extract text into RQDA: Literature reviews await

One of the great promises of working with RQDA is conceiving of it as computer assisted literature review software. This requires balancing the right amount of coding with text that can be used as warrants and backing in arguments. In theory it is a great idea--using computer assisted qualitative data analysis software (CAQDAS) for literature reviews, but how do you get the article PDFs into R and RQDA in a human readable format? By this I mean that many empirical articles are written in two column formats, and text extraction with standard tools produces text on the diagonal. Extracting PDF texts under this circumstance can be daunting when using some R packages such as 'pdftools', either with or without the assistance of  the 'tesseract' package. If you are working on a windows based computer, you can install three packages and Java to do the trick. First gather the literature articles that you would like to mark up in RQDA. Put them into a folder, and away you go.