Text Analysis Tools and Infrastructure in 2024 and Beyond

In 1983, Dr. Monte Shelley, was working as assistant director of the David O. McKay Institute for Educational Research within the College of Education at Brigham Young University (BYU). At the time Dr. Shelley also served on the Curriculum Committee for the Church of Jesus Christ of Latter-day Saints (Church). In his position on this committee, he noticed writers for the Church were having difficulties finding supporting scriptural references when creating lessons for Sunday school manuals. He envisioned a computer program that would help find scriptural references quickly and easily. During this same period the IBM PC was becoming a viable and affordable microcomputer product. Dr. Shelley recognized advances in computer hardware, operating systems, data storage, and computer languages should make this endeavor possible. Dr. Shelley hired James Rosenvall as a software engineer, and they began work on the research project that would eventually become the WordCruncher Text Indexing and Retrieval software application.

Throughout the mid 1980s, Dr. Shelley and Mr. Rosenvall worked closely with Church leadership to develop a product for use by the Church membership. In April 1988, The Computerized Scriptures was released which included a software program entitled LDS View and 30, 5.25” floppy disks that contained the Holy Bible, Book of Mormon, and other standard works of the Church. The program maintained a full text index of its data which facilitated very fast search and retrieval of information. Soon after the release of LDS View, a company named Electronic Text Corporation (ETC) was granted a license to market the software commercially under the name WordCruncher. They developed several texts for sale including such titles as The Constitution Papers and the Complete Works of William Shakespeare, however they were unable to find a viable market for electronic texts at the time and the company dissolved in 1990.

After the demise of ETC, development of the WordCruncher software continued at BYU. In the fall of 1989, Jason Dzubak was added to the team as a student developer with the charge to port the software to Microsoft Windows. In 1993, Mr. Dzubak was hired on full time at BYU as a software engineer and currently serves as the Director of the Research Technology Group primarily in charge of WordCruncher development in the Office of Digital Humanities at BYU.

Throughout the 1990s and early 2000s, WordCruncher continued to be a valuable asset to both the Church and BYU. Multilingual features were added to WordCruncher during this period as the Unicode standard was adopted. The Church released updated versions of its Computerized Scriptures in April 2000 and again in October 2002. These releases contained the Holy Scriptures in more than 30 languages, including Hebrew and Greek.

This was also a time where WordCruncher was used as the engine that powered several individual projects. For example, in 1997, The Institute for the Study and Preservation of Ancient Religious Texts (ISPART) – which later became the Neal A. Maxwell Institute for Religious Scholarship (Maxwell Institute) – in partnership with Brill publishing and Dr. Emanuel Tov from Hebrew University used the WordCruncher software to develop The Dead Sea Scrolls Electronic Reference Library. This product included high resolution photographs of all scrolls and fragments that had been documented at the time. The software was updated to allow the inclusion of and searching for morphological, lemma, definition, and other word metadata. This metadata was also included with the Dead Sea Scrolls product. In 2006 a revised edition was released and in 2013 a version that included the Biblical scrolls was released on the WordCruncher Bookstore.

By the early 2010s, the WordCruncher team recognized that advances to the software included in various projects over the years, had made the WordCruncher program a significant tool for text processing, study, and analysis. However, the team also realized that the history of the software being used almost exclusively for proprietary Church projects and one-off, individually titled products had made name recognition of WordCruncher almost non-existent in both industry and education. Because of this, a shift was made to generalize the software and make it available to more people.

Today, work is ongoing to make the WordCruncher software more accessible to students and faculty. The WordCruncher team, working with corpus linguists such as Dr. Mark Davies (BYU), are incorporating up-to-date statistics and analysis tools (colocation, n-gram phrase comparisons, dispersion, etc.) to the software. A WordCruncher Bookstore has been created where many public domain and paid texts and corpora can be downloaded and used by students and scholars. The software interface is going through a face lift to bring it up to modern standards, and the functionality is being simplified to make it easier for users of the software to compile and process their own data. While WordCruncher will probably remain primarily a desktop application, work is being done to further its reach by expanding it to other operating systems and devices to help make WordCruncher a destination tool for text and language analysis.

WordCruncher is an excellent option for students, researchers, or anyone who is interested in a text analysis tool (with beginning to advanced capabilities). It is especially adept at analyzing texts marked up with metadata, such as the part of speech-tagged TED Corpus and the morphologically-coded Dead Sea Scrolls Electronic Library, but it is just as capable at helping a beginning researcher analyze simple text files.

WordCruncher’s tools fall generally into three categories: search, study, and analysis. You can use WordCruncher to...

Study

Read two translations of a text synchronously

View multiple translations of a text in English, Portuguese, German, Japanese, Czech, and a host of other languages side-by-side (as available). The WordCruncher bookstore offers translations of texts like The Scriptures, the Quran, and the TED Corpus.

View metadata

Hover over a word in the Hebrew Bible to see that אֱ‫לֹהִ֑ים ‬(God; gods) is male, plural, a noun, and absolute.

Record (and search) notes

Take notes, search them, and highlight the text.

Analyze

Identify key words and phrases (n-grams) in Hamlet

Calculate key words and phrases up to 9 words long. In the Riverside Shakespeare collection, identify frequently used words and phrases in Hamlet: my lord (179 hits), king (172), no more (24), and good my lord (15)

Find uncommon, unusual, or misspelled words and phrases

Identify low-frequency words and phrases with the Phrase Compare report. Across the entire TED Corpus, these words and phrases are only ever used once: mobile phone photographs, related with, milky waves, floriculture, and perflourooctanoic acid. Across all of Shakespeare’s works, he only uses these words one time: abiliments (Antony and Cleopatra), hugger-mugger (Hamlet), irregulous (Cymbeline).

Compare key words and phrases between Shakespeare and a word frequency list

How does Shakespeare’s language differ from contemporary American English? Compare a word frequency list (like this one from COCA) with all of Shakespeare and find that his language has a higher normalized usage of words and phrases like thou, sir, no more, I pray you, and I will not than COCA does, and find that the following words and phrases occur in Shakespeare’s works but not at all in COCA: ‘tis, doth, beseech, and prithee.

View diachronic change in word usage

The usage of among has declined over time, as seen in General Conference Addresses 1839 to Present. Its use was most frequent between 1839 and the 1930s (2280 hits in the 1880s and 1587 in the 1910s), but usage began to decline in the 1900s. By the 2010s, usage was reduced to less than a third of what it had once been (adjusting for differences in size), with only 438 hits for among.

Find word patterns in concordance lines

How is pay used in the TED Corpus? People will often pay a bribe, and they’ll ask what you pay and how much would you pay? As a verb, pay has the strongest collocational relationship with the noun attention–most often people will pay attention, but we can also pay too much attention, pay close attention, and pay special attention. When used (much less frequently) as a noun, people frequently talk about equal pay.

More

For a full guide to WordCruncher, visit the Guide page.

As a software supported by BYU, WordCruncher is in a unique position to meet the needs of students, scholars, and independent researchers within the digital humanities and beyond. The future asks us as text analysis developers to provide more than just software updates for our tools: we must address questions of accessibility, sustainability, and reinvention in meeting the needs of the broader community.

In line with BYU’s desire to become the “acknowledged language capital of the world” (Kimball, 1975), WordCruncher currently offers select texts in multiple languages, with display menus and windows in English, French, German, Portuguese, and Spanish. For any tool, expanding the number of multi-language texts and language display options is a natural next step for becoming more accessible to the broader research community, but this is especially the case for text analysis. The potential for collaboration is another area for expansion. In the past, WordCruncher has mainly collaborated with individuals primarily within BYU and the Church, but within the context of the greater digital humanities community this reach is limited.

The future of text analysis must involve developing software with the users in mind, leading to greater accessibility and functionality. In terms of sustainability practices, WordCruncher can perhaps inform a discussion of the development of digital humanities tools. After the nearly 40 years that WordCruncher has existed, it has undergone significant software improvements and additions to the bookstore as requested by users, including the display of part of speech tagging, reports to generate collocations and N-grams, flexibility in word calculation and markup, and the additions of The Dead Sea Scrolls Electronic Reference Library and the Scriptures in more than 30 languages. Despite the significant improvements to software, WordCruncher has remained a Windows-only application, a hindrance to many who may wish to do text analysis but do not have access to a Windows computer. Current efforts now include the development of a redesigned, cross-platform WordCruncher.

A user-focused approach to digital humanities tools looks further than simply the creators’ vision for the program, creating a reciprocal relationship between creator and user. While not every suggestion can realistically become integrated into the software or the bookstore, user-informed decision-making is a pattern of sustainable, accessible programs. In an effort to follow this pattern, user-experience (UX)/user interface (UI) work can provide invaluable suggestions for user-informed software. As WordCruncher works towards a redesign, a specialized UX/UI team has worked to map the software, interviewing users, and simplifying the interface to make user-focused suggestions that will circle back to the essential reciprocal relationship between software creators and users. The user-focused redesign will involve user-targeted documentation and user testing to ultimately create an accessible cross-platform tool.

With the goal to produce tools that will benefit all, we ask the panel where to start:

  • How can we and other text analysis tools collaborate to best serve the community?
  • What texts are of value and should be made available in our programs?
  • What can AI offer text analysis tools?
  • What are best practices in expanding texts and language capabilities?

Alsulami, A. (2022). Narratives of Saudi women: A corpus-assisted analysis of the discursive construction of personal identities on Twitter in the context of Saudi Vision 2030 [Conference poster]. BAAL 2022, Queen’s University Belfast, Northern Ireland.

Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA). Available online at https://www.english-corpora.org/coca/.

Hilton, J., Vincent J., & Harper, R. (2022). “Last at the cross”: Teachings about Christ’s crucifixion in the Woman’s Exponent, the Relief Society Magazine, and the Young Woman’s Journal. BYU Studies Quarterly, 61(3), 31–58.

Johnson, A. (2023). Peanuts corpus [Unpublished master’s thesis]. Brigham Young University.

Kimball, S. W. (1975). The Second Century of Brigham Young University. BYU Speeches. https://speeches.byu.edu/talks/spencer-w-kimball/second-century-brigham-young-university/