Phrase Compare (N-Grams)
The Phrase Compare Report compares recurring phrases in one or more texts based on n-grams. An n-gram is a sequence of n items (words or symbols) from a given sequence of text or speech. For example, "I want to talk to you" is a phrase that is 6-grams long.
This report can be accessed by either:
- Selecting
Analyze > Book Reports > Phrase Compare Report (N-grams)
in the navigation bar. - Clicking on the Book Reports (Analyze) dropdown menu in the top right corner of your open book's window, and then selecting Phrase Compare Report (N-grams).
Compare and Calculate
- Under Book 1, select a book name. Add bounds to a specify just a part of the book, if desired.
- Under Book 2, select the other book name and modify the bounds, if desired.
- Select the size of n-gram by changing the “Phrase length” in the options section.
- Press the Compare button.
- Under Book 1, select a book name and word list.
- Click on the down arrow next to Bounds and select either Table of Contents or Section Bounds (Table of Contents is more likely).
- Check the box next to the section(s) you want for Book 1.
- Repeat for Book 2, selecting the same book name and word list, but select a different section for the bounds.
- Under
Book 2
, select theName
dropdown menu and select the optionOpen a Phrase Frequency List
. - Select a tab-separated CSV or TXT file that is formatted in the way described below.
Ignore case
– This is added to an exported frequency phrase list if you added Ignore case prior to pressing theCompare
button.Ignore diacritics
– This is added to an exported frequency phrase list if you added Ignore diacritics prior to pressing theCompare
button.Book Name
– This is added to an exported frequency phrase list to indicate which book it came from.Total Frequency of Relative Corpus
– This is a comparison option that is essential if you only have a sample of the word frequency list. For example, you might only have all the words that occur at least 20 times. In this case, find the total word frequency of the relative corpus and add that number here. Otherwise, the comparison will assume the total frequency is based on the 1-grams you have in your frequency list.- Under Book 1, select a book name. Add bounds to a specify just a part of the book, if desired.
- Select the size of n-gram by changing the “Phrase length” in the options section.
- Leave Book 2 empty. It should say “None” in the book name box.
- Press the Compare button.
- Repeated (all n-grams that occur at least twice).
- Not repeated (all n-grams that occur only once).
N-gram length
You can calculate from 1-grams to 9-grams long. You can change this by changing the
Phrase length in the Options
section.
All n-grams from 1 to the length you select will be calculated. For example, by selecting a length of 5, you can also toggle between n-grams of 5, 4, 3, 2, and 1.
Ignore case and diacritics
There are options just below the book selection menus to ignore case and/or diacritics.
To compare n-grams, you’ll need to open 2 books first by using the File > Open Book
menu. Then:
You can do this by using Table of Contents or Section Bounds.
You might want to compare your text to a word frequency list, such as the Corpus of Contemporary American English. Corpora like COCA usually do not provide the full text to you, but you can get a word/phrase frequency list. Using the Phrase Compare report, you can compare your text to an external word frequency list. To do this:
Formatting your word/phrase frequency list
A sample word/phrase frequency list is available on GitHub. You can format your word/phrase frequency list based on the sample. Otherwise, the instructions for formatting are below.
There are 3 required columns for your frequency list:
Len
1
2
3
…
Freq
240
200
160
…
Phrase
and
and it
and it came
…
For the Len
column, add the length of your n-gram. If you are only working with individual words, this will just be 1
.
For the Freq
column, add the frequency of the word/phrase.
For the Phrase
column, add the word/phrase. If it's 2 or more words long, use spaces between each word.
Optional flags
Besides the last flag, these are flags that are added to the column header when exporting a phrase frequency list. When formatting your own external frequency list, the first three are not required and will not affect the functionality of the Phrase Compare report. After the Phrase
column header, you can add the following column headers:
Notes
Make sure to add an empty column/tab for each optional flag even if you don't use all flags because WordCruncher looks for Ignore case
in the 4th column, Ignore diacritics
in the 5th column, and so forth. For example, if you are adding the Ignore case
and Total Frequency of Relative Corpus
flags, you'd want to add 3 tabs between these two flags.
Also make sure there is an empty line at the end of your frequency list file if you are using a TXT file.
To calculate the n-grams just for one book:
If you’ve calculated n-grams for just one book, you can toggle the view to show:
Export
- Run the Phrase Compare report on a book or books.
- Click on the Save results dropdown menu at the bottom.
- Select Export All (Phrase Compare).
- Provide a file name and click Save.
- Run the Phrase Compare report on a book or books.
- Click on the Save results dropdown menu at the bottom.
- Select Export Phrase Frequency List and choose Book 1 or Book 2.
- Provide a file name and click Save.
You can export any of the table views:
You can get the n-gram frequency list of any n-grams calculated. Note that if you calculate the n-grams with a length of 5, you’ll also get n-grams with a length of 1, 2, 3, and 4.
To export the list:
Calculate Type-to-Token Ratio
Type-to-token ratio data is calculated when the phrase compare report is generated, and it can be exported to a CSV or TXT file.
You can learn how to access this data here.
Statistics
The main statistic used for comparing phrases is the Bayseian Information Factor (BIC). Significant phrases are calculated based on books’ relative frequencies rather than its raw frequency. If an n-gram has a BIC higher than 2, the n-gram is considered to be statistically more prevalent in one book than the other. If an n-gram has a BIC lower than 2, then the n-grams are statistically equal. More information about the statistics in this report are available in this PDF.