Phrase Compare (N-Grams)

The Phrase Compare Report compares recurring phrases in one or more texts based on n-grams. An n-gram is a sequence of n items (words or symbols) from a given sequence of text or speech. For example, "I want to talk to you" is a phrase that is 6-grams long.

This report can be accessed by either:

  • Selecting Analyze > Book Reports > Phrase Compare Report (N-grams) in the navigation bar.
  • Clicking on the Book Reports (Analyze) Book Reports Icon dropdown menu in the top right corner of your open book's window, and then selecting Phrase Compare Report (N-grams).

Compare and Calculate

    N-gram length

    You can calculate from 1-grams to 9-grams long. You can change this by changing the Phrase length in the Options section.

    All n-grams from 1 to the length you select will be calculated. For example, by selecting a length of 5, you can also toggle between n-grams of 5, 4, 3, 2, and 1.


    Ignore case and diacritics

    There are options just below the book selection menus to ignore case and/or diacritics.

    Comparing N-grams when ignoring case and not ignoring case

    To compare n-grams, you’ll need to open 2 books first by using the File > Open Book menu. Then:

    1. Under Book 1, select a book name. Add bounds to a specify just a part of the book, if desired.
    2. Under Book 2, select the other book name and modify the bounds, if desired.
    3. Select the size of n-gram by changing the “Phrase length” in the options section.
    4. Press the Compare button.

    You can do this by using Table of Contents or Section Bounds.

    1. Under Book 1, select a book name and word list.
    2. Click on the down arrow next to Bounds and select either Table of Contents or Section Bounds (Table of Contents is more likely).
    3. Check the box next to the section(s) you want for Book 1.
    4. Repeat for Book 2, selecting the same book name and word list, but select a different section for the bounds.

    You might want to compare your text to a word frequency list, such as the Corpus of Contemporary American English. Corpora like COCA usually do not provide the full text to you, but you can get a word/phrase frequency list. Using the Phrase Compare report, you can compare your text to an external word frequency list. To do this:

    1. Under Book 2, select the Name dropdown menu and select the option Open a Phrase Frequency List.
    2. Select a tab-separated CSV or TXT file that is formatted in the way described below.
    Formatting your word/phrase frequency list

    A sample word/phrase frequency list is available on GitHub. You can format your word/phrase frequency list based on the sample. Otherwise, the instructions for formatting are below.

    There are 3 required columns for your frequency list:

    Len

    1

    2

    3

    Freq

    240

    200

    160

    Phrase

    and

    and it

    and it came

    For the Len column, add the length of your n-gram. If you are only working with individual words, this will just be 1.

    For the Freq column, add the frequency of the word/phrase.

    For the Phrase column, add the word/phrase. If it's 2 or more words long, use spaces between each word.

    Optional flags

    Besides the last flag, these are flags that are added to the column header when exporting a phrase frequency list. When formatting your own external frequency list, the first three are not required and will not affect the functionality of the Phrase Compare report. After the Phrase column header, you can add the following column headers:

    1. Ignore case – This is added to an exported frequency phrase list if you added Ignore case prior to pressing the Compare button.
    2. Ignore diacritics – This is added to an exported frequency phrase list if you added Ignore diacritics prior to pressing the Compare button.
    3. Book Name – This is added to an exported frequency phrase list to indicate which book it came from.
    4. Total Frequency of Relative Corpus – This is a comparison option that is essential if you only have a sample of the word frequency list. For example, you might only have all the words that occur at least 20 times. In this case, find the total word frequency of the relative corpus and add that number here. Otherwise, the comparison will assume the total frequency is based on the 1-grams you have in your frequency list.

    To calculate the n-grams just for one book:

    1. Under Book 1, select a book name. Add bounds to a specify just a part of the book, if desired.
    2. Select the size of n-gram by changing the “Phrase length” in the options section.
    3. Leave Book 2 empty. It should say “None” in the book name box.
    4. Press the Compare button.

    If you’ve calculated n-grams for just one book, you can toggle the view to show:

    • Repeated (all n-grams that occur at least twice).
    • Not repeated (all n-grams that occur only once).
    Example comparison of repeated and nonrepeated phrases

Export

    You can export any of the table views:

    1. Run the Phrase Compare report on a book or books.
    2. Click on the Save results dropdown menu at the bottom.
    3. Select Export All (Phrase Compare).
    4. Provide a file name and click Save.

    You can get the n-gram frequency list of any n-grams calculated. Note that if you calculate the n-grams with a length of 5, you’ll also get n-grams with a length of 1, 2, 3, and 4.

    To export the list:

    1. Run the Phrase Compare report on a book or books.
    2. Click on the Save results dropdown menu at the bottom.
    3. Select Export Phrase Frequency List and choose Book 1 or Book 2.
    4. Provide a file name and click Save.

Calculate Type-to-Token Ratio

Type-to-token ratio data is calculated when the phrase compare report is generated, and it can be exported to a CSV or TXT file.

You can learn how to access this data here.

Statistics

The main statistic used for comparing phrases is the Bayseian Information Factor (BIC). Significant phrases are calculated based on books’ relative frequencies rather than its raw frequency. If an n-gram has a BIC higher than 2, the n-gram is considered to be statistically more prevalent in one book than the other. If an n-gram has a BIC lower than 2, then the n-grams are statistically equal. More information about the statistics in this report are available in this PDF.