WordCruncher Monthly

Filtering the WordWheel with Regular Expressions

Overview

The WordWheel has become even more useful with its latest addition of regular expression filtering. Regular expressions (RegEx for short), for those that are unfamiliar with the term, consist of a system of special characters to create a search pattern. If you want to search for all numbers 0-9, you can simply type \d. Without regular expressions, the alternative would be to add ten filters (one for each number). Below are a few examples of what regular expressions can let you find.

RegEx

Adding a filter

Filter for words that only contain numbers

Regular Expression: ^\d+$

Explanation: Look for one or more digits from 0-9 (\d+) from the beginning of the word (^) to the end of the word ($).

Example WordWheel:

Frequency Word
2,861 10
2,020 20
1,503 30
1,451 100

Filter for words that don't contain numbers

Regular Expression: ^[^\d]+$

Explanation: Look for one or more characters that aren’t digits from the beginning of the word (^) to the end of the word ($). This will clean your WordWheel substantially in case you’re only focused on words and not numbers. For example, 77,866 out of 80,547 words in the TED Talk Corpus (English) don’t contain any numbers.

Example WordWheel:

Frequency Word
149,295 that
16,864 going
12,409 did
4,299 thought
3,781 example

Filter for words that contain common punctuation

Regular Expression: [\.,!\?\(\)] or \.|,|!|\?|\(|\)

Explanation: These two regular expressions will look for periods, commas, exclamation points, question marks, and parentheses. This is useful because it generally finds which words weren’t separated properly, perhaps due to a typo or missing space in the text. You could add other punctuation as well, but keep in mind that a backslash (\) needs to be added in front of characters that are considered special in regular expressions. These include ., *, +, ?, (, ), ^, $, [, and ].

Example WordWheel:

Frequency Word
731 U.S.
512 Dr.
374 10,000

Filter for words that have characters outside of the ASCII range

Regular Expression: ^[^\u0000-\u00ff]+$

Explanation: ASCII characters consist of the first 256 characters in Unicode. These characters contain all the characters used on the English keyboard. This regular expression looks for words that contain characters outside of this range of characters, which likely consists of foreign words, symbols, and emojis.

Example WordWheel:

Frequency Word
3,761
1,666
1 ...
1 عسل
1 प्रेम
1 你好

Learning Regular Expressions

These are just a few examples of what's possible with regular expressions. It can identify various categories of words ranging from typos to Chinese characters. If you'd like to learn more about regular expressions, we recommend starting at RegexOne, which is an interactive lesson to learn the basics.

See Other Articles from October 2021