Syllabification with Regex


Created: May, 2022 Last updated: February 03, 2024

In this tutorial, we will see how regular expressions (Regex) can help us syllabify words in any language. We will use simple examples from English, but you can easily adapt the code to accommodate any phonotactic pattern of interest. Our words will be in orthographic form to keep things simple. Don’t worry: the method is exactly the same, so you can easily adapt it to inputs that are phonetically transcribed (as they should be!). Regex is at the core of most key functions in the Fonology package. This tutorial is also part of the PaPE 2023 workshop on text analytics and phonology.

I assume you already use R and that you may be familiar with regular expressions (if you’re not, see here and see RStudio’s cheat sheet here). We will be syllabifying the three words in the vector below. Obviously, this is just a tiny sample to show you how to get started.

Code
library(tidyverse)

words = c("international", "clandestine", "crestfallen")

Starting point: CV syllables

The easiest way to start syllabifying words is to assume a CV template. This is a simplistic assumption for English, but it makes phonological sense (and it’s easier to code!).1 Here’s how we can think about this in terms of regular expressions: we want to replace a given vowel V with V-. To accomplish that, we need to use capturing groups. It will be clear below how useful capturing groups can be in many tasks—especilly when we work with syllabification.

To replace a given string (i.e., V) with another string (i.e., V-), we will use str_replace_all(), which comes from the stringr package (loaded when you load tidyverse). Notice that the replacement must match the pattern we’re replacing, such that if we’re looking for a, the replacement must be a-. In a nutshell, we’d like to create a “variable” that repeats the input in the output. This can’t be done with simple replacement, of course. In the code blocks below, the variable CV will hold our syllabified outputs.

Code
library(tidyverse)
CV = str_replace_all(string = words,
                     pattern = "([aeiou])",
                     replacement = "\\1-")

CV
#> [1] "i-nte-rna-ti-o-na-l" "cla-nde-sti-ne-"     "cre-stfa-lle-n"

The pattern in the code above, ([aeiou]), is a capturing group because it’s in parentheses. Inside the group, we have [aeiou]. Square brackets simply mean “any of the characters inside should be matched”. As a result, we’re looking for any (orthographic) vowel. We will replace this vowel (whichever vowel we find) with the same vowel + a hyphen, hence \\1-. Number 1 here simply refers back to our group; since there’s only one group, we use \\1.

Fixing phonotactic patterns

Now that we have syllables in our CV variable, it’s time to make the necessary (language-specific) adjustments. We can start by fixing the endings of our syllabified entries. First, we need to remove word-final hyphens, as cla-nde-sti-ne-. Second, we need to remove fix -C# sequences.

Code
# Remove hyphen at the right edge of the word:
CV = CV %>% 
  str_remove_all(pattern = "-$")

# Replace -C# with C:
CV = CV %>% 
  str_replace_all(pattern = "-([bcdfghjklmnpqrstvxz]$)",
                  replacement = "\\1")

CV
#> [1] "i-nte-rna-ti-o-nal" "cla-nde-sti-ne"     "cre-stfa-llen"

Next, we need to fix our illicit onsets: -nt, -rn, -nd, -stf. Let’s ignore ll in crestfallen since we’re dealing with orthography anyway. Notice that all four problematic clusters can be split into two parts to become licit onsets or codas in English: n-t, r-n, n-d, st-f. In other words, by moving the hyphen, we fix the issue. The key here is to treat sf as a single group (yet another advantage of using capturing groups).

Below, we split the clusters into two groups: (1) (n|r|st) and (2) (t|n|d|f).2 Notice that any combination of these two groups will result in an illicit onset cluster in English (e.g., -stn, -nr, -rf, etc.). We basically want to go from –(1)(2) to (1)–(2). That’s exactly what we do here (using double backslashes before each group number).

Code
# Fix onsets:
CV = CV %>% 
  str_replace_all(pattern = "-(n|r|st)(t|n|d|f)",
                  replacement = "\\1-\\2")

CV
#> [1] "in-ter-na-ti-o-nal" "clan-de-sti-ne"     "crest-fa-llen"

Finally, we may want to improve the syllabification of clandestine. More specifically, we want the following orthographic syllabification: clan-des-ti-ne (which is exactly how you’d syllabify this word in a language like Portuguese). To do that, we want every s in a cluster that follows an open syllable to become the coda of said syllable. That’s what the line of code below does. Here, we’re assuming that the clusters st, sp, sn, sm, sl should broken into coda-onset sequences (if they follow an open syllable). Naturally, this rule can/should be improved. The point here is to show how we can easily accomplish these substitutions using capturing groups.

Code
# Fix "clandestine":
CV = CV %>% 
  str_replace_all(pattern = "([aeiou])-s([tpnml])", 
                  replacement = "\\1s-\\2")

CV
#> [1] "in-ter-na-ti-o-nal" "clan-des-ti-ne"     "crest-fa-llen"

Clearly, working out all the necessary substitutions for an entire language is not an easy task. But the overall structure of our code will follow the same rationale we see above. With regular expressions and capturing groups, manipulating strings is surprisingly straightforward.


Copyright © 2024 Guilherme Duarte Garcia

Footnotes

  1. For Hawaiian or any other (C)V language, we’d be done.↩︎

  2. The pipe, |, means “or” in Regex.↩︎