Regular expressions and the Fonology package

Their use for phonemic transcription

R
regex
Fonology
CRSH
SSHRC
Author

Matéo Lévesque

Published

April 5, 2026

Version française

TipThe role of lexical and post-lexical statistics in second language acquisition

How can we map phonological patterns in a language using written data?

We see text everywhere, so it’s reasonable to assume that gathering data has never been easier. The problem is the well-known mismatch between letters (graphemes) and sounds (phonemes): we cannot fully understand phonological systems by simply examining how letters are distributed in a given corpus. Thus, to map phonological patterns, we first need to convert graphemes into phonemes.

This series is dedicated to a SSHRC-funded project (grant no. 141280) examining how lexical statistics can be explored to generate a baseline for comparison with experimental data. Part of this project involves the development of grapheme-phoneme conversion tools — it is very difficult to examine phonological patterns in written data without access to phonetic transcription. The Fonology package is directly connected to this project, covering Portuguese, French, Italian, and Spanish. Matéo Levesque worked on grapheme-phoneme conversion scripts for French. These posts are part of the project’s knowledge mobilization efforts.

Guilherme D. Garcia

Introduction

As we saw in the previous article, the goal of transcription in Fonology is to take textual data in French and encode it phonologically. In this article, we will take a closer look at how to use regex to perform phonemic transcription in French.

Transcription

Transcription is divided into several important steps that follow a specific structure. These steps are: cleaning, handling exceptions, applying transcription rules, and final cleaning.

Cleaning

Before transcribing words, the data must be cleaned. To do this, all characters are converted to lowercase and all punctuation marks are removed. Once the cleaning is done, transcription can begin.

Exceptions

In French, spelling is relatively generalizable,1 but many words still have irregular spelling. These unique words, such as “monsieur”, “hier”, or “yeux”, must be transcribed first if we want to avoid our transcription rules replacing the spelling that allows us to identify them.

Applying the rules

It would be impossible to present all the rules used in the module. The following rules are therefore examples that illustrate the basic concepts needed to understand the process.

There are letters (and groups of letters) that are fairly simple to transcribe, such as:

  • “â” –> /ɑ/
  • “gn” –> /ɲ/
  • “oy” –> /waj/
  • etc.

These graphemes are regular and therefore very easy to transcribe2. However, for other rules, more caution is required. The order in which rules are applied is generally very important. For example, consider two rules:

  • A : “u” –> /y/
  • B : “ou” –> /u/

If rule A is applied before rule B, there is no problem. However, if rule B is applied before rule A, all “u” will become /y/, since the program does not distinguish between the character “u” and the phoneme /u/.3

Temporary replacements

In some cases, even changing the order of the rules does not fix the errors. To address this, we used temporary replacements. This method makes it possible to specify whether a letter has already been transcribed or not. For example, consider the following rules:

  • A : “ées”, “és”, “ée” and “é”4 –> /e/
  • B : “e” –> /ə/

At first, the rules are in a problematic order, because rule A will lose its effect due to rule B. However, if we try to reorder the rules, we still observe a problem: “ées” and “ée” will be transcribed as /eə/. It is in cases like this that temporary replacements can be used.

Keeping the order described above, we can modify rule A as follows:

  • A : “ées”, “és”, “ée” and “é” –> “E”
  • B : “e” –> /ə/

As a result, rule B no longer targets the output of rule A, since it is now an uppercase letter.

Final cleaning and last transcriptions

If temporary replacements are used, the result is functional but not yet a phonemic transcription. It is therefore necessary to convert the temporary characters into the correct phonemes. It is also in this stage of transcription that geminate consonants are reduced to their non-geminate equivalents. For example: “tt” –> /t/.

Conclusion

So, in order:

  1. Clean the data.
  2. Handle exceptions.
  3. Apply the rules that have priority (to avoid rewriting our transcriptions).
  4. Apply the remaining rules whose order matters less.
  5. Convert temporary replacements into the appropriate phonemes.
  6. Finalize the remaining transcriptions.

This summarizes how French phonemic transcription works in Fonology.

Although solutions have been proposed for the issues presented above, transcription can sometimes still be incorrect. In the next article, we will examine other difficult problems when using regular expressions. We will also look at some limitations of this method.


Copyright © Guilherme Duarte Garcia

Footnotes

  1. French often requires a longer analysis window than languages such as Portuguese or Spanish, where a single character is often replaced by an unambiguous phonetic symbol.↩︎

  2. Indeed, if we had a language whose spelling perfectly represented its sounds, regular expressions would allow for perfect transcription (100% accuracy)↩︎

  3. There is therefore a parallel between the application of replacement rules (regex) and phonological rules.↩︎

  4. Here, I deliberately exclude words ending in “er”, “ai”, “ez”, and others for reasons of simplicity and efficiency.↩︎