Regular Expressions in Fonology
The Limitations and Challenges of Regex
How can we map phonological patterns in a language using written data?
We see text everywhere, so it’s reasonable to assume that gathering data has never been easier. The problem is the well-known mismatch between letters (graphemes) and sounds (phonemes): we cannot fully understand phonological systems by simply examining how letters are distributed in a given corpus. Thus, to map phonological patterns, we first need to convert graphemes into phonemes.
This series is dedicated to a SSHRC-funded project (grant no. 141280) examining how lexical statistics can be explored to generate a baseline for comparison with experimental data. Part of this project involves the development of grapheme-phoneme conversion tools — it is very difficult to examine phonological patterns in written data without access to phonetic transcription. The Fonology package is directly connected to this project, covering Portuguese, French, Italian, and Spanish. Matéo Levesque worked on grapheme-phoneme conversion scripts for French. These posts are part of the project’s knowledge mobilization efforts.
Guilherme D. Garcia
Introduction
In the previous article, we presented a method for performing phonemic transcription of French. However, there are still several limitations imposed by regex. In this article, we will examine the limitations related to the French language as well as those inherent to regular expressions themselves. We will also provide an overview of other challenges and explore a few possible solutions.
Limitations
During the development of the French transcription module, we encountered several limitations.
Limitations of French
As mentioned in the previous article, French orthography contains many exceptional cases. The only practical solution to this limitation is to maintain a list of words that are pre-transcribed before applying the general transcription rules. The problem with such a list is that it can never be completely exhaustive. To mitigate this issue and allow users to build the most accurate exception list possible, the Fonology extension provides the
add_lex_...()functions, which allow you to add your own custom transcriptions to the exception list.1French orthography also contains many cases of homophony and homography. Homophony is not particularly problematic since we can simply specify that “ou” and “où” should receive the same transcription. Homographs, however, are much more difficult to handle. We cannot specify whether “fils” should be transcribed as /fis/ or /fil/ because regex cannot analyze the linguistic context surrounding a word. This naturally leads us to the limitations of regex themselves.
Limitations of Regex
The biggest limitation of regex is that they were never designed for phonemic transcription. It is therefore difficult to achieve highly accurate transcriptions when the target language has an inconsistent orthography.2 Furthermore, since regex do not have access to phonological structures such as syllables, feet, and other prosodic units, it is difficult to account for phenomena such as the laxing of close vowels in the first syllable of certain words in Quebec French.3 This brings us to another challenge: dialectal variation.
If we wanted to support regional varieties such as Quebec French, Belgian French, Swiss French, and others, we would need to repeat all the steps described in Article 2 for each variety. While some of the standard French rules could be reused, many additional rules would have to be added and others removed (i.e., conditional rules would have to be introduced). One possible workaround would be to simply treat each regional variety as its own language, but this remains a tedious and complex process.
Possible Solutions
For cases (1) and (4), the proposed solutions are already described in their respective sections. However, for cases (2) and (3), here are a few possible approaches.
To address case (2), we could force the output of every possible transcription for a given word.4 This would indicate that the word is a homograph, although it still would not account for contextual information.
For case (3), we could leverage the syllabification system already included in the extension to handle transcriptions that depend on syllable boundaries. This would solve most of the remaining issues for French. However, for languages that also require access to higher-level phonological structures such as feet, this approach would still be limiting.
Conclusion
Throughout these three articles, we have presented a method for performing phonemic transcription using French as an example. The same methodology can be applied to many other languages as well. Feel free to install the Fonology extension and experiment with the four languages that are already supported.
Copyright © Guilherme Duarte Garcia
Footnotes
Note that these functions are session-dependent, and will not by themselves save the new lexicon permanently across R sessions. This is by design. The user should therefore also use the
export_lex()function to make sure new entries are saved to a file, which can then be posted to the repository of Fonology so that new updates include the entries in question.↩︎These issues are greatly reduced in languages with more regular spelling systems, such as Spanish.↩︎
This is only a problem if the user wishes to go beyond phonemic transcription and include variable phenomena as well.↩︎
The same approach could also be applied to case (4) to handle dialectal variation.↩︎