Regular expressions and the Fonology package
An introduction to regular expressions for French phonemic transcription
How can we map phonological patterns in a language using written data?
We see text everywhere, so it’s reasonable to assume that gathering data has never been easier. The problem is the well-known mismatch between letters (graphemes) and sounds (phonemes): we cannot fully understand phonological systems by simply examining how letters are distributed in a given corpus. Thus, to map phonological patterns, we first need to convert graphemes into phonemes.
This series is dedicated to a SSHRC-funded project (grant no. 141280) examining how lexical statistics can be explored to generate a baseline for comparison with experimental data. Part of this project involves the development of grapheme-phoneme conversion tools — it is very difficult to examine phonological patterns in written data without access to phonetic transcription. The Fonology package is directly connected to this project, covering Portuguese, French, Italian, and Spanish. Matéo Levesque worked on grapheme-phoneme conversion scripts for French. These posts are part of the project’s knowledge mobilization efforts.
Guilherme D. Garcia
What are regular expressions?
Regular Expressions (or regex) are used in information technology to validate the format of a character string or to find a specific sequence of letters, numbers or symbols in some text. They are mostly used to verify that an email address respects the right format or to check the strength of a password. They can also be particularly useful for cleaning data or to delete or replace certain words in a text. For Fonology, we used regular expressions to help with grapheme to phoneme (G2P) transcription by establishing a set of orthographic generalisations that we could target with regex, then replace with the appropriate phonemes.
The project
The goal of this project is to simplify the process of compiling databases that allow us to propose hypotheses concerning native and non-native phonological grammars. The problem is that many databases online are not coded for phonological variables. The Fonology library helps with this problem by allowing us to transform textual data into phonologically coded databases. Here, we focus specifically on French, but Fonology covers Portuguese, Spanish and Italian.
Why use regular expressions for this type of project?
As mentioned previously, regular expressions are an excellent tool for replacing precise character strings in a text. Based on that, we can imagine applying this kind of replacement to phonological transcription. We can target characters that consistently represent the same sounds and change them to their corresponding IPA symbols. For example, characters such as “é”, “oi” or “rr” can be transcribed as /e/, /wa/ and /ʁ/ in French.
However, in certain cases we find that the mapping isn’t one-to-one. A targeted string may correspond to different IPA symbols. Take “ch”, it can correspond to /ʃ/ or /k/, and sometimes even to /tʃ/. This is where regex are particularly powerful.
With regex, we can specify that all instances of “ch” followed by “r” or “l” (as in “chrome”, “chronologie” or “chlore”) must be replaced by /k/, while all other occurrences of “ch” need to be replaced by /ʃ/. After further generalizations, we achieve a transcription script with satisfactory precision.
This method allows us to apply transcription rules faster than methods such as machine learning, which require a large amount of transcribed data to learn efficiently. Such an approach would be unproductive in our project, since we try to compile phonological data without relying on prior training data. This is why we chose this method.
What’s next?
This is only a brief introduction to the concept of regular expressions. In the next two articles, we will see:
- How regular expressions were used in Fonology?
- The difficulties and limitations of regex in grapheme-to-phoneme transcription?
Useful links
Copyright © Guilherme Duarte Garcia