Fonology package

Phonological Analysis in R

The Fonology package (Garcia, 2023) provides different functions that are relevant to phonology research and/or teaching. If you have any suggestions or feedback, please visit the GitHub page of the project. To install the package, you will need to function install_github() from the devtools package (see below). Here’s a slide presentation with a demo of the package (English, français, português).

Updates

A new function to count syllables has been added: countSyl(). Functions to analyze yy:mm ages have been added to the package. Broad phonemic transcription for Spanish (beta) is now more precise. You can now use a custom phonemic inventory with the functions that deal with distinctive features and natural classes.


How to install

library(devtools)
install_github("guilhermegarcia/fonology")

Main functions and data

  • getFeat() and getPhon() to work with distinctive features
  • ipa() phonemically transcribes words (real or not) in Portuguese or Spanish
  • syllable() to extract syllabic constituents
  • sonDisp() calculates the sonority dispersion of a given demisyllable or the average dispersion for a set of words—see also meanSonDisp() for the average dispersion of a given word
  • wug_pt() generates hypothetical words in Portuguese
  • biGram_pt() calculates bigram probabilities for a given word
  • plotVowels() generates vowel trapezoids
  • plotSon() plots the sonority profile of a given word
  • ipa2tipa() translates IPA sequences into tipa sequences
  • monthsAge() and meanAge()
  • psl contains the Portuguese Stress Lexicon
  • pt_lex contains a simplified version of psl
  • stopwords_pt and stopwords_sp contain stopwords in Portuguese and Spanish

Distinctive features

The function getFeat() requires a set of phonemes ph and a language lg. It outputs the minimal matrix of distinctive features for ph given the phonemic inventory of lg. Five languages are supported: English, French, Italian, Portuguese, and Spanish. You can also use a custom phonemic inventory. See examples below.

The function getPhon() requires a feature matrix ft (a simple vector in R) and a language lg. It outputs the set of phonemes represented by ft given the phonemic inventory of lg. The languages supported are the same as those supported by getFeat(), and you can again provide your own phonemic inventory.

library(Fonology)

getFeat(ph = c("i", "u"), lg = "English")
#> [1] "+hi"    "+tense"
getFeat(ph = c("i", "u"), lg = "French")
#> [1] "Not a natural class in this language."
getFeat(ph = c("i", "y", "u"), lg = "French")
#> [1] "+syl" "+hi"
getFeat(ph = c("p", "b"), lg = "Portuguese")
#> [1] "-son"  "-cont" "+lab"
getFeat(ph = c("k", "g"), lg = "Italian")
#> [1] "+cons" "+back"
library(Fonology)

getPhon(ft = c("+syl", "+hi"), lg = "French")
#> [1] "u" "i" "y"
getPhon(ft = c("-DR", "-cont", "-son"), lg = "English")
#> [1] "t" "d" "b" "k" "g" "p"
getPhon(ft = c("-son", "+vce"), lg = "Spanish")
#> [1] "z" "d" "b" "ʝ" "g" "v"
library(Fonology)

getFeat(ph = c("p", "f", "w"), 
        lg = c("a", "i", "u", "y", "p", 
               "t", "k", "s", "w", "f"))
#> [1] "-syl" "+lab"

getPhon(ft = c("-son", "+cont"), 
        lg = c("a", "i", "u", "s", "z", 
               "f", "v", "p", "t", "m"))
#> [1] "s" "z" "f" "v"

IPA transcription

The function ipa() takes a word (or a vector with multiple words, real or not) in Portuguese or Spanish in its orthographic form and returns its phonemic (i.e., broad) transcription, including syllabification and stress. Narrow transcription is available for Portuguese (based on Brazilian Portuguese), which includes secondary stress—this can be generated by adding narrow = T to the function. Run ipa_pt_test() and ipa_sp_test() for sample words in both languages. By default, ipa() assumes that lg = "Portuguese" (or lg = "pt") and narrow = F.

ipa("atlético")
#> [1] "a.ˈtlɛ.ti.ko"
ipa("cantalo", narrow = T)
#> [1] "kãn.ˈta.lʊ"
ipa("antidepressivo", narrow = T)
#> [1] "ˌãn.t͡ʃi.ˌde.pɾe.ˈsi.vʊ"
ipa("feris") 
#> [1] "fe.ˈɾis"
ipa("mejorado", lg = "sp")
#> [1] "me.xo.ˈɾa.do"
ipa("nuevos", lg = "sp")
#> [1] "nu.ˈe.bos"

A more detailed function, ipa_pt(), is available for Portuguese only. In it, stress is assigned based on two scenarios. First, real words (non-verbs) have their stress assignment derived from the Portuguese Stress Lexicon (Garcia, 2014)—if the word is listed there. Second, nonce words follow the general patterns of Portuguese stress as well as probabilistic tendencies shown in my work (Garcia, 2017a, 2017b, 2019). As a result, a nonce word may have antepenultimate stress under the right conditions based on lexical statistics in the language. Likewise, words with other so-called exceptional stress patterns are also generated probabilistically (e.g., LH] words with penultimate stress). Stress and weight are also used to apply both spondaic and dactylic lowering to narrow transcriptions, following work such as Wetzels (2007). Secondary stress is provided when narrow = T. For ipa(), stress is not probabilistic (and therefore not variable): it merely follows the orthography as well as the typical stress rules in Portuguese (and Spanish).

There are several assumptions about surface-forms when narrow = T (i.e., for Portuguese). Most of these assumptions can be adjusted. Diphthongization, for example, is sensitive to phonotactics. A word such as CV.ˈV.CV will be narrowly transcribed as ˈCGV.CV (except when the initial consonant is an affricate (allophonic), which seems to lower the probability of diphthongization based on my judgement). Diphthongization is not applied if the onset is complex. Needless to say, these assumptions are based on a particular dialect of Brazilian Portuguese, and I do not expect all of them to seamlessly apply to other dialects (although some assumptions are more easily generalizable than others).

Narrow transcription also includes (final) vowel reduction, voicing assimilation, l-vocalization, vowel devoicing, palatalization, and epenthesis in sC clusters and other consonant sequences that are expected to be repaired on surface forms (e.g., kt, gn). Examples can be generated with the function ipa_pt_test(). Finally, it’s important to note that the goal of the ipa() function is phonemic transcription, not narrow phonetic transcription. Furthermore, there are certain limitations imposed by ASCII when it comes to specific phonetic diacritics (e.g., super- and subcript symbols, which affects secondary articulation).

Use ipa_pt() if you have nonce words as well as real words in Portuguese and you’d like to generate stress probabilistically based on the lexical statistics in the language. Note that ipa_pt() is not vectorized. Use ipa() if you just want to transcribe a large number of words (real or not) in Portuguese or Spanish and you don’t care about probabilistic stress assignment (i.e., you’re fine with categorical stress assignment). 99% of the time, you will use ipa().

Helper functions

If you plan to tokenize texts and create a table with individual columns for stress and syllables, you can use some simple additional helper functions. For example, getWeight() will take a syllabified word and return its weight profile (e.g., getWeight("kon.to") will return HL). The function getStress()1 will return the stress position of a given word (up to preantepenultimate stress)—the word must already be stressed, but the symbol used can be specified in the function (argument stress). Finally, countSyl() will return the number of syllables in a given string, and getSyl() will extract a particular syllable from a string. For example, getSyl(word = "kom-pu-ta-doɾ", pos = 3, syl = "-") will take the antepenultimate syllable of the string in question. The default symbol for syllabification is the period.

Here’s a simple example of how you could tokenize a text and create a table with coded variables using the functions discussed thus far (and without using packages such as tm or tidytext)—note also the function cleanText().

library(tidyverse)
text = "Por exemplo, em quase todas as variedades do português..."

d = tibble(word = text |>
             cleanText())

d = d |>
  mutate(IPA = ipa(word),
         stress = getStress(IPA), 
         weight = getWeight(IPA), 
         syl3 = getSyl(IPA, 3),
         syl2 = getSyl(IPA, 2),
         syl1 = getSyl(IPA, 1)) |>
  filter(!word %in% stopwords_pt) # remove stopwords
word IPA stress weight syl3 syl2 syl1
exemplo e.ˈzem.plo penult LHL e zem plo
quase ˈkwa.ze penult LL NA kwa ze
todas ˈto.das penult LH NA to das
variedades va.ɾi.e.ˈda.des penult LLH e da des
português poɾ.tu.ˈges final HLH poɾ tu ges

We often need to extract onsets, nuclei, codas and rhymes from syllables. That’s what syllable() does: given a syllable (phonemically transcribed), the function returns a constituent of interest. Let’s add columns to d where we extract all constituents of the final syllable (syl1 column).

d = d |>
  select(-c(syl3, syl2, stress)) |> 
  mutate(on1 = syllable(syl = syl1, const = "onset"),
         nu1 = syllable(syl = syl1, const = "nucleus"),
         co1 = syllable(syl = syl1, const = "coda"),
         rh1 = syllable(syl = syl1, const = "rhyme"))
word IPA weight syl1 on1 nu1 co1 rh1
exemplo e.ˈzem.plo LHL plo pl o NA o
quase ˈkwa.ze LL ze z e NA e
todas ˈto.das LH das d a s as
variedades va.ɾi.e.ˈda.des LLH des d e s es
português poɾ.tu.ˈges HLH ges g e s es

It’s important to decide whether we want to count glides as part of onsets or codas, or whether we want them to be included in nuclei only. By default, syllable() assumes that all glides are nuclear. You can change that by setting glides_as_onsets = T and glides_as_codas = T (both are set to F by default).

Do you have tons of data?

If you have a considerably large number of words to analyze with functions such as ipa() or syllable(), it’s much faster to first run the functions on types and then extend the variables created to all tokens (say, by using right_join() from dplyr).

IPA transcription of lemmas

You can easily combine Fonology with other packages that have tagging capabilities. In the example below, we import a short excerpt of Os Lusíadas, tag it using udpipe (Wijffels, 2023), and transcribe only the nouns in the data.

library(udpipe)
# Download model for Portuguese:
pt = udpipe_download_model(language = "portuguese-gsd")

udmodel_pt = udpipe_load_model(file = "portuguese-gsd-ud-2.5-191206.udpipe")

txt_pt = read_lines("data_files/lus.txt") |> 
  str_to_lower()

set.seed(1)
annotation_pt = udpipe_annotate(udmodel_pt, txt_pt) |> 
  as_tibble() |> 
  select(sentence, token, lemma, upos)

lusiadas = annotation_pt |> 
  select(lemma, upos) |> 
  filter(upos == "NOUN",
         !is.na(lemma)) |>
  mutate(ipa = ipa(lemma),
         stress = getStress(ipa)) |> 
  select(-upos) |> 
  ungroup()
lemma ipa stress
arma ˈaɾ.ma penult
barão ba.ˈɾãw̃ final
praia ˈpɾa.ja penult
mar ˈmaɾ final
taprobana ta.pɾo.ˈba.na penult

[Back to top]

Sonority

There are three functions in the package to analyze sonority. First, demi(word = ..., d = ...) extracts either the first (d = 1, the default) or second (d = 2) demisyllables of a given (syllabified) word (or vector of words. Second, sonDisp(demi = ...) calculates the sonority dispersion score of a given demisyllable, based on Clements (1990) (see also Parker (2011)). Note that this metric does not differentiate sequences that respect the sonority sequencing principle (SSP) from those that don’t, i.e., pla and lpa will have the same score. For that reason, a third function exists, ssp(demi = ..., d = ...), which evaluates whether a given demisyllable respects (1) or doesn’t repect (0) the SSP. In the example below, the dispersion score of the first demisyllable in the penult syllable is calculated—ssp() isn’t relevant here, since all words in Portuguese respect the SSP.

example = tibble(word = c("partolo", "metrilpo", "vanplidos"))

example = example |> 
  rowwise() |> 
  mutate(ipa = ipa(word),
         syl2 = getSyl(word = ipa, pos = 2),
         demi1 = demi(word = syl2, d = 1),
         disp = sonDisp(demi = demi1),
         SSP = ssp(demi = demi1, d = 1))
word ipa syl2 demi1 disp SSP
partolo paɾ.ˈto.lo to to 0.06 1
metrilpo me.ˈtɾil.po tɾil tɾi 0.56 1
vanplidos vam.ˈpli.dos pli pli 0.56 1

You may also want to calculate the average sonority dispersion for whole words with the function meanSonDisp(). If your words of interest are possible or real Portuguese words, they can be entered in their ortographic form. Otherwise, they need to be phonemically transcribed and syllabified. In this scenario, use phonemic = T.

meanSonDisp(word = c("partolo", "metrilpo", "vanplidos"))
#> [1] 1.53

Plotting sonority

The function plotSon() creates a plot using ggplot2 to visualize the sonority profile of a given word, which must be phonemically transcribed. This is adapted from the Shiny App you can find here. If you want the figure to differentiate the syllables in the word of interest (syl = T), your input must also be syllabified (in that case, the stressed syllable will be highlighted with thicker borders). Finally, if you want to save your figure, simply add save_plot = T to the function. The function has a relatively flexible phonemic inventory. If a phoneme isn’t supported, the function will print it (and the figure won’t be generated). The sonority scale used here can be found in Parker (2011).

"combradol" |> 
  ipa() |> 
  plotSon(syl = F)

"sobremesa" |> 
  ipa(lg = "sp") |> 
  plotSon(syl = T)

[Back to top]

Bigram probabilities

The function biGram_pt() returns the log bigram probability for a possible word in Portuguese (word must be broadly transcribed). The string must use broad phonemic transcription, but no syllabification or stress. The reference used calculate probabilities is the Portuguese Stress Lexicon.

biGram_pt("paklode")
#> [1] -43.11171

Two additional functions can be used to explore bigrams: nGramTbl() generates a tibble with phonotactic bigrams from a given text, and plotnGrams() creates a plot for inputs generated with nGramTbl(). Check ?plotnGrams() for more information.

Word generator for Portuguese

The function wug_pt() generates a hypothetical word in Portuguese. Note that this function is meant to be used to get you started with nonce words. You will most likely want to make adjustments based on phonotactic preferences. The function already takes care of some OCP effects and it also prohibits more than one onset cluster per word, since that’s relatively rare in Portuguese. Still, there will certainly be other sequences that sound less natural. The function is not too strict because you may have a wide range of variables in mind as you create novel words. Finally, if you wish to include palatalization, set palatalization = T—if you do that, biGram_pt() will de-palatalize words for its calculation, as it’s based on phonemic transcription.

set.seed(1)
wug_pt(profile = "LHL")
#> [1] "dɾa.ˈbuɾ.me"
# Let's create a table with 5 nonce words
# and their bigram probabilities
set.seed(1)
tibble(word = character(5)) |>
  mutate(word = wug_pt("LHL", n = 5),
         bigram = word |> 
           biGram_pt())
word bigram
dɾa.ˈbuɾ.me -49.23458
ze.ˈfɾan.ka -50.74279
be.ˈʒan.tɾe -49.19741
ʒa.ˈgɾan.fe -51.86230
me.ˈxes.vɾo -68.84952

[Back to top]

Plotting vowels

The function plotVowels() creates a vowel trapezoid using ggplot2. If tex = T, the function also saves a tex file with the LaTeX code to create the same trapezoid using the vowel package. Available languages: Arabic, French, English, Dutch, German, Hindi, Italian, Japanese, Korean, Mandarin, Portuguese, Spanish, Swahili, Russian, Talian, Thai, and Vietnamese. Only oral monophthongs are plotted. This function is also implemented as a Shiny App here.

plotVowels(lg = "Spanish", tex = F)
plotVowels(lg = "French", tex = F)

[Back to top]

From IPA to TIPA

The function ipa2tipa() takes a phonemically transcribed sequence and returns its tipa equivalent, which can be handy if you use \(\LaTeX\).

"Aqui estão algumas palavras" |> 
  cleanText() |> 
  ipa(narrow = T) |> 
  ipa2tipa()
#> Done! Here's your tex code using TIPA:
#> \textipa{ / a."ki es."t\~{a}\~{w} aw."g\~{u}.m5s pa."la.vR5s / }
Output tipa

[Back to top]

Working with ages in acquisition studies

It’s very common to use the format yy;mm for children’s ages in language acquisition studies. To make it easier to work with this format, two functions have been added to the package: monthsAge(), which returns an age in months given a yy;mm age, and meanAge(), which returns the average age of a vector using the same format (in both functions, you can specify the year-month separator). Here are a couple of examples:

monthsAge(age = "02;06")
#> [1] 30
monthsAge(age = "05:03", sep = ":")
#> [1] 63

meanAge(age = c("02;06", "03;04", NA))
#> [1] "2;11"
meanAge(age = c("05:03", "04:07"), sep = ":")
#> [1] "4:11"

[Back to top]


Acknowledgements and funding

Parts of this project have benefitted from funding from the ENVOL program at Université Laval. Two research assistants at Université Laval have helped with the Spanish and French transcription (winter 2023): Nicolas C. Bustos and Linda Wong.

Citing the package

citation("Fonology")
#> To cite Fonology in publications, use:
#> 
#>   Garcia, Guilherme D. (2023). Fonology: Phonological Analysis in R. R
#>   package version 0.9. Available at https://gdgarcia.ca/fonology
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {Fonology: Phonological Analysis in {R}},
#>     author = {Guilherme D. Garcia},
#>     note = {R package version 0.9},
#>     year = {2023},
#>     url = {https://gdgarcia.ca/fonology},
#>   }

Copyright © 2024 Guilherme Duarte Garcia

References

Clements, G. N. (1990). The role of the sonority cycle in core syllabification. In J. Kingston & M. Beckman (Eds.), Papers in Laboratory Phonology 1: Between the grammar and physics of speech (pp. 283–333). Cambridge University Press.
Garcia, G. D. (2014). Portuguese Stress Lexicon.
Garcia, G. D. (2017a). Weight effects on stress: Lexicon and grammar [PhD thesis, McGill University]. https://doi.org/10.31219/osf.io/bt8hk
Garcia, G. D. (2017b). Weight gradience and stress in Portuguese. Phonology, 34(1), 41–79. https://doi.org/10.1017/S0952675717000033
Garcia, G. D. (2019). When lexical statistics and the grammar conflict: Learning and repairing weight effects on stress. Language, 95(4), 612–641. https://doi.org/10.1353/lan.2019.0068
Garcia, G. D. (2023). Fonology: Phonological analysis in R. https://gdgarcia.ca/fonology
Parker, S. (2011). Sonority. In M. van Oostendorp, C. J. Ewen, E. Hume, & K. Rice (Eds.), The Blackwell companion to phonology (pp. 1160–1184). Wiley Online Library. https://doi.org/10.1002/9781444335262.wbctp0049
Wetzels, W. L. (2007). Primary word stress in Brazilian Portuguese and the weight parameter. Journal of Portuguese Linguistics, 5, 9–58. https://doi.org/10.5334/jpl.144
Wijffels, J. (2023). Udpipe: Tokenization, parts of speech tagging, lemmatization and dependency parsing with the ’UDPipe’ ’NLP’ toolkit. https://CRAN.R-project.org/package=udpipe

Footnotes

  1. Functions without _pt or _sp are language-independent.↩︎