Wordcloud using R

As armas e os barões assinalados,
Que da ocidental praia Lusitana,
Por mares nunca de antes navegados,
Passaram ainda além da Taprobana,
Em perigos e guerras esforçados,
Mais do que prometia a força humana,
E entre gente remota edificaram
Novo Reino, que tanto sublimaram

Camões

Textmining Os Lusíadas

Figure générée par DALL•E 3 This is a simple example of how you can create a wordcloud in R from Os Lusíadas (‘The Lusiads’). The work in question is an epic poem, and is the most important literary work in the Portuguese language (think of it as the Portuguese equivalent to Virgil’s Aeneid, for example). It was written by Luís de Camões in 1572. The book consists of ten parts (Cantos). This particular wordcloud was done using the a couple of very useful packages: dplyr, readr (which are found in tidyverse), tidytext, and wordcloud2, which renders interactive wordclouds.

A typical wordcloud

Here’s how you can do it step-by-step

1. Load the packages

2. Load the actual book

book = read_lines("http://www.gutenberg.org/cache/epub/3333/pg3333.txt")

# Select only lines in the actual poem

book = as.data.frame(book[30:(length(book)-374)])
names(book) = "verse"
book$verse = as.character(book$verse)

# Remove commas

book$verse = str_replace_all(book$verse, pattern = ",", replacement = "")

3. Prepare words

# Add canto and stanza numbers

book = book |> mutate(line = row_number()) |> 
    mutate(canto = cumsum(str_detect(string = verse, pattern = "^Canto "))) |> 
    group_by(canto) |> 
    mutate(stanza = cumsum(str_detect(string = verse, pattern = "[1-9]+"))) |> 
    select(canto, stanza, line, verse) |> 
    ungroup()

# Tokenize text

tidy_book = book |> 
    unnest_tokens(input = verse, output = word)

# Load stop words (a csv file with function words in Portuguese)
# Note that this is *not* an exhaustive list, so we'll miss some stop words

stopwords = read_csv("stopword.csv")
names(stopwords) = "word"

# Remove stopwords from tokens

tidy_book = tidy_book |> 
    anti_join(stopwords)

# Remove numbers

tidy_book = tidy_book |> 
    filter(!str_detect(word, "^\\d"))

Wordcloud input

words = tidy_book |> 
    group_by(word) |> 
    summarise(freq = n()) |> 
    arrange(desc(freq)) 

words = as.data.frame(words)

rownames(words) = words$word

Let’s see what the input data frame looks like (10 most frequent words):

word	freq
gente	230
terra	222
rei	204
mar	188
mundo	100
reino	86
céu	83
forte	79
ó	78
peito	76

5. Actual wordcloud

# The following code generates the wordcloud at the top of the page

wordcloud2(data = words, size = .5, 
           shape = "oval",
           rotateRatio = 0.5, 
           ellipticity = 0.9, color = "brown")