Wordcloud using R

As armas e os barões assinalados,
Que da ocidental praia Lusitana,
Por mares nunca de antes navegados,
Passaram ainda além da Taprobana,
Em perigos e guerras esforçados,
Mais do que prometia a força humana,
E entre gente remota edificaram
Novo Reino, que tanto sublimaram


Textmining Os Lusíadas

Figure générée par DALL•E 3This is a simple example of how you can create a wordcloud in R from Os Lusíadas (‘The Lusiads’). The work in question is an epic poem, and is the most important literary work in the Portuguese language (think of it as the Portuguese equivalent to Virgil’s Aeneid, for example). It was written by Luís de Camões in 1572. The book consists of ten parts (Cantos). This particular wordcloud was done using the a couple of very useful packages: dplyr, readr (which are found in tidyverse), tidytext, and wordcloud2, which renders interactive wordclouds.

A typical wordcloud

Here’s how you can do it step-by-step

1. Load the packages

2. Load the actual book

book = read_lines("http://www.gutenberg.org/cache/epub/3333/pg3333.txt")

# Select only lines in the actual poem

book = as.data.frame(book[30:(length(book)-374)])
names(book) = "verse"
book$verse = as.character(book$verse)

# Remove commas

book$verse = str_replace_all(book$verse, pattern = ",", replacement = "")

3. Prepare words

# Add canto and stanza numbers

book = book |> mutate(line = row_number()) |> 
    mutate(canto = cumsum(str_detect(string = verse, pattern = "^Canto "))) |> 
    group_by(canto) |> 
    mutate(stanza = cumsum(str_detect(string = verse, pattern = "[1-9]+"))) |> 
    select(canto, stanza, line, verse) |> 

# Tokenize text

tidy_book = book |> 
    unnest_tokens(input = verse, output = word)

# Load stop words (a csv file with function words in Portuguese)
# Note that this is *not* an exhaustive list, so we'll miss some stop words

stopwords = read_csv("stopword.csv")
names(stopwords) = "word"

# Remove stopwords from tokens

tidy_book = tidy_book |> 

# Remove numbers

tidy_book = tidy_book |> 
    filter(!str_detect(word, "^\\d"))

  1. Wordcloud input

words = tidy_book |> 
    group_by(word) |> 
    summarise(freq = n()) |> 

words = as.data.frame(words)

rownames(words) = words$word

Let’s see what the input data frame looks like (10 most frequent words):

word freq
gente 230
terra 222
rei 204
mar 188
mundo 100
reino 86
céu 83
forte 79
ó 78
peito 76

5. Actual wordcloud

# The following code generates the wordcloud at the top of the page

wordcloud2(data = words, size = .5, 
           shape = "oval",
           rotateRatio = 0.5, 
           ellipticity = 0.9, color = "brown")

Copyright © 2024 Guilherme Duarte Garcia