Wordcloud using R

As armas e os barões assinalados,
Que da ocidental praia Lusitana,
Por mares nunca de antes navegados,
Passaram ainda além da Taprobana,
Em perigos e guerras esforçados,
Mais do que prometia a força humana,
E entre gente remota edificaram
Novo Reino, que tanto sublimaram

Camões

Textmining Os Lusíadas

This is a simple example of how you can create a wordcloud in R from Os Lusíadas (‘The Lusiads’). This particular wordcloud was done using the a couple of very useful packages: dplyr, readr (which are found in tidyverse), tidytext, and wordcloud2, which renders interactive wordclouds.


A typical wordcloud


Here’s how you can do it step-by-step

1. Load the packages

2. Load the actual book

book = read_lines("http://www.gutenberg.org/cache/epub/3333/pg3333.txt")

# Select only lines in the actual poem

book = as.data.frame(book[30:(length(book)-374)])
names(book) = "verse"
book$verse = as.character(book$verse)

# Remove commas

book$verse = str_replace_all(book$verse, pattern = ",", replacement = "")

3. Prepare words

# Add canto and stanza numbers

book = book |> mutate(line = row_number()) |> 
    mutate(canto = cumsum(str_detect(string = verse, pattern = "^Canto "))) |> 
    group_by(canto) |> 
    mutate(stanza = cumsum(str_detect(string = verse, pattern = "[1-9]+"))) |> 
    select(canto, stanza, line, verse) |> 
    ungroup()

# Tokenize text

tidy_book = book |> 
    unnest_tokens(input = verse, output = word)

# Load stop words (a csv file with function words in Portuguese)
# Note that this is *not* an exhaustive list, so we'll miss some stop words

stopwords = read_csv("stopword.csv")
names(stopwords) = "word"

# Remove stopwords from tokens

tidy_book = tidy_book |> 
    anti_join(stopwords)

# Remove numbers

tidy_book = tidy_book |> 
    filter(!str_detect(word, "^\\d"))

  1. Wordcloud input

words = tidy_book |> 
    group_by(word) |> 
    summarise(freq = n()) |> 
    arrange(desc(freq)) 

words = as.data.frame(words)

rownames(words) = words$word


Let’s see what the input data frame looks like (10 most frequent words):

word freq
gente 230
terra 222
rei 204
mar 188
mundo 100
reino 86
céu 83
forte 79
ó 78
peito 76



5. Actual wordcloud

# The following code generates the wordcloud at the top of the page

wordcloud2(data = words, size = .5, 
           shape = "oval",
           rotateRatio = 0.5, 
           ellipticity = 0.9, color = "brown")

Copyright © 2023 Guilherme Duarte Garcia