A typical wordcloud
Wordcloud using R
As armas e os barões assinalados,
Que da ocidental praia Lusitana,
Por mares nunca de antes navegados,
Passaram ainda além da Taprobana,
Em perigos e guerras esforçados,
Mais do que prometia a força humana,
E entre gente remota edificaram
Novo Reino, que tanto sublimaram
Camões
Textmining Os Lusíadas
This is a simple example of how you can create a wordcloud in R from Os Lusíadas (‘The Lusiads’). The work in question is an epic poem, and is the most important literary work in the Portuguese language (think of it as the Portuguese equivalent to Virgil’s Aeneid, for example). It was written by Luís de Camões in 1572. The book consists of ten parts (Cantos). This particular wordcloud was done using the a couple of very useful packages: dplyr
, readr
(which are found in tidyverse
), tidytext
, and wordcloud2
, which renders interactive wordclouds.
Here’s how you can do it step-by-step
1. Load the packages
2. Load the actual book
book = read_lines("http://www.gutenberg.org/cache/epub/3333/pg3333.txt")
# Select only lines in the actual poem
book = as.data.frame(book[30:(length(book)-374)])
names(book) = "verse"
book$verse = as.character(book$verse)
# Remove commas
book$verse = str_replace_all(book$verse, pattern = ",", replacement = "")
3. Prepare words
# Add canto and stanza numbers
book = book |> mutate(line = row_number()) |>
mutate(canto = cumsum(str_detect(string = verse, pattern = "^Canto "))) |>
group_by(canto) |>
mutate(stanza = cumsum(str_detect(string = verse, pattern = "[1-9]+"))) |>
select(canto, stanza, line, verse) |>
ungroup()
# Tokenize text
tidy_book = book |>
unnest_tokens(input = verse, output = word)
# Load stop words (a csv file with function words in Portuguese)
# Note that this is *not* an exhaustive list, so we'll miss some stop words
stopwords = read_csv("stopword.csv")
names(stopwords) = "word"
# Remove stopwords from tokens
tidy_book = tidy_book |>
anti_join(stopwords)
# Remove numbers
tidy_book = tidy_book |>
filter(!str_detect(word, "^\\d"))
- Wordcloud input
words = tidy_book |>
group_by(word) |>
summarise(freq = n()) |>
arrange(desc(freq))
words = as.data.frame(words)
rownames(words) = words$word
Let’s see what the input data frame looks like (10 most frequent words):
word | freq |
---|---|
gente | 230 |
terra | 222 |
rei | 204 |
mar | 188 |
mundo | 100 |
reino | 86 |
céu | 83 |
forte | 79 |
ó | 78 |
peito | 76 |
5. Actual wordcloud
# The following code generates the wordcloud at the top of the page
wordcloud2(data = words, size = .5,
shape = "oval",
rotateRatio = 0.5,
ellipticity = 0.9, color = "brown")
Copyright © 2024 Guilherme Duarte Garcia