A typical wordcloud
Wordcloud using R
As armas e os barões assinalados,
Que da ocidental praia Lusitana,
Por mares nunca de antes navegados,
Passaram ainda além da Taprobana,
Em perigos e guerras esforçados,
Mais do que prometia a força humana,
E entre gente remota edificaram
Novo Reino, que tanto sublimaram
Camões
Textmining Os Lusíadas
This is a simple example of how you can create a wordcloud in R from Os Lusíadas (‘The Lusiads’). This particular wordcloud was done using the a couple of very useful packages: dplyr
, readr
(which are found in tidyverse
), tidytext
, and wordcloud2
, which renders interactive wordclouds.
Here’s how you can do it step-by-step
1. Load the packages
2. Load the actual book
book = read_lines("http://www.gutenberg.org/cache/epub/3333/pg3333.txt")
# Select only lines in the actual poem
book = as.data.frame(book[30:(length(book)-374)])
names(book) = "verse"
book$verse = as.character(book$verse)
# Remove commas
book$verse = str_replace_all(book$verse, pattern = ",", replacement = "")
3. Prepare words
# Add canto and stanza numbers
book = book |> mutate(line = row_number()) |>
mutate(canto = cumsum(str_detect(string = verse, pattern = "^Canto "))) |>
group_by(canto) |>
mutate(stanza = cumsum(str_detect(string = verse, pattern = "[1-9]+"))) |>
select(canto, stanza, line, verse) |>
ungroup()
# Tokenize text
tidy_book = book |>
unnest_tokens(input = verse, output = word)
# Load stop words (a csv file with function words in Portuguese)
# Note that this is *not* an exhaustive list, so we'll miss some stop words
stopwords = read_csv("stopword.csv")
names(stopwords) = "word"
# Remove stopwords from tokens
tidy_book = tidy_book |>
anti_join(stopwords)
# Remove numbers
tidy_book = tidy_book |>
filter(!str_detect(word, "^\\d"))
- Wordcloud input
words = tidy_book |>
group_by(word) |>
summarise(freq = n()) |>
arrange(desc(freq))
words = as.data.frame(words)
rownames(words) = words$word
Let’s see what the input data frame looks like (10 most frequent words):
word | freq |
---|---|
gente | 230 |
terra | 222 |
rei | 204 |
mar | 188 |
mundo | 100 |
reino | 86 |
céu | 83 |
forte | 79 |
ó | 78 |
peito | 76 |
5. Actual wordcloud
# The following code generates the wordcloud at the top of the page
wordcloud2(data = words, size = .5,
shape = "oval",
rotateRatio = 0.5,
ellipticity = 0.9, color = "brown")
Copyright © 2023 Guilherme Duarte Garcia