Textmining Moby Dick
And for years afterwards, perhaps, ships shun the place; leaping over it as silly sheep leap over a vacuum, because their leader originally leaped there when a stick was held. There’s your law of precedents; there’s your utility of traditions; there’s the story of your obstinate survival of old beliefs never bottomed on the earth, and now not even hovering in the air! There’s orthodoxy!
Melville, Herman. Moby Dick (chapter 69)
Wordcloud
This is a simple example of how you can create a wordcloud and a sentiment analysis in R based on Moby Dick. This particular wordcloud was done using the a couple of very useful packages: dplyr
, readr
(which are found in tidyverse
), tidytext
, and wordcloud2
, which renders interactive wordclouds.
After creating our word cloud, we’ll go over some steps to design a figure for our sentiment analysis based on the 135 chapters in the novel. The goal will be to combine different layers of information to give you an example of what you could do using ggplot2
in this type of analysis.
Here’s how you can do it step-by-step
1. Load the packages
2. Load and prepare the actual book
This is a sequence of steps where we start by reading the txt
file (downloaded from Project Gutenberg and stripped of all the lines that don’t belong to the main text). We then transform our text into a tibble, rename a column, extract chapter numbers, remove the word “CHAPTER”, among other things. Next, we sum the number of words, which can be useful to work on later. The second step here is to tokenize the text using the unnest_tokens()
function.
Code
md = read_lines("mobydick.txt") |>
as_tibble() |>
rename(line = value) |>
mutate(chapter = str_extract(string = line, pattern = "CHAPTER \\d+")) |>
fill(chapter, .direction = "down") |>
mutate(chapter = str_remove(string = chapter, pattern = "CHAPTER ")) |>
filter(line != "") |>
filter(!str_detect(string = line, pattern = "CHAPTER .*")) |>
mutate(line_n = row_number(),
chapter = as.numeric(chapter),
nWords = str_count(line, pattern = " ") + 1) |>
group_by(chapter) |>
mutate(totalWords = cumsum(nWords)) |>
ungroup()
# Tokenized version of the text
toks = md |>
unnest_tokens(word, line)
You can see the result in the table below, where a random sample of 5 words is displayed.1 nWords
represents the number of words in a single line of text (the original unit per row in our tibble above), whereas totalWords
represents the total number of words in a given chapter. This allows us to visualize chapters by their representativeness (i.e., their length) in the upcoming figure (on sentiment analysis at the bottom of the page).
chapter | line_n | nWords | totalWords | word |
---|---|---|---|---|
15 | 2031 | 4 | 393 | tophet |
124 | 16413 | 7 | 874 | portents |
75 | 10701 | 9 | 391 | peruvian |
108 | 15002 | 13 | 126 | speak |
42 | 5951 | 12 | 1080 | of |
3. Prepare words
Now we just need to count out words and make sure we focus on content words only. Fortunately, we can refer to the existing object stop_words
inside our filter()
function and voilà.
4. Actual wordcloud
A typical wordcloud
Sentiment analysis
Our goal is to show the proportion of positive (vs. negative) words by chapter. We also want to adjust our figure based on the size of each chapter. First, in chapterLenght
, we create a summary from md
that counts the words. We then create a tibble, sents1
, using a sentiment corpus (bing
) and merge that corpus with the words under analysis. Finally, in sents2
, we calculate our propostions and add the information of chapter length. The last line of code here simply selects some sample chapters to illustrate the figure later on.
1. Prepare the data
Code
chapterLength = md |>
group_by(chapter) |>
summarize(words = sum(nWords))
sents1 = md |>
unnest_tokens(word, line) |>
inner_join(get_sentiments(lexicon = "bing")) |>
mutate(sentBin = if_else(sentiment == "positive", 1, 0))
sents2 = sents1 |>
group_by(chapter, sentiment) |>
count() |>
group_by(chapter) |>
mutate(prop = n / sum(n)) |>
filter(sentiment == "positive") |>
left_join(chapterLength, by = "chapter") |>
mutate(sampleChapter = if_else(chapter %in% c(28, 36, 93, 125), 1, 0))
2. Create the figure
Overall, we observe a balanced relationship between positive and negative words throughout. Towards the end of the novel (from chapter 110 onwards), however, chapters are mostly negative.
Code
ggplot(data = sents2, aes(x = chapter, y = prop)) +
geom_vline(xintercept = 28, linetype = "dotdash", color = "gray") +
geom_vline(xintercept = 36, linetype = "dotdash", color = "gray") +
geom_vline(xintercept = 93, linetype = "dotdash", color = "gray") +
geom_vline(xintercept = 125, linetype = "dotdash", color = "gray") +
annotate("text", x = 29, y = 1, label = "Ahab",
angle = "90", vjust = 1, hjust = 1, fontface = "italic",
color = "gray50", size = 3) +
annotate("text", x = 37, y = 1, label = "The Quarter-Deck",
angle = "90", vjust = 1, hjust = 1, fontface = "italic",
color = "gray50", size = 3) +
annotate("text", x = 94, y = 1, label = "The Castaway",
angle = "90", vjust = -1, hjust = 1, fontface = "italic",
color = "gray50", size = 3) +
annotate("text", x = 126, y = 1, label = "The Log and Line",
angle = "90", vjust = 1, hjust = 1, fontface = "italic",
color = "gray50", size = 3) +
geom_point(aes(size = words, color = prop), alpha = 0.5, show.legend = FALSE) +
theme_classic() +
scale_size_continuous(range = c(2, 10)) +
geom_hline(yintercept = 0.5, linetype = "dotted") +
theme(legend.position = "top",
text = element_text(size = 11)) +
scale_y_continuous(labels = percent_format()) +
scale_x_continuous(breaks = c(seq(10, 135, 20))) +
labs(y = "% of positive words",
x = "Chapter",
size = "Number of words in chapter:") +
geom_point(data = sents2 |> filter(sampleChapter == 1), size = 3) +
scale_color_gradient(low = "skyblue", high = "red") +
coord_cartesian(ylim = c(0, 1))
More examples?
There’s another example of text-mining in the French version of this website (Madame Bovary).
Copyright © 2024 Guilherme Duarte Garcia
Footnotes
We’ll get rid of stopwords later.↩︎