Textmining Moby Dick

And for years afterwards, perhaps, ships shun the place; leaping over it as silly sheep leap over a vacuum, because their leader originally leaped there when a stick was held. There’s your law of precedents; there’s your utility of traditions; there’s the story of your obstinate survival of old beliefs never bottomed on the earth, and now not even hovering in the air! There’s orthodoxy!

Melville, Herman. Moby Dick (chapter 69)

Wordcloud

This is a simple example of how you can create a wordcloud and a sentiment analysis in R based on Moby Dick. This particular wordcloud was done using the a couple of very useful packages: dplyr, readr (which are found in tidyverse), tidytext, and wordcloud2, which renders interactive wordclouds.

After creating our word cloud, we’ll go over some steps to design a figure for our sentiment analysis based on the 135 chapters in the novel. The goal will be to combine different layers of information to give you an example of what you could do using ggplot2 in this type of analysis.

Here’s how you can do it step-by-step

1. Load the packages

Code

library(wordcloud2)
library(tidytext)
library(tidyverse)
library(scales)

2. Load and prepare the actual book

This is a sequence of steps where we start by reading the txt file (downloaded from Project Gutenberg and stripped of all the lines that don’t belong to the main text). We then transform our text into a tibble, rename a column, extract chapter numbers, remove the word “CHAPTER”, among other things. Next, we sum the number of words, which can be useful to work on later. The second step here is to tokenize the text using the unnest_tokens() function.

Code

md = read_lines("mobydick.txt") |> 
  as_tibble() |> 
  rename(line = value) |>
  mutate(chapter = str_extract(string = line, pattern = "CHAPTER \\d+")) |> 
  fill(chapter, .direction = "down") |> 
  mutate(chapter = str_remove(string = chapter, pattern = "CHAPTER ")) |> 
  filter(line != "") |> 
  filter(!str_detect(string = line, pattern = "CHAPTER .*")) |> 
  mutate(line_n = row_number(),
         chapter = as.numeric(chapter),
         nWords = str_count(line, pattern = " ") + 1) |> 
  group_by(chapter) |> 
  mutate(totalWords = cumsum(nWords)) |>
  ungroup()


# Tokenized version of the text
toks = md |> 
  unnest_tokens(word, line)

You can see the result in the table below, where a random sample of 5 words is displayed.¹ nWords represents the number of words in a single line of text (the original unit per row in our tibble above), whereas totalWords represents the total number of words in a given chapter. This allows us to visualize chapters by their representativeness (i.e., their length) in the upcoming figure (on sentiment analysis at the bottom of the page).

chapter	line_n	nWords	totalWords	word
15	2031	4	393	tophet
124	16413	7	874	portents
75	10701	9	391	peruvian
108	15002	13	126	speak
42	5951	12	1080	of

3. Prepare words

Now we just need to count out words and make sure we focus on content words only. Fortunately, we can refer to the existing object stop_words inside our filter() function and voilà.

Code

cloud = toks |> 
  select(word) |> 
  # mutate(across(where(is.character), as.factor)) |> 
  group_by(word) |> 
  count() |> 
  arrange(desc(n)) |> 
  filter(!word %in% stop_words$word) |> 
  droplevels()

4. Actual wordcloud

A typical wordcloud

Sentiment analysis

Our goal is to show the proportion of positive (vs. negative) words by chapter. We also want to adjust our figure based on the size of each chapter. First, in chapterLenght, we create a summary from md that counts the words. We then create a tibble, sents1, using a sentiment corpus (bing) and merge that corpus with the words under analysis. Finally, in sents2, we calculate our propostions and add the information of chapter length. The last line of code here simply selects some sample chapters to illustrate the figure later on.

1. Prepare the data

Code

chapterLength = md |> 
  group_by(chapter) |> 
  summarize(words = sum(nWords))

sents1 = md |> 
  unnest_tokens(word, line) |> 
  inner_join(get_sentiments(lexicon = "bing")) |> 
  mutate(sentBin = if_else(sentiment == "positive", 1, 0))

sents2 = sents1 |> 
  group_by(chapter, sentiment) |> 
  count() |> 
  group_by(chapter) |> 
  mutate(prop = n / sum(n)) |> 
  filter(sentiment == "positive") |> 
  left_join(chapterLength, by = "chapter") |> 
  mutate(sampleChapter = if_else(chapter %in% c(28, 36, 93, 125), 1, 0))

2. Create the figure

Overall, we observe a balanced relationship between positive and negative words throughout. Towards the end of the novel (from chapter 110 onwards), however, chapters are mostly negative.

Code

ggplot(data = sents2, aes(x = chapter, y = prop)) + 
  geom_vline(xintercept = 28, linetype = "dotdash", color = "gray") +
  geom_vline(xintercept = 36, linetype = "dotdash", color = "gray") +
  geom_vline(xintercept = 93, linetype = "dotdash", color = "gray") +
  geom_vline(xintercept = 125, linetype = "dotdash", color = "gray") +
  
  annotate("text", x = 29, y = 1, label = "Ahab", 
           angle = "90", vjust = 1, hjust = 1, fontface = "italic",
           color = "gray50", size = 3) +
  
  annotate("text", x = 37, y = 1, label = "The Quarter-Deck", 
           angle = "90", vjust = 1, hjust = 1, fontface = "italic",
           color = "gray50", size = 3) +
  
  annotate("text", x = 94, y = 1, label = "The Castaway", 
           angle = "90", vjust = -1, hjust = 1, fontface = "italic",
           color = "gray50", size = 3) +
  
  annotate("text", x = 126, y = 1, label = "The Log and Line", 
           angle = "90", vjust = 1, hjust = 1, fontface = "italic",
           color = "gray50", size = 3) +
  
  geom_point(aes(size = words, color = prop), alpha = 0.5, show.legend = FALSE) +
  theme_classic() +
  scale_size_continuous(range = c(2, 10)) + 
  geom_hline(yintercept = 0.5, linetype = "dotted") +
  theme(legend.position = "top",
        text = element_text(size = 11)) +
  scale_y_continuous(labels = percent_format()) + 
  scale_x_continuous(breaks = c(seq(10, 135, 20))) +
  labs(y = "% of positive words",
       x = "Chapter",
       size = "Number of words in chapter:") + 
  geom_point(data = sents2 |> filter(sampleChapter == 1), size = 3) +
  scale_color_gradient(low = "skyblue", high = "red") +
  coord_cartesian(ylim = c(0, 1))

More examples?

There’s another example of text-mining in the French version of this website (Madame Bovary).

Footnotes

We’ll get rid of stopwords later.↩︎