Textmining Moby Dick

And for years afterwards, perhaps, ships shun the place; leaping over it as silly sheep leap over a vacuum, because their leader originally leaped there when a stick was held. There’s your law of precedents; there’s your utility of traditions; there’s the story of your obstinate survival of old beliefs never bottomed on the earth, and now not even hovering in the air! There’s orthodoxy!


Melville, Herman. Moby Dick (chapter 69)

Wordcloud

Source: WikipediaThis is a simple example of how you can create a wordcloud and a sentiment analysis in R based on Moby Dick. This particular wordcloud was done using the a couple of very useful packages: dplyr, readr (which are found in tidyverse), tidytext, and wordcloud2, which renders interactive wordclouds.

After creating our word cloud, we’ll go over some steps to design a figure for our sentiment analysis based on the 135 chapters in the novel. The goal will be to combine different layers of information to give you an example of what you could do using ggplot2 in this type of analysis.


Here’s how you can do it step-by-step

1. Load the packages

2. Load and prepare the actual book

This is a sequence of steps where we start by reading the txt file (downloaded from Project Gutenberg and stripped of all the lines that don’t belong to the main text). We then transform our text into a tibble, rename a column, extract chapter numbers, remove the word “CHAPTER”, among other things. Next, we sum the number of words, which can be useful to work on later. The second step here is to tokenize the text using the unnest_tokens() function.

Code
md = read_lines("mobydick.txt") |> 
  as_tibble() |> 
  rename(line = value) |>
  mutate(chapter = str_extract(string = line, pattern = "CHAPTER \\d+")) |> 
  fill(chapter, .direction = "down") |> 
  mutate(chapter = str_remove(string = chapter, pattern = "CHAPTER ")) |> 
  filter(line != "") |> 
  filter(!str_detect(string = line, pattern = "CHAPTER .*")) |> 
  mutate(line_n = row_number(),
         chapter = as.numeric(chapter),
         nWords = str_count(line, pattern = " ") + 1) |> 
  group_by(chapter) |> 
  mutate(totalWords = cumsum(nWords)) |>
  ungroup()


# Tokenized version of the text
toks = md |> 
  unnest_tokens(word, line)

You can see the result in the table below, where a random sample of 5 words is displayed.1 nWords represents the number of words in a single line of text (the original unit per row in our tibble above), whereas totalWords represents the total number of words in a given chapter. This allows us to visualize chapters by their representativeness (i.e., their length) in the upcoming figure (on sentiment analysis at the bottom of the page).

chapter line_n nWords totalWords word
15 2031 4 393 tophet
124 16413 7 874 portents
75 10701 9 391 peruvian
108 15002 13 126 speak
42 5951 12 1080 of

3. Prepare words

Now we just need to count out words and make sure we focus on content words only. Fortunately, we can refer to the existing object stop_words inside our filter() function and voilà.

Code
cloud = toks |> 
  select(word) |> 
  # mutate(across(where(is.character), as.factor)) |> 
  group_by(word) |> 
  count() |> 
  arrange(desc(n)) |> 
  filter(!word %in% stop_words$word) |> 
  droplevels()

4. Actual wordcloud

A typical wordcloud


Sentiment analysis

Our goal is to show the proportion of positive (vs. negative) words by chapter. We also want to adjust our figure based on the size of each chapter. First, in chapterLenght, we create a summary from md that counts the words. We then create a tibble, sents1, using a sentiment corpus (bing) and merge that corpus with the words under analysis. Finally, in sents2, we calculate our propostions and add the information of chapter length. The last line of code here simply selects some sample chapters to illustrate the figure later on.

1. Prepare the data

Code
chapterLength = md |> 
  group_by(chapter) |> 
  summarize(words = sum(nWords))

sents1 = md |> 
  unnest_tokens(word, line) |> 
  inner_join(get_sentiments(lexicon = "bing")) |> 
  mutate(sentBin = if_else(sentiment == "positive", 1, 0))

sents2 = sents1 |> 
  group_by(chapter, sentiment) |> 
  count() |> 
  group_by(chapter) |> 
  mutate(prop = n / sum(n)) |> 
  filter(sentiment == "positive") |> 
  left_join(chapterLength, by = "chapter") |> 
  mutate(sampleChapter = if_else(chapter %in% c(28, 36, 93, 125), 1, 0))

2. Create the figure

Overall, we observe a balanced relationship between positive and negative words throughout. Towards the end of the novel (from chapter 110 onwards), however, chapters are mostly negative.

Code
ggplot(data = sents2, aes(x = chapter, y = prop)) + 
  geom_vline(xintercept = 28, linetype = "dotdash", color = "gray") +
  geom_vline(xintercept = 36, linetype = "dotdash", color = "gray") +
  geom_vline(xintercept = 93, linetype = "dotdash", color = "gray") +
  geom_vline(xintercept = 125, linetype = "dotdash", color = "gray") +
  
  annotate("text", x = 29, y = 1, label = "Ahab", 
           angle = "90", vjust = 1, hjust = 1, fontface = "italic",
           color = "gray50", size = 3) +
  
  annotate("text", x = 37, y = 1, label = "The Quarter-Deck", 
           angle = "90", vjust = 1, hjust = 1, fontface = "italic",
           color = "gray50", size = 3) +
  
  annotate("text", x = 94, y = 1, label = "The Castaway", 
           angle = "90", vjust = -1, hjust = 1, fontface = "italic",
           color = "gray50", size = 3) +
  
  annotate("text", x = 126, y = 1, label = "The Log and Line", 
           angle = "90", vjust = 1, hjust = 1, fontface = "italic",
           color = "gray50", size = 3) +
  
  geom_point(aes(size = words, color = prop), alpha = 0.5, show.legend = FALSE) +
  theme_classic() +
  scale_size_continuous(range = c(2, 10)) + 
  geom_hline(yintercept = 0.5, linetype = "dotted") +
  theme(legend.position = "top",
        text = element_text(size = 11)) +
  scale_y_continuous(labels = percent_format()) + 
  scale_x_continuous(breaks = c(seq(10, 135, 20))) +
  labs(y = "% of positive words",
       x = "Chapter",
       size = "Number of words in chapter:") + 
  geom_point(data = sents2 |> filter(sampleChapter == 1), size = 3) +
  scale_color_gradient(low = "skyblue", high = "red") +
  coord_cartesian(ylim = c(0, 1))

More examples?

There’s another example of text-mining in the French version of this website (Madame Bovary).


Copyright © 2024 Guilherme Duarte Garcia

Footnotes

  1. We’ll get rid of stopwords later.↩︎