This blog post was written with the intention of describing how I conducted my sentiment analyses so that others can replicate what I did and possibly adapt it to their projects. The analysis was conducted in R using the tidyverse series of packages.
Setting up
I started by loading the following packages I needed to conduct sentiment analysis into R:
- tidyverse, a suite of packages that makes it easy to manipulate tibbles (a type of table or dataframe) and generate graphs using the dplyr and ggplot2 packages respectively;
- stringi as I needed the stri_isempty() function to remove empty lines;
- corpus for the text_tokens() function to stem and complete words while cleaning the lyrics;
- qdap and tm for cleaning lyrics and conducting initial text mining analyses;
- tidytext to access the “bing” and “nrc” sentiment lexicons I needed to conduct sentiment analyses;
- wordcloud2 to visualise word frequencies within Bandori songs; and
- broom to convert the test results into a table so that parts of the test results can be easily extracted.
#Load the required packages
library(tidyverse)
library(stringi)
library(corpus)
library(qdap)
library(tm)
library(tidytext)
library(wordcloud2)
library(broom)
I created an empty tibble “lyricsTbl” containing four columns: “doc_id” for numerically identifying the lyrics; “text” for holding lyric data, “title” for song name and “band” for band name. This empty tibble was used to import lyrics and associated data. Note that the columns were set in this exact layout because when the corpus is created, the tibble splits in half with the last two columns being defined as metadata for the lyrics data from the first two columns.
#Create an empty tibble to import lyrics
lyricsTbl <- tibble(doc_id = numeric(), text = character(), title = character(), band = character())
Function building
I created three functions to make it easier to run all importing and cleaning steps in one go with fewer inputs.
I first built the lyric_import() function to import lyrics into the lyricsTbl tibble. This function takes a string containing the name of the lyrics document (doc) and a number (id) for the doc_id and returns a lyricsTbl tibble containing the doc_id and lyrics data as well as its associated metadata (i.e., song and band names). Note that the lyrics documents have to be set up in a specific layout so that parts of the lyrics document are correctly imported into lyricsTbl:
- The first line contains the song and band titles, separated with a “ by ” separator.
- The second line contains the URL to the lyrics.
- The third line is a credits line, acknowledging the person that has translated the lyrics and the date in which the translated lyrics were first uploaded.
- The fourth line onwards contains the lyrics.
#Build a function to import lyrics into lyricsTbl
lyric_import <- function(doc, id) {
#Import raw lyrics document into R
document <- readLines(doc)
#Collect song information and lyrics from the raw lyrics document
first_line <- document[1]
title <- str_split(first_line, " by ")[[1]][1]
band <- str_split(first_line, " by ")[[1]][2]
lyrics <- document[4:length(document)]
#Remove empty lines in lyrics
lyrics <- lyrics[!stri_isempty(lyrics)]
#Combine lyrics into one line
line <- str_c(lyrics, collapse = " ")
#Add lyrics and song information into lyricsTbl table
lyricsTbl <- add_row(lyricsTbl, doc_id = id, text = line, title = title, band = band)
#Return the updated lyricsTbl table
return (lyricsTbl)
}
The second function I built is the corpus_clean() function. This function is designed to run the cleaning steps of each lyrics document so that punctuation and common stopwords (commonly-used words that do not add meaning to text analyses) are removed. This function takes a corpus of lyrics documents and a vector of additional stopwords to remove and returns a corpus where punctuation and stopwords are removed.
#Build a function containing a pipe to clean the corpus of lyrics documents
corpus_clean <- function(corpus, stopword = "") {
#Define stopwords first to remove common and user-defined stopwords
stopwords <- c(stopwords("en"), stopword)
#Build a stemmer dictionary from lexoconista.com
stem_list <- read_tsv("C:\\D\\2015\\PhD\\data science blog\\bandori lyric analysis\\lemmatization-en.txt")
names(stem_list) <- c("stem", "term")
stem_list2 <- new_stemmer(stem_list$term, stem_list$stem)
stemmer <- function (x) text_tokens(x, stemmer = stem_list2)
#Replace all mid-dots (present in some lyrics) with an empty space. This function was defined as removePunctuation cannot remove mid-dots
remove_mid_dot <- function (x) str_replace_all(x, "·", " ")
#Replace original apostrophies in the lyrics with the alternative apostrophe because retaining original apostrophies prevents replace_contraction from working
replace_apos <- function(x) str_replace_all(x, "'", "'")
#Clean corpus through a pipe
corpus <- corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(content_transformer(remove_mid_dot)) %>%
tm_map(content_transformer(replace_apos)) %>%
tm_map(content_transformer(replace_abbreviation)) %>%
tm_map(content_transformer(replace_contraction), sent.cap = FALSE) %>%
tm_map(content_transformer(replace_symbol)) %>%
tm_map(content_transformer(replace_number)) %>%
tm_map(content_transformer(stemmer)) %>%
tm_map(removeWords, stopwords) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
return(corpus)
}
Lastly, I built the word_freq() function which generates a word frequency table of each song in tidy format (with each row defining a word-song pair and each column representing a variable). This function takes a corpus of cleaned lyrics documents and returns a tidied tibble containing columns for words, song names and their frequencies.
#Create a function showing frequencies of each word in a song in tidy tibble format
word_freq <- function(corpus) {
#Generate a Term Document Matrix (TDM) and convert it into a matrix
tdm <- TermDocumentMatrix(corpus)
matrix <- as.matrix(tdm)
#Name columns (i.e., the songs) in the matrix
colnames(matrix) <- meta(corpus)$title
#Convert matrix into a tibble and add a column containing the words
tdm_song <- as_tibble(matrix) %>% mutate(word = rownames(matrix))
#Swap columns so that word column is moved from the last to the first column
tdm_song <- tdm_song[, c(ncol(tdm_song), 1:ncol(tdm_song) - 1)]
#Tidy the table so that all song names are placed in one column and remove any rows with 0 frequency
tdm_tidy <- tdm_song %>%
gather(key = "song", value = "freq", -word) %>%
filter(freq != 0)
return(tdm_tidy)
}
Writing these three functions made it easier to generate the tables of data that were required to conduct sentiment analyses.
Importing and cleaning lyrics data
English-translated lyrics were copied from the Bandori Wikia (https://bandori.wikia.com/wiki/BanG_Dream!_Wikia) and pasted into separate .txt files in Notepad. These .txt files were then saved into one folder containing all the English-translated lyrics of Bandori original songs. From there, the lyric_import() function was used to import translated lyrics from all bands into lyricsTbl. Each lyrics document was assigned a unique doc_id according to the order in the folder.
#Import all lyrics into lyricsTbl
for (i in 1:length(dir())) {
lyricsTbl <- lyric_import(dir()[i], i)
}
The lyricsTbl containing the lyrics was converted into a volatile corpus. In the creation of the corpus, the lyricsTbl was split into content (containing the “doc_id” and “text” columns) and metadata (containing “title” and “band” columns) tables. The lyrics were then cleaned with the corpus_clean() function. Note that no stopwords were added to the corpus_clean() function alongside the most common stopwords.
#Create a volatile corpus containing the Bandori lyrics
bandori_corpus <- VCorpus(DataframeSource(lyricsTbl))
#Clean bandori_corpus with corpus_clean() function
bandori_corpus_clean <- corpus_clean(bandori_corpus)
This is because each band had words that are overused in a small number of songs (typically one to two songs) without giving much context. Examples of such words include words that were used consecutively (e.g., “fight”), exclamations (e.g., “cha”) and sound effects (e.g., “nippa”). These words were identified for each band using the term frequency-inverse document frequency (Tf-Idf) which identifies the most frequent words spread over few documents. These words along with the band were stored in a CSV file which was loaded into R as “stopwords.”
#Load CSV file containing band-specific stopwords
stopwords <- read_csv("C:\\D\\2015\\PhD\\data science blog\\bandori lyric analysis\\compiled lyrics\\band_stopwords.csv")
A tidied word frequency table containing frequencies of all words for each song (minus common stopwords) was generated with the word_freq() function. From there, two anti-joins were conducted to remove the stopwords: the common stopwords that were not initially removed in the corpus_clean() function (under stop_words) and the band-specific stopwords (contained in stopwords). As the word frequency tibble contained the song names but not the band names, the column containing the band names was also added via an inner_join() function between the word frequency table and the meta table of the corpus.
#Create a tidied word frequency table, removing more common stopwords and band-specific stopwords.
bandori_noStop <- word_freq(bandori_corpus_clean) %>%
anti_join(stop_words) %>%
inner_join(meta(bandori_corpus), by = c("song" = "title")) %>%
anti_join(stopwords, by = c("word", "band"))
Exploratory data analysis
I initially generated a table which had counts for the number of songs and words for each band. Song counts were derived from the original lyricsTbl tibble while word counts were obtained from the bandori_noStop tibble. They were then combined into one table so that the number of songs and words were matched to their bands.
#Count the number of songs for each band
band_count <- lyricsTbl %>%
group_by(band) %>%
summarise(num_songs = n())
#Count the total number of words for each band
word_count <- bandori_noStop %>%
group_by(band) %>%
summarise(num_words = sum(freq))
#Combine the total number of songs and words into one table
(band_summary <- band_count %>%
left_join(word_count, by = "band"))
## # A tibble: 6 x 3
## band num_songs num_words
## <chr> <int> <dbl>
## 1 Afterglow 8 900
## 2 Hello, Happy World! 7 653
## 3 Pastel*Palettes 8 690
## 4 Poppin'Party 30 3255
## 5 RAISE A SUILEN 4 462
## 6 Roselia 18 1939
From the word frequency table of Bandori songs, I generated a wordcloud of the 100 most frequently used words using the wordcloud2 package. A colour gradient was used with gray, yellow and red representing increasing word frequencies.
#Include only 100 most frequently used words in Bandori songs
top_100_nostop <- bandori_noStop[, c(1, 3)] %>%
group_by(word) %>%
summarise(total = sum(freq)) %>%
arrange(desc(total)) %>%
head(100)
#Define colour gradient for wordcloud
cloud_colour3 <- ifelse(top_100_nostop$total > 66, "#E50050",
ifelse(top_100_nostop$total >= 40, "#F2B141",
"#808080"))
#Generate the wordcloud
wordcloud2(top_100_nostop,
size = 0.25,
shape = "star",
shuffle = FALSE,
color = cloud_colour3)
“Bing” sentiment analysis of lyrics
Here was the ggplot2 theme that I used in most graphs of this blog post.
#Define a theme to be used across all graphs
bandori_theme <- theme(legend.position = "bottom",
plot.title = element_text(hjust = 0.5, size = 15,
face = "bold"),
axis.title = element_text(size = 10, face = "bold"
),
axis.text.x = element_text(size = 8),
legend.text = element_text(size = 10))
To do the “bing” sentiment analysis of the lyrics, I matched the words from the word frequency table with the table of known words and their “bing” sentiments via an inner-join. The resultant table bandori_bing contains words that are identified as “positive” or “negative” under the “bing” sentiment lexicon.
#Match bandori_noStop with the "bing" sentiment lexicon
bandori_bing <- bandori_noStop %>%
inner_join(get_sentiments("bing"), by = "word")
For each band, I counted the number of words that were either “positive” or “negative” and generated a table with separate “positive” and “negative” count columns. I did further calculations to count the total number of sentiment words (total), the difference between the number of positive and negative words (polarity) and the proportion of positive words (prop_pos).
#Count the total number of positive and negative words from "bing" sentiment lexicon
bandori_bing_total <- bandori_bing %>%
group_by(band, sentiment) %>%
summarise(count = sum(freq)) %>%
spread(sentiment, count) %>%
mutate(total = positive + negative,
polarity = positive - negative,
prop_pos = positive / total)
It is possible that the proportion of positive words might appear to deviate away from the 0.50 value by chance. Hence, to test whether there were significantly more positive than negative words, I conducted a Test of Equal Proportions. From the test, I collected the p-values as well as the 95% lower and higher confidence interval values. Along with the number of songs, these were added to the bandori_bing_total table.
#Define empty numeric vectors to store Equal Proportion test results
p_values <- vector(mode = "numeric")
conf_low <- vector(mode = "numeric")
conf_high <- vector(mode = "numeric")
#Conduct the Equal Proportion test to see whether each band has more positive words than negative words (by comparing it to prop = 0.5)
test_results <- for (i in 1:6) {
z_test <- binom.test(bandori_bing_total$positive[i],
bandori_bing_total$total[i],
alternative = "two.sided")
tidied <- tidy(z_test)
p_values[i] <- tidied$p.value
conf_low[i] <- tidied$conf.low
conf_high[i] <- tidied$conf.high
}
#Add test results onto bandori_bing_total
bandori_bing_total$conf_low <- conf_low
bandori_bing_total$conf_high <- conf_high
bandori_bing_total$p_value <- p_values
bandori_bing_total$num_songs <- band_count$num_songs
#Rearrange bandori_bing_total so that number of songs is next to song names, then round decimals to 3 significant values
bandori_bing_total2 <- bandori_bing_total %>%
select(band, num_songs, negative:p_value) %>%
mutate(prop_pos = round(prop_pos, 3),
conf_low = round(conf_low, 3),
conf_high = round(conf_high, 3),
p_value = round(p_value, 3))
From the bing sentiment analysis, I also generated a graph visualising the proportion of positive and negative words for each band. Given that there were different word counts among the bands, I normalised the number of positive and negative words as proportions so that they can be compared across bands.
#Graph the proportions of positive and negative words for each band
bandori_bing %>%
group_by(band, sentiment) %>%
summarise(count = sum(freq)) %>%
ggplot(aes(x = band, y = count, fill = factor(sentiment))) +
geom_col(position = "fill") +
geom_hline(yintercept = 0.50, colour = "black", linetype = 2) +
scale_fill_manual(values = c("red", "green")) +
labs(x = "Band", y = "Proportion", fill = "Sentiment", title = "Proportion of positive/negative words in Bandori songs") +
bandori_theme
I also counted the number of songs that were positive or negative overall according to bing sentiment analysis. Positive and negative songs were defined as songs where the polarity (the difference between the number of positive and negative words) is more than 2 or less than -2 respectively. Songs whose polarities were between -2 and 2 inclusive were defined as neutral because these differences were too small to conclusively group that song as positive or negative.
#Count the number of positive and negative sentiment words for each song (while keeping the band name)
bandori_bing_songSent <- bandori_bing %>%
group_by(band, song, sentiment) %>%
summarise(total = sum(freq)) %>%
ungroup() %>%
spread(sentiment, total)
#Replace NAs in bandori_bing_songSent with 0
bandori_bing_songSent[is.na(bandori_bing_songSent)] <- 0
#Continue grouping songs into different sentiment categories
bandori_bing_sort <- bandori_bing_songSent %>%
mutate(polarity = positive - negative,
result =
case_when(polarity > 2 ~ "positive",
(polarity <= 2) & (polarity >= -2) ~ "neutral",
polarity < -2 ~ "negative")) %>%
group_by(band, result) %>%
summarise(num_song = n()) %>%
spread(result, num_song)
#Replace NAs in bandori_bing_sort with 0
bandori_bing_sort[is.na(bandori_bing_sort)] <- 0
“NRC” sentiment analysis of lyrics
Similar to the “bing” sentiment analysis, I initially matched the words from the word frequency table to the table of known words and their emotions via an inner-join. Then for each band, the number of words under a specific emotion were counted. As positive and negative sentiments were already analysed during the “bing” sentiment analysis, data relating to the two sentiments were excluded for the “NRC” sentiment analysis. The table was then modified so that emotions appear as separate columns.
#Match bandori_noStop with the "nrc" sentiment lexicon
bandori_nrc <- bandori_noStop %>%
inner_join(get_sentiments("nrc"), by = "word")
#For each band, count the number of words under each emotion and exclude positive and negative sentiments
bandori_nrc_total <- bandori_nrc %>%
group_by(band, sentiment) %>%
summarise(count = sum(freq)) %>%
filter(!sentiment %in% c("positive", "negative"))
#Spread the bandori_nrc_total table so that emotions appear as separate columns
bandori_nrc_spread <- spread(bandori_nrc_total, sentiment, count)
Proportions of words under specific emotions were also calculated so that they can be compared across bands. This was done by generating a proportional or marginal table.
#Convert bandori_nrc_spread into a matrix
bandori_nrc_matrix <- as.matrix(bandori_nrc_spread[, 2:9])
rownames(bandori_nrc_matrix) <- bandori_nrc_spread$band
#Calculate proportions for bandori_nrc_total
bandori_nrc_prop <- round(prop.table(bandori_nrc_matrix, 1), 2)
Following this, I generated a graph visualising the proportion of words that appeared under a specific emotion for each band. Again, the height of the bars were normalised in order to calculate and compare proportions across different bands.
#Define a named vector of colours attached to specific emotions
emotion_colour <- c("red", "green4", "lawngreen", "black", "yellow1", "navy", "purple", "lightskyblue")
names(emotion_colour) <- c("anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", "trust")
#For each song, count the number of words belonging to a specific sentiment
bandori_nrc_total %>%
ggplot(aes(x = band, y = count, fill = sentiment)) +
geom_col(position = "fill") +
scale_fill_manual(values = emotion_colour) +
labs(x = "Band", y = "Proportion", fill = "Emotion", title = "\"NRC\" emotion words in Bandori songs") +
bandori_theme
Acknowledgements
I would like to acknowledge the following people who have translated the original Bandori songs from Japanese to English:
- AERIN
- Aletheia
- Arislation
- Betasaihara
- BlaZofgold
- bluepenguin
- Choocolatiah
- Eureka
- Hikari
- Komichi
- Leephysic
- Leoutoeisen
- LuciaHunter
- lunaamatista
- MijukuNine
- Ohoyododesu
- PocketLink
- Rolling
- Starlogakemi
- Thaerin
- Tsushimayohane
- UnBound
- vaniiah
- Youraim
I may have missed other people who have translated songs for this analysis. If I have missed you, I would also like to thank you all the same.
2 thoughts on “A methodology of conducting sentiment analysis of Bandori lyrics”