Comparing strengths and weaknesses of NLP techniques
Topic Modeling to identify topics discussed in the restaurant reviews
In a sequence of articles we compare different NLP techniques to show you how we get valuable information from unstructured text. About a year ago we gathered reviews on Dutch restaurants. We were wondering whether ‘the wisdom of the croud’ – reviews from restaurant visitors – could be used to predict which restaurants are most likely to receive a new Michelin-star. Read this post to see how that worked out. We used topic modeling as our primary tool to extract information from the review texts and combined that with predictive modeling techniques to end up with our predictions.
We got a lot of attention with our predictions and also questions about how we did the text analysis part. To answer these questions, we will explain our approach in more detail in the coming articles. But we didn’t stop exploring NLP techniques after our publication, and we also like to share insights from adding more novel NLP techniques. More specifically we will use two types of word embeddings – a classic Word2Vec model and a GLoVe embedding model – we’ll use transfer learning with pretrained word embeddings and we use BERT. We compare the added value of these advanced NLP techniques to our baseline topic model on the same dataset. By showing what we did and how we did it, we hope to guide others that are keen to use textual data for their own data science endeavours.
In a previous article, we showed how we prepared our data to be used for various NLP techniques. Here, we commence our series of articles on NLP techniques by introducing Topic Modeling and show you how to identify topics, visualise topic model results. In a later article, we show you how to use the topic model results in a predictive model.
# Loading packages library(tidyverse) library(tidytext) library(topicmodels) library(tm) library(LDAvis)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ── ✔ ggplot2 3.3.0 ✔ purrr 0.3.4 ✔ tibble 3.0.1 ✔ dplyr 1.0.2 ✔ tidyr 1.1.1 ✔ stringr 1.4.0 ✔ readr 1.3.1 ✔ forcats 0.5.0 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() Loading required package: NLP Attaching package: ‘NLP’ The following object is masked from ‘package:ggplot2’: annotate
# **reviews.csv**: a csv file with review texts - the fuel for our NLP analyses. (included key: restoreviewid, hence the unique identifier for a review) reviews <- read.csv(file = 'https://bhciaaablob.blob.core.windows.net/cmotionsnlpblogs/reviews.csv',header=TRUE,stringsAsFactors=FALSE) # **labels.csv**: a csv file with 1 / 0 values, indicating whether the review is a review for a Michelin restaurant or not (included key: restoreviewid) labels <- read.csv(file = 'https://bhciaaablob.blob.core.windows.net/cmotionsnlpblogs/labels.csv',header=TRUE,stringsAsFactors=FALSE) # **trainids.csv**: a csv file with 1 / 0 values, indicating whether the review should be used for training or testing - we already split the reviews in train/test to enable reuse of the same samples for fair comparisons between techniques (included key: restoreviewid)storage_download(cont, "blogfiles/labels.csv",overwrite =TRUE) set.seed(1234) trainids <- read.csv(file = 'https://bhciaaablob.blob.core.windows.net/cmotionsnlpblogs/trainids.csv',header=TRUE,stringsAsFactors=FALSE) %>% mutate(rv = runif(nrow(.)))
## combine unigrams and bigrams into reviewTextClean and divide text into separate words reviews_tokens <- reviews %>% mutate(reviewTextClean = paste0(reviewTextClean,bigrams)) %>% select(restoReviewId, reviewTextClean) %>% unnest_tokens(token, reviewTextClean) %>% group_by(restoReviewId) %>% mutate(n_tokens = n()) %>% filter(n_tokens>=25) %>% ungroup() %>% select(-n_tokens) # filter out reviews with less than 25 tokens # summarize result after tokenization str(reviews_tokens)
tibble [8,681,554 × 2] (S3: tbl_df/tbl/data.frame) $ restoReviewId: chr [1:8681554] "255757_1" "255757_1" "255757_1" "255757_1" ... $ token : chr [1:8681554] "heerlijk" "eten" "leuke" "sfeer" ...
# what % of all reviews are Michelin reviews? labels %>% group_by(ind_michelin) %>% summarize(n=n()) %>% mutate(pct=scales::percent(n/sum(n)))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 3 ind_michelin n pct <int> <int> <chr> 1 0 141554 97% 2 1 4313 3%
# sample reviews: Take all michelin train reviews and complement with non-michelin train cases to include 10K reviews in total reviews_tokens_train <- reviews_tokens %>% inner_join(labels,by = "restoReviewId") %>% inner_join(trainids,by = "restoReviewId") %>% mutate(train_smpl = case_when(train==1 & ind_michelin == 1 ~ 1, # sample all reviews that are michelin review and in the train subset train==1 & rv < (7100/95000) ~ 1, # complete 10K sample by adding 7.1K reviews from non-michelin train reviews TRUE~0)) # all other reviews are not in the train_smpl # what reviews will we keep? reviews_tokens_train %>% group_by(train_smpl,train,ind_michelin) %>% summarize(n_reviews=n_distinct(restoReviewId),n_tokens=n_distinct(token)) %>% print() #create train data using train_smpl as filter reviews_tokens_train <- reviews_tokens_train %>% filter(train_smpl == 1) sprintf('%s unique reviews and %s unique tokens selected to train topic model',n_distinct(reviews_tokens_train$restoReviewId),n_distinct(reviews_tokens_train$token))
`summarise()` regrouping output by 'train_smpl', 'train' (override with `.groups` argument)
# A tibble: 5 x 5 # Groups: train_smpl, train [3] train_smpl train ind_michelin n_reviews n_tokens <dbl> <int> <int> <int> <int> 1 0 0 0 40997 433179 2 0 0 1 1290 35129 3 0 1 0 88505 773887 4 1 1 0 7181 111435 5 1 1 1 2971 70548 [1] "10152 unique reviews and 163137 unique tokens selected to train topic model"
reviews_tokens_train %>% group_by(token) %>% summarize(token_freq=n()) %>% mutate(token_freq_binned = case_when(token_freq>20~20,TRUE~as.numeric(token_freq))) %>% group_by(token_freq_binned) %>% summarise(n_tokens = n()) %>% mutate(pct_tokens = n_tokens/sum(n_tokens), cumpct_tokens = cumsum(n_tokens)/sum(n_tokens)) %>% ggplot(aes(x=token_freq_binned)) + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + geom_bar(aes(y=pct_tokens),stat='identity',fill='blue') + geom_line(aes(y=cumpct_tokens),stat='identity',color='orange',linetype='dashed') + geom_text(aes(y=cumpct_tokens,label=scales::percent(cumpct_tokens,accuracy=1)),size=3) + theme_minimal() + ggtitle("Frequency of token in Corpus (all reviews)") + xlab("token frequency") + ylab("% of all tokens")
reviews_tokens_train %>% group_by(token) %>% summarize(token_freq=n()) %>% mutate(min_5_freq = case_when(token_freq<5~'token frequency: <5',TRUE~'token frequency: >=5')) %>% group_by(min_5_freq) %>% summarise(n_unique_tokens = n(),n_tokens=sum(token_freq)) %>% mutate(pct_unique_tokens = scales::percent(n_unique_tokens / sum(n_unique_tokens)),pct_all_tokens=scales::percent(n_tokens / sum(n_tokens)))
`summarise()` ungrouping output (override with `.groups` argument) `summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 5 min_5_freq n_unique_tokens n_tokens pct_unique_tokens pct_all_tokens <chr> <int> <int> <chr> <chr> 1 token frequency: <5 150612 187523 92% 27% 2 token frequency: >=5 12525 518146 8% 73%
# remove infrequent tokens reviews_tokens_train_smpl <- reviews_tokens_train %>% group_by(token) %>% mutate(token_freq=n()) %>% filter(token_freq>=5) # create document term matrix dtm <- reviews_tokens_train_smpl %>% cast_dtm(document = restoReviewId,term = token,value = token_freq) #check dimenstions of dtm cat(paste0('DTM dimensions: Documents (',dim(dtm)[1],') x Tokens (',dim(dtm)[2],')',' (average token frequency: ',round(sum(dtm)/sum(dtm!=0),2),')'))
DTM dimensions: Documents (10152) x Tokens (12525) (average token frequency: 702.3)
lda_fit <- LDA(dtm, k = 3)
# phi (topic - token distribution matrix) - topics in rows, tokens in columns: phi <- posterior(lda_fit)$terms %>% as.matrix cat(paste0('Dimensions of phi (topic-token-matrix): ',paste(dim(phi),collapse=' x '),'\n')) cat(paste0('phi examples (8 tokens): ','\n')) phi[,1:8] %>% as_tibble() %>% mutate_if(is.numeric, round, 5) %>% print() # theta (document - topic distribution matrix) - documents in rows, topic probs in columns: theta <- posterior(lda_fit)$topics %>% as.matrix cat(paste0('\n\n','Dimensions of theta (document-topic-matrix): ',paste(dim(theta),collapse=' x '),'\n')) cat(paste0('theta examples (8 documents): ','\n')) theta[1:8,] %>% as_tibble() %>% mutate_if(is.numeric, round, 5) %>% setNames(paste0('Topic', names(.))) %>% print()
Dimensions of phi (topic-token-matrix): 3 x 12525 phi examples (8 tokens): # A tibble: 3 x 8 zeldzaam slechte avond service niveau eten goed vanavond <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0 0.00006 0.00295 0.00001 0.00021 0.00022 0.112 0.0001 2 0 0.00041 0.00936 0.00843 0.00032 0.430 0.00277 0.0001 3 0 0.00002 0.0157 0.00412 0.00086 0.00003 0.112 0.000070 Dimensions of theta (document-topic-matrix): 10152 x 3 theta examples (8 documents): # A tibble: 8 x 3 Topic1 Topic2 Topic3 <dbl> <dbl> <dbl> 1 0 0.603 0.397 2 0 0.319 0.681 3 0 0.445 0.555 4 0.00001 0.00001 1.00 5 0.340 0 0.660 6 0 0.424 0.576 7 0.00001 0.00001 1.00 8 0.220 0.467 0.313
# get token probability per token per topic topics <- tidy(lda_fit) # only select top-10 terms per topic based on token probability within a topic plotinput <- topics %>% mutate(topic = as.factor(paste0('Topic',topic))) %>% group_by(topic) %>% top_n(10, beta) %>% ungroup() %>% arrange(topic, -beta) # plot highest probability terms per topic names <- levels(unique(plotinput$topic)) colors <- RColorBrewer::brewer.pal(n=length(names),name="Set2") plist <- list() for (i in 1:length(names)) { d <- subset(plotinput,topic == names[i])[1:10,] d$term <- factor(d$term, levels=d[order(d$beta),]$term) p1 <- ggplot(d, aes(x = term, y = beta, width=0.75)) + labs(y = NULL, x = NULL, fill = NULL) + geom_bar(stat = "identity",fill=colors[i]) + facet_wrap(~topic) + coord_flip() + guides(fill=FALSE) + theme_bw() + theme(strip.background = element_blank(), panel.grid.major = element_line(colour = "grey80"), panel.border = element_blank(), axis.ticks = element_line(size = 0), panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank() ) + theme(legend.position="bottom") plist[[names[i]]] = p1 } library(gridExtra) do.call("grid.arrange", c(plist, ncol=3))
# phi (topic - token distribution matrix) - tokens in rows, topic scores in columns: phi <- posterior(lda_fit)$terms %>% as.matrix # theta (document - topic distribution matrix) - documents in rows, topic probs in columns: theta <- posterior(lda_fit)$topics %>% as.matrix # number of tokens per document doc_length <- reviews_tokens_train_smpl %>% group_by(restoReviewId) %>% summarize(doc_length=n()) %>% select(doc_length) %>% pull() # vocabulary: unique tokens vocab <- colnames(phi) # overall token frequency term_frequency <- reviews_tokens_train_smpl %>% group_by(token) %>% summarise(n=n()) %>% arrange(match(token, vocab)) %>% select(n) %>% pull() # create JSON containing all needed elements json <- createJSON(phi, theta, doc_length, vocab, term_frequency)
`summarise()` ungrouping output (override with `.groups` argument) `summarise()` ungrouping output (override with `.groups` argument)
# render LDAvis - NOT RUN HERE - in RStudio, it opens a new window with the interactive LDAvis tool serVis(json)
#since LDAvis package cannot be used in notebook, we use pyLDAvis, which can be ...
%py #workaround (pyLDAvis to show LDAvis in Notebook, not as html) - https://git ...
%py pyLDAvis.display(ldavis_data) ...
# modify the tokens to consider in topic model reviews_tokens_train_smpl_new <- reviews_tokens_train %>% # remove infrequent tokens (<5) group_by(token) %>% mutate(token_freq=n()) %>% filter(token_freq>=5) %>% ungroup() %>% # combine some tokens that are dominant in solutions and represent same meaning mutate(token = case_when(token == 'gerechten' ~ 'gerecht', token == 'wijnen' ~ 'wijn', token == 'smaken' ~ 'smaak', token == 'vriendelijke' ~ 'vriendelijk',TRUE~token)) %>% # remove some 'too frequent' tokens filter(!token %in% c('goed','eten','restaurant','lekker','gegeten','komen','gaan','kregen','heerlijk','heerlijke','prima','kwam', 'mooi','mooie','leuk','leuke', 'lekker','lekkere','jammer','weinig','gezellig','gezellige', 'voldoende','uitstekend','attent','grote')) # recreate the document term matrix after modifying the tokens to consider dtm_new <- reviews_tokens_train_smpl_new %>% cast_dtm(document = restoReviewId,term = token,value = token_freq) #check dimensions of dtm cat(paste0('DTM dimensions: Documents (',dim(dtm_new)[1],') x Tokens (',dim(dtm_new)[2],')',' (average token frequency: ',round(sum(dtm_new)/sum(dtm_new!=0),2),')')) # estimate lda with k topics, set control variables nstart=n to have n runs, best=FALSE to keep all run results and set the seed for reproduction lda_fit_def <- LDA(dtm_new, k = 7,control = list(nstart=1,best=TRUE,seed=5678)) saveRDS(lda_fit_def,'lda_fit_def.RDS')
DTM dimensions: Documents (10152) x Tokens (12496) (average token frequency: 416.26)
#since LDAvis package cannot be used in notebook, we use pyLDAvis, which can be used in notebook . We need to export the R topic model output to use in python's pyLDAvis # phi (topic - token distribution matrix) - tokens in rows, topic scores in columns: phi <- posterior(lda_fit_def)$terms %>% as.matrix # theta (document - topic distribution matrix) - documents in rows, topic probs in columns: theta <- posterior(lda_fit_def)$topics %>% as.matrix # number of tokens per document doc_length <- reviews_tokens_train_smpl_new %>% group_by(restoReviewId) %>% summarize(doc_length=n()) %>% select(doc_length) %>% pull() # vocabulary: unique tokens vocab <- colnames(phi) # overall token frequency term_frequency <- reviews_tokens_train_smpl_new %>% group_by(token) %>% summarise(n=n()) %>% arrange(match(token, vocab)) %>% select(n) %>% pull() # use tsne method to calculate distance between topics (default sometimes fails - https://www.rdocumentation.org/packages/LDAvis/versions/0.3.2/topics/createJSON) library(tsne) svd_tsne <- function(x) tsne(svd(x)$u) # create JSON containing all needed elements json <- createJSON(phi, theta, doc_length, vocab, term_frequency,mds.method=svd_tsne) # render LDAvis - NOT RUN HERE - in RStudio, it opens a new window with the interactive LDAvis tool serVis(json) # press ESC or Ctrl-C to kill
# save needed model outputs for pyLDAvis (write to driver so that python can eas ...
%py #workaround (pyLDAvis to show LDAvis in Notebook, not as html) - https://git ...
# get token probability per token per topic topics <- tidy(lda_fit_def) topiclabels <- data.frame(topic=seq(1,7), label=c('Hospitality [6]','Tastes & Courses [2]','Culinary Experience & Wines [4]','Atmosphere [5]', 'Price/Quality [3]','Plate details [7]','Waiter & Waitress [1]')) # only select top-10 terms per topic based on token probability within a topic plotinput <- topics %>% inner_join(topiclabels,by="topic") %>% group_by(label) %>% top_n(10, beta) %>% ungroup() %>% arrange(label, -beta) # plot highest probability terms per topic names <- levels(unique(plotinput$label)) colors <- RColorBrewer::brewer.pal(n=length(names),name="Set2") plist <- list() for (i in 1:length(names)) { d <- subset(plotinput,label == names[i])[1:10,] d$term <- factor(d$term, levels=d[order(d$beta),]$term) p1 <- ggplot(d, aes(x = term, y = beta, width=0.75)) + labs(title=names[i],y = NULL, x = NULL, fill = NULL) + geom_bar(stat = "identity",fill=colors[i]) + coord_flip() + guides(fill=FALSE) + theme_bw() + theme(strip.background = element_blank(), panel.grid.major = element_line(colour = "grey80"), panel.border = element_blank(), axis.ticks = element_line(size = 0), panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank(), plot.title = element_text(size=7)) + theme(legend.position="bottom") plist[[names[i]]] = p1 } library(gridExtra) do.call("grid.arrange", c(plist, ncol=4))
What is Topic Modeling?
To discover the topics that restaurant reviewers write about in their restaurant reviews, we use
Topic Modeling
. But what is a Topic Model? In machine learning and natural language processing, a topic model is a type of statistical model that can be used for discovering the abstract topics that occur in a collection of documents. There are a number of algorithms to extract topics from a collection of texts, but the Latent Dirichlet Allocation is one of the most popular algorithms because it is efficient en results in highly interpretable topics. Interpretability of topics is an important feature of a topic model, since we do not only want to find statistically relevant groupings of words, we also want to be able to label the identified topics with a topic name that others can relate to. As such, topic modeling has some similarities to clustering techniques likeKMeans
, where interpretation is also as important as statistical metrics are in determining what is a ‘good’ solution. How topic modeling / LDA works, is visualised by Blei as:As the figure shows:
So after we are done topic modeling our reviews: