How to uncover the predictive potential of textual data using topic modeling, word embedding, transfer learning and transformer models with R
Textual data is everywhere: reviews, customer questions, log files, books, transcripts, news articles, files, interview reports … Yet, texts are still (too) little involved in answering analysis questions, in addition to available structured data. In our opinion, this means that some of the possible predictive power is not used and an important part of an explanation is missed. Why is that the case? Textual data, in contrast to structured data from databases and tables, is not ready for usage in analyzes, data visualizations and AI models. It takes some extra effort, texts must be correctly converted into numerical representations, so that the tools and techniques that analysts and data scientists use to answer the business questions can work with them.
Despite the rise in popularity of Python, R is still used by many crafted analysts and data scientists to analyze, visualize and build predictive models. But whereas there are many blogs, books and tutorials on how to use NLP techniques with Python, the availability of examples is much more limited for R users. Also, not that many blogs compare techniques on the same data and show how to prepare the textual data and what choices to make along the way to answer a business question. In our series of blogs, 6 in total, we are closing some gaps and show how to get value out of textual data using R. We show how to use different NLP techniques on exactly the same dataset, such as topic modeling, word embeddings and transformer models like BERT for a specific downstream task: predictive modeling. We show how to translate textual data into features for your model, preferably in combination with other, numeric features you have available for your predictive model to get most out of the data you have available.
Setting the scene: Restaurant reviews
Our articles center around restaurant reviews we have collected in our attempt to predict new Michelin stars. Late 2019 we attempted to predict the next Dutch Michelin star restaurants. Not an easy task, but by looking at restaurant reviews and especially the review texts, we got very nice results. For this prediction we used topic modeling and sentiment analysis to arrive at the best possible prediction of Michelin restaurants. In this series of articles, we show how we did that and expand to other more novel NLP techniques. Links to all articles in this series can be found below, all notebooks (originally written in Azure Databricks) can be downloaded from our gitlab page, links to the available data sources can be found within the notebook code.
Anyone who has worked with textual data will confirm that preparing text before you can analyze it is quite different compared to preparing numeric, structured data. Therefore, we’ve written a separate blog to show you how we preprocess our textual data before we use it for our NLP tasks. Key is that choices you need to make during preprocessing — such as what words to exclude, what to do with quotes, interpunctions, plurals, capitals, … — potentially have a major impact on analysis results. While preprocessing text, a lot of choices are to be made. It’s quite a challenge to find the right balance between rethinking all earlier choices you’ve made on the one hand while not rushing through and making suboptimal or even harmful choices on the other hand.
The best advice we can give to you, is to take at least a moment — a few seconds, a minute or two, sometimes an hour if needed — to give each of those choices a thought, annotate choices explicitly in your code and in case you are really in doubt what to do (and only then!) make this choice a parameter you can iterate and evaluate to find out what the best option is. Be very careful not to make too many parameters though! There are simply too many choices to make and before you know it, you are iterating for hours or even days because you didn’t have the nerve to follow your guts (which is most often the best you can do). But even worse than that is skipping all choices and use an out-of-the-box text preprocessor that makes all choices for you, resulting in preprocessed text for which you have no clue what harm is done to the actual texts. Aside from that, the knowledge and familiarity with the texts you build by preprocessing it yourself will definitely pay off when analyzing your texts.
Building a good topic model requires a mix of expertise, creativity and some perseverance. Nevertheless, it is a great text analytics tool to have at your disposal. As a data scientist you are at the steering wheel when extracting topics from text. Topic modeling is the collective name for techniques that help to divide texts into topics, whereby determining the relevant topics is part of the analysis. As this is an unsupervised technique it requires many iterations, creativity and an eye for interpretability of the end result. A good working relation with your (internal) client is also of utmost importance here, discussing draft results and getting feedback on the topics extracted is key for success. Anyone can extract topics, but the goal is to extract topics that discriminate, are appealing and of value in the downstream task. Yes, other, more novel NLP techniques might beat topic models in terms of accuracy. However, a big upside in using the topics extracted from the text in a downstream task is that the contribution of each topic (and other features you have available) in the prediction model is easily analyzed and visualized. This is not only useful for you but foremost to anyone you are explaining the predictive model to.
Although our first topic model results were great, we did lose a lot of information in our texts, translating the full reviews into less than 10 features — the topic probabilities. To reveal much more subtleties hidden in the review texts, we use word embeddings to improve our predictions. Word embeddings are a representation of text where words that have the same meaning have a similar numeric representation. Have you ever heard the example: King — Man + Woman = Queen? This is word embeddings in action. This technique captures the semantic similarities between words, and these similarities (of distances) can be useful for prediction purposes. The upside of using word embeddings in that you can visualize them quite easily. A visual representation of word similarities for humans is much better to grasp than a matrix with numbers. Also, training word embeddings on your own documents provides tailor made input for your downstream task. The word similarities are used in a Neural Network for prediction purposes. Yes, as we show in our blog, having both the word similarities as well as the review features in a single prediction model can be a bit daunting and choosing the correct neural network architecture and model configuration can take some time as well. But looking back, model performance was overwhelmingly good.
Transformer models and BERT
How can we discuss getting value of textual data and not talk about transformer models and BERT? We can’t! In our last blog we show you how you can stand on the shoulders of giants by using pretrained NLP models you can use for your downstream task in R. We show how to do that with and without finetuning the pretrained model and show the difference in performance. Using Transformer models and in particularly the models trained for the Dutch language (BERTje and RobBERT) provided a significant uplift in model performance of our downstream prediction task. However, this improved fit comes at a price. First, using these models requires cloud computing capabilities or enough GPU resources. So, it is good to ask yourself upfront: how accurate do my predictions need to be? In general, the ready-to-use BERT models are trained on large amounts of general text (Wikipedia, Google books and web-crawl), so it may not always be the best way to go if you have text from a niche environment (in our case restaurant reviews). Second, the interpretation of these deep neural network models makes it hard to explain and interpret exactly how the textual data help in your downstream task. When interpretability is important for the problem you are trying to solve, the improved accuracy might not outweigh this decrease in interpretability in comparison to other techniques.
If you find yourself in a position where extracting value from text is needed to improve downstream task performance, we hope this NLP with R series has inspired you and helps you to get started. Which technique is most suitable for you is something to find out for yourself. Some general advise from our side: if interpretability of the predictive model is important then topic modeling is the best way to go. If your predictions need to be very accurate and you cannot afford many false positives then a Transformer model like BERT might be the best choice. They provide top-notch accuracy but also require much computing time and resources. When you are looking for decent model performance and are willing to scale down a bit on interpretability then word embeddings are an elegant option. Training the embeddings on your own text is quite easy and visualizing word differences will stimulate your imagination. Links to all articles in our NLP with R series written by Jurriaan Nagelkerke and Wouter van Gils can be found below:
We hope to inspire you with these articles and get you started with monetizing textual data in R yourself. We think it is important that anyone can use these articles as a starting point. That is why the notebooks and data we use are available on our gitlab page. We would love to hear your feedback and experience!
Get (even more) inspired in our NLP workshop where you learn everything you need to get started quickly