15 September 2021
At Cmotions, we love a challenge. Especially those that make us both think and have fun. Every now and then we start a so-called ‘Project-Friday’ where we combine data science with a fun project. In the past we build a coffee machine with visibility, an algorithm for handsign recognition, a book recommender and a PowerBI escaperoom. A few months ago, we decided to join a Kaggle competition: the CommonLit Readability, for numerous reasons. First of all, for many of us this is the first time attending a Kaggle competition; high on many bucket lists. Not to mention we like to do more with textual data to get a deeper knowledge of NLP techniques. And last but not least, we like to help kids to read stuff that meet their profile.
In the CommonLit competition this question needs to be answered:
“To what extent can machine learning identify the appropriate reading level of a passage of text, and help inspire learning?”
This question arose as CommonLit, Inc. and Georgia State University wanted to offer texts of the right level of challenge to 3rd to 12th grade students, in order to stimulate the natural development of their reading skills. Until now, the readability of (English) texts, has been based mainly on expert assessments, or on well-known formulas, such as the Flesch-Kincaid Grade Level, which often lack construct and theoretical validity. Other, often commercial solutions also failed to meet the proof and transparency requirements.
Before we entered the competition, we did see some drawbacks. Often, a high rank in a Kaggle competition these days means combining many, many, many models to gain a bit of precision and a lot of rank positions – there’s not a lot of difference in accuracy in the highest ranked contenders. With the increase of predictive performance there’s a steep decrease in interpretability when combining the results of many models into a final prediction. Also, since this challenge is about textual data and we do not have many train cases, state of the art pretrained transformer models can be useful. Downside,these are complex and difficult to understand. Let alone when we combine multiple transformer models for our best attempt to win the competition…
No mistake about it, when we join a competition, we’re in it to win it. But we wanted to maximize learning and explainability of our approach. Therefore, where we expected many others to focus on finetuning and combining transformer models, we’ve spent much of our available time on thinking of relevant features we could create that help in understanding why our model predicts one text is more complex to read for kids than another text. We brainstormed, read the literature and checked with our friends and family with positions in education. This resulted in an impressive set of 110 features -derived from the text. These features both include readability scales from the literature – Flesh reading ease, Flesh Kinraid grade, Dale Chall Readability score – as well as other characteristics we thought might help in explaining the readability of text – sentence length, syllables per word, abbreviations, sentiment, subjectivity, scrabble points per word, etcetera
Since we think understanding the why is as important as most predictive power, we followed this approach:
We started listing potential features and started creating them in the limited time we had available for this. After this feature engineering, we estimated XGBoost models to identify the most important features and included those features in a regression model. This approach resulted in valuable insights:
Based on these results, we can think of ways how to help publishers, teachers and students how to write texts that are easier to read.
Yet, we were not done with a model that is doing a good job in explaining. In the second stage, we explored if transformer models (BERT, Roberta, Distilbert) could help boosting our predictions. And it did! We knew these models, pretrained on huge amounts of textual data, are state of the art and excellent in many NLP tasks. Also in our case, RSME decreased from around 0.70 to 0.47 and the explained variance [R2] increased to 79%! After hyperparameter optimization a tuned Roberta Large with some advanced sampling was our best model.
[spoiler alert!] No, we didn’t win this Kaggle competition. We ended somewhere in the upper half. To be honest, quite soon we knew that our approach would not be the one that would lift our bank accounts (the winner Mathis Lucka won $20K, congrats to him!). For this type of competition, the winner spends all time and resources on building many models and combining those, squeezing every last bit of predictive power out of the train data. For example, see the approach of the runner up. It should be noted however, that the winner, Mathis Lucka did use a bunch of models and ensembled those, but also had an original approach to involve smart new data (see his Kaggle post for details).
Even though we didn’t win, we enjoyed the competition very much and we’ve learned a lot. Also, we believe that our approach – have a model to interpret to explain the why and if needed add a booster to maximize predictions is a winning approach in many real life (business) contexts. (And we believe that an approach that ensembles 10+ transformer models is not 😉). Therefore, we have already started translating what we learned in this Kaggle competition into something we can help others with to get and use more readable texts. Curious what exactly? Stay tuned, follow our blogs for our stories and you’ll find out!
This article was previously published on TheAnalyticsLab.nl.
This article is a result of an innovative project in our Analytics Lab. We are always curious of the developments in the field of Data Science and we love to explore them ourselves. That is why we founded The Analytics Lab as part of Cmotions. Here we explore the limits and possibilities of Data Science, and we are always happy to share our experience, knowledge and vision with you.
Do you want to know more about this subject? Please contact Jurriaan Nagelkerke using the details below
17 April 2023
In every company it’s a struggle to make sure we only keep the documents we want... read more
7 September 2022
We don’t know if you’ve heard already, but there is yet another crisis on our horizon:... read more
8 September 2021
In May, we kicked off another Project Friday. This time, a group of seven Cmotions colleagues,... read more