Can machine learning assess the readability of texts?

8 September 2021

Article written by Mike te Beest, Junior Consultant

In May, we kicked off another Project Friday. This time, a group of seven Cmotions colleagues, of different origins and levels, came together. As always, the aim was to make a brilliant deliverable, to learn a lot, and (most of all) to have a lot of fun.

The first challenge was to come up with a topic, which would lead to a cool product that everyone is excited about. After brainstorming for a while, it was soon decided to enter the CommonLit Readability Prize (a Kaggle competition) focusing on the following question:

“To what extent can machine learning identify the appropriate reading level of a passage of text, and help inspire learning?”

This question arose as CommonLit, Inc. and Georgia State University wanted to offer texts of the right level of challenge to 3rd to 12th grade students, in order to stimulate the natural development of their reading skills. Until now, the readability of (English) texts, has been based mainly on expert assessments, or on well-known formulas, such as the Flesch-Kincaid Grade Level, which often lack construct and theoretical validity. Other, often commercial, solutions failed to meet the proof and transparency requirements.

From May to August ’21, we worked on developing a Python notebook that is able to assess readability scores better than current standards. We worked with a training dataset of about 3.000 text excerpts that were rated by 27 professionals. In these months, we investigated which features affect readability the most, which models and architecture are best to use, and how to tweak them. The name of our team? Textfacts!

Certainly, you want to know if we succeeded in predicting text readability better than existing solutions, right? Stay tuned and follow our blogs for our stories! Do you have any questions or do you want to know more about this cool project right now? Please click here to visit the competition webpage or contact us, via info@theanalyticslab.nl.

This article was previously published on TheAnalyticsLab.nl.

The Analytics Lab logo wit grijs

This article is a result of an innovative project in our Analytics Lab. We are always curious of the developments in the field of Data Science and we love to explore them ourselves. That is why we founded The Analytics Lab as part of Cmotions. Here we explore the limits and possibilities of Data Science, and we are always happy to share our experience, knowledge and vision with you.

Contact

Do you want to know more about this subject? Please contact Mike te Beest using the details below

Mike te Beest, Junior Consultant

+31 6 46 88 41 58

m.te.beest@cmotions.nl

Latest news

Find your “high risk files” according to GDPR using our DriveScanner

17 April 2023

In every company it’s a struggle to make sure we only keep the documents we want... read more

Nachos Hackathon 2022

7 September 2022

We don’t know if you’ve heard already, but there is yet another crisis on our horizon:... read more

What we learned from kaggle’s commonlit readability prize 

15 September 2021

Project-Friday At Cmotions, we love a challenge. Especially those that make us both think and have... read more

Subscribe to our newsletter

Never miss anything in the field of advanced analytics, data science and its application within organizations!