series: NO DATA SCIENTIST IS THE SAME! – part 9
This article is part of our series about how different types of data scientists build similar models differently. No human is the same and therefore also no data scientist is the same. And the circumstances under which a data challenge needs to be handled change constantly. For these reasons, different approaches can and will be used to complete the task at hand. In our series we will explore the four different approaches of our data scientists – Meta Oric, Aki Razzi, Andy Stand, and Eqaan Librium. They are presented with the task to build a model to predict whether employees of a company – STARDATAPEPS – will look for a new job or not. Based on their distinct profiles discussed in the first blog you can already imagine that their approaches will be quite different.
This article is the final article of our series, in which we look back at what our data scientists learned. They did not only build the model according to their own approach, but also saw how their peers deal with the same challenge.
Like most companies nowadays, STARDATAPEPS is an agile organisation. This means that it works according to the scrum methodology. And as a result, each sprint ends with a retrospective, in which the team takes some time to reflect, look back at the accomplishments and at the process of how those results were reached.
Before we look at what the data scientists Aki, Andy, Eqaan and Meta have achieved, let’s briefly refresh what their different perspectives are.
From left to right:
- Meta Oric: ‘It’s all about Meteoric speed ‘ Meta is always super busy.Therefore, she has no time to tweak and tune each model she is asked to build. She uses state of the art models with little to no modifications and prefers ensemble techniques that do not require a lot of data preparation or model selection.
- Aki Razzi: ‘Accuracy is what truly matters’ Aki has won multiple Kaggle competitions, since her models achieve the highest possible performance. Hail the almighty accuracy, precision and recall.
- Andy Stand: ‘Understand is what we do’ Andy does not believe that black-box models are the future. He is happy to sacrifice a bit of accuracy in order to achieve the most clear and understandable solution.
- Eqaan Librium: ‘Equilibrium and work-life balance’ Eqaan strikes for the balance between preparing his data himself and using some ‘black-box’ elements. He wants to be able to explain his models and data manipulations to others, without loosing too much predictive power.
An in depth overview can also be found in our earlier article.
In our series, we discuss different methods to deal with common challenges that a data scientist encounters when building a predictive model:
- How to build a very good XGBoost model quickly with minimal optimization?
- Hyperparameter optimization to get most predictive power from your XGBoost model
- What are the most common data problems that data scientists encounter?
- How to deal with categorical features with many values (high cardinality)?
- Managing missing data when building a predictive model
- Best visualizations to assess the business impact of a predictive model
In each article, the challenge at hand is worked out in different ways and our data scientists explain why they prefer one approach over the other.
And now it’s time for a short recap. All the data scientists have discussed their approach and solution with relevant business colleagues. The team lead has invited Meta, Aki, Andy and Eqaan to join the retro. He’s interested to hear from each of them what – in retrospect – they think the main strengts are of their approach. And what – based on the feedback of the business – would be the most important improvement to their approach, also taking the other data scientist’s results into consideration.
Before the meeting starts, even before everyone has picked a chair, Meta emphasized she has little time for this meeting, lots of stuff on her plate to finish today. Therefore, they decide: let’s all name the one thing we love the most of our own approach and name the one thing they would like to improve to their approach.
Meta starts: I still love the power of XGBoost. I’m such a big fan because by using my default pipeline, with only minor adjustments, it enables me to quickly develop models for many many different use cases instead of spending my limited time in helping just one or a few business colleagues. When talking to the business about the model, I do notice that I have limited insights to share about the model, just how well it works. I feel that I would help the business more if I could share more insights in the model. Therefore, I would like to invest more time in providing insights in the most important features. I’m going to check Andy’s model performance plots and I’m going to have a deeper look at methods like Shap and Lime to explain my models. Just need to find some time…
Andy emphasizes: In my philosophy the user of a model needs to understand the impact of the model. Therefore, I put so much effort in building explainable features, use interpretable techniques and visualize the value of the model in business terms like gains, lift and response. I do realize that with more and more features available it’s important to keep track of what predictive power is available in the data. Therefore, I’ll also try to use XGBoost more often. Meta’s approach is great for generating a good baseline model and to get more insights on the actual impact of the most important predictors. With techniques like Shap and Lime, I can further explain how my models work, so happy to team on that topic, Meta!
Aki states: Although the win in accuracy was not very big, the hyperparameter tuning did result in a better model. As a data scientist, it’s our job to get most out of the data we have available. If this results in selecting that one extra employee for the employee retention program that is actually retainable, these efforts have been worth the extra effort! So far, I’ve put most emphasis on tuning the hyperparameters, namely improving the data by better managing missing values and features with high cardinality to improve the predictive power from the data.
Eqaan is the last to share his successes and ideas for further improvements: For me, all that Andy, Meta and Aki stated is valid. The art is in striking the right balance: When to go for the defaults on a model and spend more time on a new challenge, when to go that extra mile for performance, when to emphasize on how to use the scores or how to interpret the model. This is an ongoing challenge, it requires to be able to quickly grasp the potential of the model when used in production.
This finalizes our series on building a predictive model with four different perspectives on how to do that. Most likely, most of you recognize one or more of our data science rock stars Meta, Andy, Eqaan and Aki – in your colleagues or in yourself. We believe there is not one way every data scientist should approach each challenge in which a predictive model is built. There is a lot to learn from the different perspectives one can take, we hope our series helps you (re)consider some important steps in your next endeavours!
All Python notebooks from this series are available on our Gitlab page.