Two years ago we published an article in which we compared the four most used programmes for statistical analysis. We compared two open source platforms (Python en R) with two commercial vendors (SPSS en SAS). Our comparison was based on Methods and Techniques, Simplicity in Learning, Support, Graphic Options and Costs. If the last few years have learned us anything as Data Scientists, it is that the world and developments around data is changing fast. Very fast! So it’s about time to see how things are with Python, R, SPSS en SAS.
In our previous article we came to the conclusion that R, SAS and SPSS were developed by universities (University of Auckland, North Carolina State University and University of Stanford respectively). We look at R as the benjamin of these three programmes. Python has been conceived by Dutchman Guido van Rossum, an avid fan of Monty Python. It shows that Python wasn’t developed by a university. R, SAS en SPSS each have statistic modelling as their primary goal and are based on explanatory analysis, and moving to predictive modeling. Whereas Python is a ‘general purpose language’, which means that it can be used for developing a number of different applications, one of which is statistical modelling.
A trade-off exists within statistical modelling. This trade-off is between explainability on one side and predictability on the other. Explanatory models ask the ‘why’ questions: Why are my customers leaving? Predictive models ask ‘who’ or ‘what’ questions: Who of my clients is about to leave? Such predictive models are often called black boxes, because of their ability to recognize very complex patterns that we as humans no longer can interpret. In other words: an increase in predictive power is at odds with explainability. The same goes for predictability. Just like any model, a statistical model is a simplified view of reality. If we want to isolate a certain effect, the rest of ‘reality’ needs to comply with specific demands to keep things understandable. Which means we can’t take the complete reality into account, which diminishes the predictability.
From academic to common practice
Explanatory analysis are common in the academic world. So it isn’t surprising that R, SAS and SPSS, from their history, have been developed with an eye on explanatory analyses. The use of statistical modelling has broadened and intensified over the years. Which led to new concepts like Data Mining, Big Data and Machine Learning. Both R, SAS and SPSS played into this. While SAS developed their Enterprise Miner, SPSS developed Modeler. A more organic approach was chosen for R, because of the enormous community for R as an open source programme. This community is what makes R strong in both the explanatory and the predictive side of data analysis.
General purpose
Python is also an open source language with a growing community. Because Python is a general purpose language, it is often used for developing applications. Part of these applications are sometimes statistical analyses that make predictions, which need to be used directly. Because of this Python has had a strong development with regard to Machine Learning, which makes it especially suited for the predictive side of the spectrum. In particular for scaling and securing predictions, so these can perform their task without an analyst having to intervene.
In our article from two years ago we made a comparison between the four programmes based on five different subject. Here is a summary of our conclusions:
In my previous article I contested that I would always choose R and/or Python, the choice depending on the specific purpose. I will put my trust in R for analyzing customer behavior, since this is better in opening the black box. For implementing the algorithm with direct use of the results, Python would be my go-to tool. Read on to find out if this is still the case.
Stack Overflow predominantly the most widely used platform for developers to ask questions about tooling. When we look at how frequently the tags Python, R en SAS occur (SPSS is never mentioned), we discover a clear trend. We can see the same trend in our daily life as Data Scientists. Many of our customers have made the switch and the vast majority uses at least one of the two open source solutions. Universities are making the switch as well. In my time the Marketing department used SPSS for research, but now MARUG (Marketing Association of the University of Groningen) has started R.
Bron: Stack Overflow Trends, 22-1-2019
It is not an entirely objective way to follow trends, since the support for Python and R is mostly coming from the community (mainly on Stack Overflow), whereas SAS and SPSS mainly use official channels for support. Yet it is still a picture that appeals to me, also because it shows a bit of my own development as a user.
If I take a look at which tooling I have used over the years, SAS and SPSS are virtually non-existent, while Python is clearly trending upwards. I also still use R quite often for marketing purpose, but Python remains a favorite. One reason for this, is that the community around Python is growing more rapidly, I think. France will implement Python as the official programming language in high schools. This means that all French students will be introduced to Python. According to The Economist Python has surpassed two other ‘general purpose’ programming languages in the last two years.
The community is also growing more and more with large organizations that develop programmes and packages, and publish these as open source. Examples are Netflix en Airbnb or Google with Tensorflow. Tensorflow is used for Deep Learning, an assortment of techniques which have proved themselves over the years as algorithms with the biggest potential. This results in a growing number of applications, which accelerates development even more.
In daily life, managers often ask for predictive models, while they also have a strong need for an explanation of the patterns. For example, if it is unclear which patterns an algorithm has found, one can not explain why some customers receive a certain value. This diminishes the acceptation of a model, and the confidence for using it.
Therefore, an important trend is the call for explainability within predictive Machine Learning models, a trend is still growing. This makes another argument for using R and Python. And it still makes R very strong. On the other hand, interest in Python is growing fast for the same reasons. Both R and Python are making a contribution to transparency with regard to what is happening within the algorithm. There is a reason why ‘Explainable Artificial Intelligence’ (or XAI) is an gaining traction in our field
Over the last years we have seen a decrease in the use of both SAS and SPSS, and in increase in the use of R and especially Python. The communities behind this open source tooling make the languages so strong, that the communities grow even bigger. This makes the number of possibilities grow exponentially, which, combined with developments around Deep Learning, give R and Python a competitive edge which SAS a SPSS so far have no answer for.
For the coming years I think that Deep Learning 1) will continue to grow stronger by improvements in infrastructure, layers and increasing processing power and 2) become more accessible by packages like Tensorflow and Keras. My use of Python on a daily basis will grow more and more because of this.
I also think that Notebooks will play an increasingly bigger part in the near future. Already Notebooks are used frequently within Python with Jupyter. But I also expect Notebooks to play a bigger role within R. The combination of code and markdown makes it ideally suited for the Data Scientist to inform colleague Data Scientists about his or her choices.
7 September 2022
We don’t know if you’ve heard already, but there is yet another crisis on our horizon:... read more
21 February 2022
Plan your calendar free and make sure you have plenty of party clothes in your closet,... read more
15 September 2021
Project-Friday At Cmotions, we love a challenge. Especially those that make us both think and have... read more