Python & R vs. SPSS & SAS – comparison renewed after two years

18 February 2019

Two years ago we published an article in which we compared the four most used programmes for statistical analysis. We compared two open source platforms (Python en R) with two commercial vendors (SPSS en SAS). Our comparison was based on Methods and Techniques, Simplicity in Learning, Support, Graphic Options and Costs. If the last few years have learned us anything as Data Scientists, it is that the world and developments around data is changing fast. Very fast! So it’s about time to see how things are with Python, R, SPSS en SAS.

In our previous article we came to the conclusion that R, SAS and SPSS were developed by universities (University of Auckland, North Carolina State University and University of Stanford respectively). We look at R as the benjamin of these three programmes. Python has been conceived by Dutchman Guido van Rossum, an avid fan of Monty Python. It shows that Python wasn’t developed by a university. R, SAS en SPSS each have statistic modelling as their primary goal and are based on explanatory analysis, and moving to predictive modeling. Whereas Python is a ‘general purpose language’, which means that it can be used for developing a number of different applications, one of which is statistical modelling.

 

Balance between predictability and explainability

A trade-off exists within statistical modelling. This trade-off is between explainability on one side and predictability on the other. Explanatory models ask the ‘why’ questions: Why are my customers leaving? Predictive models ask ‘who’ or ‘what’ questions: Who of my clients is about to leave? Such predictive models are often called black boxes, because of their ability to recognize very complex patterns that we as humans no longer can interpret. In other words: an increase in predictive power is at odds with explainability. The same goes for predictability. Just like any model, a statistical model is a simplified view of reality. If we want to isolate a certain effect, the rest of ‘reality’ needs to comply with specific demands to keep things understandable. Which means we can’t take the complete reality into account, which diminishes the predictability.

From academic to common practice
Explanatory analysis are common in the academic world. So it isn’t surprising that R, SAS and SPSS, from their history, have been developed with an eye on explanatory analyses. The use of statistical modelling has broadened and intensified over the years. Which led to new concepts like Data Mining, Big Data and Machine Learning. Both R, SAS and SPSS played into this. While SAS developed their Enterprise Miner, SPSS developed Modeler. A more organic approach was chosen for R, because of the enormous community for R as an open source programme. This community is what makes R strong in both the explanatory and the predictive side of data analysis.

General purpose
Python is also an open source language with a growing community. Because Python is a general purpose language, it is often used for developing applications. Part of these applications are sometimes statistical analyses that make predictions, which need to be used directly. Because of this Python has had a strong development with regard to Machine Learning, which makes it especially suited for the predictive side of the spectrum. In particular for scaling and securing predictions, so these can perform their task without an analyst having to intervene.

 

Two years ago

In our article from two years ago we made a comparison between the four programmes based on five different subject. Here is a summary of our conclusions:

  • Methods and Techniques : R and Python were by far the best for this category. SAS and SPSS can’t keep up, simply because of the large community that helps with developing packages and algorithms for the open source languages.
  • Simplicity: SPSS and SAS are both equipped with an elaborate Graphical User Interface, which means you can simply click, instead of writing code. Both R en Python can not compete with the commercial vendors in this regard. Between R en Python, Python is the winner. Legibility was one of the main starting points when Python was developed. This makes the language easy to learn for non-programmers.
  • Support: Both SAS and SPSS offer official support, which is an important argument for some decision makers to make the switch to these tools. However, in reality the communities for R and Python are much better and faster. The sheer size of these communities does not only result in faster developing, but also in better, worldwide support through various platforms and fora.
  • Graphic functionality: Again, R and Python are winners here as far as I’m concerned. Not only do charts look much better when for example GGplot or D3 is being used, but both R & Python offer the possibility to develop interactive tools.
  • Cost: This one goes to R and Python as well. Both are open source, which makes them free to use. Yet they do have a longer learning curve than the GUI’s from SPSS and SAS, which often makes training more expensive.

In my previous article I contested that I would always choose R and/or Python, the choice depending on the specific purpose. I will put my trust in R for analyzing customer behavior, since this is better in opening the black box. For implementing the algorithm with direct use of the results, Python would be my go-to tool. Read on to find out if this is still the case.

 

Trends

Stack Overflow predominantly the most widely used platform for developers to ask questions about tooling. When we look at how frequently the tags Python, R en SAS occur (SPSS is never mentioned), we discover a clear trend. We can see the same trend in our daily life as Data Scientists. Many of our customers have made the switch and the vast majority uses at least one of the two open source solutions. Universities are making the switch as well. In my time the Marketing department used SPSS for research, but now MARUG (Marketing Association of the University of Groningen) has started R.
Stack overflow questions that month_Python & R vs. SPSS & SAS

Bron: Stack Overflow Trends, 22-1-2019

It is not an entirely objective way to follow trends, since the support for Python and R is mostly coming from the community (mainly on Stack Overflow), whereas SAS and SPSS mainly use official channels for support. Yet it is still a picture that appeals to me, also because it shows a bit of my own development as a user.

If I take a look at which tooling I have used over the years, SAS and SPSS are virtually non-existent, while Python is clearly trending upwards. I also still use R quite often for marketing purpose, but Python remains a favorite. One reason for this, is that the community around Python is growing more rapidly, I think. France will implement Python as the official programming language in high schools. This means that all French students will be introduced to Python. According to The Economist Python has surpassed two other ‘general purpose’ programming languages in the last two years.

The community is also growing more and more with large organizations that develop programmes and packages, and publish these as open source. Examples are Netflix en Airbnb or Google with Tensorflow. Tensorflow is used for Deep Learning, an assortment of techniques which have proved themselves over the years as algorithms with the biggest potential. This results in a growing number of applications, which accelerates development even more.

In daily life, managers often ask for predictive models, while they also have a strong need for an explanation of the patterns. For example, if it is unclear which patterns an algorithm has found, one can not explain why some customers receive a certain value. This diminishes the acceptation of a model, and the confidence for using it.
Therefore, an important trend is the call for explainability within predictive Machine Learning models, a trend is still growing. This makes another argument for using R and Python. And it still makes R very strong. On the other hand, interest in Python is growing fast for the same reasons. Both R and Python are making a contribution to transparency with regard to what is happening within the algorithm. There is a reason why ‘Explainable Artificial Intelligence’ (or XAI) is an gaining traction in our field

 

A glimpse into the future

Over the last years we have seen a decrease in the use of both SAS and SPSS, and in increase in the use of R and especially Python. The communities behind this open source tooling make the languages so strong, that the communities grow even bigger. This makes the number of possibilities grow exponentially, which, combined with developments around Deep Learning, give R and Python a competitive edge which SAS a SPSS so far have no answer for.

For the coming years I think that Deep Learning 1) will continue to grow stronger by improvements in infrastructure, layers and increasing processing power and 2) become more accessible by packages like Tensorflow and Keras. My use of Python on a daily basis will grow more and more because of this.
I also think that Notebooks will play an increasingly bigger part in the near future. Already Notebooks are used frequently within Python with Jupyter. But I also expect Notebooks to play a bigger role within R. The combination of code and markdown makes it ideally suited for the Data Scientist to inform colleague Data Scientists about his or her choices.

Latest news

Nachos Hackathon 2022

7 September 2022

We don’t know if you’ve heard already, but there is yet another crisis on our horizon:... read more

In 2022, we are going to have a great year of celebration! Many weddings expected

21 February 2022

Plan your calendar free and make sure you have plenty of party clothes in your closet,... read more

What we learned from kaggle’s commonlit readability prize 

15 September 2021

Project-Friday At Cmotions, we love a challenge. Especially those that make us both think and have... read more

Subscribe to our newsletter

Never miss anything in the field of advanced analytics, data science and its application within organizations!