Why learn Python for Data Science, and how?

18 September 2017

Python is occupying a larger and larger footprint in the interesting world of Data Science. According to KDnuggets, Python has even overtaken R at the top of the leader board of AI and Machine Learning platforms. But what is it that makes Python so great for Data Science? One of the reasons is that Python is a general-purpose programming language. This implies that Python does not have one specific purpose, so it unlocks opportunities to use models directly in a wider context. Meanwhile R is not a general-purpose language, as its chief purpose in life is statistic modelling. Another reason that Python is becoming so popular is that it is an easy language to learn, so even non-programmers (like me) can still easily learn a programming language.

So is it better to only learn Python and to no longer bother with R? No! Python complements R. In a previous article we made a comparison between R and Python (along with SAS and SPSS). R is effective in both explanatory and predictive models, whereas Python’s focus is really on the predictive side: Machine Learning, Data Mining and Artificial Intelligence applications tend to be written in Python rather than R. Tensorflow, the Deep Learning framework designed by Google, has also been written in Python. So the fact that Python is overtaking R in Machine Learning doesn’t mean that Python will replace R in all models. In particular, R is still much bigger than Python for making predictions.


When should I use Python?

When an algorithm is a part of a larger whole, and when the results need to be used immediately for a follow-up action, I favour Python. For example, we are currently working on a coffee machine with facial recognition, and we are building that in Python. Because once the algorithm has recognised the person, the machine has to immediately take action by making the coffee. Python is also being used for self-driving cars: When an algorithm spots a bend to the left or a person, or child, crossing the road, it has to act instantaneously.


Python 2.7 vs. Python 3

When Guido van Rossum was developing Python back in 2008, he decided to focus more on tidying up the code rather than on backwards compatibility (= i.e. that old code continues to work in newer versions). As a result, there are two versions of Python doing the rounds now: 2.7 and 3.5.

Surely you just always use the newest one? In this case, not necessarily. Most packages are still written in Python 2.7, so that’s the one you can still do the most with. However, more and more things are becoming available for Python 3. Some new packages are actually developed only for Python 3, e.g. Tensorflow. But it has yet to take over and replace it.

Personally I still mostly use Python 2.7, considering all my old code is written in it. I only use Python 3 when I need to train Deep Learning modelling, because Tensorflow has been developed in Python 3 only. You can run them both alongside each other on your computer. You can also find functionalities to convert code from Python 2 to Python 3 and vice versa.


How do you learn Python?

The popularity of this language is also reflected in the sheer number of courses available now. Coursera now features 115 Python courses and specialisations, many universities and websites are offering online courses, and a whole host of books have been published. You can’t see the wood for the trees. So where to start?

Learning a programming language can be compared to learning a new language. You can study it, but the real way to master it is by lots of practice. Therefore we recommend starting with one of these two courses: Code Academy or Udemy. These courses start off at an easily accessible level and have a logical structure. Once you’ve finished either of these, start using it right away. Play around with tutorials. Start building things with Python: Build a model in Python some time rather than in the tool you are used to. Or you could build a webscraper that automatically searches for the cheapest tickets with the shortest times.

As you’re building, have Google open next to it. 99.9% of the things you will want to know will have already been asked by other people. There’s often good advice on websites such as stackoverflow. In addition, these two books are also very useful points of reference: Python Data Science Handbook by Jake VanderPlas and Python for Data Analysis by McKinney. Both of these books explain everything clearly and simply, with lots of example code.

Beyond these start-up guides there are also some other tips we can offer:

  • Install Anaconda instead of the usual Python installers. Anaconda has lots of packages already installed for you and so it is easy to use.
  • Use Jupyter Notebooks (which is installed along with Anaconda) instead of the Python interface. The advantage of Notebooks is that it can work interactively.
  • Useful packages that you really should install anyway if you are going to work on Machine Learning are:
    • General must-haves: numpy, scipy, pandas,
    • Machine Learning: statsmodels, scikit-learn
    • Deep Learning: Tensorflow, Theano, Keras
    • Visualisation: matplotlib, plotly, ggplot
    • Webscraping: Selenium, Beautifulsoup, scrapy
    • Text Mining/NLP: NLTK, Gensim
    • Making User Interfaces: Flask (web), PyGame, TKinter

Have we missed out any interesting resources here that are perfect for learning Python for Data Science? Please let us know!

Latest news

Find your “high risk files” according to GDPR using our DriveScanner

17 April 2023

In every company it’s a struggle to make sure we only keep the documents we want... read more

Nachos Hackathon 2022

7 September 2022

We don’t know if you’ve heard already, but there is yet another crisis on our horizon:... read more

What we learned from kaggle’s commonlit readability prize 

15 September 2021

Project-Friday At Cmotions, we love a challenge. Especially those that make us both think and have... read more

Subscribe to our newsletter

Never miss anything in the field of advanced analytics, data science and its application within organizations!