25 September 2017
Dataiku has been on the map since 2017, partly thanks to Gartner’s Magic Quadrant for Data Science Platforms. And not by halves: along with IBM, Dataiku got the highest score of all platforms on the “Completeness of Vision” axis. That sounds like a promising basis to take the crown from the top 4 platforms (IBM, SAS, Rapid Miner and KNIME). In one of our previous blogs, IBM (SPSS) and SAS were put to the test against open source tools R and Python. And in another blog we also put Rapid Miner and KNIME in a “head to head” comparison. In this blog I want to talk you through my experiences with Dataiku: what actually makes it so “visionary”? And what about its ease of use for data scientists and data analysts?
I have been working on a project with Dataiku since early 2016. At the time, the client in question was one of the first organisations in the Netherlands to start working with Dataiku. Since then, the data analysts, data scientists and data engineers in the organisation have been working with Dataiku more and more in order to put data to use to analyse and improve business operations. I myself have been able to put Dataiku to use for a variety of purposes in this process, in the analysis and data science workflow. In this blog I will first give a brief introduction to the Data Science Studio and then I will talk you through my personal experience with Dataiku.
First of all, some general information. This is necessary because you may have heard the name in passing and rushed to Google and searched for “dataiq” or “data iq” (like I did!) and wondered what it was really all about. Well, about [Daa-taa-ie-kuuuuu] actually…
Dataiku was founded in France in 2013 by its four “fathers”, namely Florian Douetteau, Clément Stenac, Marc Batty and Thomas Cabrol. Dataiku has been moving at a relentless pace ever since, now with its headquarters in New York and with a much larger organisation now approaching the 100 employees mark. With the help of a group of venture capitalists, Dataiku has since succeeded in developing its Data Science Studio (DSS) and is on the cusp of further new developments.
Why does Gartner call Dataiku’s Data Science Studio visionary? In their own words: “…Its placement as a Visionary is due to the innovative nature of the DSS, especially its openness and ability to cater to different skill levels, which enables better collaboration. Dataiku’s Ability to Execute suffers from limited user adoption and deficiencies in its data access and exploration capabilities.”
Do you want to build a more complete picture for yourself of the possibilities of DSS? Then you should first of all take a look for yourself – although Dataiku is proprietary and therefore not an open source project, they do offer a free (limited) version which also allows you to snoop around the full Enterprise Edition for two weeks. It will become clear to you in no time that Dataiku offers a wide range of functions. It is far too extensive to describe exhaustively in this blog, and I’m not even going to try. However, what I want to do is to share what I think are the most important moments of joy and frustration when using Dataiku with my peers. This, in order to make sure that when you yourself have to choose whether or not to switch to Dataiku, you will already have a better idea of what to expect if you start using it.
How does Dataiku DSS “work”? Dataiku can be installed on an internal server or be run in the cloud, via Amazon Web Services or Microsoft Azure. You access DSS through your browser. In a project in DSS, you work mostly from the Flow (see printscreen below). This integrates data sources, data processes and also the relationships between them. A dataset (blue icons) can be processed with visually-supported data processes (visual recipes – yellow icons) or with special pieces of code in various languages (code recipes – orange icons). Statistical models are indicated by green icons. In addition to the flow, DSS also offers the possibility to analyse and explore the data sources in the flow with notebooks (for those who are familiar with Jupyter/Ipython notebooks: that’s exactly what we mean! Notebooks for Python, R, SQL, Hive, Impala and Scala (Spark)!). You can also explore and analyse visually; analyses allow you to slice and dice and of course to unleash a whole range of machine learning algorithms on the data. Once you have arrived at a certain workflow to maintain, you can schedule it with a Scenario so that it runs at the specified frequency. You can also set up monitoring and dashboarding for it. So that was the ultra-brief intro to the most important Dataiku lingo and functionality; take a look here for more detailed introductions.
I would like to talk you through a few of the positives and negatives that I encountered in day-to-day practice. Let’s start with the positives:
Connectivity with a lot of different types of data sources
The scope for connecting Dataiku to various different data sources with a host of data types is just staggering. This is especially true when you consider how easy it is to combine multiple sources. Uploading local files or folders, via FTP, from a database (Oracle, MySql, SQL Server, Redshift, PostgreSQL, you name it), from cloud storage such as S3, from NoSQL (ElasticSearch, etc.), is possible with just a few clicks (and of course a few settings too). Once you’ve got it in your Flow, you can let rip on it with both visual and code recipes; wherever possible, data processes are carried out wherever they are most efficient, locally, in-database,…
Flexibility in processing: visual and in code, exploring and maintaining
What I really like about Dataiku and what makes it accessible to a wide audience is the degree of freedom it offers in data processing and analysis – visually supported or heavily code-based or a combination of the two. In DSS, you are totally free to process data with a visually-powerful interface, which makes a wide range of data processes possible whilst also keeping all the data processes clear and orderly. Filtering, combining, aggregating, transforming and modelling datasets. Users of tools such as SPSS Modeler, SAS Enterprise Guide/Miner, RapidMiner and Knime will feel at home here. Personally I find the transformation options contained in the Prepare Recipe especially impressive. It offers a huge number of possibilities to turn textual data, web data, geographical data and time-related characteristics, etc. into the characteristics you want to extract from them. Dataiku have invested heavily to achieve a powerful, user-friendly user interface for “data wrangling”, i.e. moulding the data to make it available to a broad audience.
However, if you prefer to do processes in code, you can do so just as easily. R, Python, SQL, Hyve, Pig, Spark SQL, PySpark,… they’re all available for you to use – if it matches the source – instead of a visual recipe. This also means that you have direct access to all machine learning libraries from R, Python, Spark R, PySpark, … for use in DSS.
And if you don’t quite know exactly what you want to do with the data because you want to explore it a bit more first, there are Jupyter-style notebooks available with which to explore a dataset from your flow – once again, in multiple programming languages. And for those who are not quite so code-orientated, there is the lab environment which uses visual tools to slice & dice data and build models. If you do know what you want to do in the flow, you can still make a recipe that will become part of your overall flow. By combining the possibilities for exploring (notebook/lab) and safeguarding (recipes), you hardly become tempted to work anywhere other than in Dataiku, which is a great benefit, especially when you are collaborating on a data science project. DSS offers terrific opportunities to collaborate with others within the same project; when doing this, it is useful to view everything together.
Negatives: Clarity of Flows
Is it all sunshine and rainbows? Of course not – it is still just a tool, and every tool has its limitations and frustrations. Little things, when you encounter them often enough, can become very annoying very quickly. One of my greatest frustrations with Dataiku is that every time you change something in your flow (e.g. by adding a data source or a new recipe or a model), Dataiku changes the whole organisation of the flow! Optimisation is what Dataiku would call this. The flow is then “optimally” visually restructured from scratch, with lines connecting the nodes, as few overlaps as possible and data processes running from left to right. However, if you have quite a lot of nodes in your flow and you change something in the bottom left, it can happen that your node then suddenly moves to the bottom right. Or the top right. Or into the middle. All of which is extremely disorientating. DSS is extraordinarily good at clearly displaying all the different transformations you are doing within a recipe. But by the same token, the overall layout is equally confusing because of the way it constantly skips around.
Flexibility in editing Flows
The second major frustration is that when you want to edit something somewhere towards the beginning of your flow – say you have a slightly different data source that you want to use after all, or you want to add in an extra filter recipe – you end up spending quite some time rebuilding your flow. Dataiku is less good at this than, for example, SPSS Modeler or other flow-based packages where you can add in an extra node or replace a source without any trouble. In Dataiku I think it currently takes far too long to build your flow – sometimes almost from scratch.
Data visualisation options
The third limitation I found is that the possibilities Dataiku DSS offers for making appealing visualisations are still terribly limited. The interface for making your own aesthetically pleasing plots and overviews is absolutely inadequate to put together a dashboard that will impress business. Fortunately, you can easily move prepared data to a place where you can do that, or can get quite far with Notebooks to achieve the visualisations you want in R or Python. However, I think this is a very significant challenge for Dataiku if you want to work with the Tableaus and Qlikviews of this world, for example.
Overall, I have been absolutely delighted by Dataiku over the last year. I believe its greatest strength lies in Dataiku’s flexibility in connecting to sources, the scalable possibilities it offers for processing, analysing and safeguarding data in processes and the choice it consistently gives you between coding and clicking. For a data science project, you don’t need to go out of DSS at all from beginning to end, and you can safeguard the process well, share it with colleagues and also collaboratively develop it (even at the same time). The greatest disadvantages or frustrations I found in Dataiku are the lack of control over the workflow, the ability to easily edit an existing workflow and the possibilities (at least at this point in time) Dataiku offers for data visualisation.
If you and your organisation are currently at the point of considering a data science workflow package, Dataiku is certainly a strong candidate to consider. Depending on your organisation’s requirements, it can resolve several challenges at the same time. Even though Dataiku has been on the market for a relatively short length of time, it is a fully-developed and fully-fledged package that can be used by a wide audience. Not for business users – I think the claim that this can meet their data and analysis needs is going a bit too far – but it is certainly viable for data analysts, data scientists, data engineers and more data professionals besides. If you would like to find out about more user experiences or have any specific questions, please feel free to get in touch with me.
Do you want to know more about this subject? Please contact Jurriaan Nagelkerke using the details below
17 April 2023
In every company it’s a struggle to make sure we only keep the documents we want... read more
7 September 2022
We don’t know if you’ve heard already, but there is yet another crisis on our horizon:... read more
15 September 2021
Project-Friday At Cmotions, we love a challenge. Especially those that make us both think and have... read more