During a break at an industry conference a while back, I unintentionally overheard a conversation.
“Have you guys got a data lake yet?” “Erm, what do you mean?” “Well, just that, a data lake…” “And what do you mean by that….?”
I immediately sympathised with the person asking that last question, even more so when the answer – as far as I could tell – was rather similar to the definition we have used for many years for a data warehouse: a large collection of data that has been created through the conduct of business operations and that can be used to generate management information.
Investigating the topic of data lakes is a funny old exercise and before long you’re bound to conclude that it’s just another new hype. When you search for “data lake”, you soon come across the term “data sewer” too. Then, again just as easily, you end up with the concept of the “data swamp”, and everywhere you look you find the somewhat poetic parallel of the water bottle (i.e. the data warehouse) and the overflowing lake (i.e. the data lake). And of course, you guessed it: the term “data lake” comes from the IT world, as does the term “big data”. Oh yes, surprise surprise, both these concepts are indeed related.
The thing you’ll find all definitions of the data lake have in common is that it concerns a collection of data from various different sources. The data in question is in fact all data: structured and unstructured, transactional to clicks, audio and visual, moving or stationary. Hence the link to big data too – which is actually what gave rise to the need for the data lake.
When you look at who the intended users of data lakes are, then first and foremost it’s data scientists. Why? Because creating a data lake does not involve modelling for the transformation or interpretation of data or the creation of a business model. This is unlike a data warehouse, where business rules have to ensure that the source data is collected accurately and consistently.
The data definitions that are applicable in the lake are the definitions that were applicable on the source systems. Therefore the lake is actually more like a sandpit (i.e. a separate environment from the other data or systems) from which you can mine data to your heart’s content without worrying about any predetermined definitions or interpretations.
The data in the lake is therefore kept in its original form, regardless of form or structure, and is only transformed when someone needs that data. As a result, the logical conclusion to draw is that you will spend more time preparing the data at the time when you want to create a set for analysis.
This is a significant difference with the data warehouse. Firstly, the structure and model of a data warehouse is a reflection of the business model of the organisation. Secondly, definitions are set for all the data elements before they enter the warehouse. In this sense, the data warehouse serves much more as the organisation’s “memory” than the data lake, from which data can also disappear if it turns out not to be useful.
So for a data scientist, the data warehouse is actually restrictive. After all, you can only access data mainly from the primary and secondary process of the organisation. Log, click, social media, 2nd and 3rd party data therefore falls off the edge and, for analysis purposes, you still can’t access it when you only take the data warehouse as your source.
I don’t think there is an obvious choice between the one or the other. In fact, the data warehouse and the data lake have quite different applications and therefore constitute quite different things. If you are a data scientist and want quick access to a flexible analysis environment with diverse sources outside the organisation, then the data lake is a suitable solution. However, for reliable management information, the data warehouse remains the appropriate source.
Do you want to know more about this subject? Please contact Kees Groenewoud using the details below
17 July 2018
As if our daily portion of “working with data” wasn’t enough, at Cmotions we are also... read more
3 April 2017
Advanced analytics. Real time reporting. Big data. Smart data. Data virtualisation. From data centres to centres... read more