Within Cmotions it is common that groups of colleagues work together on projects about topics of their interest in order to strengthen their expertise on those topics and to collaborate and learn from each other. These groups are called CIAs (Cmotions Interest Areas) and one of these groups has recently worked on a project about a meal delivery service.
Sometimes we come across data that inspires questions. As frequent customers of meal delivery services, we wondered about the questions we could answer if we had access to all the public data of one of these service providers. How many new restaurants were added to the website during covid? What would be the perfect place for a new sushi restaurant? And, to do ourselves a favor, where can we buy the best rated burger? In order to find answers to these questions, we needed to take all steps included in the process of working with data: from gathering the data, storing it, building models and visualizing the data.
This article will deep dive into the process around storage of the data in order to achieve a user-friendly database. Sounds boring? Not at all! Read which steps we took to create our own meal delivery database.
As Data Analysts and Data Scientists, we generally work with data of relatively good quality from data marts or well-managed databases. However, this is not a given, so we wanted to explore which how to build pipelines ourselves. We will give more insights into the steps that leads from raw data to a ready to use data mart or database, such as mining source data, data transformation and creating a data mart. Furthermore, we will pay special attention to explaining the concept of Slowly Changing Dimensions.
A historic database with website data
We created a database in which scraped data from the website of the meal delivery service is collected. We make use of so-called Slowly Changing Dimensions, which enables us to track changes for each restaurant over time. With this method we can, for instance, see which new restaurants were added over time, which restaurants disappeared, what changed on the menu or sudden drops or increases in review ratings. Later in this article, we will explain Slowly Changing Dimensions in more detail.
As the overall quality of the data in our database is high and since we have used a smart data model, we are able to turn the data into value without much effort. More concrete, it makes it relatively easy to train our prediction models, build dashboards and do all types of fun analyses on our data set.
Fully in the Cloud
The data warehouse and all the tools we use are fully cloud-based. We work with two of the most widely used tools from Microsoft’s Azure suite: Azure Data Studio and Azure Data Factory.
Azure Data Factory offers a low-code way to visually implement all data transformation steps. This makes it an accessible tool for less technically educated people or analysts who prefer working using limited code. Azure Data Studio offers a SQL-environment in which we can connect to the database, comparable to SQL Server Management Studio.
A Journey through the layers of our Data warehouse
A data warehouse is a database in which data is processed and molded into a more usable form. This transformation of the data contributes to the user friendliness of the data. More specifically, it helps in answering recurring questions. Usually, this is achieved by processing the data through different ‘layers’. This way of processing allows to trace back how the raw data is transformed. In addition, it is common to organize the data according to a data model, so that the data is stored in the most efficient way for future use.
To go from a .json file to a fully working data model requires some data transformation steps inside the data warehouse. The image below shows the different steps or layers in the ETL-process. In our case the layers of the ETL-process involved the following steps:
- Staging (STG): in this layer we load the raw data from the scraper into the data warehouse.
- Raw Data Store (RDS): this layer keeps track of all changes to the data for each restaurant.
- Integrated Data Store (IDS): only unique and active rows of a restaurant are selected and obsolete data is not. This is a first step to data cleaning.
- Standard Data Model (SDM): used to create a data model out of all the data of previous steps (see image below).
- Data Mart (DM): contains the input data from the SDM, but it can also be used to add derived columns which are necessary for analyses or dashboards.
How we use Hash Algorithms for comparing differences in data
Hash algorithms help us to monitor changes at restaurants. A Hash algorithm converts certain input into a string with a fixed amount of characters. The input can be literally anything: 1 word, multiple words, or a full text. Whatever you use as input, the algorithm will always convert it into a string (or hash) of the same length. The length of this string depends on the algorithm you use.
As long as the input is exactly the same, the algorithm will always return the exact same hash. But if anything changes in your input, the returned hash will be completely different. See the table below for a simple example. If we change the input from ‘hallo’ to ‘Hallo’, the hash is a completely different string.
The same principle can be applied to the data in our data model. We concatenate (‘merge’) all columns of one restaurant and use the algorithm to create a hash value. Whenever new data about a restaurant is provided from the scraper, we hash this new input as well. Subsequently, we compare the two hashes (so the old and new data). Are they the same? Then we can keep the old record and ignore the new restaurant info. Are the hashes not the same? Then the newest record becomes the most recent and we archive the old record.
|R123456, Restaurant de Pauw, Bunnik||753692ec36adb4c794c973945eb2a99c1649703ea6f76bf259abb4fb838e013e|
|R123456, Restaurant de Pauw, Amersfoort||e647ce62d0b158fd76f297c16727de4074495de17e096945280f996262a3e048|
By saving the old record we can follow the full history of a restaurant. Each record contains the effective start date and end date between which it was active. This principle is called Slowly Changing Dimensions, since records only ‘slowly’ change over time. To illustrate, restaurants are not likely to change their name or address every day.
|R123456||Restaurant de Pauw||Bunnik||2021-04-01||2021-10-01||0|
|R123456||Restaurant de Pauw||Amersfoort||2021-10-01||9999-12-31||1|
A result to be proud of
The end result is a data warehouse with ready-to-use data on which we can build analyses, models or dashboards. The data warehouse is built is such a way that new data can easily be added, and existing data can easily be archived. Thanks to Azure Data Factory, it is easy to see which transformations were applied and we can easily share the results with less tech savvy people. Moreover, the data model makes it easy to create new data marts.