Last week we told you about the concept of «Data Lake» and what benefits does it bring to the table. We also left unanswered the question of what is a «Warehouse», which a concept very similar to «Data Lake». For that, these two concepts are usually mistaken for each other, but their approaches are very different.
A Data Warehouse allows you to store data, but in a very particular way; The information must be stored in a structured manner based on the user’s needs. Meanwhile, as we have seen, a Data Lake is a raw data warehouse, where the data is stored as it has arrived until it is used.
Let’s see the main differentiating elements of a Data Lake with respect to the approach of a Data Warehouse.
A Data Lake perserves all data
During the development of a Data Warehouse, time is spent analyzing the data sources, understanding the business processes and profiling the data. The result is a structured data model designed for reporting.
A large part of this process includes making decisions about what data to store and not store. Generally, if the data is not used, it can be excluded from the warehouse, to simplify the data model and also to conserve space in the storage.
In contrast, Data Lake preserves all the data, not just the one that is currently in use, but data that can be used, and even data that will never be used just because one day it might be of use.
The data is also kept all the time so that we can go back in time to any point to do the analysis. This approach is possible because the hardware for a Data Lake is often very different from that used for a Data Warehouse and scaling up to terabytes can be done inexpensively.
A Data Lake supports all types of data
Data Warehouses are generally made up of data extracted from transactional systems along with quantitative metrics and the attributes that describe them. Non-traditional data sources such as web server logs, sensor data, social media activity, text, and images are largely ignored.
The Data Lake approach encompasses these non-traditional data types. At Data Lake, we store all data regardless of source and structure. We keep them in their raw form and only transform them when we are ready to use them. This approach is known as «Schema on Read» compared to «Schema on Write» which is the approach used in the Data Warehouse.
A Data lake supports all the users
In most organizations, 80% or more of the users are «operational». These users want to get their reports, see their KPIs or select the same data set in a spreadsheet every day. The Data Warehouse is ideal for these users because it is structured, easy to use and understand, and designed to answer their questions.
The next 10% do more analysis on that data. They use the Data Warehouse, but they often go back to the source systems to get data that is not included. Their favorite tool is the spreadsheet and they create new reports that are often distributed throughout the organization.
Lastly, the other 10% do deep analysis, they mix many different types of data, they can create totally new data sources, they often ignore the Data Warehouse, as they are usually asked to go beyond their capabilities. These users include data scientists and can use advanced analytical tools and capabilities such as statistical analysis and predictive modeling.
The Data Lake approach supports all of these users equally. Data scientists can go to the Data Lake and work with the diverse set of data they need, while other users make use of more structured views of the data provided for their use.
The Data Lakes adapt easily to changes
One of the main complaints about Data Warehouses is how long it takes to change them. Considerable time is spent in advance during the development of the warehouse structure. A good warehouse design can be adapted to change, but due to the complexity of the data loading process and the work done to facilitate analysis and reporting, these changes will necessarily consume some developer resources and take some time.
Many business-related questions can’t wait for the Data Warehouse team to adapt its system to answer them. The growing need for faster responses is what has given rise to the concept of self-service business intelligence.
On the other hand, in Data Lakes, since all data is stored raw and always accessible to someone who needs to use it, users have the power to go beyond the warehouse structure to explore data in new ways and respond to your questions at your own pace.
If the result of an exploration is proven to be useful and there is a desire to repeat it, then a more formal scheme can be applied and automation and reuse can be developed to help extend the results to a wider audience. If the output is determined to be unhelpful, it can be discarded and no changes have been made to the data structures or development resources consumed.
Interesting, right? Well, we already know what a Data Lake is and how it differs from a Data Warehouse. Next week we will see examples of Data Lakes, as well as Open Source proposals. We hope that you will tune in!