Surely we have all heard of something called «Data Lake» and its relationship with Big Data. But what does exactly consist of?
Through this series of themed entries, we are going to be breaking down what is Data Lake, its differences with a Data Warehouse, types of Data Lakes, its relationship with the cloud, and how we support it in Onesait Platform.
What is a Data Lake and what is it for?
This is perhaps the most important question to ask ourselves to begin with. Well, according to the Gartner consultancy, we can define a Data Lake as:
A collection of storage instances of various data assets where these assets are stored and maintained as a replica of the structured or unstructured source format, in addition to the original data stores
Some examples of Data Lakes are Amazon S3, Apache Hadoop o Azure Data Lake. Surely some of those names also sound familiar.
For the time being, let’s give another spin to the definition of Data Lake because is not clear with the definition given so far.
The definition of Data Lake was coined by James Dixon, technological director of Pentaho, and refers to the particular nature of the system’s data, in contrast with the clean and processed data saved by the traditional data storage systems or Data Mart.
According to Dixon, if a Data Mart is storage of clean bottled water, packaged and structured for easy consumption, a Data Lake is a great mass of water in a more natural state. Its content comes from a source that feels up the lake where various users can examine, dive or take a sample.
Perhaps a more specific definition is that of Amazon Web Services, which defines it as follows:
Centralized Repository that allows storage of all structured and non-structured data at any scale. Can store the data as is, without the need of structuring it first, and execute different types of analysis, from dashboards and visualizations to large data processing, real time analysis an automatic learning to make better decisions.
With this, I think we already get an idea of what this Data Lake is, right?
Benefits that come with the use of a Data Lake
This is an important matter. The technology has to provide something of use for me to use it and if not, is not worth using. Well, a Data Lake contributes the following benefits:
- Centralization of disparate content sources: allows the centralization of all data in one place no matter the origin, to later be processed.
- Decrease in preparation costs: the data is prepared «as necessary», which allows to avoid the need of knowing how they need to processed initially and do it when its applicable (Which is a requirement of Data Warehouses).
- Big Data Processing: once extracted from their «information silos», sources can be combined and processed, normalized and enriched, as well as allowing as to perform discoveries, exploration of data and analysis for decision making. The data scientist can access, prepare and analyze quicker and more precisely thanks to the Data Lakes.
- Ubiquity: any authorized user can access the information and enrich it from any place, which helps the organization gather the information to make decisions in an easier manner.
- Adaptación a los cambios: In reference to one of the main complaints about data warehouses being how long it takes to change them.
- Security: a Data Lake contemplates security during the act of accessing data so that the user can only look at the data is authorized to.
- Cost savings: Data Lakes normally execute a series of commodity hardware clusters and allow for horizontal scalability. This way, the capacity of the Data Lake is increased as needed.
Just by taking a look, we can see that the benefits provided by Data Lakes a worth getting into such a complicated technology.
When talking about the benefits, an interesting term came up: «Warehouses». Next week we will talk about what it does consists of and what differences does it have with a Data lake. Hope you are looking forward to it!