Working with a Data Lake in Onesait Platform (part 3)

03/12/2021 LuisMi Gracia

Welcome to our third post on Data Lakes related to Onesait Platform. Although we have not yet reached this last point, for now we have already seen what a Data Lake is and what benefits it brings to us, and how it differs from a Data Warehouse.

Today we want to tell you about the types of Data Lakes that exist, tell you about Hadoop, and tell you about some Open Source proposals on the market.

Hadoop as Data Lake

Data Lake is usually associated with Hadoop-oriented object storage. In this scenario, an organization’s data is first uploaded to the Hadoop platform, and then data mining and analytics tools are applied to the data that resides on the Hadoop cluster nodes.

At the core of Hadoop we find its storage layer, the HDFS (Hadoop Distributed File System), which stores and replicates data across multiple servers, in addition the Hadoop ecosystem includes several supplementary tools, such as Hive, Flume, Sqoop and Kafka that help with data ingestion, preparation and extraction.

Hadoop Data Lakes can be mounted locally or in the cloud using enterprise platforms such as Cloudera, Azure HDInsight, or GCP DataProc.

Strengths of a Data Lake on Hadoop

Even today, mounting a Data Lake on Hadoop is a popular option for the following reasons:

Bigger familiarity between technical teams.
Open Source Solutions, which makes their implementation economical.
More economic since they are open source.
A lot of tools available for the integration with Hadoop.
Easy to scale.
The data location allows for a faster computation.
Possibility of mounting it On Premise or as a service in the various Clouds.

Hadoop today

Hadoop was once the dominant choice for Data Lakes, but in the changing world of technology there are other more modern approaches based on tools such as Spark or Presto.

Let’s look back to understand how things have changed; Hadoop emerged in the early 2000s and became popular during the decade, in fact, because many companies went for open source, most of the early BigData and Data Lakes projects back then were based on Hadoop.

person wearing black leather shoes — *Photo by* Fallon Michael on Unsplash

Hadoop offered 2 main capacities:

Hadoop Distributed File System (HDFS) to persist data.
Processing framework that allows all that data to be processed in parallel.

Increasingly, organizations began to want to work with all of their data and not just some of it. And as a result, Hadoop became popular for its ability to store and process new data sources, including log records, clickstreams, and machine and sensor-generated data.

In the 2000s, Hadoop made a lot of sense as it allowed you to build local clusters with basic hardware to store and process this new data cheaply.

But Open Source continued to evolve and a new framework emerged: Apache Spark, optimized to work with data in memory and not on disk. And this, of course, means that the algorithms running on Spark will be faster, but the data still needed to be persisted, so Spark was included in many Hadoop distributions. That worked, but with the rise of the cloud, there is a better approach to the persistence of your data: object storage.

In addition to this, with the purchase of Hortonworks by Cloudera (and MapR by HP) in essence we can say that free distributions of Hadoop no longer exist, and this means that alternative solutions are being sought in the Open Source world.

MinIO and Presto as Data Lake

As mentioned before, nowadays a Data Lake can be set up using Spark and an item repository. We are going to be describing an interesting alternative to HDFS based environments and the rest of the Hadoop Ecosystem. Instead, we will be basing it on MinIO and Presto.

green forest near lake and mountain under cloudy sky — *Photo by* dirk von loen-wagner onUnsplash

Specifying this approach, MinIO is a distributed Object Storage that implements the AWS S3 API (we talked about that last Monday). MinIO can be deployed on-premise and in the cloud and works on top of Kubernetes. In addition, it bases its storage on objects, where each object is made up of 3 concepts:

The data itself. The data can be anything you want to store, from a photo to a 400,000 page manual.
An expandable amount of metadata. The metadata is defined by whoever creates the object; they contain contextual information about what the data is, what it should be used for, its confidentiality, or anything else that is relevant to how the data should be used.
A global unique identifier. The identifier is an address that is given to the object, so that it can be found in a distributed system. In this way, data can be found without the need to know its physical location (which could exist in different parts of a data center or in different parts of the world).

And if MinIO can replace HDFS as storage in a Data Lake, we are missing an HIVE-style SQL query engine, and this is where Presto comes in.

Presto is an Open Source distributed SQL query engine, built in Java and designed to launch interactive analytical queries against a large number of data sources (through connectors) supporting queries on data sources ranging from gigabytes to petabytes.

It is also considered an ANSI-SQL query engine, allowing you to query and manipulate data in any connected data source with the same SQL statements, functions, and operators.

Therefore, we can use Presto in the Data Lake to consult the data stored in MinIO. Additionally, Presto can run on top of Spark, allowing you to take advantage of Spark as the execution environment for Presto queries.

Advantages of this approach

This approach has numerous advantages over mounting a Data Lake over Hadoop:

The combination is more elastic than the typical Hadoop configuration, while in Hadoop adding and removing nodes to a Hadoop cluster is a complete process, in this approach everything runs on top of Kubernetes, which allows us to scale easily.
Separate compute and storage: With Hadoop, if you want to add more storage, you add more nodes (with compute). If you need more storage, you are going to have more compute whether you need it or not whereas with the object storage architecture if you need more computation, you can add nodes to the Presto cluster and keep the storage, so that compute and storage are not only elastic, they are elastic independently.
Maintenance: Maintaining a stable and reliable Hadoop cluster is a complex task, for example upgrading a cluster usually involves shutting down the cluster, continuous upgrades are complex, and so on.
Cost reduction: With this architecture we will have a reduction in the total cost of ownership: since MinIO hardly requires management, and also the storage of objects is cheaper.

white biplane — *Photo by* Pascal Meier on Unsplash

As we can see, MinIO and Presto have enormous potential. So much so, that we have incorporated both MinIO into Onesait Platform, which we talked about last Monday, and Presto, which we will talk about next Monday.

Next week we will talk about Data Lakes and the cloud, and the advantages it has compared to the most common On-Premise solution so far. We hope you tune in!

Header photo by Philipp Katzenberger in Unsplash