Monday March 16, 2020
A data lake is a collection of raw data in the form of blobs or files. It acts as a single store for all the data in an enterprise that can include raw source data, pictorial representations, charts, processed data and much more.
An advantage with the data lake is that it can contain different forms of data including structure data like a database including rows and columns, semi-structured data in the form of CSV, XML, JSON, etc.
It can also store unstructured data like PDF, emails and word documents along with images, and videos.
It is a store of all the data and information in an enterprise. The concept of the data lake is catching up fast due to the growing needs of data storage and analysis in all the domains.
Let us learn more data lakes.
What is Data Lake?
Are you wondering why it is called a data lake? For that, we need to understand what a data mart is. Datamart can be considered as a repository of summarized data for easy understanding and analysis.
Pentaho CTO James Dixon was the person who first used the term data lake. As per him, a data mart is like packaged and cleaned drinking water that is ready for consumption.
The source of this drinking water is the lake. Hence the term data lake. A storehouse of information from where the data mart can interpret and filter out the data as needed.
What is the importance of the data lake?
A data lake is a huge storage of raw data. This data can be used in infinite ways to help people in varied positions and roles.
Data is information and power, that can be used to arrive at inferences and help in decision making too.
Here are some major areas where data lakes are most helpful:
Advantage of a data lake
In this section let us list out some of the obvious reasons and advantages of having a single repository of data that we call a data lake.
All data in a data lake is stored in its raw format and is never deleted. A data lake can typically scale several terabytes of data in the original form.
What’s the architecture of a Data lake?
The above image is a pictorial representation of the architecture of the data lake.
ingestion tier – Contains data source. Data is fed to the data lake in batches and that too in real-time
Insights tier – Located on the right side, is the insights of the system.
HDFS – Spcialy built a cost-effective system for structured and unstructured data.
Distillation – Data will be retrieved from storage and will be converted to structured data
Processing – User queries will be run through analytical algorithms to generate structured data
Unified operations – system management, data management, monitoring, workflow management, etc.
Differences between data lake, database, and data warehouse
In the simplest form data lake contains structured and unstructured data while both database and data warehouse except pre-processed data only. Here are some differences between all of them.
|Data Lake||Data Warehouse|
|Stores everything||Stores only business-related data|
|Lesser control||Better control|
|Can be structured, unstructured and semi-structured||In tabular form and structure|
|Can be a data source to EDW||Compliments EDW|
|Can be used in analytics for the betterment of business||Mainly used for data retrieval|
|Used by scientists||Used by business professionals|
Data Lake Implementation
Data Lake is a heterogeneous collection of data from various sources. There are two parts to any successful implementation of a data lake.
The first part is the source of data. Since the lakes take all forms of data, the source need not have any restriction. This can be the company’s production data to be monitored, emails, reports, and more.
Another zone that can be added to this implementation is the data warehouse or a curated data source. This will contain a set of structured data ready for analysis and derivations.
Example for Data Lake
Think about this scenario where unstructured data that you have can be used for endless purposes and insights. However, possession of data late doesn’t implicate that you can load all the unwanted data. You don’t need data swamp right? The collected data must have a log called a catalog. Having data catalogs makes the data lake much more effective.
Examples of Data Lake based system include,
and many more.
Also Read: Wish to know about data-driven testing?
What is Snowflake? Is it a data lake?
Snowflake is a combination of a data warehouse and a data lake, taking in the benefits of both.
It is a cloud-based data warehouse that can be used as a service. It can be used as a data lake to give your organization unlimited storage of multiple relational data types in a single system at very reasonable rates.
This is like a modern data lake with advanced features. Being a cloud-based service, the use of snowflakes is catching up fast.
Data Lake solution from AWS
Amazon is one of the leading cloud service providers globally. With the advent and extensive use of the data lakes, they have also come up with their own data lake solution that will automatically configure the core AWS services that will help to simplify tagging, searching and implementing algorithms.
This solution includes a simple user console from where one can easily pull the data and analysis that one needs with ease.
Below is a pictorial representation of the data lake solution from Amazon.
Some of the main components of this solution include the data lake console and CLI, AWS Lambda, Amazon S3, AWS Glue, Amazon DynamoDB, Amazon CloudWatch, Amazon Athena among others.
Amazon’s simple storage service or Amazon S3 is a web-based object storage service launched by Amazon in March 2006.
It enables organizations of all sizes to store and protect their data at a low cost. As shown in the diagram above, it is a part of the data lake solution provided.
It is designed to provide 99.999999999% durability and is being used by companies across the globe to store their data.
With the growing needs to data and the requirement to make it centrally available for storage and analysis, data lakes fit the bill for most companies.
Newer technologies like Hadoop and big data facilitate the storage and assimilation of huge amounts of data centrally.
There are still some challenges with respect to data lakes and they are likely to be overcome soon, making data lakes the one-stop solution for the data needs of every organization.