What is Data lake?
A data lake is a way of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually, object blobs or files. The main purpose of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transform data which is used for various tasks including reporting, visualization, analytics and machine learning. The data lake includes structured data from relational databases like rows and columns, semi-structured data like CSV, logs, XML, JSON, unstructured data like emails, documents, PDFs and even binary data like images, audio, video thus creating a centralized data store accommodating all forms of data.
A data swamp is a deteriorated data lake, that is inaccessible to its intended users and provides little value
It was James Dixon, then chief technology officer at Pentaho, who allegedly coined the term data like to contrast it with data mart, which is a smaller repository of interesting attributes extracted from raw data. James argued that data marts have several inherent problems, so he promoted data lakes. These problems are often referred to as information siloing. One example of a data lake is the distributed file system used in Apache Hadoop.
Many companies also use cloud storage services such as Azure Data Lake and Amazon S3. There is a gradual academic interest in the concept of data lakes, for instance, Personal DataLake at Cardiff University to create a new type of data lake which aims at managing big data of individual users by providing a single point of collecting, organizing, and sharing personal data.
An earlier data lake ,Hadoop 1.0, had limited capabilities with its batch-oriented processing (Map Reduce) and was the only processing paradigm associated with it. Interacting with the data lake meant you had to have expertise in Java with map reduce and higher level tools like Pig & Hive (which by themselves were batch-oriented). With the dawn of Hadoop 2.0 and separation of duties with Resource Management taken over by YARN (Yet Another Resource Negotiator), new processing paradigms like streaming, interactive, on-line have become available via Hadoop and the Data Lake.
The data in Data Lakes should not have indefinite life in the repository to make it data swamp. Most companies who manage data lakes define effective data archival or data removing techniques and procedures to keep the pond within controllable limits.
Click HERE to download eBooks about Data Lake.