I’m not a huge fan of the term “Data Lake,” but I am less of a fan of it being hijacked for a sales agenda. So I would like to clear up a few points about what a Data Lake is and what it most definitely isn’t.
What it is
A Data Lake is a repository which can contain structured, semi-structured, and unstructured data. The purpose of the repository is provide a way to bypass an ETL process and quickly move data into a queryable environment. This one location is ideal for mashing up disparate data sources and leveraging big data techniques such as applying schema-on-read.
I think you’ll find this definition a bit more satisfying than what you’ll find on wikipedia:
A Data lake is a large storage repository that “holds data until it is needed”. The term was coined by James Dixon, Pentaho chief technology officer. As of 2015, data lakes could be described as “one of the more controversial ways to manage big data”
What it means
Firstly, a data lake is a concept (more on Microsoft’s product in a moment) which blows a kiss in the wind with a promise that integration work is not necessary. Integration work, after all, is most of the work in a analytic project. If that work can be skipped, it’s a huge win for the people who want their reports built today. A data lake needs analytical tools that can do light ETL work and have a low barrier to entry. If this is HDFS and Hive, great – if it is machine learning algorithms analyzing images or text, not great. If the data is clean an additive, wonderful. If it is full of duplicates, needs fuzzy join logic, revises history or is full of codes that need to be translated to a human readable form, not so great. A Data Lake hopes to spell the end to integration work and remove a huge barrier to getting to the fun stuff, the analytics. This may work well in some cases, but I’m not sure that it spells the end for Data Warehouses. I think we will find some happy medium between delayed integration work and fast analytics.
Earlier this year, Microsoft released a product called Data Lake. This is a version of the very product that they use to run Bing, Xbox, and skype. The success story here is that Bing was improved as a search engine because the data and analytic tools were opened up to all of Microsoft to participate in experimentation. So, I guess if it works for Microsoft…