I am a huge fan of data and can’t imagine how exciting it would be to be swimming amongst an unlimited number of zeros and ones. I can’t wait for the next evolution of data…a data ocean! Seriously? Don’t even tease me. But, before we purchase floaties for the ocean, let’s understand what a data lake is first.
In a previous article, I wrote about what data warehousing is and the many purposes it serves. As a reminder, it typically stores an organization’s data in a very structured and well thought out database. This data can then be used by the organization to make key business decisions, generate reports, and perform year-over-year analysis. A lot of effort is exerted in order to prepare the data for consumption.
From a storage standpoint, a data lake has much more flexibility than a data warehouse. You can very easily store structured and unstructured data as well as old tires. You may not be able to derive much business mileage from the tread but a data lake can hold just about anything. Data can be stored in its original format which means there isn’t much thought necessary for how the data will be stored. If you have data that does not appear to have any immediate value, do not discard it. With storage being extremely inexpensive these days, toss it into the lake. (For comparison, a one megabyte hard drive would have cost approximately one million dollars 50 years ago but today would cost two cents.) In the future, this data could help you land that record catch that will push your business forward.
Key Differences between a Data Warehouse and a Data Lake
As mentioned earlier, a data warehouse would have a predetermined database schema in which the data would have specific data types, lengths, etc. This is what is known as schema-on-write. Inserts into the database would fail if certain rules, enforced by the database, aren’t met. A data lake, however, would be receptive to swimmers, boaters, and all kinds of marine life. It is capable of storing all flavors of data – relational, non-relational, social media generated data, mobile app generated data, etc. This is referred to as schema-on-read because the schema isn’t determined until a user begins querying the data. This puts the role of ETL developer on the user.
The quality of the data within a data warehouse will be much more pristine than a data lake because the data has gone through rigorous evaluation by the business and data transformation rules have been applied before the data is loaded. However, a data warehouse will not contain ALL of the data available to it because, more than likely, the business has deemed that some of the data has no analytical value. A data lake does not suffer from the “missing data” problem because all of the data is stored. Because it is stored in its original form, it may not be as cleansed as the data warehouse data.
A third difference between the two is data storage cost. Typically, a data warehouse will have high performing disks so that the business can get results in a timely fashion. A data lake will have less costly storage because performance isn’t as critical and the amount of data stored can be immense. You can literally pour data into the lake.
How do I implement a Data Lake?
Serious thought needs to go into how all of the data will be stored so that it is useful in the future. The last thing you want is a data dumping ground where finding specific data is difficult. There are cloud based solutions that can limit this side effect that fully support data lake implementations. For example, Amazon Web Services (AWS) offers numerous solutions. Amazon Simple Storage Service (S3) is a very cost effective way to store a variety of different data formats. Amazon Glue provides the critical data cataloging necessary to find data easily as well as ETL (Extract, Transform, Load) functionality. There are additional Amazon services that enhance the analytics but these two would be foundational for the lake.
Microsoft’s Azure Data Lake is another option for getting into the cloud. If an organization is primarily a Microsoft shop, this choice probably makes a lot of sense. It will be fairly easy to just add onto the Microsoft stack. Microsoft’s data lake integrates easily with their Azure SQL Data Warehouse, Data Factory, and Power BI products to provide a complete analytics platform.
Ok, I get the differences between the two but which one do I go with? Well, I will answer that with the infamous “it depends” response. Do I have to pick one or the other? No, you don’t. One or the other or a combination of both will be dictated by business needs.
By design, a data warehouse will facilitate organizational reporting as well as provide business analysts with a known and trusted source of data. From a technical perspective, analysts will need to know a query language and/or a reporting tool. These skill sets are not terribly difficult to acquire.
On the other hand, retrieving information from the data lake will require more savvy technical skills. Data scientists may be required to make the best use of this data and they will likely leverage Apache Hadoop, an open source software product, that has become one of the industry leaders in this arena.
So if you want an environment that is well organized and easily queried, then a data warehouse is your answer. If you want an environment where no data is lost and can always be retrieved, then consider a data lake. Likely, both scenarios are applicable and you will want to have a highly responsive and curated data warehouse, plus an extremely low cost repository for the mountains of data that you access less frequently.