Businesses are defining themselves on being able to tune and optimize their operation on the smallest of data elements, but given the huge amount of data spread how is this really accomplished? Databases are cheap, so it is very common these days to have multiple (sometimes 100’s) of separate databases running within an organization all providing their own part to support the overall company. With this organic data spread it becomes increasingly difficult to get a consolidated birds-eye view of a company. Querying each database, one-by-one, piecing together all the data, is logistically not practical nor timely. There has to be a better way.
Fortunately there is, it is called data warehousing. This is where we bring together all the data from all around the organization into a single database, optimized for querying and long term storage. Put simply, it is the one database to rule them all.
That, however, is doing a major disservice to the world of data warehousing, so let us take a deeper dive on what a data warehouse is and how it differs from the operational databases you are most likely familiar with at the moment. The story doesn’t end there, as the world of big data is defined, we also have the concept of a data lake, but we’ll leave that for another article.
As I’ve already described, a data warehouse is a consolidation of an organization’s data into one environment providing one location where all of the pertinent and critical information resides for reporting and analytics in support of the business. That sounds logical enough, but what does that really mean? Are you simply copying the data from one database to another?
In a small way yes, but in a much larger context it is so much more. An operational database is designed to support the daily operations of the company, where data is continually inserted, updated and sometimes deleted, with changes usually not being tracked. These databases are optimized for this type of workload and don’t generally keep a lot of historical data hanging around too long.
A data warehouse, in contrast, is designed to never shrink, but to continually collect and store data in a way that is optimized for querying and analysis. Data inside a warehouse is rarely updated and never deleted. A huge benefit to this approach is that year-over-year analysis can be easily done with a properly architected data warehouse solution.
Collecting the data
The warehouse rarely reflects the schemas (the way we lay out data) of the operational database it is sourced from. The operational data schema is designed to have a single record for a given object, for example, a customer record. The warehouse, however, is designed to show how that customer record has evolved over time, allowing for a much richer and deeper analysis.
Not all data makes it to the warehouse, though most of it should. As part of the process of defining the warehouse schema, various indicators (or flags) are determined as to what the data means in the context of the business. This is known as “one version of the truth”. An example of this could be a flag to determine how hot or cold a given customer is at any point in time by consolidating data from a number of different operational data sources. This logic is extremely important to get right as it defines the metrics to which the business will make crucial decisions upon and serves as the benchmark to which they measure against for future transactions.
Keeping a warehouse up-to-date with the latest data is not a trivial task. You don’t want to place any additional load on the operational databases as that may impact the performance and be detrimental to the customer experience. As part of the loading process there will be data transformation and logic applied to it as it is stored in the warehouse. This whole process is often referred to as ETL – Extract, Transform and Load. I’ll talk about the world of ETL in a future article.
Before I leave ETL and the importance of getting data into the warehouse, data does not necessarily have to be sourced from accompanying databases; it can be any source of data, including flat files, or even third party data, such as the US Postal Service data to give deeper depth to geographic analysis for example.
Key Performance Indicators
The data warehouse is the backbone to the business intelligence (BI) layer. Having all of the data consolidated in one area provides a platform for deeper analysis. This can lead to discovering areas of opportunity for new product development and additional revenue generation capability. Business leaders want to focus on their organization’s Key Performance Indicators (KPI) in order to keep a pulse on how they are doing overall. Areas of concern bubble up as soon as they become problematic and can be addressed immediately. The best data driven organizations do not look at the Data Warehouse as a data repository but instead treat it as a source that facilitates and impacts their daily decisions.
At its heart, the data warehouse is a specialized database with a schema specifically designed to store the data in a way that makes it efficient to continually augment data and query very large volumes. Examples of this that you may be already familiar with are SQL Server and Oracle.
Depending on the end usage of the data warehouse, there are different platforms optimized for that specific task. These include Teradata and Amazon’s Redshift (warehouse in the cloud). The whole world of possibilities with warehousing in the cloud is something I am personally very excited about. More on that in future articles.
Undertaking and championing the construction of a data warehouse can seem like a daunting project, but with the right planning and business objectives it can be done in an evolutionary process without disruption to the business.
We here at the MacLaurin Group have experience in all facets of designing and building a data warehouse, for organizations small and large.
One of the side benefits of creating a data warehouse project is that it will quickly identify all the gaps you may have in your data. You may not have as much data as you first thought you had, particularly when it comes to behavioral data (how was data collected vs. what was collected). The good news is that collecting data is usually not the hard part, knowing what to collect is the challenge and the data warehouse will help you narrow in on that.
As your organization grows, if you find yourself running multiple databases with no single consolidated view across them all, no easy way to get at historical results and no year-over-year comparisons, then you probably want to start thinking about getting a data warehouse.
You know what they say? It is never too late to start.
Image Source: Franklin Heijnein