Essential Steps for the Integrated Enterprise Data Warehouse, Part 1
Enterprise Architecture
In this two-part article, I propose a specific architecture for building an integrated enterprise data warehouse (EDW). This architecture directly supports master data management (MDM) efforts and provides the platform for consistent business analysis across the enterprise. I describe the scope and challenges of building an integrated EDW, and I provide detailed guidance for designing and administering the necessary processes that support integration. This article has been written in response to a lack of specific guidance in the industry as to what an integrated EDW actually is and what necessary design elements are needed to achieve integration.
What Does an Integrated EDW Deliver?
The mission statement for the integrated EDW is to provide the platform for business analysis to be applied consistently across the enterprise. Above all, this mission statement demands consistency across business process subject areas and their associated databases. Consistency requires:
-
Detailed textual descriptions of entities such as customers, products, locations and calendars to be applied uniformly across subject areas, using standardized data values. This is a fundamental tenet of MDM.
-
Aggregated groupings such as types, categories, flavors, colors and zones defined within entities to have the same interpretations across subject areas. This can be viewed as a higher-level requirement on the textual descriptions.
-
That constraints posed by business intelligence (BI) applications, which attempt to harvest the value of consistent text descriptions and groupings, be applied with identical application logic across subject areas. For instance, constraining on a product category should always be driven from a field named Category found in the Product dimension.
-
That numeric facts are represented consistently across subject areas so that it makes sense to combine them in computations and compare them to each other, perhaps with ratios or differences. For example, if Revenue is a numeric fact reported from multiple subject areas, then the definitions of each of these revenue instances must be the same.
-
That international differences in languages, location descriptions, time zones, currencies and business rules be resolved to allow all of the previous consistency requirements to be achieved.
-
That auditing, compliance, authentication and authorization functions be applied in the same way across subject areas.
-
Coordination with industry standards for data content, data exchange and reporting, where those standards impact the enterprise. Typical standards include ACORD (insurance), MISMO (mortgages), SWIFT and NACHA (financial services), HIPAA and HL7 (health care), RosettaNet (manufacturing) and EDI (procurement).
The Ultimate Litmus Test for Integration
Even an EDW that meets all of the consistency requirements must additionally provide a mechanism for delivering integrated reports and analyses from BI tools, attached to many database instances, possibly hosted on remote, incompatible systems. This is called drilling across and is the essential act of the integrated EDW. When we drill across, we gather results from separate business process subject areas and then align or combine these results into a single analysis.
For example, suppose the integrated EDW spans manufacturing, distribution and retail sales in a business that sells audio/visual systems. Assume that each of these subject areas is supported by a separate transaction processing system. A properly constructed drill-across report could look like Figure 1.
|
Figure 1: Three Fact Table Drill-Across Report
|
The first two columns are row headers from the Product and Calendar conformed dimensions, respectively. The remaining three fact columns each come from separate databases, namely manufacturing, distribution and retail sales. This deceptively simple report can only be produced in a properly integrated EDW. In particular, the Product and Calendar dimensions must be available in all three separate databases, and the Category and Period attributes within those dimensions must have identical contents and interpretations. Although the metrics in the three fact columns are different, the meaning of the metrics must be consistent across product categories and times.
You must understand and appreciate the tight constraints on the integrated EDW environment demanded by the above report. If you don’t, you won’t understand this article, and you won’t have the patience to study the detailed steps described below. Or, to put the design challenge in other terms, if you eventually build a successful integrated EDW, you will have visited every issue that follows. With those warnings, read on.
Organizational Challenges
The integrated EDW deliverables I’ve described are a daunting list indeed. But for these deliverables to even be possible, the enterprise must make a profound commitment, starting from the executive suite. The separate divisions of the enterprise must have a shared vision of the value of data integration, and they must anticipate the steps of compromise and decision-making that will be required. This vision can only come from the senior executives of the enterprise, who must speak very clearly on the value of data integration.
Existing MDM projects provide an enormous boost for the integrated EDW, because presumably the executive team already understands and approves the commitment to building and maintaining master data. A good MDM resource greatly simplifies, but does not eliminate, the need for the EDW team to build the structures necessary for data warehouse integration.
In many organizations, a chicken-and-egg dilemma exists as to whether MDM is required before an integrated EDW is possible or whether the EDW team creates the MDM resources. Often, a low-profile EDW effort to build conformed dimensions solely for data warehouse purposes morphs into a full-fledged MDM effort that is on the critical path to supporting main-line operational systems. In my classes since 1993, I have shown a backward-pointing arrow leading from cleaned data warehouse data to operational systems. In the early days, we sighed wistfully and wished that the source systems cared about clean, consistent data. Now, more than 15 years later, we seem to be getting our wish!
Conformed Dimensions and Facts
Since the earliest days of data warehousing, conformed dimensions have been used to consistently label and constrain separate data sources. The idea behind conformed dimensions is very simple: two dimensions are conformed if they contain one or more common fields, whose contents are drawn from the same domains. That results in constraints and labels having the same content and meaning when applied against separate data sources.
Conformed facts are simply numeric measures that have the same business and mathematical interpretations so that they may be compared and computed against each other consistently.
Using the Bus Matrix to Communicate with Executives
When you combine the list of EDW subject areas with the notion of conformed dimensions, a powerful diagram emerges, which we call the enterprise data warehouse bus matrix. A typical bus matrix is shown in Figure 2.
|
Figure 2: Bus Matrix for a Manufacturing EDW
|
The business process subject areas are shown along the left side of the matrix and the dimensions are shown across the top. An asterisk marks where a subject area uses the dimension. Note that “subject area” in our vocabulary corresponds to a business process, typically revolving around a transactional data source. Thus, “customer” is not a subject area.
At the beginning of an EDW implementation, this bus matrix is very useful as a guide, both to prioritize the development of separate subject areas and to identify the potential scope of the conformed dimensions. The columns of the bus matrix are the invitation list to the conformed dimension design meeting.
Before the conformed dimension design meeting occurs, this bus matrix should be presented to senior management, perhaps in exactly the form of Figure 2. Senior management must be able to visualize why these dimensions (master entities) attach to the various business process subject areas, and they must appreciate the organizational challenges of assembling the diverse interest groups together to agree on the conformed dimension content. If senior management is not interested in what the bus matrix implies, then, to make a long story short, you have no hope of building an integrated EDW.
It is worth repeating the definition of a conformed dimension at this point to take some of the pressure off of the conforming challenge. Two instances of a dimension are conformed if they contain one or more common fields, whose contents are drawn from the same domains. This means that the individual subject area proponents do not have to give up their cherished private descriptive attributes. It merely means that a set of master, universally agreed-upon attributes must be established. These master attributes then become the contents of the conformed dimension and become the basis for drilling across.
Managing the Backbone of the Integrated EDW
The backbone of the integrated EDW is the set of conformed dimensions and conformed facts. Even if the enterprise executives support the integration initiative and the conformed dimension design meeting goes well, there is a lot to the operational management of this backbone. This management can be visualized most clearly by describing two personality archetypes: the dimension manager and the fact provider. Briefly, the dimension manager is a centralized authority who builds and distributes a conformed dimension to the rest of the enterprise, and the fact provider is the client who receives and utilizes the conformed dimension, almost always while managing one or more fact tables within a subject area.
At this point, I must make three fundamental architectural claims to prevent false arguments from arising:
-
The need for dimension managers and fact providers arises solely from the natural reuse of dimensions across multiple fact tables (or online analytical processing cubes). Once the EDW community has committed to supporting cross-subject area analysis, there is no way to avoid all the steps described in this article.
-
Although I describe the handoff from the dimension manager to the fact provider as if it were occurring in a distributed environment where they are remote from each other, their respective roles and responsibilities are the same whether the EDW is fully centralized on a single machine or is profoundly distributed across many diverse machines in different locations.
-
The roles of dimension manager and fact provider, although obviously couched in dimension modeling terms, do not arise from a particular modeling persuasion. All of the steps described in this article would be needed in a fully normalized environment.
The second half of this article, which will appear in the next issue of DM Review, describes the detailed roles and responsibilities of the dimension manager and the fact provider.