Decode360's Blog

业精于勤而荒于嬉 QQ:150355677 MSN:decode360@hotmail.com

  BlogJava :: 首页 :: 新随笔 :: 联系 ::  :: 管理 ::
  397 随笔 :: 33 文章 :: 29 评论 :: 0 Trackbacks

Essential Steps for the Integrated Enterprise Data Warehouse, Part 2

If you plan to combine data across subject areas, these personality types will do the job.

In the first half of this article, published in the April issue, we saw that drilling across was the defining act of the integrated enterprise data warehouse (EDW). Remember that drilling across requires the integration of data from two more business process subject areas. We then defined two personality archetypes: the dimension manager and the fact provider. The dimension manager is a centralized authority who builds and distributes a conformed dimension to the rest of the enterprise, and the fact provider is the client who receives and utilizes the conformed dimension, almost always while managing one or more fact tables within a subject area.

We are now ready to roll up our sleeves and describe exactly what the dimension manager and fact provider do.

The Dimension Manager

The dimension manager defines the content and structure of a conformed dimension and delivers that conformed dimension to downstream clients known as fact providers. This role can definitely exist within a master data management (MDM) framework, but the role is much more focused than just being the keeper of the single truth about an entity. The dimension manager has a list of deliverables and responsibilities, all oriented around creating and distributing physical versions of the dimension tables that represent the major entities of the enterprise. In many enterprises, key conformed dimensions include customer, product, service, location, employee, promotion, vendor and calendar. As we describe the dimension manager’s tasks, we will use customer as the example to keep the discussion from being too abstract. The tasks of the customer dimension manager include:

Defining the content of the customer dimension. The dimension manager chairs the design meeting for the conformed customer dimension. At that meeting, all the stakeholders from the customer-facing transaction systems come to agreement on a set of dimensional attributes that everyone will use when drilling across separate subject areas. Remember that these attributes are used as the basis for constraining and grouping customers. Typical conformed customer attributes include Type, Category, Location (multiple fields implementing an address), Primary Contact (name, title, address), First Contact Date, Credit Worthiness, Demographic Category and others. Every customer of the enterprise appears in the conformed customer dimension.

Receiving notification of new customers. The dimension manager is the keeper of the master list of dimension members, in this case, customers. The dimension manager must be notified whenever a new customer is registered.

Deduplicating customer dimension. The dimension manager must deduplicate the master list of customers. Customer lists in the real world are nearly impossible to deduplicate completely. Even when customers are registered through a central MDM process, it is often possible to create duplicates, either for individual customers or business entities.

Assigning a unique durable key to each customer. The dimension manager must identify and keep track of a unique durable key for each customer. Many database administrators (DBAs) automatically assume that this is the “natural key,” but quickly choosing the natural key may be the wrong choice. A natural key may not be durable! Using the customer example, if there is any conceivable business rule that could change the natural key over time, then it is not durable. Also, in the absence of a formal MDM process, natural keys can arise from more than one customer-facing process. In this case, different customers could have natural keys of very different formats. Finally, a source system’s natural key may be a complex, multifield data structure. For all these reasons, the dimension manager needs to step back from literal natural keys and assign a unique durable key that is completely under the control of the dimension manager. I recommend that this unique, durable key be a simple sequentially assigned integer, with no structure or semantics embedded in the key value.

Tracking time variance of customers with Type 1, 2 and 3 slowly changing dimensions (SCDs). The dimension manager must respond to changes in the conformed attributes describing a customer. Much has been written about tracking the time variance of dimension members using SCDs. A Type 1 change overwrites the changed attribute and therefore destroys history. A Type 2 change creates a new dimension record for that customer, properly timestamped as of the effective moment of the change. A Type 3 change creates a new field in the customer dimension that allows an “alternate reality” to be tracked. The dimension manager updates the customer dimension in response to change notifications received from various sources.*

Assigning surrogate keys for the customer dimension. Type 2 is the most common and powerful of the SCD techniques because it provides precise synchronization of a customer description with that customer’s transaction history. Because Type 2 creates a new record for the same customer, the dimension manager is forced to generalize the customer dimension primary key beyond the unique, durable key. The primary key should be a simple surrogate key, sequentially assigned as needed, with no structure or semantics in the key value. This primary key is separate from the unique durable key, which simply appears in the dimension as a normal field. The unique, durable key is the glue that binds the separate SCD Type 2 records for a single customer together.

Handling late arriving dimension data. When the dimension manager receives late notification of a Type 2 change affecting a customer, special processing is needed. A new dimension record must be created and the effective dates of the change adjusted. The changed attribute must be propagated forward in time through existing dimension records.*

Providing version numbers for the dimension. Before releasing a changed dimension to the downstream fact providers, the dimension manager must update the dimension version number if Type 1 or Type 3 changes have occurred or if late-arriving Type 2 changes have occurred. The dimension version number does not change if only contemporary Type 2 changes have been made since the previous release of the dimension.

Adding private attributes to dimensions. The dimension manager must incorporate private departmental attributes in the release of the dimensions to the fact providers. These attributes are of interest to only a part of the EDW community, perhaps a single department.

Building shrunken dimensions as needed. The dimension manager is responsible for building various shrunken dimensions that are needed by fact tables at high levels of granularity. For example, a customer dimension might be rolled up to the Demographic Category to support a fact table that reports sales at this level. The dimension manager is responsible for creating this shrunken dimension and assigning its keys.

Replicating dimensions to fact providers. The dimension manager periodically replicates the dimension and its shrunken versions to all the downstream fact providers. All the fact providers should attach the new dimensions to their fact tables at the same time, especially if the version number has changed.

Documenting and communicating changes. The dimension manager maintains metadata and documentation describing all the changes made to the dimension with each release.

Coordinating with other dimension managers. Although each conformed dimension can be administered separately, it makes sense for the dimension managers to coordinate their releases to lessen the impact on the downstream fact providers.

The Fact Provider

The fact provider sits downstream from the dimension manager and responds to each release of each dimension that is attached to a fact table under the provider’s control. Tasks include:

Avoiding changes to conformed attributes. The fact provider must not alter the values of any conformed dimension attributes, or the whole logic of drilling across diverse subject areas will be corrupted.

Responding to late-arriving dimension updates. When the fact provider receives late-arriving updates to a dimension, the primary keys of the newly created dimension records must be inserted into all fact tables using that dimension whose time spans overlap the date of the change. If these newly created keys are not inserted into the affected fact tables, then the new dimension record will not tie to the transactional history. The new dimension key must overwrite existing dimension keys in the affected fact tables from the time of the dimension change up to the next dimension change that was already correctly administered.*

Tying the conformed dimension release to the local dimension. The dimension manager must deliver to the fact provider a mapping that ties the fact provider’s local natural key to the primary surrogate key assigned by the dimension manager. In the surrogate key pipeline (see next task), the fact provider replaces the local natural keys in the relevant fact tables with the conformed dimension primary surrogate keys using this mapping.

Processing dimensions through surrogate key pipeline. The fact provider converts the natural keys attached to contemporary transaction records into the correct primary surrogate keys and loads the fact records into the final tables with these surrogate keys.

Handling late-arriving facts. The surrogate key pipeline mentioned in the previous paragraph can be implemented in two different ways. Traditionally, the fact provider maintains a current key lookup table for each dimension that ties the natural keys to the contemporary surrogate keys. This works for the most current fact table data where you can be sure that the contemporary surrogate key is the one to use. But the lookup tables cannot be used for late-arriving fact data because it is possible that one or more old surrogate keys must be used. In this traditional approach, the fact provider must revert to an inefficient dimension table lookup in order to figure out which old surrogate key applies.

A more modern approach to the surrogate key pipeline implements a dynamic cache of records looked up in the dimension table rather than a separately maintained lookup table. This cache handles contemporary fact records as well as late-arriving fact records with a single mechanism.*

Synchronizing dimension releases with other fact providers. It is critically important for all the fact providers to respond to dimension releases at the same time. Otherwise a client application attempting to drill across subject areas will encounter dimensions with different version numbers. See the description of using dimension version numbers in the next section.

Configuring Business Intelligence (BI) Tools

There is no point in going to all the trouble of setting up dimension managers, fact providers and conformed dimensions if you aren’t going to perform drill-across queries. In other words, you need to sort-merge separate answer sets on the row headers defined by the values from the conformed dimension attributes. There are many ways to do this in standard BI tools and in straight SQL.

Using dimension version numbers in drill across queries. If the requesting application does not include the version number in the select list, then erroneous results are possible because dimension attributes may not be consistent across subject areas. If the requesting application does include the version number in the select list, then at least the results from the fact table queries will end up on separate rows of the answer set, properly labeled by the dimension version. This isn’t much consolation to the end user, but at least the problem is diagnosed in an obvious way.

Figure 1 shows a report drilling across the same three databases as in the first figure in Part 1, but where a dimension version mismatch occurs. Perhaps the definition of certain product categories has been adjusted between product dimension version 7 and version 8. In this case, the retail sales fact table is using version 8 whereas the other two fact tables are still using version 7. By including the product dimension version attribute in the SQL select list, we automatically avoid merging potentially incompatible data. Such an error would be particularly insidious because without the rows being separated, the result would look perfectly reasonable but could be disastrously misleading.

EDW-03
Figure 1: Drill Across Report With a Dimension Version Mismatch

Joint Responsibilities

Dimension managers and fact providers must ensure that auditing, compliance, authentication, authorization and usage tracking functions are applied uniformly for all BI clients. This set of responsibilities is especially challenging because it is outside the scope of the steps described in this article. Even when modern role-enabled authentication and authorization safeguards are in place when using the EDW, subtle differences in the definition of roles may give rise to inconsistency. For example, a role named “senior analyst” may have different interpretations at different entry points to the EDW. The best that can be said for this difficult design challenge is that personnel responsible for defining the Lightweight Directory Access Protocol-enabled roles should be invited to the original dimension conforming meetings so that they become aware of the scope of EDW integration.

The integrated EDW promises a rational, consistent view of enterprise data. This promise has been repeated endlessly in the trade literature. But until now, there has been no specific design for actually implementing the integrated EDW. Although this implementation of the integrated EDW must seem daunting, I believe that the steps and responsibilities I have described are basic and unavoidable, no matter how your data warehouse environment is organized. Finally, this architecture represents a distillation of more than two decades’ experience in building data warehouses based on conformed dimensions and facts. If you carefully consider the detailed recommendations in these articles, you should avoid reinventing the wheel when you are building your integrated EDW.

*See any of the Kimball Group books or my Web site for more information on this topic.


 

Ralph Kimball is the founder of the Kimball Group and Kimball University where he has taught data warehouse design to more than 10,000 students. He is known for the best selling series of data warehouse "Toolkit" books. He started with a Ph.D. in man-machine systems from Stanford in 1973 and has spent the last 34 years designing systems for end users that are simple and fast. You can reach him at ralph@kimballgroup.com .

posted on 2008-10-09 21:53 decode360 阅读(217) 评论(0)  编辑  收藏 所属分类: 13.English

只有注册用户登录后才能发表评论。


网站导航: