数据仓库的潜在问题
这是一篇转载的文章,文章主要描述了在数据仓库在每天的ETL过程中,有可能存在的潜在问题,并给出了问题的一些理论上的解决办法。根据墨菲定律,需要在考虑到有可能发生的问题的同时就在设计时杜绝问题发生的可能性。下面是这篇文章的原文,后附我的翻译:
Murphy’s Laws of Data Warehousing: What Goes Bump in the Night?
InfoManagement Direct, August 15, 2008
Fon Silvers
Murphy’s Law tells us that whatever can go wrong will go wrong. This axiom applies to all aspects of life, including data warehousing. The following corollaries to Murphy’s Law relate this inevitability of something going “bump” in the night to data warehousing.
Your data warehouse has been extracting data from a source system. The extract, transform and load (ETL) from the source system to the data warehouse is stable. Everything seems to be working correctly, and then:
1.
A rose by any other name - The source system decides to change the meaning of a data element that you are extracting on a daily basis. The structure and form of the data are the same, but the meaning has changed. If the extract, transform and load application enforces relational integrity, you may know about this source data change before a data warehouse customer tells you about it.
2.
Decoy - The source system decides to switch to a new operational application with its new sets of data. For back-out and backward compatibility purposes, the old operational application and its data are allowed to continue their existence. Your only clue is that data volumes from the source system decrease. If the ETL application monitors its data volumes, you may be aware of the change before a data warehouse customer tells you about it.
3.
New caboose - The source system adds new processes to the end of the job stream. The new processes mean the data you are extracting, which had previously been in its final state, is now in an intermediate state. The caboose of the train is no longer the caboose, or at least, not the last car. If the ETL application monitors the quality of its data, you might observe the change before a data warehouse customer tells you about it.
4.
Data quality - The source system blows a gasket. The quality of the data goes through the floor. If the ETL application is monitoring the quality of its data, you may have been able to see that the source system has a problem. In this situation, the data warehouse is the least of the system’s problems. Even though you’ve exercised the due diligence to identify the presence of poor quality data, the source system is not able to provide high-quality data. In this conundrum, the data warehouse may opt to exclude erroneous data from a source system until that source system can clean up its own data.
5.
Data volume - The source system dramatically increases its data volume. The cause of the sudden increase in data volume can be anything from true expansion to corporate reorganization. Regardless of the cause, the effect is a significantly higher number of records passing through the ETL application, causing data files to exceed their allocations and the data warehouse to miss its service level agreement (SLA). If the ETL application monitors its data volumes, you may notice the change before a data warehouse customer asks you why the data warehouse failed to meet its SLA.
6.
Higher power - Upper management decides the data warehouse should include a subject area(s) specifically in support of a new high-profile, high-ROI and politically charged application. The data will be extracted, stored and reported to meet the specific requirements of the new operational application. This is when a data warehouse is a victim of its own success. The reason upper management makes such a decision is because the data in the data warehouse is expected to increase the value and ROI of the new operational application. The best defenses against such attempts to operationalize a data warehouse are:
1.A complete and comprehensive enterprise data model, so that all subject areas of the enterprise have a home;
2.An understanding of and commitment to the data warehousing principles established by Ralph Kimball and Bill Inmon; and
3.Knowledge of available data warehousing architectures that simultaneously follow data warehousing principles and provide data to operationalapplications in order to meet specific requirements.
7.
Wrong question - Data warehouse customers try to make business decisions by asking a question of the data warehouse and then finding the answer using the data in the data warehouse. But if they don’t understand how to formulate their question or how to find the answer, they often blame the data warehouse. The best remedy for such misunderstanding is metadata. Metadata provides descriptions and definitions of the data in a data warehouse. Delivered side-by-side with the data, metadata gives data warehouse customers the information they need to correctly formulate questions and answers.
8.
Attractive nuiscance - You find out that a table or view, which was descoped by a project a year ago, has been erroneously allowed to continue its existence in the data warehouse. To make matters worse, data warehouse customers and operational applications found this table or view and are actively using it. As a result, you have stale and obsolete data that is being used actively. The best defense against such mishaps is the dynamic metadata, which provides the activity (i.e., number of rows loaded, date loaded and row count) of that table. Metadata providing statistics about individual tables, would lead you to wonder why there is a table that is queried but never loaded.
9.
Locksmith -The key of the source data changes. Usually when this occurs, a new data field has been added to the key. Suddenly, the changed data capture process of the ETL application misinterprets source data. If the ETL application monitors its data volumes, you may have noticed the change in the number of records the ETL application applies to the data warehouse.
10.
Didn’t I Tell You? - You find out that a logic gap exists in an ETL application. This logic gap is causing the ETL application to misstate the data it is loading into the data warehouse. Such a logic gap can be the result of new logic recently added to an operational source system, or it may have existed from the beginning of the ETL application and only recently been revealed by a new set of conditions. The number of rows has not changed. The keys of the data have not changed. It could be as simple as the application of a new sales tax or profit margin. The best defense against such logic gaps is a good working relationship with a data steward who knows the subject area and can recognize such errors.
It is rather easy to create an ETL application that trusts the world to be a perfect place wherein nothing goes wrong. A robust ETL application, however, includes metadata, monitors its own data volumes and assesses data quality. A robust data warehouse is built on a broad foundation provided by a comprehensive enterprise data model and data warehousing principles. These methods that enhance the quality and viability of a data warehouse can be a tough sell when allocating personnel, funds and other resources to the creation of a data warehouse. That is, until you understand the world is not a perfect place and anything that can go wrong, will go wrong - even in a data warehouse.
Fon Silvers graduated from the University of South Florida with an MBA concentrating in Information Systems. He is currently an ETL analyst for a Fortune 500 retail corporation, developing ETL applications for a data warehouse program. In March 2008 Silvers published his first book, Building and Maintaining a Data Warehouse. He may be reached at
fon.silvers@verizon.net
.
译文:
摩菲定理告诉我们:任何有可能发生的问题都将会成为问题。这个公理可以应用于生活的各个方面,也包括数据仓库。以下关于摩菲定理的推论,告诉我们一些可能会在未知的
情况下对数据仓库形成的破坏的因素。
数据仓库从源数据库进行数据抽取,ETL将数据从源数据库抽取、转换、载入到数据仓库是稳定的,所有一切都看上去工作正常,但是:
1、数据的本质 - 源系统决定改变某个数据元素的含义,而这个元素正好是你每天抽取的数据。整个结构、表格都是正确的,但是含义却改变了。如果你在ETL程序中定义了数据
关联的完整性,那么你就会在数据仓库使用者告诉你错误之前,知道源数据被更改了。
2、圈套 - 源数据库决定切换到新的应用程序以及新的数据格式,而依然向后兼容。所以旧的系统以及它的数据依然存在。我们只能知道这部分数据每天从源数据中导入的量减
少了。如果ETL程序监控了数据量,我们就可以在仓库使用者告诉你错误之前,知道这个改变。
3、新的结尾 - 源系统添加了新的程序到工作流的末端,这意味着原先抽取的末端数据现在变成了中间数据,原来火车的末端不再是末端,至少不再是最后的车厢。如果ETL监控了数据性质,那么就可以在仓库使用者告诉你错误之前,知道这个改变。
4、数据质量 - 源系统的数据质量很差。如果ETL监控了数据的质量,那么可以发现源系统的很多问题,在这种情况下,数据仓库会出去尽可能多的源系统中存在的问题。不管你怎么样努力得去排除低质量的数据,源系统还是很难系统高质量的数据。在这个矛盾下,数据仓库可以选择拒绝从源系统接收错误数据,直到源系统可以自己清洗干净自己的数据。
5、数据量 - 源系统不断得增加它的数据量。而公司的重组会使得数据库产生突然间的增加。不管什么原因,突然间大规模的记录通过ETL程序进行传输,会引起数据文件增长超过其配额,而数据仓库会失去其服务等级协议(service level agreement,SLA)。如果ETL监控了数据量,那么我们可以在仓库使用者告诉你错误之前,知道这个改变,并且告诉你为什么数据仓库不再适合SLA。
6、高能量 - 高层管理人员决定要在数据仓库中包含一个指定的支持新的决议、品质管理、制度等的模块。数据会被新的操作程序进行抽取、存储、报告。这是数据仓库为了其成功而做出的牺牲。因为高层管理人员做这个决定是因为数据仓库中的数据是为了提升公司的销售量、品质管理。最佳的拒绝这类在数据仓库中的操作的方法是:
1) 制作完整、全面的企业数据模型,这样所有的企业模块都会在数据模块当中;
2) 根据Ralph Kimball和Bill Inmon的法则,对数据仓库做一个限制;
3) 了解现有的数据仓库数据、结构,然后在接下去的数据仓库设计中提供一些特殊的应用需求。
7、错误的问题 - 数据仓库用户希望能够通过询问数据仓库来得到答案以做出业务上的决策。但是如果他们不知道怎么使用查询公式或不知道如何得到答案时,他们总是会责怪数据仓库。对于这种不理解的最佳解决办法是使用元数据,元数据提供了对于数据仓库中数据的定义和描述,与数据进行并行的传递。元数据会告诉数据仓库用户如何做出正确的查询或得到答案。
8、吸引人的损害 - 有时你会找到这样的一个表或者视图,在一年之前被移除出一个项目,而又错误得存在数据仓库内。更糟糕的是,数据仓库操作员和使用者发现了这个表或视图,并应用了它。结果是,你废除的数据重新被激活应用了。消除这类错误的最佳方法是使用动态元数据,用动态元数据来提供表的动态信息。元数据会为你提供某个表的信息,并让你惊讶于为什么某个表被需要但没有被载入。
9、锁定 - 源系统数据变化的钥匙。当锁定发生时,一个新的数据域被加入到密匙中,改变的数据会突然捕获到ETL程序曲解了源系统的数据。如果ETL监控了数据量,你会注意到数据记录条数在数据仓库中的改变。
10、我没有告诉你? - 你发现一个逻辑缺陷存在于数据仓库中。这个逻辑缺陷使得ETL错误得转换数据到数据仓库之中,这类逻辑缺陷有可能是新增加到源系统的逻辑引起的,或者可能是之前就存在于ETL程序中,而因为修改了某处的设置而显露出来。数据记录数量没有改变,数据主键没有改变,这可能和一个新的销售税率或利润率一样简单的应用。消除这类逻辑错误的最好办法是和一个数据管理员一同工作,他知道这些模块,并且可以找到这样的错误。
要创建一个ETL应用程序,并且相信它完美而不会出错是很简单的。一个健壮的ETL应用,包括元数据来监控它本身的数据量以及所接受数据的质量。一个健壮的数据仓库是建立在一个完善的企业数据模型和完整的数据仓库理论之上的。这个方法提高了数据仓库的质量和可扩展性,当它得到人员、资金、以及其他资源的支持。直到你了解到,这个世界并不完美,任何事情都可能会出错,并且出错,包括数据仓库。