Data Mining from a Statistical Perspective
John Maindonald
Statistical Consulting Unit of the Graduate School, Australian National University
Abstract
Data mining is the data analysis component of Knowledge Discovery in Databases (KDD). According to its exponents, KDD encompasses all steps from the collection and management of data through to data analysis. Frequent themes are analysis (both exploratory and formal), methods for handling the computations, and automation, all with a focus on large data sets. At a superficial level, any data set is large where there is a large number of observations, perhaps running into the millions or beyond. Such data sets do pose challenging database management and computational problems but are not necessarily large for predictive purposes, where it is essential to have regard to the data structure. Thus a survey of the use of fax machines in Australian might gather huge amounts of data from just six universities, making prediction to Australian universities as a whole hazardous. In addition or alternative to structure of this type in the sampling units, there may be extensive sampling over time. A further issue is homogeneity. A predictive spatial model which works well for areas where vegetation is sparse, perhaps inferring soil characteristics from satellite imaging data, is unlikely to work well for heavily forested areas.
Data structure affects the forms of graphical or other summary that are appropriate. It often has strong implications for obtaining realistic assessments of predictive accuracy, whether theoretically based or using cross-validation or using the training/test set methodology. In both cross-validation and in the use of the training/test set methodology, data structure and the intended use of any predictive model may be relevant to the division of the data between training and test set. Data structure has implications for data analysis and for efficient modelling.
Data sets which are relatively large and homogeneous, to the extent that it might be reasonable to use mainstream statistical techniques on the whole or a very large subset of the data, raise at least two types of issues for practical analysis. Algorithms that work well with data sets of modest size may fail or take an unreasonably long time to run in really large data sets. Inferential procedures, including cross-validation and the training/test set methodology, may suggest that estimates are more accurate than is really the case. There are at least two reasons for this: (1) various forms of dependence are almost inevitably present in any large data set and may be difficult to model adequately, and (2) the data are almost never a random sample from the context to which results will be applied. Point 1 has special relevance to mainstream model-based methods of assessment of predictive power. Point 2 emphasises the importance of validating any predictive model under the conditions of its intended use.
The collection of data together into large databases raises further issues. Such collections may and often should be the basis for data-based overview of a whole area of knowledge, allowing for much better and more informed use of research-based knowledge and more informed planning of future research. Evidence from such databases must be used critically, having regard to the widely differing quality of different types of evidence. These points are well illustrated from experience with medical databases.
Even where the initially collected data are of high evidential quality, distortions may be introduced by the processing of the data for publication. On the one hand, forms of summarisation which do not lose information in the data can make later use of the data much easier. On the other hand, cavalier summary analysis, including some statistical analyses which form the basis for results that are presented in journals, have the potential to introduce serious and unacceptable biases. Example are given. Data quality issues take a variety of forms. The retention of crucial background information, including information on data structure and potential sources of bias, is a key issue for the collection of data together into databases.
Exploratory data analysis must, for large data sets, rely heavily on various forms of data summary. What forms of data summary are likely to be helpful, while losing minimum information from the data? It may often be reasonable to base analyses on one or more random samples of the data. Where data cleaning is a large chore, there may be a trade-off between time spent on data cleaning and time spent on analysis. It may then make sense to limit cleaning to a random sample of the data, allowing more time for analysis.
Keywords
Data mining, statistics, data analysis, knowledge discovery in databases, evidence-based medicine, Francis Bacon.
INTRODUCTION
Knowledge Discovery in Databases
"Data mining" and the allied term "Knowledge Discovery in Databases" (KDD) are in the tradition of "artificial intelligence", "expert systems", and other such terms which computer technology regularly spawns. Knowledge Discovery in Databases gives a better sense of the aims of the enterprise than the term "data mining". We have databases, often quite large databases, and we want information from them. The search may or may not follow a highly structured pattern. What matters is that knowledge comes out at the end.
Data mining is a brash metaphor that is designed to grab attention, much as the `voyage of discovery' image in Figure 1 was designed to grab attention. Bacon, like modern data miners, wanted to sell an idea. Bacon was the first great advocate of organised scientific research. Bacon wanted a research institute The College of the Six Days' Works created to systematically gather and systematise all knowledge. In Bacon's fictional New Atlantis, this becomes Salomon's House.
The end of our Foundation is the knowledge of Causes and secret motions of things, and the enlarging of the bounds of Human Empire, to the effecting of all things possible.
[Francis Bacon, following Warhaft 1965.]
Nowadays Bacon sounds remarkably like a proponent of Knowledge Discovery in Databases. Bacon set out a process of discovery that would be largely driven by data, in which theoretical insights would be kept in careful check. Bacon's vision was bold and imaginative, even if deficient from the vantage point that four subsequent centuries of scientific discoveries gives us. He thought that the task would be completed relatively quickly. Such consensus as one can find in a hotly disputed area of debate suggests that Bacon gave too little weight to the role of theoretical insights in guiding the collection of data. He certainly underestimated the role of mathematics.
Definitions and Approach
I will define statistics as the science of collecting, analysing and presenting data. If this admittedly broad definition is accepted, then KDD is statistics and data mining is statistical analysis. KDD has a spin that comes from database methodology and from computing with large data sets, while statistics has an emphasis that comes from mathematical statistics, from computing with small data sets, and from practical statistical analysis with small data sets. People whose forte is practical data analysis are not as much to the fore as they should be in either community. The best results will come from a merging of the insights and skills of those who come from diverse intellectual traditions.
The points I make will be illustrated using quite small data sets, which are tractable for the purposes of this paper. Large data sets raise these same issues, and other issues besides. I will emphasise that large data sets often contain, for the purposes for which the data will be used, a relatively small number of independent items of information. This is important when considering whether the huge extent of a data set may allow the use of analysis methods which, for small or medium sized samples, would make very poor use of the data.
All analyses, whether those of mainstream statistics or those favoured by data miners, have as their intended outcome the reduction of a set of data to a small amount of readily assimilated information. The forms of summary may include graphs, or summary statistics, or equations that can be used for prediction, or a decision tree. Often it is helpful to carry out this process of summarisation in several steps. Where a large volume of data can without loss of information be reduced to a much smaller summary form, this can enormously aid the subsequent analysis task. It becomes much easier to make graphical and other checks that give the analyst assurance that predictive models or other analysis outcomes are meaningful and valid. Relevant graphical summaries are perhaps the most important tool at the data analyst's disposal. Experimentation with alternative forms of graphical representation becomes far more feasible once the data have been reduced to managable size. Data structure is the key to data summarisation, and will be a large focus of this paper.
Types of databases
What sorts of databases are we talking about? Here are some examples.
- Large stores and supermarkets hold huge databases on customer purchases, initially collected for inventory and financial recording purposes. Interest may be in using information on customer purchasing patterns to increase sales.
- Insurance firms have huge databases of information on insurance claims, which can be used to adjust estimates of risk.
- There are, currently or in the process of formation, huge geological databases, museum databases, and databases intended to answer biodiversity and species distribution questions. One may wish to know where to find the mineral deposits. Databases that address biodiversity and species distribution questions are important for environmental management.
- Public medicine databases, e. g. databases on medical treatments and claims that are held by the Australian Health Insurance Commission.
- Huge databases of astronomical data. (e. g. Find new interesting astronomical objects.)
The Australian Health Insurance Commission (HIC) wants to identify patterns in the data which may help them identify fraud, inappropriate treatment or over-treatment, and trends which may lead to an escalation of medical costs. Medical practice variations will be of interest. Thus in the second half of the 1970s 4½ times as many women were getting hysterectomies in New England in the U.S.A. as in Norway (McPherson et al. 1982, McPherson 1990) . The HIC will be interested in any comparable large discrepancies in the modern Australian context.
At Australian National University there has been work on huge astronomical data sets. The Massive Astronomical Complex Halo Object (MACHO) database (Ng et al., 1998) has time series from about 20 million stars, collected from each star for a period of four years. This is being searched for evidence of massive compact halo objects.
In items 3 and 4, the data quality, data relevance, data collection and data analysis issues that I have highlighted are likely to be very troublesome.
DATA MINING
The Aims of Data Mining
Expositions of data mining suggest widely varying ideas of the aims of data mining. Elements which may be present include:
- Contrived serendipity, creating the conditions for fortuitous discovery.
- Exploratory data analysis with large data sets, in which the data are as far as possible allowed to speak for themselves, independently of subject area assumptions and of models which might explain their pattern. There is a particular focus on the search for unusual or interesting features.
- Specialised problems: fraud detection.
- The search for specific known patterns. Market basket analysis, and the search for of massive compact halo objects in astronomical data, have this character.
- Standard statistical analysis problems, and especially discrimination, with large data sets. I regard hot spot analysis as a discrimination problem.
Popular ideas of data mining have in them a large element of what I have described as contrived serendipity. Contrived serendipity is not the same as exploratory data analysis. The eighteenth century writer Horace Walpole coined the word "serendipity". He had read a story about three princes of Serendip who were for ever making discoveries, by accident and by sagacity, of things that they were not looking for. Contrived serendipity is not a silly idea. The adage that "Fortune favours the prepared mind" is as true for data prospecting as for mineral prospecting or for research at the laboratory bench. Perhaps the best exposition of this point that I know is Beveridge's little book: "The Art of Scientific Discovery". Serendipity relies on a high element of human intervention.
My own story of serendipity concerns an experiment (Maindonald and Finch 1986) to determine whether trucks with mechanical suspension are kinder to apples than trucks with airbag suspension. The data had outliers which surprised us. Investigation revealed occasional unstable bins where there were huge levels of damage, dwarfing any effect from suspension. It was not at all the result we were looking for. In retrospect we ought to have looked carefully at the bins that we intended to use. A modicum of careful experimental design was crucial to this finding.
Highly automated analyses do not create the conditions for serendipity. We have not yet learned how to construct computers so that they have the "Aha" experience. Serendipity does not go well with the passion for automation.
Exploratory data analysis can be simple-minded, for example using normal probability plots to look for outliers in columns of data. Or it may be very sophisticated. Where careful data analysis has been something of an oddity, simple-minded exploratory techniques may yield easy results. The striking differences in hysterectomy rates between New England and Norway which McPherson et al. (1982) found would have been obvious even to a casual observer. In the study of comparative mortality rates in coronary artery bypass grafting that is reported in Chassin et al. (1996), comparisons were not totally straightforward. Adjustments for prior risk were needed to give figures that were genuinely comparable. Again, there were some large effects. 27 surgeons who each performed less than 50 operations had, for the part of the period 1989-1992 for which they practiced, an average risk-adjusted mortality rate of 11.9% as against a statewide average of 3.1%. In the last of the five years for which results are reported, data was from a total of 16,690 patients.
The search for particular kinds of interesting objects might, as Alan Welsh has suggested, be called "data prospecting". Prospectors have a clear idea of what is valuable and what is not. There's structure in their searching. An example is the search for interesting features in the MACHO data set (Ng et al.). One needs a very sophisticated form of exploratory data analysis. Simple forms of searching will not reveal anything. Mackinnon and Glick (1999) suggest the term "data geologist".
Finally, there are classification and regression problems. Based on events that have been associated with earlier computer incursions (or break-ins), computer administrators want to be able to detect the tell-tale signs of any new incursion. Or a marketing firm may want to predict which potential addressees are likely to respond. In hot spot analysis the aim is to find addressees who are highly likely to respond.
Data Structure Issues
Here I will demonstrate a data set of modest size which reduces, for some predictive modelling purposes, to just six independent items of information. The same often happens when there are much larger datasets. Consider for example a hypothetical study of the use of fax machines in large organisations. It will be much easier to get extensive information from a small number of obliging organisations, than to get information that is widely representative of large organisations.
The data consists of 286 observations drawn from six sites at a variety of latitudes. For purposes of generalising to other comparable sites we have, effectively, six observations. The example is, like many of the examples in books on data mining and machine learning, is a classification problem. It comes from a study of geographical variation in plant architecture (King and Maindonald 1999). Orthotropic species have steeply angled branches, with leaves coming off on all sides. Plagiotropic species have spreading branches. David King was interested in comparing the leaf dimensions of the two types of species.
It turns out that, at each of the locations, one can discriminate between orthotropic and plagiotropic, based on two quantities, the leaf width to length ratio, and the petiole or leaf stalk length. Figure 2 shows the plot of the data, together with discriminant line, for North Queensland. Different latitudes require different lines. The line moves up for locations that are closer to the equator, and down for locations that are further from the equator. The dotted line, on the same graph, shows my prediction for Wellington. The line has the same slope, only now it has shifted down. It is then fair to ask: "How accurate is the discrimination?"
For another sample of plants from the same six sites, the estimated accuracy of prediction is 87.4%. One gets this by cross-validation here I divided the data up into ten parts, then left out each 10% of the data in turn and predicted the classification for the omitted 10% from the data that are left in.
But Wellington is not one of the original six sites. I have assumed that the whole of the difference between the six sites is explained by latitude differences. That may not be true. There are a number of other variables which may be important: altitude, temperature, cloud cover, rainfall, and so on. With our present data these other variables are not needed to explain differences between the six sites. Nevertheless, in data from a wider range of sites, some of them may turn out to be important. To get a fairer estimate of the accuracy of prediction at some new site one needs to leave out sites one at a time, and use the remaining five sites for prediction. I had expected that the estimated accuracy of prediction would deteriorate badly. It turns out it does not change much, which may be just luck. Note also that with just six sites the estimate of prediction accuracy is itself inaccurate. Saying that we have 286 species sounds pretty good. Noting that we have only six sites spoils the story.
We are still talking about estimates of accuracy that are internal to these data. The sites were not selected at random. If we had sites in abundance we might separate the data into two parts, using one set of sites to develop our model and another set to test it out. I would then be relatively well placed to comment on the accuracy of prediction of the discriminant line that I have drawn for Wellington.
If we had huge amounts of data we might use the training/test set idea. There are two possibilities. We can either split the data from each site into two parts, or we can assign a proportion of the sites to the training set and the remainder to the test set. Here is the first possibility.
|
Site 1 |
Site 2 |
. |
Site 60 |
Training set |
60% |
60% |
|
60% |
Test set |
40% |
40% |
|
40% |
The other possibility is:
Training Set |
Test Set |
30 Sites |
Remaining Sites |
The training/test set idea safeguards against an inappropriate model, but only if the test set reflects the structure of the population to which predictions will be applied. Thus the second of these ways of dividing up the data is required, to be able to generalize sensibly beyond the sites that were used in the sample. Note however that we still have only an internal measure of accuracy, from purposively chosen sites. To get an external measure of accuracy we need to choose totally new sites and see how well the predictive model performs on them.
In summary, there are several points which emerge. However much information had been gathered from the North Queensland site, it could never have told us about the effect of latitude. For puposes of giving information on the effects of latitude, data from the Queensland consisted of one item of information only. Even when we take all data together, the whole data set has only six independent items of information on the effects of latitude. No matter how much further information we might gather from these six sites, this limitation would remain. Even with six sites there are other variables whose effects we cannot assess. One such pertinent variable may be altitude.
Further Comment on the Training/Test Set Methodology
The idea of training set and test set is of enormous importance. The training data set is used to develop the model. The test set is used to check the accuracy of model predictions. So also are the structure of the training sets and test sets. We need to tell first year statistics students about training sets and test sets. Cross-validation may be seen as an extension of the training/test set idea. It uses all the data for prediction, while giving the same insight on the accuracy of predictions that is available from the training/test set methodology.
Detection of computer intrusions provides another example. Here we are clearly dealing with a moving target. Any accuracy rate that is based on the training sample will be optimistic. The intruders will have honed their methods, effectively changing the target population, by the time that we come to use the discriminant function. We might choose the training set to be data for the first six months, and the test set to be data for the next three months. Or one might split the original data into training and test set for purposes of developing the model, then using a validation set consisting of new data for testing the model. Validation should be a continuing ongoing process, as new data accumulate.
The Limitations of Models
Statisticians like to build models and use the model both to make predictions and assess the accuracy of the predictions. Again I will use a small and tractable data set to illustrate points which have equal relevance to large data sets. I use it to highlight the common importance of the time dimension.
The data in Figure 3 are from a series of experiments which Michelson conducted in 1879 to measure the speed of light. There is clear correlation between the result from one run and the result from the next, but with an occasional sudden change. A consequence is that the estimate from two points that are close together in time are much more similar than the results from points that are well separated.
In order to make inferences for these data we require a model that allows for this serial correlation. For example, we may wish to ask whether there are systematic trends, up or down, within each experiment. Or is the apparent pattern a result of the serial correlation? Our argument may rely quite heavily on the model assumptions. There are several possible reasons for building a model that allows for the sequential correlation. One is to get an efficient estimate of the slopes. Another is to allow us to get reliable estimates of model accuracy. If in Figure 3 we had huge numbers of slope estimates we could for these purposes reduce or avoid reliance on modeling of the sequential correlation structure in each run. A third reason is that we may wish to understand why each result is so strongly influenced by immediately previous results. How far back in the sequence does this influence go?
Much of the bulk of some of the largest databases is a result of intensive sampling over time. Mackinnon and Glick note that the Earth Observing System (EOS) satellites will generate around 0.33× 1015 bytes of data in a year. There must be attention to time dependence.
|
Small Data Sets |
Medium Size |
Large Data Sets |
Need for efficient prediction?
What model structure is desirable? |
Strong
Linear terms |
Reduced
Smoothing terms can be fitted |
Much reduced.
Decision trees may be acceptable. |
How can we get internal estimates of accuracy?
(i. e. prediction to a similarly drawn sample) |
Estimates must be model-based, with strong assumptions
|
Alternatives are: model-based estimates, or use resampling, or training/test sets
|
Use training/test set, with random splitting of the data.
|
How can we get external estimates of accuracy?
(i. e. prediction outside of sample population) |
No good alternative to reliance on model-based assessments
|
Use training/test set, with purposive choice of test set.
|
Use training/test set, with purposive choice of test set.
|
Table 1. An assessment of how size of data set may effect model prediction and assessments of the accuracy of estimates.
|
Table 1 gives a broad assessment of the implications of size of data set for the sorts of reliance that we must or can afford to place on models. We are forced to make strong assumptions in order to get anything useful from small data sets. With data sets that are genuinely large, assumptions such as linearity that are inevitable in a small data set may unacceptably bias the more accurate predictions that are now possible. Our predictions are often reasonable, while our estimates of accuracy may very often be problematic. With very large data sets we can afford to use methods, of which the simpler types of decision tree methods seem the most popular, which do not take advantage of important features of much of the data on which they are used. We can at the same time afford to base our analysis on a part only of the data, keeping the rest for testing predictive accuracy. Estimates of accuracy obtained in this way are in general safer than model-based estimates of accuracy.
Data sets that are genuinely large, in the sense that they have a huge amount of replication at the level of what survey statisticians call the primary sampling unit, are much rarer than is commonly supposed. While the number of patients in the data on mortality from heart surgery that we considered earlier was of the order of 70,000, the number of surgeons would have been less than 200, still more than enough for a study of factors which may affect differences in mortality between one surgeon and another.
With data sets that are genuinely large, heterogeneity may become a problem. Satellite imaging data may be able to predict soil characteristics fairly well over homogenous areas where there is little ground cover. Models developed for making predictions in this simple case will nor generalise to handle areas covered by forest or scrub, or where there are marked changes in the landscape. To develop models that will be effective, one needs large amounts of data for each different type of terrain, widely sampled over that terrain. The task is further complicated because the pattern of spatial dependence will change from one type of terrain to another. Data sets that are large enough for developing predictive models that can cope effectively with these different types of heterogeneity are unusual.
The possible or probable extent of heterogeneity will vary from one type of study to another. The pattern of responses of human populations to medical treatment seem likely to remain broadly similar as one moves to from one location to another. Television advertising seems to work in much the same way in Beijing as in Sydney. There is much less need to worry about the huge heterogeneities, both in the pattern of response and in the correlation structure, that one finds in spatial data.
Statisticians will do well to be sceptical about model-based and other internal estimates of error. Those from a data mining perspective need to understand the importance of data structure both for getting efficient estimates and for assessing predictive accuracy. Not every huge collection of numbers has large numbers of independent items of information, at the level of variation that is important. The computer intrusions example illustrates the frequent demand for an external check on accuracy.
Coping with Size
There are several alternatives:
- The data may be analysed as they stand.
- The data may be divided into homogeneous subsets, which are then analysed separately.
- Analysis may be based on summary measures, often substantially reducing the size of the data which require analysis.
- A sample may be taken from the data for analysis.
Attempts to model the total data as they stand may force the use of forms of analysis which do not take advantage of data structure. For the analysis of time series of observations on multiple stars, analyses that ignore the time structure of the data cannot give useful information. We want to be able to say something about stars, not about an event at a point in time. In other cases, there may be serious inefficiencies. Reduction of the data to manageable size allows effective exploratory data analysis, which is difficult or impossible when each graph requires minutes or hours of processing. It allows the insights that are available from classical forms of statistical analysis. This is particularly important when the structure of the random variation is dominated by a small number of what survey statisticians would call primary sampling units. If the number of primary sampling units is small, any approach that does not model this variation is unlikely to work well. It is usually well worth accepting some loss of information at lower levels of the hierarchy of variation, in return for accurate modelling of variation at this primary level. Automatic equipment can now make large numbers of repeated measurements one the same sample of rock, or on the same plant. Where the sample of rock or the plant is the observational unit, the determining of appropriate summary measures for each rock or plant may be a necessary step preliminary to further analysis.
Finally, there may be merit in sampling from very large data sets, using the sample for exploratory analysis and perhaps even for the final analysis. This makes especial sense when data cleaning is a huge chore, and there is a trade-off between time spent on data cleaning and time spent on analysis. Restriction of the cleaning to a sample may give large time savings, allowing more time or resources for data analysis. Uthurusamy (in Fayyad 1997) comments that "It is better to prevent than process the data glut." This is where statistical input may be very important. Checks which remain possible when data are collated may be impossible later. Some aggregation of information at the collection stage may enormously ease the later processing task. Collecting information on every variable in sight is not a good idea, unless these variables are ranked for relevance and importance.
It is however important not to lose key background information. One should not rely on forms of summary which are known to, or may, introduce serious distortions. An example will appear later. Background information that must be preserved includes:
- Time, place and context information. When was it collected, where was it collected, who collected it, and what was the rationale for its collection?
- Information that may be relevant to assessing possible biases in the data.
- Information must be preserved that identifies major aspects of the structure of the data.
KDD CONTRASTED WITH STATISTICS
Statistics
Statistics, as I want to define it, is the science of collecting, organizing, analysing and presenting data. "Knowledge Discovery in Databases" is not much different. The components that seem needed are:
- Computing skills required to manage the data and the analysis.
- An understanding of design of data collection issues.
- An understanding of statistical inferential issues.
- A knowledge of relevant mathematics.
- Insights from practical data analysis.
- Application area insights.
- Automation of data analysis.
Different types of statisticians give the items on this list different weights. Data miners and machine learners put particular emphasis on items 1 and 7. They argue that there are too few statisticians available to handle the demand for data analysts. So we have to make data analysis automatic. If one can do this for large data sets, it ought to be possible to use the same tools for modest sized data sets. This a revival of the dream of developing statistical expert systems. I do not expect data miners to achieve quick success where the statistical expert system developers failed. The experience with work on Artificial Intelligence, on which I comment below, is pertinent here.
Data mining and statistics have different intellectual traditions. Both tackle problems of data collection and analysis. Data mining has very recent origins. It is in the tradition of artificial intelligence, machine learning, management information systems and database methodology. It typically works with large data sets. Statistics has a much longer tradition. It has favoured probabilistic models, and has been accustomed to work with relatively small data sets. Both traditions use computing tools, but often different tools. Data mining may now be entering a less brash and more reflective phase of development, where it is more willing to draw from the statistical tradition of experience with data analysis. Efron's warning is apt:
"Statistics has been the most successful information science. Those who ignore statistics are condemned to re-invent it."
[Efron, quoted in Friedman 1997.]
Many statisticians of my generation went immediately from a training in mathematics to the practice of statistics. We learned statistical tools, slowly and going too frequently down blind alleys, as we went along. We now see a partial replay of this history, in the context of large data sets. Skills in the manipulation of large databases are necessary to do anything at all. It will take time to get widespread acknowledgement that the skills and tools needed to manipulate large data sets may not, on their own, be enough.
Unrealistic Expectations
In important respects, data mining is in the tradition of artificial intelligence. There is the same temptation to make outrageous promises. The term Artificial Intelligence seems now fallen in disrepute. Here are the words of one pioneer in the area, speaking in retrospect:
Alas for AI [Artificial Intelligence], the funding came screaming in with lots of strings attached and unrealistic expectations, and the results were pitifully few. Most of the applications didn't work for good reasons: they were hard problems and still are. It was essentially, in much of the AI community, hubris arrogance about one's capabilities and potentials, which just failed. The systems did not do what they claimed. But remember, often it wasn't the scientists who were doing the claiming.
One impediment is the perception by much of the computer and management culture that making something work is primarily a matter of getting the right specifications and interpreting them, so to speak making a program that satisfies those specifications. In a very fundamental way that is just plain wrong.
[Selfridge, P. 1996.]
These comments have some relevance to data mining. It is still possible to make a living from selling the computing equivalent of snake-oil. Still today there are managers who are willing to believe in magic fixes, who are convinced that the trick is to get the specifications right.
We are still a long way from a viable automated approach to data analysis, whether for small or for large data sets. It is often a considerable effort for competent data analysis professionals to get data analysis software to provide an enlightened and meaningful analysis. Many aspects of the computations that can and should be automated are not. Or the desired output information may be almost impossible to get with current software. While there are serious limitations in the data analysis software that would be at the heart of any automated system, what hope is there for a mechanically driven process?
Different methodologies
Until recently the predominant commercial data mining tool for predictive modelling was one or other version of decision trees. There has been some use of logistic regression, some use of classical regression methods, and some use of neural nets. Broadly, data miners are likely to use one or other decision tree method as their tool of first recourse, while those from a statistical tradition use a broad range of tools which may or may not include decision trees.
Decision trees do not work well with small data sets. One reason is that they take only limited advantage of the ordering relations and continuity notions that are implicit for continuous variables. This loss of information may not be serious when there is so much data that one can afford to jettison much of the information that it contains. The advantage of decision trees is that they place little constraint on the pattern of relationship. Neural nets offer the same kind of flexibility that is available from a smorgasboard of mainstream statistical models. The user must make a choice from the huge range of nets that is available. At the same time, there is a very limited tradition of experience on which to draw when choosing between the nets that are on offer. Perhaps the best commentary is Ripley (1996). Most statisticians prefer, for the time being, to stay with tools where the choices are better understood and where the output can be expressed in the form of graphs and/or equations. (The graphs I have in mind is graphs that are a representation of a functional form.)
Recent work by Jerome Friedman and others offers interesting new methodologies that build on a decision tree approach. At the very least tree-based methodologies provide a useful exploratory tool to use when starting investigation of data sets which, even allowing for whatever structure may be present, are large. They may quickly highlight major features of the data that are important for predictive modelling. They provide useful clues for subsequent more careful modelling using methods where the predictive model can be expressed in graphical and/or equation form.
DATABASES
The benefits and problems of databases
At best, the collection of data together into databases creates a resource which researchers can use in constructing an accounts that use all the data. At worst, such databases may suffer from some or more of the deficiencies: they may contain serious errors, there may be biases that arise from collection or from prior processing, key background information may be missing, information on key variables may be missing, high quality information may be mixed with information that is of very poor quality with few clues that will allow the researcher to distinguish. Here I provide examples that illustrate some of these points.
The experience of clinical medicine
There is now a large body of experience, in selected areas of research, with the data relevance and quality issues that arise in the use of evidence from databases. My concern here is to draw attention to the experience of of clinical medicine . Data relevance and quality issues are important both for the collecting together of data, and for its analysis. Particularly relevant is the experience of the emerging tradition of "Evidence-Based Medicine" (EBM; see for example Sackett et al. 1997) seems. This tradition, and clinical medicine generally, has had extensive experience in trying to pull together evidence from multiple sources. Much of the EBM activity goes on under an umbrella organisation called the "Cochrane Collaboration" (Sackett and Oxman1994). All data analysts can learn from what EBM and the Cochrane Collaboration have made of Francis Bacon's ideas.
The Cochrane Collaboration exercises, with their emphasis on data-based overview of major medical issues, have strong connections with Knowledge Discovery in Databases. The best assessment of the evidence will come from a careful critical assessment that is based on all the data. Results in individual scientific papers or from one groups of researchers can, taken in isolation, be quite misleading. Were this point better understood and acted on, there would be large implications for many areas of scientific research. There would be a strong focus on data-based critical overview. Contributing to such overviews, data which have formed the basis of published results would be archived. Knowledge discovery in databases is a great idea, but it will not be easy to set in place the mechanisms that will make it effective for its intended purpose.
The relatively small extent of the databases used for Cochrane Collaboration exercises makes no difference to the principles that apply. Such exercises make a virtue of the fact that, for example, numerous researchers in different parts of the world have done trials on the use of aspirin to ward off heart attacks. The different trials are in effect replications of a similar experiment, even though there has not been any large element of common planning. If a result can be replicated over half a dozen clinical trials, there is a good chance that it may be reproduced in clinical practice.
Evidence-based medicine is not primarily focused towards research, but towards ensuring that research results are assimilated into clinical practice. There are however huge implications for research. Some of the key insights are:
- Assimilating the evidence that may be spread across numerous papers is a non-trivial task. It requires teamwork, specialist skills, and statistical analysis data overview methodology.
- The most reliable evidence, and the only evidence that should be used if it is available, comes from well-conducted randomised controlled trials.
- Observational databases are not good sources of information on which treatments are effective and which are not. There are too many confounding factors. They may provide useful clues on what trials are worth conducting.
In clinical medicine there is a continuing debate between those who are sceptical of all non-experimental evidence, and those who consider that evidence from observational databases can be pitted against experimental evidence. Jorgensen and Gentleman (1998) give references; see also Maindonald (1999). Medical databases that show what treatments patients have received are not good sources of information on which are the optimum treatments. Claims to the contrary ignore a long history of unsuccessful efforts of this type.Except in cases where the differences are spectacular, observational databases have too many confounding factors.
There is a continuing debate over whether salt has a major role in causing hypertension high blood pressure in the populace at large. The present state of the debate was summarised in an article (Taubes 1998), called "The (Political) Science of Salt", that appeared last August in Science. Different observational studies give different answers. Randomised clinical trials indicate that any effect is very small, certainly not large enough to justify a huge expenditure of public funds on efforts to reduce salt consumption. I believe the clinical trials. Huge amounts of public money have been wasted because of reliance on data that was incapable of providing the answers that were sought. There has also been a reliance on animal models that were relevant, if at all, only to patients already suffering from hypertension. Specifically, one early researcher into the effects of salt had been able to breed a strain of salt-sensitive hypertensive rats. This was taken as evidence that it was bad for humans to eat salt.
Researchers as well as clinicians need the information that EBM tries to provide. Researchers need it so that they can get a good sense of what is already known, so that they can identify knowledge gaps, and so that they can avoid the mistakes of earlier workers. They also need it because published research gives a biased and incomplete coverage of the trials that have been undertaken, with negative results often sitting in a drawer unpublished. Trying to get around that problem is a task for a team of experts.
It should not be so hard. There is a need for an international register of clinical trials, and to archive data from trials under arrangements that guarantee anonymity. It is then available for later overview studies, or for re-analysis if there are suspicions about the initial analyis. Finally, there needs to be a high standard of reporting, so that anyone doing an overview study can easily verify, for example, whether the allocation of treatments was indeed randomised. This has led to a set of reporting standards, set out in the Consort statement (Begg et al. 1996).
Here however, my interest is in implications for the creation and use of data from databases. Care, scrutiny and critical evaluation are required at every step. Not all sources of evidence are of equal value. It is important to distinguish what is potentially misleading from what is soundly based.
There are proper and necessary roles for observational databases. Mackinnon and Glick (1999) refer to a New York Times report (Kolata 1997) that the US Food and drug Administration wants a database to monitor about 200,000 reports per year of adverse drug reactions. Such a database would seem long overdue. Drug trials cannot investigate the whole range of circumstances that will occur in clinical practice. Data collection and data analysis will often be ongoing, rather than providing an authoritative analysis at a particular endpoint in time. Many commercial applications have this character. Fraud detection is an obvious example.
Data Distortions
Even where the initially collected data are of acceptable quality, distortions may be introduced by the processing of the data for publication. This leads to distortions when processed data, taken from published papers, is collected into databases.
Figure 4 presents another example, this time from work on killing insects in produce that may be intended for export. The graph presents data, and the results of analysis, that appeared in a paper (Jessup and Baheer 1990) in the Journal of Economic Entomology. A commonly used transformation, the probit, has been used on the vertical scale. There are two problems. (1) The author has extrapolating well beyond the limits of the data. (2) The line does not fit the data.
Jessup and Baheer did at least present their data. Many of the authors who present results in the Journal of Economic Entomology do not. All they give is a line. So when one sees figures such as appear in Table 2 below, it is impossible to know whether they are comparable. The calculations that gave the New Caledonia value are mine (Sales et al. 1997), and used a complementary log-log model. The Queensland figure is from Heard et al. (1989), and assumed a probit model. My suspicion is that it is affected by a bias which is similar to, though less extreme than, that in Figure 4.
|
New Caledonia
|
Queensland
|
Third Instar of Queensland Fruit Fly |
8.4 min
(95% CI: 7.7 - 9.3) |
11.6 min |
Table 2. Estimated times to 99% mortality, following immersion in hot water at 47°C. |
An obvious database construction exercise is to go through the Journal of Economic Entomology and pick out information e. g. on 99% mortality points where they are available. It is not, in most cases, possible to go back to the original data. Clearly, given the unreliability of the analyses, the databases should be storing the original data, not the estimates of the 99% mortality point or other results from analyses. The bad news is that the original data are rarely likely to be available. Posterity would be better served if all those authors who have published in the Journal of Economic Entomology had drawn a curve through their data by eye. One may hope that the current passion for putting data together into databases will highlight the huge problems that our inattention to such matters is creating for the future use of the results of much current scientific work.
Environment Australia and others are putting a huge effort into collecting together data that will give information on species abundance and distribution. There are various kinds and qualities of data data collected haphazardly for taxonomic purposes, data collected from carefully chosen sites, data that uses statistical sampling approaches to assess biodiversity over a wide area, and predictions that are based on fitting models to data of any or all of the forgoing types. These mirror the clinical medical contrast between the different types of observational data and varying standards of clinical trials. Different types of data are not all of equal value or quality. Some data turn out to be totally useless for their claimed purpose. If garbage goes in, garbage is certain to come out.
Elder and Pregibon comment that:
The bad news is that often the available data is not representative of the population of interest and the worse news is that the data itself contains no hint that there is a potential bias present.
FINAL COMMENTS
Other Points
There are many other points of common interest between data mining and mainstream statistical analysis, points that one would cover in a course on statistical regresstion and classification modelling. Variable selection is as much or more an issue in data mining as in mainstream statistical analysis. Depending on how results are to be used, the confounding of effects of variables may be a serious problem for interpretation.
The Lore of the Data Miners
Here is a list of emphases that come through from the proponents of Data Mining and Knowledge Discovery in Databases. My responses, which are in italics, are a form of summary of the points that I have made earlier in the paper:
- Data are valuable. There are bound to be golden nuggets in the large mountains of data.
(i)The way the mountain was assembled is important. Small mountains, assembled with great care, are usually better than large mountains. Small may be beautiful, and serve the data analyst more effectively.
(ii) Data of uncertain quality may be a snare and a delusion, and may even become an excuse for avoiding getting the data that are needed to provide a reliable answer. The "nuggets of gold" analogy is misleading. Dirt is exactly what one needs to show off gold nuggets. By contrast, rubbishy data usually obscures the accurate and valuable data so effectively that it is impossible to know what to trust.
(iii) It is no use getting information about population A when what you really need is information about population B. One must ask whether data have a structure that will make it possible to generalise results to an intended wider population. Here experimental design and sampling design issues are crucially important.
- The cleaning of data is a major issue.
Cleaning of data is a major issue. Here the data mining literature is dead right. There is scope for trading off time spent in cleaning data against time spent in analysis.
- We need to get all the data together so that it can be used effectively. Hence a thrust towards networked databases in which national collections (e. g. from museums) will be available online.
Getting all the data together is a worthy enterprise. Statisticians have too often neglected it. However data from different sources, collected in different ways, may vary hugely in quality and relevance. Unless the data are collected in a way that identifies such distinctions, the collection may be useless or even misleading.
- Classical statistical methods do not scale up to these huge data sets.
Oftentimes the data should be scaled down.
On point 1, here is an experience that I have had from time to time. I tell a client that I could answer their question if they could provide such and such data. I am then told that someone did indeed collect such data several years ago, but the results were not published. The data are found and it turns out that the design of data collection was so bad that the data are useless. There are, it turns out, good reasons why the data were never published. Data relevance and data quality limit what can later be done with the data. This is why data collection is so important. It is the cornerstone for everything that comes later.
Acknowledgements
I am grateful to members of the Friday morning Canberra Applied Statistics group for helpful comments. Andreas Ruckstuhl read a draft and made a number of comments which led to substantial improvements. He is not of course responsible for what I have made of his comments.
References
Begg, C., Cho, M., Eastwood, S., Horton, R., Moher, D., Olkin, I., Pitkin, R.,Rennie, D., Schulz, K. F., Simel, D., and Stroup, D. F. 1996. Improving the Quality of Reporting of Randomised Controlled Trials: the CONSORT Statement. Journal of the American Medical Association 276: 637 - 639.
Beveridge, W. I. B., 3rd. edition 1957. The Art of Scientific Discovery. Vintage Books, New York.
Chassin, M. R., Hannan, E. L. and DeBuono, B. A. 1996. Benefits and hazards of reporting medical outcomes publicly. New England Journal of Medicine 334: 394-398.
Jessup, A. J. and Baheer, A. 1990. Low-temperature storage as a quarantine treatment for kiwifruit infested with Dacus tryoni (Diptera: Tephritidae). Journal of Economic Entomology 83: 2317-2319.
Elder, J. and Pregibon D. 1996. A statistical perspective on Knowledge Discovery in Databases. In Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, R.: Advances in Knowledge Discovery and Data Mining, pp. 83-113. AAAI Press/MIT Press, Cambridge, Massachusetts.
Fayyad, U. M., Piatetsky-Shapiro, G. and Smyth, P. 1996. From data mining to knowledge discovery: An overview. In Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P. and Uthurusamy, R.: Advances in Knowledge Discovery and Data Mining, pp. 1-34. AAAI Press/MIT Press, Cambridge, Massachusetts.
Friedman, J. H. 1997. Data Mining and Statistics. What's the Connection? Proc. of the 29th Symposium on the Interface: Computing Science and Statistics, May 1997, Houston, Texas.
Heard et al. 1991. Dose-mortality relationships for eggs and larvae of Bactrocera tryoni (Diptera: Tephriditae) Immersed in Hot Water. Journal of Economic Entomology 84: 1768-1770.
Jessup, A. J. and Baheer, A. 1990. Low-temperature storage as a quarantine treatment for kiwifruit infested with Dacus tryoni (Diptera: Tephritidae). Journal of Economic Entomology 83: 2317-2319.
Jorgensen, M. and Gentleman, R. 1998. Data mining. Chance 11: 34-39 & 42.
King, D. A. and Maindonald, J. H. 1999. Tree architecture in relation to leaf dimensions and tree stature in temperate and tropical rain forests. Journal of Ecology, to appear.
Mackinnon, M. J. and Glick, N. 1999. Data mining and knowledge discovery in databases an overview. Australianm and New Zealand Journal of Statistics 41: 255-275.
Maindonald, J. H. 1999. New approaches to using scientific data statistics, data mining and related technologies in research and research training. Occasional Paper 98/2, The Graduate School, Australian National University.
Maindonald, J. H. and Finch, G. R. 1986. Apple transport in wooden bins. New Zealand Journal of Technology 2: 171-177.
McPherson, K. 1990. Why do variations occur? In Anderson, T. F. and Mooney, G., eds.: The Challenges of Medical Practice Variations, pp.16-35. Macmillan Press, London.
McPherson, K., Strong, P. M., Jones, L. and Britton, B. J. 1982. Small area variations in the use of common surgical procedures: An international comparison of New England, England and Norway. New England Journal of Medicine 307: 1310-1314.
Ng, M. K., Huang, Z., and Hegland, M.. 1998. Data-mining massive time series astronomical data sets - a case ctudy. Second Pacific-Asia Conference on Knowledge Discovery in Data Bases, PAKDD98, 1998, pages 401-402.
Porter, R. 1997. The Greatest Benefit to Mankind. Harper Collins, London.
Ripley, B. D. 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge.
Sackett, D. L., Richardson, W. S., Rosenberg, W. M. C. and Haynes, R. B. 1997. Evidence-Based Medicine. Churchill Livingstone, New York.
Sackett, D. L. and Oxman, A. D., eds. 1994. The Cochrane Collaboration Handbook. Cochrane Collaboration, Oxford.
Sales, F., Paulaud, D., and Maindonald, J. 1997. Comparison of eggs and larval stage mortality of three fruit fly species (Diptera: Tephriditae) after immersion in hot water. Pp. 247-250 in Allwood, A. J. and Drew, R. A. I., eds., Management of Fruit Flies in the Pacific. Australian Centre for International Agricultural Research, Canberra.
Selfridge, P. 1996. In from the start. IEEE Expert 11: 15-17 and 84-86.
Stigler, S. M. 1977. Do robust estimators work with real data. Annals of Statistics 54: 1075.
Taubes, G. 1998. The (Political) Science of Salt. Science 281: 898-907 (14 August).
统计学和数据挖掘:交叉学科
摘要:
统计学和数据挖掘有很多共同点,但与此同时它们也有很多差异。本文讨论了两门学科的性质,重点论述它们的异同。
关键词:
统计学
知识发现
1.
简介
统计学和数据挖掘有着共同的目标:发现数据中的结构。事实上,由于它们的目标相似,一些人(尤其是统计学家)认为数据挖掘是统计学的分支。这是一个不切合实际的看法。因为数据挖掘还应用了其它领域的思想、工具和方法,尤其是计算机学科,例如数据库技术和机器学习,而且它所关注的某些领域和统计学家所关注的有很大不同。
统计学和数据挖掘研究目标的重迭自然导致了迷惑。事实上,有时候还导致了反感。统计学有着正统的理论基础(尤其是经过本世纪的发展),而现在又出现了一个新的学科,有新的主人,而且声称要解决统计学家们以前认为是他们领域的问题。这必然会引起关注。更多的是因为这门新学科有着一个吸引人的名字,势必会引发大家的兴趣和好奇。把“数据挖掘”这个术语所潜在的承诺和“统计学”作比较的话,统计的最初含义是“陈述事实”,以及找出枯燥的大量数据背后的有意义的信息。当然,统计学的现代的含义已经有很大不同的事实。而且,这门新学科同商业有特殊的关联(尽管它还有科学及其它方面的应用)。
本文的目的是逐个考察这两门学科的性质,区分它们的异同,并关注与数据挖掘相关联的一些难题。首先,我们注意到“数据挖掘”对统计学家来说并不陌生。例如,
Everitt
定义它为:“仅仅是考察大量的数据驱动的模型,从中发现最适合的”。统计学家因而会忽略对数据进行特别的分析,因为他们知道太细致的研究却难以发现明显的结构。尽管如此,事实上大量的数据可能包含不可预测的但很有价值的结构。而这恰恰引起了注意,也是当前数据挖掘的任务。
2.
统计学的性质
试图为统计学下一个太宽泛的定义是没有意义的。尽管可能做到,但会引来很多异议。相反,我要关注统计学不同于数据挖掘的特性。
差异之一同上节中最后一段提到的相关,即统计学是一门比较保守的学科,目前有一种趋势是越来越精确。当然,这本身并不是坏事,只有越精确才能避免错误,发现真理。但是如果过度的话则是有害的。这个保守的观点源于统计学是数学的分支这样一个看法,我是不同意这个观点的(参见【
15
】,【
9
】,【
14
】,【
2
】,【
3
】)尽管统计学确实以数学为基础(正如物理和工程也以数学为基础,但没有被认为是数学的分支),但它同其它学科还有紧密的联系。
数学背景和追求精确加强了这样一个趋势:在采用一个方法之前先要证明,而不是象计算机科学和机器学习那样注重经验。这就意味着有时候和统计学家关注同一问题的其它领域的研究者提出一个很明显有用的方法,但它却不能被证明(或还不能被证明)。统计杂志倾向于发表经过数学证明的方法而不是一些特殊方法。数据挖掘作为几门学科的综合,已经从机器学习那里继承了实验的态度。这并不意味着数据挖掘工作者不注重精确,而只是说明如果方法不能产生结果的话就会被放弃。
正是统计文献显示了(或夸大了)统计的数学精确性。同时还显示了其对推理的侧重。尽管统计学的一些分支也侧重于描述,但是浏览一下统计论文的话就会发现这些文献的核心问题就是在观察了样本的情况下如何去推断总体。当然这也常常是数据挖掘所关注的。下面我们会提到数据挖掘的一个特定属性就是要处理的是一个大数据集。这就意味着,由于可行性的原因,我们常常得到的只是一个样本,但是需要描述样本取自的那个大数据集。然而,数据挖掘问题常常可以得到数据总体,例如关于一个公司的所有职工数据,数据库中的所有客户资料,去年的所有业务。在这种情形下,推断就没有价值了(例如,年度业务的平均值),因为观测到的值也就是估计参数。这就意味着,建立的统计模型可能会利用一系列概率表述(例如,一些参数接近于
0
,则会从模型中剔除掉),但当总体数据可以获得的话,在数据挖掘中则变得毫无意义。在这里,我们可以很方便的应用评估函数:针对数据的足够的表述。事实是,常常所关注的是模型是否合适而不是它的可行性,在很多情形下,使得模型的发现很容易。例如,在寻找规则时常常会利用吻合度的单纯特性(例如,应用分支定理)。但当我们应用概率陈述时则不会得到这些特性。
统计学和数据挖掘部分交迭的第三个特性是在现代统计学中起核心作用的“模型”。或许“模型”这个术语更多的含义是变化。一方面,统计学模型是基于分析变量间的联系,但另一方面这些模型关于数据的总体描述确实没有道理的。关于信用卡业务的回归模型可能会把收入作为一个独立的变量,因为一般认为高收入会导致大的业务。这可能是一个理论模型(尽管基于一个不牢靠的理论)。与此相反,只需在一些可能具有解释意义的变量基础上进行逐步的搜索,从而获得一个有很大预测价值的模型,尽管不能作出合理的解释。(通过数据挖掘去发现一个模型的时候,常常关注的就是后者)。
还有其它方法可以区分统计模型,但在这里我将不作探讨。对此可参见【
10
】。这里我想关注的是,现代统计学是以模型为主的。而计算,模型选择条件是次要的,只是如何建立一个好的模型。但在数据挖掘中,却不完全是如此。在数据挖掘中,准则起了核心的作用。(当然在统计学中有一些以准则为中心的独立的特例。
Gifi
的关于学校的非线性多变量分析就是其中之一。例如,
Gifi
说,在本书中我们持这样的观点,给定一些最常用的
MVA
(多变量分析)问题,既可以从模型出发也可以技术出发。正如我们已经在
1.1
节所看到的基于模型的经典的多变量统计分析,
……
然而,在很多情形下,模型的选择并不都是显而易见的,选择一个合适的模型是不可能的,最合适的计算方法也是不可行的。在这种情形下,我们从另外一个角度出发,应用设计的一系列技术来回答
MVA
问题,暂不考虑模型和最优判别的选择。
相对于统计学而言,准则在数据挖掘中起着更为核心的作用并不奇怪,数据挖掘所继承的学科如计算机科学及相关学科也是如此。数据集的规模常常意味着传统的统计学准则不适合数据挖掘问题,不得不重新设计。部分地,当数据点被逐一应用以更新估计量,适应性和连续性的准则常常是必须的。尽管一些统计学的准则已经得到发展,但更多的应用是机器学习。(正如“学习”所示的那样)
很多情况下,数据挖掘的本质是很偶然的发现非预期但很有价值的信息。这说明数据挖掘过程本质上是实验性的。这和确定性的分析是不同的。(实际上,一个人是不能完全确定一个理论的,只能提供证据和不确定的证据。)确定性分析着眼于最适合的模型-建立一个推荐模型,这个模型也许不能很好的解释观测到的数据。很多,或许是大部分统计分析提出的是确定性的分析。然而,实验性的数据分析对于统计学并不是新生事务,或许这是统计学家应该考虑作为统计学的另一个基石,而这已经是数据挖掘的基石。所有这些都是正确的,但事实上,数据挖掘所遇到的数据集按统计标准来看都是巨大的。在这种情况下,统计工具可能会失效:百万个偶然因素可能就会使其失效。(【
11
】中包含例子)
如果数据挖掘的主要目的是发现,那它就不关心统计学领域中的在回答一个特定的问题之前,如何很好的搜集数据,例如实验设计和调查设计。数据挖掘本质上假想数据已经被搜集好,关注的只是如何发现其中的秘密。
3.
数据挖掘的性质
由于统计学基础的建立在计算机的发明和发展之前,所以常用的统计学工具包含很多可
以手工实现的方法。因此,对于很多统计学家来说,
1000
个数据就已经是很大的了。但这个“大”对于英国大的信用卡公司每年
350,000,000
笔业务或
AT&T
每天
200,000,000
个长途呼叫来说相差太远了。很明显,面对这么多的数据,则需要设计不同于那些“原则上可以用手工实现”的方法。这意味这计算机(正是计算机使得大数据可能实现)对于数据的分析和处理是关键的。分析者直接处理数据将变得不可行。相反,计算机在分析者和数据之间起到了必要的过滤的作用。这也是数据挖掘特别注重准则的另一原因。尽管有必要,把分析者和数据分离开很明显导致了一些关联任务。这里就有一个真正的危险:非预期的模式可能会误导分析者,这一点我下面会讨论。
我不认为在现代统计中计算机不是一个重要的工具。它们确实是,并不是因为数据的规模。对数据的精确分析方法如
bootstrap
方法、随机测试,迭代估计方法以及比较适合的复杂的模型正是有了计算机才是可能的。计算机已经使得传统统计模型的视野大大的扩展了,还促进了新工具的飞速发展。
下面来关注一下歪曲数据的非预期的模式出现的可能性。这和数据质量相关。所有数据分析的结论依赖于数据质量。
GIGO
的意思是垃圾进,垃圾出,它的引用到处可见。一个数据分析者,无论他多聪明,也不可能从垃圾中发现宝石。对于大的数据集,尤其是要发现精细的小型或偏离常规的模型的时候,这个问题尤其突出。当一个人在寻找百万分之一的模型的时候,第二个小数位的偏离就会起作用。一个经验丰富的人对于此类最常见的问题会比较警觉,但出错的可能性太多了。
此类问题可能在两个层次上产生。第一个是微观层次,即个人记录。例如,特殊的属性可能丢失或输错了。我知道一个案例,由于挖掘者不知道,丢失的数据被记录为
99
而作为真实的数据处理。第二个是宏观层次,整个数据集被一些选择机制所歪曲。交通事故为此提供了一个好的示例。越严重的、致命的事故,其记录越精确,但小的或没有伤害的事故的记录却没有那么精确。事实上,很高比例的数据根本没有记录。这就造成了一个歪曲的映象-可能会导致错误的结论。
统计学很少会关注实时分析,然而数据挖掘问题常常需要这些。例如,银行事务每天都会发生,没有人能等三个月得到一个可能的欺诈的分析。类似的问题发生在总体随时间变化的情形。我的研究组有明确的例子显示银行债务的申请随时间、竞争环境、经济波动而变化。
至此,我们已经论述了数据分析的问题,说明了数据挖掘和统计学的差异,尽管有一定的重迭。但是,数据挖掘者也不可持完全非统计的观点。首先来看一个例子:获得数据的问题。统计学家往往把数据看成一个按变量交叉分类的平面表,存储于计算机等待分析。如果数据量较小,可以读到内存,但在许多数据挖掘问题中这是不可能的。更糟糕的是,大量的数据常常分布在不同的计算机上。或许极端的是,数据分布在全球互联网上。此类问题使得获得一个简单的样本不大可能。(先不管分析“整个数据集”的可能性,如果数据是不断变化的这一概念可能是不存在的,例如电话呼叫)
当描述数据挖掘技术的时候,我发现依据以建立模型还是模式发现为目的可以很方便的区分两类常见的工具。我已经提到了模型概念在统计学中的核心作用。在建立模型的时候,尽量要概括所有的数据,以及识别、描述分布的形状。这样的“全”模型的例子如对一系列数据的聚类分析,回归预测模型,以及基于树的分类法则。相反,在模式发现中,则是尽量识别小的(但不一定不重要)偏差,发现行为的异常模式。例如
EEG
轨迹中的零星波形、信用卡使用中的异常消费模式,以及不同于其它特征的对象。很多时候,这第二种实验是数据挖掘的本质-试图发现渣滓中的金块。然而,第一类实验也是重要的。当关注的是全局模型的建立的话,样本是可取的(可以基于一个十万大小的样本发现重要的特性,这和基于一个千万大小的样本是等效的,尽管这部分的取决于我们想法的模型的特征。然而,模式发现不同于此。仅选择一个样本的话可能会忽略所希望检测的情形。
尽管统计学主要关注的是分析定量数据,数据挖掘的多来源意味着还需要处理其它形式的数据。特别的,逻辑数据越来越多-例如当要发现的模式由连接的和分离的要素组成的时候。类似的,有时候会碰到高度有序的结构。分析的要素可能是图象,文本,语言信号,或者甚至完全是(例如,在交替分析中)科学研究资料。
4.
讨论
数据挖掘有时候是一次性的实验。这是一个误解。它更应该被看作是一个不断的过程(尽
管数据集时确定的)。从一个角度检查数据可以解释结果,以相关的观点检查可能会更接近等等。关键是,除了极少的情形下,很少知道哪一类模式是有意义的。数据挖掘的本质是发现非预期的模式-同样非预期的模式要以非预期的方法来发现。
与把数据挖掘作为一个过程的观点相关联的是认识到结果的新颖性。许多数据挖掘的结果是我们所期望的-可以回顾。然而,可以解释这个事实并不能否定挖掘出它们的价值。没有这些实验,可能根本不会想到这些。实际上,只有那些可以依据过去经验形成的合理的解释的结构才会是有价值的。
显然在数据挖掘存在着一个潜在的机会。在大数据集中发现模式的可能性当然存在,大数据集的数量与日俱增。然而,也不应就此掩盖危险。所有真正的数据集(即使那些是以完全自动方式搜集的数据)都有产生错误的可能。关于人的数据集(例如事务和行为数据)尤其有这种可能。这很好的解释了绝大部分在数据中发现的“非预期的结构”本质上是无意义的,而是因为偏离了理想的过程。(当然,这样的结构可能会是有意义的:如果数据有问题,可能会干扰搜集数据的目的,最好还是了解它们)。与此相关联的是如何确保(和至少为事实提供支持)任何所观察到的模式是“真实的”,它们反应了一些潜在的结构和关联而不仅仅是一个特殊的数据集,由于一个随机的样本碰巧发生。在这里,记分方法可能是相关的,但需要更多的统计学家和数据挖掘工作者的研究。
数据挖掘科学正在萌芽。
Fayyad et al
做了重要的基础工作【
6
】,目前的研究范围可以参考国际知识发现和数据挖掘系列学报和《数据挖掘和知识发现》杂志所列的主题和领域(两个最重要的学报是【
12
】和【
11
】)。关于统计学和数据分析的论文包括【
8
】,【
4
】和【
10
】。
二、数据挖掘实用网址
凡是有该标志的文章,都是该blog博主Caoer(草儿)原创,凡是索引、收藏
、转载请注明来处和原文作者。非常感谢。