There and Back Again: Outlier Detection Between Statistical Reasoning And Data Mining Algorithms

by | Sep 13, 2018

Data mining and statistics, the roots and the path of development of statistical outlier detection and of database‐related data mining methods for outlier detection.

Credit card companies observe the financial transactions of their customers in order to be able to alert the customer or deny a transaction if it looks strange. Scientists working with measurements from lab experiments or sensor data in the wild get alerted if some measurements are considerably and unexpectedly different from the previous observations. Analysis of sports statistics can lead to the discovery of suspicious activities. Administrators get notified about unusual behavior on their webserver which could indicate technical problems or malicious attacks.

All these examples relate technically to the detection of so called outliers or anomalies, observations that do not fit well to the remainder of the given observations. In light of the common metaphor grasping the task of data mining is like mining for nuggets of information, outlier detection can be seen as being not merely interested in removing noise but also in finding interesting database objects deviating in their behavior considerably from the majority and, as such, providing new insights. Indeed, both aspects of outlier detection are like two sides of a coin as one person’s noise may be another person’s signal. The above scenarios highlight interest in outliers, as measurement errors in scientific data should possibly just be removed whereas a case of credit card abuse is the solely interesting fact among a wealth of ‘just usual’ data (that, in turn, could of course be interesting itself as well, e.g., for modeling a customer’s interests and behavior—after removing outliers).

Outlier detection is a field that has been studied in statistics and in data mining. While data mining techniques are of course based on or motivated by statistical reasoning, the development of techniques in the scientific data mining literature became detached from the statistical intuition, as the interest in data mining is the algorithmic handling of “big data” and the focus is often more on efficiency. Likewise, while statisticians nowadays also develop algorithms and programs to analyze data automatically, the algorithmic developments in data mining have not often been considered in the statistical literature, as the two communities do not strongly overlap.

In their WIREs Data Mining and Knowledge Discovery article, ‘There and Back Again: Outlier Detection Between Statistical Reasoning and Data Mining Algorithms,’ authors Arthur Zimek and Peter Filzmoser bridge the gap between the data mining literature and the statistics literature, relating concepts to each other, and discussing what it means to get an ‘outlier’ alert from some method.

 

Kindly contributed by the Authors.

Related posts: