6 Clustering
Learn how to discover natural groupings in the data through Clustering - the unsupervised mining function.
Related Topics
6.1 About Clustering
Clustering analysis finds clusters of data objects that are similar to one another. The members of a cluster are more like each other than they are like members of other clusters. Different clusters can have members in common. The goal of clustering analysis is to find high-quality clusters such that the inter-cluster similarity is low and the intra-cluster similarity is high.
Clustering, like classification, is used to segment the data. Unlike classification, clustering models segment data into groups that were not previously defined. Classification models segment data by assigning it to previously-defined classes, which are specified in a target. Clustering models do not use a target.
Clustering is useful for exploring data. You can use Clustering algorithms to find natural groupings when there are many cases and no obvious groupings.
Clustering can serve as a useful data-preprocessing step to identify homogeneous groups on which you can build supervised models.
You can also use Clustering for Anomaly Detection. Once you segment the data into clusters, you find that some cases do not fit well into any clusters. These cases are anomalies or outliers.
6.1.1 How are Clusters Computed?
There are several different approaches to the computation of clusters. Oracle Data Mining supports the following methods:
-
Density-based: This type of clustering finds the underlying distribution of the data and estimates how areas of high density in the data correspond to peaks in the distribution. High-density areas are interpreted as clusters. Density-based cluster estimation is probabilistic.
-
Distance-based: This type of clustering uses a distance metric to determine similarity between data objects. The distance metric measures the distance between actual cases in the cluster and the prototypical case for the cluster. The prototypical case is known as the centroid.
-
Grid-based: This type of clustering divides the input space into hyper-rectangular cells and identifies adjacent high-density cells to form clusters.
6.1.2 Scoring New Data
Although clustering is an unsupervised mining function, Oracle Data Mining supports the scoring operation for clustering. New data is scored probabilistically.
6.2 Evaluating a Clustering Model
Since known classes are not used in clustering, the interpretation of clusters can present difficulties. How do you know if the clusters can reliably be used for business decision making?
Oracle Data Mining clustering models support a high degree of model transparency. You can evaluate the model by examining information generated by the clustering algorithm: for example, the centroid of a distance-based cluster. Moreover, because the clustering process is hierarchical, you can evaluate the rules and other information related to each cluster's position in the hierarchy.
6.3 Clustering Algorithms
Learn different Clustering algorithms used in Oracle Data Mining.
Oracle Data Mining supports these Clustering algorithms:
-
Expectation Maximization
Expectation Maximization is a probabilistic, density-estimation Clustering algorithm.
-
k-Means
k-Means is a distance-based Clustering algorithm. Oracle Data Mining supports an enhanced version of k-Means.
-
Orthogonal Partitioning Clustering (O-Cluster)
O-Cluster is a proprietary, grid-based Clustering algorithm.
See Also:
Campos, M.M., Milenova, B.L., "O-Cluster: Scalable Clustering of Large High Dimensional Data Sets", Oracle Data Mining Technologies, 10 Van De Graaff Drive, Burlington, MA 01803.
The main characteristics of the two algorithms are compared in the following table.
Table 6-1 Clustering Algorithms Compared
Feature | k-Means | O-Cluster | Expectation Maximization |
---|---|---|---|
Clustering methodolgy |
Distance-based |
Grid-based |
Distribution-based |
Number of cases |
Handles data sets of any size |
More appropriate for data sets that have more than 500 cases. Handles large tables through active sampling |
Handles data sets of any size |
Number of attributes |
More appropriate for data sets with a low number of attributes |
More appropriate for data sets with a high number of attributes |
Appropriate for data sets with many or few attributes |
Number of clusters |
User-specified |
Automatically determined |
Automatically determined |
Hierarchical clustering |
Yes |
Yes |
Yes |
Probabilistic cluster assignment |
Yes |
Yes |
Yes |
Note:
Oracle Data Mining uses k-Means as the default Clustering algorithm.
Related Topics