Skip Headers
Oracle® Data Mining Concepts
11g Release 2 (11.2)

Part Number E16808-06
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub
DMCON057

13 k-Means

This chapter describes the enhanced k-Means clustering algorithm supported by Oracle Data Mining.

See Also:

Chapter 7, "Clustering"

This chapter includes the following topics:

DMCON338

About k-Means

The k-Means algorithm is a distance-based clustering algorithm that partitions the data into a specified number of clusters.

Distance-based algorithms rely on a distance function to measure the similarity between cases. Cases are assigned to the nearest cluster according to the distance function used.

DMCON554

Oracle Data Mining Enhanced k-Means

Oracle Data Mining implements an enhanced version of the k-Means algorithm with the following features:

  • Distance function — The algorithm supports Euclidean, Cosine, and Fast Cosine distance functions. The default is Euclidean.

  • Hierarchical model build —The algorithm builds a model in a top-down hierarchical manner, using binary splits and refinement of all nodes at the end. In this sense, the algorithm is similar to the bisecting k-Means algorithm. The centroids of the inner nodes in the hierarchy are updated to reflect changes as the tree evolves. The whole tree is returned.

  • Tree growth — The algorithm uses a specified split criterion to grow the tree one node at a time until a specified maximum number of clusters is reached, or until the number of distinct cases is reached. The split criterion may be the variance or the cluster size. By default the split criterion is the variance.

  • Cluster properties — For each cluster, the algorithm returns the centroid, a histogram for each attribute, and a rule describing the hyperbox that encloses the majority of the data assigned to the cluster. The centroid reports the mode for categorical attributes and the mean and variance for numerical attributes.

This approach to k-Means avoids the need for building multiple k-Means models and provides clustering results that are consistently superior to the traditional k-Means.

DMCON238

Centroid

The centroid represents the most typical case in a cluster. For example, in a data set of customer ages and incomes, the centroid of each cluster would be a customer of average age and average income in that cluster. The centroid is a prototype. It does not necessarily describe any given case assigned to the cluster.

The attribute values for the centroid are the mean of the numerical attributes and the mode of the categorical attributes.

DMCON562

Scoring

The clusters discovered by k-Means are used to generate a Bayesian probability model that can be used to score new data.

DMCON555

Tuning the k-Means Algorithm

The Oracle Data Mining enhanced k-Means algorithm supports several build-time settings. All the settings have default values. There is no reason to override the defaults unless you want to influence the behavior of the algorithm in some specific way.

You can configure k-Means by specifying any of the following:

See Also:

Oracle Database PL/SQL Packages and Types Reference for details about the build settings for k-Means
DMCON339

Data Preparation for k-Means

Normalization is typically required by the k-Means algorithm. Automatic Data Preparation performs outlier-sensitive normalization for k-Means. If you do not use ADP, you should normalize numeric attributes before creating or applying the model.

When there are missing values in columns with simple data types (not nested), k-Means interprets them as missing at random. The algorithm replaces missing categorical values with the mode and missing numerical values with the mean.

When there are missing values in nested columns, k-Means interprets them as sparse. The algorithm replaces sparse numerical data with zeros and sparse categorical data with zero vectors.

See Also:

Oracle Database PL/SQL Packages and Types Reference for details about normalization routines

Chapter 19 for information about automatic and embedded data transformation in Oracle Data Mining

Oracle Data Mining Application Developer's Guide for information about support for nested columns and missing data in Oracle Data Mining

Reader Comment

   

Comments, corrections, and suggestions are forwarded to authors every week. By submitting, you confirm you agree to the terms and conditions. Use the OTN forums for product questions. For support or consulting, file a service request through My Oracle Support.

Hide Navigation

Quick Lookup

Database Library · Master Index · Master Glossary · Book List · Data Dictionary · SQL Keywords · Initialization Parameters · Advanced Search · Error Messages

Main Categories

This Page

This Document

New and changed documents:
RSS Feed HTML RSS Feed PDF