The code is based upon the Wikipedia description [wikikmeans]. The objective of the algorithm is to minimize the intra-cluster variance, given by
where is the total number of clusters, the set of points in the ‘th cluster, and the center of the ‘th cluster.
When k-means has minimized the intra-cluster variance, it might not have found the global minimum of variance. It is therefore a good idea to run the algorithm several times, and use the clustering result with the best intra-cluster variance. The intra-cluster variance can be obtained from the method find_intra_cluster_variance.
[wikikmeans] | http://en.wikipedia.org/w/index.php?title=K-means_algorithm&oldid=280702005 |
To demostrate the K-means algorithm, we will construct a simple example with three clusters with Gaussian distribution.
from matplotlib.pylab import *
from pypr.clustering.kmeans import *
from pypr.clustering.rnd_gauss_clusters import *
# Make three clusters:
centroids = [ array([1,1]), array([3,3]), array([3,0]) ]
ccov = [ array([[1,0],[0,1]]), array([[1,0],[0,1]]), \
array([[0.2, 0],[0, 0.2]]) ]
X = rnd_gauss_clusters(centroids, ccov, 1000)
figure(figsize=(10,5))
subplot(121)
title('Original unclustered data')
plot(X[:,0], X[:,1], '.')
xlabel('$x_1$'); ylabel('$x_2$')
subplot(122)
title('Clustered data')
m, cc = kmeans(X, 3)
plot(X[m==0, 0], X[m==0, 1], 'r.')
plot(X[m==1, 0], X[m==1, 1], 'b.')
plot(X[m==2, 0], X[m==2, 1], 'g.')
xlabel('$x_1$'); ylabel('$x_2$')
The second plot of this example will give the clustering result from the algorithm.
Find centroids based on sample cluster assignments.
Parameters : | X : KxD np array
K : int
m : 1d np array
|
---|---|
Returns : | cc : 2d np array
|
Returns the euclidean distance for each sample to a cluster center
Parameters : | X : 2d np array
c : 1d np array or list
|
---|---|
Returns : | dist : 2d np column array
|
Returns the intra-cluster variance.
Parameters : | X : 2d np array
m : list or 1d np array
cc : 2d np array
|
---|---|
Returns : | dist : float
|
Finds the closest cluster centroid for each sample, and returns cluster membership number.
Parameters : | X : 2d np array
cc : 2d np array
|
---|---|
Returns : | m : 1d array
|
Cluster the samples in X using K-means into K clusters. The algorithm stops when no samples change cluster assignment in an itertaion. NOTE: You might need to change the default maximum number of of iterations iter, depending on the number of samples and clusters used.
Parameters : | X : KxD np array
K : int
iter : int, optional
cluster_init : string, optional
delta_stop : float, optional
verbose : bool, optional
|
---|---|
Returns : | m : 1d np array
cc : 2d np array
|