clustering Package¶

`clustering` Package¶

`gmm` Module¶

Gaussian Mixture Models

exception pypr.clustering.gmm.Cov_problem¶: Bases: exceptions.Exception

pypr.clustering.gmm.cond_dist(Y, centroids, ccov, mc)¶

Finds the conditional distribution p(X|Y) for a GMM.

Parameters :

Y : NxD array

An array of inputs. Inputs set to NaN are not set, and become inputs to the resulting distribution. Order is preserved.

centroids : list

List of cluster centers - [ [x1,y1,..],..,[xN, yN,..] ]

ccov : list

List of cluster co-variances DxD matrices

mc : list

Mixing cofficients for each cluster (must sum to one) by default equal for each cluster.

Returns :

res : tuple

A tuple containing a new set of (centroids, ccov, mc) for the conditional distribution.

pypr.clustering.gmm.em(X, K, max_iter=50, verbose=False, iter_call=None, delta_stop=9.9999999999999995e-07, init_kw={}, max_tries=10, diag_add=0.001)¶

Find K cluster centers in X using Expectation Maximization of Gaussian Mixtures.

Parameters :

X : NxD array

Input data. Should contain N samples row wise, and D variablescolumn wise.

max_iter : int

Maximum allowed number of iterations/try.

iter_call : callable

Called for each iteration: iter_call(center_list, cov_list, p_k, i)

delta_stop : float

Stop when the change in the mean negative log likelihood goes below this value.

max_tries : int

The co-variance matrix for some cluster migth end up with NaN values, then the algorithm will restart; max_tries is the number of allowed retries.

diag_add : float

A scalar multiplied by the variance of each feature of the input data, and added to the diagonal of the covariance matrix at each iteration.

Centroid initialization is given by *cluster_init*, the only available options :

are ‘sample’ and ‘kmeans’. ‘sample’ selects random samples as centroids. ‘kmeans’ :

calls kmeans to find the cluster centers. :

Returns :

center_list : list

A K-length list of cluster centers

cov_list : list

A K-length list of co-variance matrices

p_k : list

An K length array with mixing cofficients (p_k)

logLL : list

Log likelihood (how well the data fits the model)

pypr.clustering.gmm.em_gm(X, K, max_iter=50, verbose=False, iter_call=None, delta_stop=9.9999999999999995e-07, init_kw={}, max_tries=10, diag_add=0.001)¶

Find K cluster centers in X using Expectation Maximization of Gaussian Mixtures.

Parameters :

X : NxD array

Input data. Should contain N samples row wise, and D variablescolumn wise.

max_iter : int

Maximum allowed number of iterations/try.

iter_call : callable

Called for each iteration: iter_call(center_list, cov_list, p_k, i)

delta_stop : float

Stop when the change in the mean negative log likelihood goes below this value.

max_tries : int

The co-variance matrix for some cluster migth end up with NaN values, then the algorithm will restart; max_tries is the number of allowed retries.

diag_add : float

A scalar multiplied by the variance of each feature of the input data, and added to the diagonal of the covariance matrix at each iteration.

Centroid initialization is given by *cluster_init*, the only available options :

are ‘sample’ and ‘kmeans’. ‘sample’ selects random samples as centroids. ‘kmeans’ :

calls kmeans to find the cluster centers. :

Returns :

center_list : list

A K-length list of cluster centers

cov_list : list

A K-length list of co-variance matrices

p_k : list

An K length array with mixing cofficients (p_k)

logLL : list

Log likelihood (how well the data fits the model)

pypr.clustering.gmm.find_density_diff(center_list, cov_list, p_k=None, method='hellinger')¶

Difference measures for each component of the GMM.

Parameters :

centroids : list

List of cluster centers - [ [x1,y1,..],..,[xN, yN,..] ]

ccov : list

List of cluster co-variances DxD matrices

p_k : list

Mixing cofficients for each cluster (must sum to one) by default equal for each cluster.

method : string, optional

Select difference measure to use. Can be:

‘hellinger’ :

‘hellinger_weighted’ :

‘KL’ : Kullback-Leibler divergence

Returns :

diff : NxN np array

The difference between the probability distribtions of the components pairwise. Only the upper triangular matrix is used.

pypr.clustering.gmm.gauss_ellipse_2d(centroid, ccov, sdwidth=1, points=100)¶: Returns x,y vectors corresponding to ellipsoid at standard deviation sdwidth.

pypr.clustering.gmm.gm_assign_to_cluster(X, center_list, cov_list, p_k)¶

Assigns each sample to one of the Gaussian clusters given.

Returns an array with numbers, 0 corresponding to the first cluster in the cluster list.

pypr.clustering.gmm.gm_log_likelihood(X, center_list, cov_list, p_k)¶

Finds the likelihood for a set of samples belongin to a Gaussian mixture model.

Return log likelighood

pypr.clustering.gmm.gmm_em_continue(X, center_list, cov_list, p_k, max_iter=50, verbose=False, iter_call=None, delta_stop=9.9999999999999995e-07, diag_add=0.001, delta_stop_count_end=10)¶

pypr.clustering.gmm.gmm_init(X, K, verbose=False, cluster_init='sample', cluster_init_prop={}, max_init_iter=5, cov_init='var')¶

Initialize a Gaussian Mixture Model (GMM). Generates a set of inital parameters for the GMM.

Returns :

center_list : list

A K-length list of cluster centers

cov_list : list

A K-length list of co-variance matrices

p_k : list

An K length array with mixing cofficients

pypr.clustering.gmm.gmm_pdf(X, centroids, ccov, mc, individual=False)¶

Evaluates the PDF for the multivariate Guassian mixture.

Draw samples from a Mixture of Gaussians (MoG)

Parameters :

centroids : list

List of cluster centers - [ [x1,y1,..],..,[xN, yN,..] ]

ccov : list

List of cluster co-variances DxD matrices

mc : list

Mixing cofficients for each cluster (must sum to one)

by default equal for each cluster.

individual : bool

If True the probability density is returned for each cluster component.

Returns :

prob : 1d np array

Probability density values for entries in X.

pypr.clustering.gmm.logmulnormpdf(X, MU, SIGMA)¶

Evaluates natural log of the PDF for the multivariate Guassian distribution.

Parameters :

X : np array

Inputs/entries row-wise. Can also be a 1-d array if only a single point is evaluated.

MU : nparray

Center/mean, 1d array.

SIGMA : 2d np array

Covariance matrix.

Returns :

prob : 1d np array

Log (natural) probabilities for entries in X.

pypr.clustering.gmm.marg_dist(X_idx, centroids, ccov, mc)¶

Finds the marginal distribution p(X) for a GMM.

Parameters :

X_idx : list

Indecies of dimensions to keep

centroids : list

List of cluster centers - [ [x1,y1,..],..,[xN, yN,..] ]

ccov : list

List of cluster co-variances DxD matrices

mc : list

Mixing cofficients for each cluster (must sum to one) by default equal for each cluster.

Returns :

res : tuple

A tuple containing a new set of (centroids, ccov, mc) for the marginal distribution.

pypr.clustering.gmm.mulnormpdf(X, MU, SIGMA)¶

Evaluates the PDF for the multivariate Guassian distribution.

Parameters :

X : np array

Inputs/entries row-wise. Can also be a 1-d array if only a single point is evaluated.

MU : nparray

Center/mean, 1d array.

SIGMA : 2d np array

Covariance matrix.

Returns :

prob : 1d np array

Probabilities for entries in X.

Examples

from pypr.clustering import *
from numpy import *
X = array([[0,0],[1,1]])
MU = array([0,0])
SIGMA = diag((1,1))
gmm.mulnormpdf(X, MU, SIGMA)

pypr.clustering.gmm.predict(X, centroids, ccov, mc)¶

Predict the entries in X, which contains NaNs.

Parameters :

X : np array

2d np array containing the inputs. Target are specified with numpy NaNs. The NaNs will be replaced with the most probable result according to the GMM model provided.

centroids : list

List of cluster centers - [ [x1,y1,..],..,[xN, yN,..] ]

ccov : list

List of cluster co-variances DxD matrices

mc : list

Mixing cofficients for each cluster (must sum to one) by default equal for each cluster.

Returns :

Nothing - X is modified :

pypr.clustering.gmm.sample_gaussian_mixture(centroids, ccov, mc=None, samples=1)¶

Draw samples from a Mixture of Gaussians (MoG)

Parameters :

centroids : list

List of cluster centers - [ [x1,y1,..],..,[xN, yN,..] ]

ccov : list

List of cluster co-variances DxD matrices

mc : list

Mixing cofficients for each cluster (must sum to one)

by default equal for each cluster.

Returns :

X : 2d np array

A matrix with samples rows, and input dimension columns.

Examples

from pypr.clustering import *
from numpy import *
centroids=[array([10,10])]
ccov=[array([[1,0],[0,1]])]
samples = 10
gmm.sample_gaussian_mixture(centroids, ccov, samples=samples)

`kmeans` Module¶

pypr.clustering.kmeans.find_centroids(X, K, m)¶

Find centroids based on sample cluster assignments.

Parameters :

X : KxD np array

K data samples with D dimensionallity

K : int

Number of clusters

m : 1d np array

cluster membership number, starts from zero

Returns :

cc : 2d np array

A set of K cluster centroids in an K x D array, where D is the number of dimensions. If a cluster isn’t assigned any points/samples, then it centroid will consist of NaN’s.

pypr.clustering.kmeans.find_distance(X, c)¶

Returns the euclidean distance for each sample to a cluster center

Parameters :

X : 2d np array

Samples are row wise, variables column wise

c : 1d np array or list

Cluster center

Returns :

dist : 2d np column array

Distances is returned as a column vector.

pypr.clustering.kmeans.find_intra_cluster_variance(X, m, cc)¶

Returns the intra-cluster variance.

Parameters :

X : 2d np array

Samples are row wise, variables column wise

m : list or 1d np array

Cluster membership number, starts from zero

cc : 2d np array

Cluster centers row-wise, variables column-wise

Returns :

dist : float

Intra cluster variance

pypr.clustering.kmeans.find_membership(X, cc)¶

Finds the closest cluster centroid for each sample, and returns cluster membership number.

Parameters :

X : 2d np array

Samples are row wise, variables column wise

cc : 2d np array

Cluster centers row-wise, variables column-wise

Returns :

m : 1d array

cluster membership number, starts from zero

pypr.clustering.kmeans.kmeans(X, K, iter=20, verbose=False, cluster_init='sample', delta_stop=None)¶

Cluster the samples in X using K-means into K clusters. The algorithm stops when no samples change cluster assignment in an itertaion. NOTE: You might need to change the default maximum number of of iterations iter, depending on the number of samples and clusters used.

Parameters :

X : KxD np array

K data samples with D dimensionallity.

K : int

Number of clusters.

iter : int, optional

Number of iterations to run the k-means algorithm for.

cluster_init : string, optional

Centroid initialization. The available options are: ‘sample’ and ‘box’. ‘sample’ selects random samples as initial centroids, and ‘box’ selects random values within the space bounded by a box containing all the samples.

delta_stop : float, optional

Use delta_stop to stop the algorithm early. If the change in all variables in all centroids is changed less than delta_stop then the algorithm stops.

verbose : bool, optional

Make it talk back.

Returns :

m : 1d np array

cluster membership number, starts from zero.

cc : 2d np array

A set of K cluster centroids in an K x D array, where D is the number of dimensions. If a cluster isn’t assigned any points/samples, then it centroid will consist of NaN’s.

clustering Package¶