A multivariate normal distribution or multivariate Gaussian distribution is a generalization of the one-dimensional Gaussian distribution into muliple dimensions.
The distribution is given by its mean, , and covariance, , matrices. To generate samples from the multivariate normal distribution under python, one could use the numpy.random.multivariate_normal function from numpy.
In statistics, a mixture model is a probabilistic model for density estimation using a mixture distribution. A mixture model can be regarded as a type of unsupervised learning or clustering [wikimixmodel]. Mixture models provide a method of describing more complex propability distributions, by combining several probability distributions. Mixture models can also be used to cluster data.
The Gaussian mixture distribution is given by the following equation [bishop2006]:
Here we have a linear mixture of Gaussian density functions, . The parameters are called mixing coefficients, which must fulfill
and given we also have that
PyPR has some simple support for sampling from Gaussian Mixture Models.
Gaussian Mixture Models
Draw samples from a Mixture of Gaussians (MoG)
Parameters : | centroids : list
ccov : list
mc : list
|
---|---|
Returns : | X : 2d np array
|
Examples
from pypr.clustering import *
from numpy import *
centroids=[array([10,10])]
ccov=[array([[1,0],[0,1]])]
samples = 10
gmm.sample_gaussian_mixture(centroids, ccov, samples=samples)
The example sample_gmm gives an example of drawing samples from a mixture of three clusters, and plotting the result.
# Drawing samples from a Gaussian Mixture Model
from numpy import *
from matplotlib.pylab import *
import pypr.clustering.gmm as gmm
mc = [0.4, 0.4, 0.2] # Mixing coefficients
centroids = [ array([0,0]), array([3,3]), array([0,4]) ]
ccov = [ array([[1,0.4],[0.4,1]]), diag((1,2)), diag((0.4,0.1)) ]
# Covariance matrices
X = gmm.sample_gaussian_mixture(centroids, ccov, mc, samples=1000)
plot(X[:,0], X[:,1], '.')
for i in range(len(mc)):
x1, x2 = gmm.gauss_ellipse_2d(centroids[i], ccov[i])
plot(x1, x2, 'k', linewidth=2)
xlabel('$x_1$'); ylabel('$x_2$')
The resulting plot should look something like this:
The probability denisity function (PDF) can be evaluated using the following function:
Gaussian Mixture Models
Evaluates the PDF for the multivariate Guassian mixture.
Draw samples from a Mixture of Gaussians (MoG)
Parameters : | centroids : list
ccov : list
mc : list
individual : bool
|
---|---|
Returns : | prob : 1d np array
|
For more information mixture models and expectation maximization take a look at Pattern Recognition and Machine Learning by Christopher M. Bishop [bishop2006].
[wikimixmodel] | Mixture model. (2010, December 23). In Wikipedia, The Free Encyclopedia. Retrieved 12:33, February 25, 2011, from http://en.wikipedia.org/w/index.php?title=Mixture_model&oldid=403871987 |
[bishop2006] | (1, 2, 3) Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer (2006) |
In the previous example we saw how we could draw samples from a Gaussian Mixture Model. Now we will look at how we can work in the opposite direction, given a set of samples find a set of K multivariate Gaussian distributions that represent observed samples in a good way. The number of clusters, , is given, so the parameters that are to be found are the means and covariances of the distributions.
An acknowledged and efficient method for finding the parameters of a GMM is to use Expectation Maximization (EM). The EM algorithm is an iterative refinement algorithm used for finding maximum likelihood estimates of parameters in probabilistic models. The likelihood is a measure for how good the data fits a given model, and is a function of the parameters of the statistical model. If we assume that all the samples in the dataset are independent, then we can write the likelihood as,
and as we are using a mixture of gaussians as model, we can write
In contrast to the K-means algorithm, the EM algorithm for Gaussian Mixture does not assign each sample to only one cluster. Instead, it assigns each sample a set of weights representing the sample’s probability of membership to each cluster. This can be expressed with a conditional probability, , given a sample gives the probability of the sample being drawn a certain cluster . A responsibility matrix with elements is used to hold the probabilities.
Given the data and the model parameters , , and , we now can calculate the likeliness and the probabilities . This is the expectation (E) step in EM-algorithm.
In the maximization (M) step we estimate the mean, co-variances, and mixing coefficients . As each point has a probability of belonging to a cluster, , we have to weight each sample’s contribution to the paramter with that factor. The following equations are used to estimate the new set of model parameters.
The function pypr.clustering.gmm.em() is used to called to initialise and find the GMM. It will retry max_tries times if it encounters problems with the covariance matrix. The methods calls pypr.clustering.gmm_init() to initalized the clusters, and the keyword arguments passed to this can be specified using init_kw parameter.
It can be problematic to initalize the EM algorithm with the box initalization, as it might be numerically challenging calculating the log likelihood in the first step. Initalizing the EM algorithm with with either sample or kmeans should not pose any problems.
If you examine the code for the GMM EM algorithm, pypr.clustering.gmm.em(), you can see that the code uses the log-sum-exp formula cite{numericalrecipes} to avoid underflow in the floating point calculations.
The conditional and marginal distributions can be easily derived from the joint probability described by the GMM, see [bishop2006], [rasmussen2006], [sung2004]. Let and be jointly Gaussian vectors, and given a GMM with the Guassian distributions with
Then the marginal distribution can be written as
and the conditional distribution for each Gaussian component is given by
and for the whole GMM as
It is also possible to use a mixture of gaussians for regression. This is done by using the conditional distribution, so that we have . When using the GMM for regression, the model output can be calculated as the expected value of the conditional GMM probability density function
Here is an example which uses EM to find the the parameters for the Gaussian Mixture Model (GMM), and then plots the conditional distribution for one of the parameters.
# An example of using the Expectation Maximization (EM) algorithm to find the
# parameters for a mixture of gaussian given a set of samples.
from numpy import *
from matplotlib.pylab import *
import pypr.clustering.gmm as gmm
def iter_plot(cen_lst, cov_lst, itr):
# For plotting EM progress
if itr % 2 == 0:
for i in range(len(cen_lst)):
x,y = gmm.gauss_ellipse_2d(cen_lst[i], cov_lst[i])
plot(x, y, 'k', linewidth=0.5)
seed(1)
mc = [0.4, 0.4, 0.2] # Mixing coefficients
centroids = [ array([0,0]), array([3,3]), array([0,4]) ]
ccov = [ array([[1,0.4],[0.4,1]]), diag((1,2)), diag((0.4,0.1)) ]
# Generate samples from the gaussian mixture model
X = gmm.sample_gaussian_mixture(centroids, ccov, mc, samples=1000)
fig1 = figure()
plot(X[:,0], X[:,1], '.')
# Expectation-Maximization of Mixture of Gaussians
cen_lst, cov_lst, p_k, logL = gmm.em_gm(X, K = 3, iter = 400,\
verbose = True, iter_call = iter_plot)
print "Log likelihood (how well the data fits the model) = ", logL
# Plot the cluster ellipses
for i in range(len(cen_lst)):
x1,x2 = gmm.gauss_ellipse_2d(cen_lst[i], cov_lst[i])
plot(x1, x2, 'k', linewidth=2)
title(""); xlabel(r'$x_1$'); ylabel(r'$x_2$')
# Now we will find the conditional distribution of x given y
fig2 = figure()
ax1 = subplot(111)
plot(X[:,0], X[:,1], ',')
y = -1.0
axhline(y)
x1plt = np.linspace(axis()[0], axis()[1], 200)
for i in range(len(cen_lst)):
text(cen_lst[i][0], cen_lst[i][1], str(i+1), horizontalalignment='center',
verticalalignment='center', size=32, color=(0.2,0,0))
ex,ey = gmm.gauss_ellipse_2d(cen_lst[i], cov_lst[i])
plot(ex, ey, 'k', linewidth=0.5)
ax2 = twinx()
(con_cen, con_cov, new_p_k) = gmm.cond_dist(np.array([np.nan, y]), \
cen_lst, cov_lst, p_k)
x2plt = gmm.gmm_pdf(c_[x1plt], con_cen, con_cov, new_p_k)
ax2.plot(x1plt, x2plt,'r', linewidth=2,
label='Cond. dist. of $x_1$ given $x_2='+str(y)+'$')
ax2.legend()
ax1.set_xlabel(r'$x_1$')
ax1.set_ylabel(r'$x_2$')
ax2.set_ylabel('Probability')
The resulting plot should look something similar to this:
[sung2004] | Gaussian Mixture Regression and Classiufb01cation, H. G. Sung, Rice University (2004) |
[rasmussen2006] | Gaussian Processes for Machine Learning, Carl Edward Rasmussen and Christopher K. I. Williams, MIT Press (2006) |
Gaussian Mixture Models
Finds the conditional distribution p(X|Y) for a GMM.
Parameters : | Y : NxD array
centroids : list
ccov : list
mc : list
|
---|---|
Returns : | res : tuple
|
Find K cluster centers in X using Expectation Maximization of Gaussian Mixtures.
Parameters : | X : NxD array
max_iter : int
iter_call : callable
delta_stop : float
max_tries : int
diag_add : float
Centroid initialization is given by *cluster_init*, the only available options : are ‘sample’ and ‘kmeans’. ‘sample’ selects random samples as centroids. ‘kmeans’ : calls kmeans to find the cluster centers. : |
---|---|
Returns : | center_list : list
cov_list : list
p_k : list
logLL : list
|
Find K cluster centers in X using Expectation Maximization of Gaussian Mixtures.
Parameters : | X : NxD array
max_iter : int
iter_call : callable
delta_stop : float
max_tries : int
diag_add : float
Centroid initialization is given by *cluster_init*, the only available options : are ‘sample’ and ‘kmeans’. ‘sample’ selects random samples as centroids. ‘kmeans’ : calls kmeans to find the cluster centers. : |
---|---|
Returns : | center_list : list
cov_list : list
p_k : list
logLL : list
|
Difference measures for each component of the GMM.
Parameters : | centroids : list
ccov : list
p_k : list
method : string, optional
|
---|---|
Returns : | diff : NxN np array
|
Returns x,y vectors corresponding to ellipsoid at standard deviation sdwidth.
Assigns each sample to one of the Gaussian clusters given.
Returns an array with numbers, 0 corresponding to the first cluster in the cluster list.
Finds the likelihood for a set of samples belongin to a Gaussian mixture model.
Return log likelighood
Initialize a Gaussian Mixture Model (GMM). Generates a set of inital parameters for the GMM.
Returns : | center_list : list
cov_list : list
p_k : list
|
---|
Evaluates the PDF for the multivariate Guassian mixture.
Draw samples from a Mixture of Gaussians (MoG)
Parameters : | centroids : list
ccov : list
mc : list
individual : bool
|
---|---|
Returns : | prob : 1d np array
|
Evaluates natural log of the PDF for the multivariate Guassian distribution.
Parameters : | X : np array
MU : nparray
SIGMA : 2d np array
|
---|---|
Returns : | prob : 1d np array
|
Finds the marginal distribution p(X) for a GMM.
Parameters : | X_idx : list
centroids : list
ccov : list
mc : list
|
---|---|
Returns : | res : tuple
|
Evaluates the PDF for the multivariate Guassian distribution.
Parameters : | X : np array
MU : nparray
SIGMA : 2d np array
|
---|---|
Returns : | prob : 1d np array
|
Examples
from pypr.clustering import *
from numpy import *
X = array([[0,0],[1,1]])
MU = array([0,0])
SIGMA = diag((1,1))
gmm.mulnormpdf(X, MU, SIGMA)
Predict the entries in X, which contains NaNs.
Parameters : | X : np array
centroids : list
ccov : list
mc : list
|
---|---|
Returns : | Nothing - X is modified : |
Draw samples from a Mixture of Gaussians (MoG)
Parameters : | centroids : list
ccov : list
mc : list
|
---|---|
Returns : | X : 2d np array
|
Examples
from pypr.clustering import *
from numpy import *
centroids=[array([10,10])]
ccov=[array([[1,0],[0,1]])]
samples = 10
gmm.sample_gaussian_mixture(centroids, ccov, samples=samples)