Next: Generalisations of k-means (1) Up: No Title Previous: Bayesian interpretation of k-means

Bayesian interpretation of k-means (3)

$\begin{displaymath}\mbox{Pr}\left[ {\bf x}, \rho \vert \lambda, {\bf c} \right] ... ...um_j \rho_j^{(i)} \vert {\bf x}_i - {\bf c}_j \vert^2 \right)} \end{displaymath}$

Since the exponent is quadratic, the above probability density is equivalent to a product of Gaussian radial basis functions. The ${\bf c}$ vectors are the basis function centres and $\lambda$ is an inverse variance parameter.

The multiplicities $\rho$ are missing data. The EM algorithm (Dempster, Laird & Rubin) gives a way to maximise likelihoods with missing data as follows:

E-step: Calculate posterior estimates $\hat{\rho}$ for the missing data.
M-step: Setting $\rho=\hat{\rho}$ , maximise the likelihood with respect to ${\bf c}$ .
Iterate to convergence.

As $\lambda$ gets large, the expected multiplicities $\hat{\rho}_j^{(i)}$ tend to one for the cluster nearest point i and zero otherwise. So in the limit of infinite variance, this algorithm reduces to k-means (Bishop, 1995, ``Neural Networks for Pattern Recognition'').

2000-04-26