The expression model,

In the k-means algorithm, each cluster may be considered to have a single vector-valued parameter, the ``center'' of the cluster. The goodness-of-fit of a cluster is measured by an error function defined as the total squared distance (i.e. the sum-of-squares differences) from the cluster center to each of the member points [Tavazoie et al.1999]. The optimal position for the cluster center is clearly the mean (centroid) of all the member points.

We seek a likelihood model ${\cal E}$ that is analogous to a cluster. Comparing the sum-of-squares error function to a log-likelihood, the most obvious candidate would be a multivariate Gaussian density, i.e.

$\displaystyle \mbox{Pr}\left[\vec{E}^{(g)} \pm \Delta \vert {\cal E}, \vec{\mu}... ...] \simeq \quad \quad \quad \quad\quad \quad \quad \quad \quad \quad \quad \quad$
$\displaystyle \left(2\pi\sigma^2\right)^{-X/2} \exp{\left[ -\frac{\vert\vec{E}^{(g)} - \vec{\mu}\vert^2}{2\sigma^2} \right]} \cdot (2\Delta)^X$			(1)

where $\vec{\mu}$ and $\sigma$ are the (vector-valued) mean and (scalar-valued) standard deviation of the cluster of expression profiles generated by the model ${\cal E}$ . Note that $\vec{\mu}$ is an adjustable cluster-specific parameter (equivalent to the cluster centroid vector in the k-means algorithm) whereas $\sigma$ is treated as a fixed, global parameter.

The $\Delta$ 's arise because, strictly, we have to integrate over an (arbitrarily small) volume of $\vec{E}^{(g)}$ -space to get a finite probability; from now on, we will drop these terms from our notation.

The connection to k-means can be made more explicit. If we posit that each of the observed $\vec{E}^{(g)}$ vectors were generated by one of k alternative models, denoting by ${\cal E}^{(m^{(g)})}$ the model which gave rise to the g'th vector, and if we treat the m^(g) as missing data and apply the EM algorithm, then the k-means algorithm emerges naturally in the limit $\sigma \rightarrow 0$ , since the posterior distribution $\mbox{Pr}\left[{\cal E}^{(m^{(g)})}\vert\vec{E}^{(g)},...\right]$ tends to one for the cluster nearest to $\vec{E}^{(g)}$ and zero otherwise [Bishop1995]. We will elaborate on this treatment below.

We can generalize equation (1) by replacing the variance $\sigma^2$ with a covariance matrix ${\bf N}$ (for ``noise'', since the scale of this matrix should reflect the expected dispersion of the cluster):

$\displaystyle \mbox{Pr}\left[\vec{E}^{(g)} \vert {\cal E}, \vec{\mu}, {\bf N}\r... ...d \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad$
$\displaystyle \left(2\pi\vert{\bf N}\vert\right)^{-X/2} \exp{\left[ -\frac{1}{2}(\vec{E}^{(g)} - \vec{\mu})^T {\bf N}^{-1} (\vec{E}^{(g)} - \vec{\mu}) \right]}$			(2)

We complete the generalized model by writing down a prior distribution for the adjustable parameter $\vec{\mu}$ . We use another multivariate Gaussian, with mean $\vec{\nu}$ and covariance matrix ${\bf C}$ :

$\displaystyle \mbox{Pr}\left[\vec{\mu} \vert \vec{\nu}, {\bf C}\right] = \quad ... ...d \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad$
$\displaystyle \left(2\pi\vert{\bf C}\vert\right)^{-X/2} \exp{\left[ -\frac{1}{2}(\vec{\mu} - \vec{\nu})^T {\bf C}^{-1} (\vec{\mu} - \vec{\nu}) \right]}$			(3)

For later convenience, we also introduce a ``multiplicity'' $\rho^{(g)}$ for each gene, representing the number of exact duplicates of each profile there are in the dataset (one can also think of this value as a weight for the gene). The total log-probability (including the likelihood and the prior) is then

$\displaystyle {\cal L}\left[\vec{\bf E}, \vec{\mu} \hspace{.3em} \vert \hspace{... ...= \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad$
$\displaystyle \quad \quad \quad \log{\mbox{Pr}\left[\vec{\mu} \vert \vec{\nu}, ... ...)} \log{\mbox{Pr}\left[\vec{E}^{(g)} \vert {\cal E}, \vec{\mu}, {\bf N}\right]}$			(4)

where $\Theta = \{{\bf N}, \vec{\nu}, {\bf C}\}$ is the global expression model parameter set, $\vec{\rho} = \{ \rho^{(g)} \}$ is the multiplicities vector and we have introduced the shorthand $\vec{\bf E}$ to represent the entire set of expression vectors $\{\vec{E}^{(g)}\}$ . Note that the vector $\vec{\rho}$ is indexed by gene g, as contrasted with vectors like $\vec{\mu}$ or $\vec{\nu}$ which are indexed by experiment x.

The posterior distribution of the signature expression profile for the cluster, $\mbox{Pr}\left[\vec{\mu} \vert \vec{\bf E}, ...\right]$ , is also a multivariate Gaussian. From (4), the optimal value for $\vec{\mu}$ is given by the matrix equation:

$\displaystyle \vec{\mu}^{\mbox{\small (opt)}}$	=	$\displaystyle \left( {\bf C}^{-1} + \sum_{g=1}^G \rho^{(g)} {\bf N}^{-1} \right) ^{-1}$
		$\displaystyle \times \left( {\bf C}^{-1} \vec{\nu} + {\bf N}^{-1} \sum_{g=1}^G \rho^{(g)} \vec{E}^{(g)} \right)$	(5)

In the limit of large G (or, strictly speaking, as $\sum \rho^{(g)}$ tends to infinity), this tends towards the $\rho^{(g)}$ -weighted centroid of the $\vec{E}^{(g)}$ vectors, as with k-means.

Can we find an interpretation for these Gaussian priors? Recalling that E^(g)_x is the expression level at time t_x, equation (3) implies that the probability distribution over all possible cluster expression profiles $\{ \mu(t) \}$ is uniquely specified by the covariance between any two points, i.e. the entries of the covariance matrix:

$\begin{displaymath}C_{xy} = \left\langle (\mu(t_x) - \nu(t_x)) \cdot (\mu(t_y) - \nu(t_y)) \right\rangle_{\{\mu(t)\}} \end{displaymath}$

Such probability distributions over function spaces are described by the theory of Gaussian processes (GPs), mathematical models for randomly varying signals [MacKay1997]. A GP has the property that if it's sampled at any number of points, the joint distribution of the sample levels is a multivariate Gaussian. This is because a GP may be viewed as an infinite-dimensional Gaussian distribution (one dimension for every possible value of t) and the marginal distribution of any subspace of a multidimensional Gaussian is itself Gaussian.

An almost-trivial example is white noise: if samples are taken at a set of time points, each sampled value will be independently normally distributed. In this case, the covariance matrix is just the identity matrix. This is therefore the implicit process underlying equation (1) and the k-means algorithm.

The great boon of Gaussian processes is the availability of well-studied covariance functions corresponding to e.g. Brownian motion, short-range fluctuations, quasi-periodic oscillations or indeed any given power spectrum [Abrahamsen1997]. Covariance functions appropriate to a system of interest can be chosen: for example, when clustering cell-cycle transcript levels, one might choose a ${\bf C}$ matrix representing a weakly periodic prior for the mean cluster signal and an ${\bf N}$ matrix representing quickly-damped fluctuations from this signal, plus uncorrelated experimental errors, for the individual gene profiles. In the event that the experimental protocol induces correlated errors, these can also be modeled in this framework. Suitable covariance functions for this example are described by Mackay97 Mackay97.