Clustering using multiple models

We now move to the question of how to cluster sequence and expression data simultaneously using multiple competing ${\cal SE}$ models. The prototype optimization algorithms (Expectation/Maximization for clustering, Gibbs sampling for alignment) have already been introduced, so it only remains to combine them. We will describe just one of several ways that this can be done.

We must introduce some more notation (the last!). Suppose that we have k sequence-expression models; the m'th model is ${\cal SE}_{[m]}$ and its parameters are $\Lambda_{[m]} = \{ \vec{a}_{[m]}, {\bf q}_{[m]}, \vec{\mu}_{[m]} \}$ .

We also allow a null sequence-expression model, ${\cal SE}_\emptyset$ . The null sequence likelihood is given by (6) and the null expression likelihood by setting $\vec{\mu}$ equal to $\vec{E}^{(g)}$ in (3). The joint null likelihood is the product of these two likelihoods

$\displaystyle {\mbox{Pr}\left[{\bar{S}\vec{E}}^{(g)} \vert {\cal SE}_\emptyset, \Theta''\right] =}$
		$\displaystyle \mbox{Pr}\left[\bar{S}^{(g)} \vert {\cal S}_\emptyset, \vec{p}\ri... ...mbox{Pr}\left[\vec{E}^{(g)} \vert {\cal E}_\emptyset, \vec{\mu}, {\bf N}\right]$	(13)

The null sequence model generates independent unrelated sequences and the null expression model generates expression profiles each of which is effectively an independent one-member cluster.

The basic job of our clustering algorithm is to assign each gene to a cluster. We handle this in our formalism by having a different multiplicities vector for each model, denoting the vector for the m'th model by $\vec{\rho}_{[m]}$ (and for the null model, $\vec{\rho}_\emptyset$ ). Then the act of assigning gene g to model ${\cal SE}_m$ is equivalent to setting

$\begin{displaymath}\rho^{(g)}_{[m']} = \left\{ \begin{array}{ll} 1 & \mbox{if $m' = m$ } \ 0 & \mbox{if $m' \neq m$ } \end{array}\right. \end{displaymath}$

(14)

In our Gibbs/EM algorithm, the $\rho^{(g)}_{[m]}$ are the missing data. The E-step of the EM algorithm involves setting these missing data to their expectation values according to the posterior probability distribution $\mbox{Pr}\left[{\cal SE}_{[m]} \vert \bar{S}\vec{E}^{(g)}, ...\right]$ (i.e. multiplying equation (14) by this probability and summing over m).

In other words, as with k-means, we effectively force the models to compete to explain the genes. At the same time, by allowing the $\rho_{[m]}^{(g)}$ to be continuously valued expectations rather than zero or one, we allow uncertainty where k-means does not. See e.g. Bishop95 Bishop95 for an alternative explanation of this approach.

Relevant equations (apart from Bayes' rule) are (7), (8) and (10) for the Gibbs sampling step, (12) and (13) for the Expectation step (we can also introduce prior model probabilities $\mbox{Pr}\left[{\cal SE}_{[m]}\right]$ here) and (5) for the Maximization step.

Another way of viewing this algorithm is as a mixture density estimation problem, where a k-component mixture of ${\cal SE}$ models must be fit to an observed scattering of sequence-expression data. Viewed this way, the oft-quoted limitation of k-means (that the parameter k must be known in advance) seems not insurmountable, as mixture density estimation when the number of components is unknown is a familiar problem in Bayesian statistics.