Implementation

We have developed a program, called kimono, to implement the Gibbs/EM algorithm for sequence-expression clustering with competing ${\cal SE}$ models. The source code for kimono is freely available, under the terms of the GNU Public License [GPL2000], from the following URL:

kimono implements all the features described in this abstract, including arbitrary covariance priors, null sequence and expression models, prior probabilities for models, reverse strand sampling, multiple re-seeding and the déja vu heuristic. It also offers some useful features not covered here, including automatic null model optimization and an experimental implementation of a simulated annealing version of Gibbs/EM.

The program is easy to use: it understands standard file formats (FASTA format for sequence data [Pearson & Lipman1988], Stanford format tab-delimited tables for expression data [Spellman et al.1998]). In addition, it uses a customizable multi-priority logging system and displays friendly help messages and errors.

The acronym kimono stands for k-means integrated models for oligonucleotide arrays. We would however like to point out that nothing in the present formalism prohibits the use of quantitative expression data from large-scale functional genomics experiments using technologies other than microarrays.

kimono is implemented in C++ using the Standard Template Library and compiles cleanly on a RedHat 6.0 GNU/Linux system using the gcc compiler, version egcs-2.91.66.

Figure 1 shows sample paths of log-likelihood over time illustrating kimono's typical behavior on simulated data. The paths shown were selected for illustrative purposes from ten randomly seeded runs on each of four simulated datasets. The variables distinguishing the four plots are the expression signal-to-noise ratio $\varepsilon$ defined by

**Figure:** Sample paths of total log-likelihood versus number of iterations, illustrating **kimono**'s characteristic behavior on simulated data generated with number of experiments X = 10, number of genes G = 24, number of models k = 3, nucleotide probability $p_n = \frac{1}{4}$ , motif length $\eta = 10$ , expected expression profile $\vec{\nu} = \vec{0}$ and covariance matrices ${\bf N}$ and ${\bf C}$ equal to multiples of the identity matrix. The **kimono** parameters were left at their default settings, except for k which was set to 3. The Gibbs/EM algorithm was halted when the correct solution was approached to within a mildly error-tolerant margin. Different paths in each graph correspond to different random seeds. a: sequence length L = 110, expression signal-to-noise $\varepsilon = 50$ , and sequence signal-to-noise $\psi \simeq 350$ . b: L = 110, $\varepsilon = 20$ , $\psi \simeq 350$ . c: L = 230, $\varepsilon = 50$ , $\psi \simeq 170$ . d: L = 230, $\varepsilon = 20$ , $\psi \simeq 170$ .
$\begin{figure}\psfig{figure=xmgr2.eps,width={\textwidth},clip=} \par\end{figure}$

Table: kimono performance on a 450MHz Red Hat Linux Pentium system. In each round, the alignment of every sequence to every model was sampled once. The simulation parameters were: number of genes G = 24, number of models k = 3, nucleotide probability $p_n = \frac{1}{4}$ , motif length $\eta = 10$ , expected expression profile $\vec{\nu} = \vec{0}$ and covariance matrices ${\bf N}$ and ${\bf C}$ equal to multiples of the identity matrix, with expression signal-to-noise $\varepsilon = 50$ . Sampling was terminated after 1000 rounds regardless of convergence.

Sequence	Rounds to	Time ( $\mu$ s) per
length	convergence	residue cluster
20	12 $\pm$ 6	29 $\pm$ 4
30	36 $\pm$ 19	26 $\pm$ 2
50	49 $\pm$ 23	28 $\pm$ 0.8
70	61 $\pm$ 25	30 $\pm$ 0.6
110	66 $\pm$ 18	29 $\pm$ 0.2
150	155 $\pm$ 36	28 $\pm$ 0.07
230	412 $\pm$ 55	28 $\pm$ 0.06
310	674 $\pm$ 47	28 $\pm$ 0.05
470	984 $\pm$ 13	27 $\pm$ 0.05

These signal-to-noise ratios refer to the actual parameters used in generating the data. The kimono program was run using its default settings, except for k which was set to 3 to match the simulated data. The default kimono parameters are the same as the simulation parameters listed in the caption to Figure 1 , with the exception of the ${\bf C}$ and ${\bf N}$ matrices (which are equal to the identity matrix by default); also, the pseudocounts D_n are set to 0.1p_n, where the null nucleotide probabilities p_n are automatically estimated from composition of the sequence database.

The graphs in Figure 1 exhibit several features characteristic of this algorithm. Whereas the Gibbs sampler for a single alignment initially spends most of its time exploring a more-or-less flat log-likelihood landscape, or ``plateau'', before finding a seed for the true signal and then rapidly converging to the global optimum [Lawrence et al.1993], the multiple-model sampler steps through several plateaux at different levels as the distinct clusters converge semi-independently on their respective solutions. The convergence of different clusters is not expected to be completely independent, as the arrival of each cluster at its optimal alignment helps to fix some of the weights $\vec{\rho}$ correctly, and thus limits the introduction of noise into the other alignments.

For the data shown, the simulated expression signal-to-noise ratio $\varepsilon$ is quite large and so the assignments of genes to their correct clusters is reasonably well constrained. However, as $\varepsilon$ or $\psi$ is reduced, the clustering becomes more uncertain, with the observable effect that the plateaux become bumpier and start to blur into one another. One corollary of this is that one starts to see the algorithm step downwards to less favorable plateaux. Of course, this will happen (albeit less frequently) even in the simple case, when there's only one plateau separated from the optimum by a large jump in the log-likelihood. This behavior can be seen in plot c in Figure 1.

A characteristic feature of Gibbs sampling in general is that the distribution of convergence times has a long tail. This is apparently due to the algorithm occasionally getting stuck in an appealing local optimum. Convergence is generally slower when $\varepsilon$ and $\psi$ are small.

An idea of the running speed of kimono can be gained by studying Table 1. Extrapolating to 30 clusters--the number used by TavazoieEtal99--and the full complement of roughly 6,000 yeast ORFs, convergence is expected in approximately one month on a 450MHz Pentium system. Our procedure is inherently slower than the two-stage algorithm of TavazoieEtal99, since their algorithm samples one alignment per gene where our algorithm has to sample k such alignments. It would be straightforward to parallelise this extra sampling duty on a multiprocessor system. At the time of writing, there is also plenty of room for optimizing the C++ code.

The two-stage algorithm is a special case of our algorithm that can be obtained formally by scaling the matrices ${\bf C}$ and ${\bf N}$ by some parameter $\lambda$ and sampling the alignment for each gene to only the most plausible model with some probability p. As $\lambda \rightarrow 0$ and $p \rightarrow 1$ , then the two-stage algorithm is approached.