We have developed a program, called kimono, to implement the Gibbs/EM algorithm for sequence-expression clustering with competing models. The source code for kimono is freely available, under the terms of the GNU Public License [GPL2000], from the following URL:
http://www.okchicken.com/~yam/kimono/
kimono implements all the features described in this abstract, including arbitrary covariance priors, null sequence and expression models, prior probabilities for models, reverse strand sampling, multiple re-seeding and the déja vu heuristic. It also offers some useful features not covered here, including automatic null model optimization and an experimental implementation of a simulated annealing version of Gibbs/EM.
The program is easy to use: it understands standard file formats (FASTA format for sequence data [Pearson & Lipman1988], Stanford format tab-delimited tables for expression data [Spellman et al.1998]). In addition, it uses a customizable multi-priority logging system and displays friendly help messages and errors.
The acronym kimono stands for k-means integrated models for oligonucleotide arrays. We would however like to point out that nothing in the present formalism prohibits the use of quantitative expression data from large-scale functional genomics experiments using technologies other than microarrays.
kimono is implemented in C++ using the Standard Template Library and compiles cleanly on a RedHat 6.0 GNU/Linux system using the gcc compiler, version egcs-2.91.66.
Figure 1 shows sample paths of log-likelihood over time illustrating kimono's typical behavior on simulated data. The paths shown were selected for illustrative purposes from ten randomly seeded runs on each of four simulated datasets. The variables distinguishing the four plots are the expression signal-to-noise ratio defined by
![]() |
|
and the sequence signal-to-noise ratio defined by
These signal-to-noise ratios refer to the actual parameters used in generating the data. The kimono program was run using its default settings, except for k which was set to 3 to match the simulated data. The default kimono parameters are the same as the simulation parameters listed in the caption to Figure 1 , with the exception of the and
matrices (which are equal to the identity matrix by default); also, the pseudocounts Dn are set to 0.1pn, where the null nucleotide probabilities pn are automatically estimated from composition of the sequence database.
The graphs in Figure 1 exhibit several features characteristic of this algorithm. Whereas the Gibbs sampler for a single alignment initially spends most of its time exploring a more-or-less flat log-likelihood landscape, or ``plateau'', before finding a seed for the true signal and then rapidly converging to the global optimum [Lawrence et al.1993], the multiple-model sampler steps through several plateaux at different levels as the distinct clusters converge semi-independently on their respective solutions. The convergence of different clusters is not expected to be completely independent, as the arrival of each cluster at its optimal alignment helps to fix some of the weights correctly, and thus limits the introduction of noise into the other alignments.
For the data shown, the simulated expression signal-to-noise ratio is quite large and so the assignments of genes to their correct clusters is reasonably well constrained. However, as
or
is reduced, the clustering becomes more uncertain, with the observable effect that the plateaux become bumpier and start to blur into one another. One corollary of this is that one starts to see the algorithm step downwards to less favorable plateaux. Of course, this will happen (albeit less frequently) even in the simple case, when there's only one plateau separated from the optimum by a large jump in the log-likelihood. This behavior can be seen in plot c in Figure 1.
A characteristic feature of Gibbs sampling in general is that the distribution of convergence times has a long tail. This is apparently due to the algorithm occasionally getting stuck in an appealing local optimum. Convergence is generally slower when and
are small.
An idea of the running speed of kimono can be gained by studying Table 1. Extrapolating to 30 clusters--the number used by TavazoieEtal99--and the full complement of roughly 6,000 yeast ORFs, convergence is expected in approximately one month on a 450MHz Pentium system. Our procedure is inherently slower than the two-stage algorithm of TavazoieEtal99, since their algorithm samples one alignment per gene where our algorithm has to sample k such alignments. It would be straightforward to parallelise this extra sampling duty on a multiprocessor system. At the time of writing, there is also plenty of room for optimizing the C++ code.
The two-stage algorithm is a special case of our algorithm that can be obtained formally by scaling the matrices and
by some parameter
and sampling the alignment for each gene to only the most plausible model with some probability p. As
and
, then the two-stage algorithm is approached.