kimono -- K-means integrated models for oligonucleotide arrays

Introduction

kimono is a FREE software package for finding regulatory elements in promoter sequences using quantitative expression data. It works by applying a stochastic analogue of the K-means algorithm to cluster a dataset of promoter sequences and associated expression profiles generated using a functional genomics protocol like microarray technology.

Practically speaking, kimono uses a scoring function to compare each sequence-expression tuple to a range of alternative sequence-expression models. These models are estimated dynamically from the data. Each model corresponds to a cluster in the K-means formulation.

The scoring function is a weighted sum of a sequence score (a log-odds score derived from an alignment to a profile weight matrix estimated on-the-fly by Gibbs sampling) and an expression score (which essentially is a sum-of-squares distance measure from the mean expression profile for the cluster).

There is slightly more to it than that: the expression score may be formulated as a log-likelihood (just as the sequence score already is) within the framework of Gaussian processes. The model is then fully probabilistic, and K-means is a special case of the EM algorithm applied to mixture density estimation. The Gibbs sampling adds a stochastic element that addresses the problem of local optima.

The relative weighting of the sequence and expression scores determines which will be the principal factor that decides the clustering. When the expression weight dominates the sequence weight, then the kimono algorithm is identical to a procedure earlier proposed by the Church group.

Download

Download the source code: kimono.tar.gz (275K). The program is written in C++ using the Standard Template Library. It compiles cleanly "out of the box" on RedHat Linux 6.0.

Browse or download the ISMB 2000 abstract: ismb2000-abstract.ps.gz (92K), ismb2000-abstract.ps (236K), ismb2000-abstract.pdf (204K)

Browse or download some notes for a talk hosted by Mel Simon at Cal Tech on April 25, 2000: caltech4-00.ps.gz (79K), caltech4-00.ps (261K), caltech4-00.pdf (170K)

Questions?

If you have any choice words about kimono or the underlying algorithm, please mail or put them in KimonoWiki or KimonoWikiFlames...