Next: Definitions Up: Finding Regulatory Elements Using Previous: Finding Regulatory Elements Using

Introduction

A promising use for data from expression profiling experiments is as a signpost for finding transcription factor binding sites. For example, groups of putatively co-regulated transcripts may be identified by clustering the expression profiles generated by a series of microarray experiments [Bucher1999]. The promoter sequences for each transcript in a cluster may then be fed to a motif discovery algorithm. Motifs that are common to a set of apparently co-expressed genes are plausible candidates for binding sites implicated in transcriptional regulation [Wasserman & Fickett1998].

Such was the approach taken by Church and co-workers [Tavazoie et al.1999], who used the k-means algorithm to cluster the Saccharomyces mitotic cell cycle expression data generated by ChoEtAl98 ChoEtAl98 (k-means is an algorithm that clusters points in a multidimensional vector space by repeatedly assigning each point to the nearest of kcluster ``centroids'', calculating newly optimal centroids for each group, and iterating until convergence). The Church group then looked for ungapped motifs in the corresponding genomic sequence using a Gibbs sampler (which samples from the posterior probability distribution over ungapped alignments by repeatedly sampling one row at a time, picking a new indentation by making a random choice weighted by the probability of the resulting aligned motif [Lawrence et al.1993]).

It is acknowledged that this endeavor is fraught with potential pitfalls: noise in the microarray datasets, cryptic promoter elements, interactions between binding proteins and alternative modes of transcriptional regulation are among the aspects of the underlying biological process that are poorly understood. However, even quite simple algorithms that ignore such issues appear to generate interesting results [Spellman et al.1998,Eisen et al.1998,Tavazoie et al.1999] and so it seems worth investigating improved methods and formalisms that can be of practical benefit.

It is possible to map the two-stage algorithm described by Church and co-workers [Tavazoie et al.1999] into an integrated likelihood framework. The two stages--k-means clustering of expression profiles followed by Gibbs sampling of sequences--can be viewed as operating on the marginal distributions of a joint probabilistic model for sequence and expression data.

Why is it useful to have an integrated likelihood model? One reason is that the two-stage algorithm and the integrated algorithm are solving subtly different problems. Our integrated algorithm can be summarized as:

$\fbox{{\em \begin{tabular}{c} Find clusters of genes that have: \ (a) similar expression patterns {\bf and} \ (b) similar promoters. \ \end{tabular}}}$

On the other hand, the two-stage algorithm can be summarized as:

$\fbox{{\em \begin{tabular}{c} Find genes with similar expression patterns,\ {\bf then} see if they have similar promoters. \end{tabular}}}$

In particular, with the two-stage algorithm, the presence or absence of a promoter motif can have no influence on which cluster a gene is assigned to. This is reminiscent of a situation sometimes encountered in medicine, where several distinct genetic diseases can cause thoroughly similar symptoms and thus be naively diagnosed as the same disease. Furthermore, the two-stage algorithm cannot easily apply any penalty for finding the same motif in two different clusters. In the integrated algorithm, however, motifs can play a key role in determining the overall clustering pattern.

The two-stage algorithm is suited for validating the hypothesis that similar expression profiles can imply transcriptional co-regulation via shared transcription factor binding sites. Once this is taken as a fact, the integrated method may be used as a more effective way to find meaningful clusters and promoters. The hope is that using an integrated approach and a better formulated optimization problem will result in significantly improved discriminative power.

In this abstract we describe how to use just such an integrated model for identifying transcription factor (TF) binding motifs. We start, following Bishop95 Bishop95, by noting that the k-means algorithm is itself a special case of a parameter estimation algorithm (namely the Expectation/Maximization or EM algorithm) applied to a mixture density where the mixture components are spherical Gaussian distributions. Generalizing, we allow arbitrary variance (and covariance): this gives us a prior model for expression profiles within the framework of ``Gaussian processes''. The description of the sequence model closely follows that of LawrenceEtal93 LawrenceEtal93, since their development is already probabilistic. We then show how to combine the two models and suggest an adaptation of the Gibbs sampling and EM algorithms to the resulting joint model.

Finally, we direct the reader to an implementation of this algorithm: the kimono program (k-means integrated models for oligonucleotide arrays) which, despite the acronym, can be used with any source of quantitative expression data--not just microarrays! We illustrate the characteristic behavior of the program with data from simulations. The source code for kimono is covered by the GNU Public License and will remain available to all users free of charge. We strongly encourage interested parties to contact us to discuss performance, suggest improvements or just offer praise.

Next: Definitions Up: Finding Regulatory Elements Using Previous: Finding Regulatory Elements Using

2000-04-26