We have described a probabilistic model--the model--that relates a promoter sequence to a quantitative expression profile of the corresponding downstream message. We have shown how to use multiple
models for identification of putative transcription factor binding sites, giving a Gibbs/EM algorithm for model training. The program kimono implements this algorithm; its source code is freely available from the Internet.
Clearly, probabilistic modeling is not limited to ungapped motif models and simple Gaussian process clusters. It is possible to adapt the approach that we have described in many ways. One could imagine extending the and
models described here, replacing them with other models, adding new features or modeling different kinds of functional genomics data. Although we do not intend to list every possible modification, there are some that are worth mentioning.
One obvious extension to the expression half of the model is to include multiple classes of model. For example, when clustering a cell-cycle dataset one might allow for both periodic and non-periodic clusters. Related ideas include allowing the covariance matrix to be a model-specific variable and improving the scaling model for experimental and gene-specific bias in the light of more experimental data.
Promoter sequence models are also worth developing, for a broad range of applications. The most obvious probabilistic framework in which to do this seems to be the Hidden Markov model (HMM). In fact, the ungapped motif models that we use are special cases of profile HMMs with no insert states, delete states or internal cycles. One basic improvement, then, would be to allow loop transitions within the HMM (and thus multiple instances of the motif). Another would be to place a prior distribution over the motif length, allowing it to vary (note however that a geometric prior is far too broad for this job, leading mostly to long, rambling, weak pseudo-motifs; a narrower distribution, like a Gaussian, might work).
A familiar example--the ``TATA box''--motivates development of better promoter models for use with our method. Since the TATA box motif is found in the majority of eukaryotic promoter regions regardless of the transcriptional behavior of the downstream gene, it will have some overall aggregating effect on the clustering. This example shows that, counterintuitively, the integration of more (sequence) data can potentially lead to coarser clusters if an incorrect model is used (although we expect this particular example to be handled somewhat better by our Gibbs/EM approach, which allows for uncertainty in the cluster assignments, than by other clustering algorithms, such as k-means, which make ``hard'' decisions). It is easy to become pessimistic about the chances of ever modeling promoters well, but it is good to remember that, pragmatically speaking, one doesn't need to have a perfect model in order to generate leads that experimentalists can pursue.
Another route to new models is to attempt to place procedures that are not explicitly probabilistic within a Bayesian framework, as has been done previously with the k-means algorithm and mixture density estimation using Gaussian basis functions. With many procedures, the Bayesian interpretation (or the nearest approximation) may be less obvious than the relatively simple mapping between k-means and Gaussian processes. An interesting candidate for this approach might be the logistic regression model for aggregating promoter sequence motifs that was proposed by WassermanFickett98 WassermanFickett98. With respect to the analysis of expression data, the support vector machine (SVM) approaches used by the Haussler group are promising [Brown et al.2000]; their work also differs from ours in taking a supervised learning approach, where at least some of the cluster membership assignments must be pre-specified by the user.
In addition to these improvements to the probabilistic model, there may be many potential improvements to the Gibbs/EM algorithm that we use to train the model (i.e. to identify and characterize clusters). Examples of such improvements could include simulated annealing approaches [Gilks, Richardson, & Spiegelhalter1996] or incremental Expectation Maximization [Neal & Hinton1993]. It would also be useful to be able to sample over unknown parameters, such as the number of clusters k, using Markov Chain Monte Carlo methods.
An exciting prospect for the future is the automatic identification of regulatory networks on the basis of functional genomics data [D'haeseleer et al.1999]. In a probabilistic framework, this is equivalent to relaxing the independence assumption for transcripts. This presents many challenges to the statistical analyst, including issues of how much data is needed to identify interactions, a problem common in MCMC analysis.