We use a fixed-length, ungapped, nonrepeating model for TF binding motifs, following LawrenceEtal93 LawrenceEtal93. This model is simplistic and has been improved upon [Neuwald, Liu, & Lawrence1995,Zhu, Liu, & Lawrence1998,Roth et al.1998], but it is a plausible first approximation. Since it has been described extensively elsewhere, we will keep our (slightly modified) definition terse.
Denote the motif model by and the ``null'' model by
. The probability of nucleotide n under the null model is pn(i.e.
is a vector indexed by type of nucleotide), so that the null model likelihood is:
Suppose that the weight matrix for the motif model is , so that the probability of seeing nucleotide n at position j in the motif is qjn. Fix the motif length at
. The prior probability of the motif is given by a product of independent identical Dirichlet distributions for each position j, with pseudocounts
(these counts will typically be proportional to the null model probabilities pn):
Here is the vector of nucleotide probabilities for position j and
is the Dirichlet distribution with pseudocounts
[Durbin et al.1998].
Consider the first base in sequence g that is aligned to the motif. Let the index of this position be a(g) (so that the complete alignment is specified by the gene-indexed vector ). Then the odds-ratio of the aligned motif to the null model is:
We can optimize the total log-likelihood, including the Dirichlet prior, by setting the weight matrix entries proportional to the column nucleotide frequencies plus the pseudocounts. For convenience we first recall the multiplicities introduced for expression data in equation (4) to write down an analogous total log-probability for the sequence data:
where is the global sequence model parameter set and
is a shorthand for the entire sequence dataset
.
The optimal weight matrix entries are then given by the (multiplicity-weighted) nucleotide frequencies, plus pseudocounts:
The update procedure for Gibbs sampling is as follows:
Numerous generalizations of this algorithm--for example, to look for multiple motifs [Neuwald, Liu, & Lawrence1995], or to search the complementary strand of DNA [Fickett1996]--have been published. Systematic studies of the convergence properties of the algorithm, whether empirical or theoretical, are less numerous. A heuristic rule is to use the expected information content ratio of sequence data to motif as a guide to the number of iterations to run. Another useful rule of thumb is frequently to start sampling from scratch with a random seed alignment (but how frequently is ``frequently''?). Arrival at the same solution from independent random seeds seems to be a good indicator of having found the global optimum [Lawrence1996]; we call this the déja vu heuristic.