The sequence model,

We use a fixed-length, ungapped, nonrepeating model for TF binding motifs, following LawrenceEtal93 LawrenceEtal93. This model is simplistic and has been improved upon [Neuwald, Liu, & Lawrence1995,Zhu, Liu, & Lawrence1998,Roth et al.1998], but it is a plausible first approximation. Since it has been described extensively elsewhere, we will keep our (slightly modified) definition terse.

Denote the motif model by ${\cal S}$ and the ``null'' model by ${\cal S}_\emptyset$ . The probability of nucleotide n under the null model is p_n(i.e. $\vec{p}$ is a vector indexed by type of nucleotide), so that the null model likelihood is:

$\begin{displaymath}\mbox{Pr}\left[\bar{S}^{(g)} \vert {\cal S}_\emptyset, \vec{p}\right] = \prod_{i=1}^{L} p_{S^{(g)}_i} \end{displaymath}$

(6)

Suppose that the weight matrix for the motif model is ${\bf q}$ , so that the probability of seeing nucleotide n at position j in the motif is q_jn. Fix the motif length at $\eta$ . The prior probability of the motif is given by a product of independent identical Dirichlet distributions for each position j, with pseudocounts $\vec{D} = \{D_n\}$ (these counts will typically be proportional to the null model probabilities p_n):

$\begin{displaymath}\mbox{Pr}\left[{\bf q} \vert \vec{D}\right] = \prod_{j=1}^\eta {\cal D} (\vec{q}_{[j]} \vert \vec{D}) \end{displaymath}$

(7)

Here $\vec{q}_{[j]}$ is the vector of nucleotide probabilities for position j and ${\cal D} (\vec{\theta} \vert \vec{\alpha})$ is the Dirichlet distribution with pseudocounts $\vec{\alpha}$ [Durbin et al.1998].

Consider the first base in sequence g that is aligned to the motif. Let the index of this position be a^(g) (so that the complete alignment is specified by the gene-indexed vector $\vec{a} = \{a^{(g)}\}$ ). Then the odds-ratio of the aligned motif to the null model is:

$\begin{displaymath}\frac{\mbox{Pr}\left[\bar{S}^{(g)}, a^{(g)} \vert {\cal S}, \... ... \frac{q_{j,S^{(g)}_{j+a^{(g)}-1}}}{p_{S^{(g)}_{j+a^{(g)}-1}}} \end{displaymath}$

(8)

We can optimize the total log-likelihood, including the Dirichlet prior, by setting the weight matrix entries proportional to the column nucleotide frequencies plus the pseudocounts. For convenience we first recall the multiplicities $\rho^{(g)}$ introduced for expression data in equation (4) to write down an analogous total log-probability for the sequence data:

$\displaystyle {\cal L}\left[\bar{\bf S}, \vec{a}, {\bf q} \hspace{.3em} \vert \... ...= \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad$
$\displaystyle \quad \quad \quad \log{\mbox{Pr}\left[{\bf q} \vert \vec{D}\right... ...{\mbox{Pr}\left[\bar{S}^{(g)}, a^{(g)} \vert {\cal S}, \vec{p}, {\bf q}\right]}$			(9)

where $\Theta' = \{ \vec{p}, \vec{D} \}$ is the global sequence model parameter set and $\bar{\bf S}$ is a shorthand for the entire sequence dataset $\{\bar{S}^{(g)}\}$ .

The optimal weight matrix entries are then given by the (multiplicity-weighted) nucleotide frequencies, plus pseudocounts:

$\begin{displaymath}q_{jn}^{\mbox{\small (opt)}} = \frac{D_n + \sum_{\{g:S^{(g)}_... ...= n\}} \rho^{(g)}}{\sum_{n'} D_{n'} + \sum_{g=1}^G \rho^{(g)}} \end{displaymath}$

(10)

Numerous generalizations of this algorithm--for example, to look for multiple motifs [Neuwald, Liu, & Lawrence1995], or to search the complementary strand of DNA [Fickett1996]--have been published. Systematic studies of the convergence properties of the algorithm, whether empirical or theoretical, are less numerous. A heuristic rule is to use the expected information content ratio of sequence data to motif as a guide to the number of iterations to run. Another useful rule of thumb is frequently to start sampling from scratch with a random seed alignment (but how frequently is ``frequently''?). Arrival at the same solution from independent random seeds seems to be a good indicator of having found the global optimum [Lawrence1996]; we call this the déja vu heuristic.