Next: The expression model, Up: Finding Regulatory Elements Using Previous: Introduction

Definitions

We begin by introducing some notation for the experimental data. Let the number of gene transcripts be G. The observed sequence and expression data for the g'th gene are $\bar{S}^{(g)}$ and $\vec{E}^{(g)}$ respectively (we will use the notational convention that x is a scalar, $\bar{x}$ is a sequence, $\vec{x}$ is a vector and ${\bf x}$ is a matrix).

We suppose that each upstream sequence is the same length, L. Thus, sequence $\bar{S}^{(g)}$ is defined to be the L bases upstream of the transcription start site for gene g, and the i'th nucleotide of this sequence is denoted by S^(g)_i.

The expression profile $\vec{E}^{(g)}$ for gene g is a vector of real numbers, typically log intensity-ratios [Spellman et al.1998], presumed to be related to mRNA abundance levels. Thus E^(g)_x is the ``expression level'' measured in experiment x, with $1 \leq x \leq X$ . We suppose that the X experiments form a time series (although this is not a restriction--see below), so that experiment x corresponds to time t_x and $E^{(g)}_x \equiv E^{(g)}(t_x)$ . Note that this allows for multiple sampling runs if t_x = t_y for some $x \neq y$ .

The notation of the previous paragraph could still be considered valid for experiments that did not correspond to a time series. For example, t_x could be a discrete-valued variable representing the tissue from which the cells in experiment x were taken. The point is that we want to be able to model correlations between different expression experiments; if the experiments are actually independent (or more likely, if it's too hard to figure out which experiments should be correlated with which) the usual practice of assuming the experiments are independent can readily be carried out.

Next: The expression model, Up: Finding Regulatory Elements Using Previous: Introduction

2000-04-26