A note on scaling of expression data

Next: Implementation Up: Definitions Previous: Clustering using multiple models

A note on scaling of expression data

Consider what happens if we take a couple of real numbers, $\gamma^{(g)}$ and $\beta^{(g)}$ , and use them to scale/offset the expression data for gene g by applying the mapping

$\begin{displaymath}E^{(g)}_x \rightarrow \exp{[-\gamma^{(g)}]} (E^{(g)}_x - \beta^{(g)}) \end{displaymath}$

to all likelihood formulae involving expression profiles.

If we choose

$\begin{displaymath}\beta^{(g)} = \frac{1}{X}\sum_{x=1}^X E^{(g)}_x \end{displaymath}$

$\begin{displaymath}\gamma^{(g)} = -\frac{1}{2} \log{\left[ \frac{1}{X} \sum_{x=1}^X (E^{(g)}_x - \tau^{(g)})^2 \right]} \end{displaymath}$

then this is equivalent to the ``normalization'' of profile vectors that is commonly applied to expression data preceding analysis. While this normalization is effective as a correction for biases in the experimental protocol, it is also crude: it does not distinguish between systematic experimental errors and genuine differences in transcription levels.

Our likelihood framework permits a more principled approach. We can incorporate prior probabilities for the model-independent scaling parameters $\beta^{(g)}$ and $\gamma^{(g)}$ . Suitable priors might be the Gaussian distributions $\beta^{(g)} \sim {\cal N}(0,\tau^2)$ and $\gamma^{(g)} \sim {\cal N}(0,\upsilon^2)$ . The values of $\beta^{(g)}$ and $\gamma^{(g)}$ can then be sampled at some point during the Gibbs/EM update procedure, e.g. between the Gibbs and Expectation steps. One way to do the sampling is to generate values from the prior distribution, then choose one of these values randomly according to the likelihoods given by equation (12).

It is equally straightforward to incorporate experiment-dependent parameters into the scaling, e.g.

$\begin{displaymath}E^{(g)}_x \rightarrow \exp{[-\gamma^{(g)}-\delta^{(x)}]} (E^{(g)}_x - \beta^{(g)} - \epsilon^{(x)}) \end{displaymath}$

with $\delta^{(x)} \sim {\cal N}(0,\phi^2)$ and $\epsilon^{(x)} \sim {\cal N}(0,\varphi^2)$ .

When such additional parameters are allowed, we may constrain them (by keeping $\tau$ , $\upsilon$ , $\phi$ and $\varphi$ small) to avoid overfitting when the dataset is sparse.

Next: Implementation Up: Definitions Previous: Clustering using multiple models

2000-04-26