Research Article |
Open Access |
|
|
A Probabilistic Approach to
Study Yeast’s Gene Regulatory Network |
Pinto F.R |
Centro de Química e Bioquímica, Departamento de
Química e Bioquímica, Faculdade de Ciências da Universidade
de Lisboa, Campo Grande, 1749-016 Lisboa, Portugal |
| *Corresponding author: |
Pinto F.R, Centro de Química e Bioquímica,
Departamento de Química e Bioquímica, Faculdade de Ciências da Universidade de Lisboa,
Campo Grande, 1749-016 Lisboa, Portugal,
Phone : +351 217500891,
Fax : +351 217500088,
Email : frpinto@fc.ul.pt |
|
| Received December 14, 2008; Accepted February 25, 2009; Published February 27, 2009 |
|
Citation: Pinto F.R (2009) A Probabilistic Approach to Study Yeast’s Gene Regulatory Network. J Comput Sci Syst Biol 2: 043-050. |
| |
Copyright: ©2009 Pinto F.R. This is an open-access article distributed under the terms of the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and
source are credited. |
| |
|
Using only the transcription network structure information, a probabilistic model was developed that computes
the probabilities with which a pair of genes responds simultaneously (SR) or differentially (DR) to a random
network perturbation. Study of yeast’s transcription regulatory network in association with gene expression
profiles shows that SR and DR probabilities are significantly associated with the distribution of strong co-expression.
It is 100 fold more probable to observe co-expression when P(SR)≈0.5 for a random perturbation of 3 transcription
factors (TFs), allowing for perturbation spread until a depth of 3 connections in the regulatory network. The
model also predicts that positive co-expression enhancement is related with the proportion of common TFs
(number of TFs that regulate both genes in a pair divided by the total number of TFs that regulate at least one
gene in the pair), and not to the absolute number. The relationship between the model derived probabilities and
other graph-theoretic measures used to analyse biological networks is discussed. |
Key words: |
| gene regulatory networks; network perturbation; response probability; co-regulation; co-expression; and graphbased
measures |
Abbreviations |
GTOM – generalized topological overlap measure
P(DR) – probability of differential response to network perturbation
P(NR) – probability of neutral response to network perturbation
P(SR) – probability of simultaneous response to network perturbation
TF – transcription factor
TFS – transcription factor similarity
TOM – topological overlap measure |
Introduction |
Systems biology has recently re-emerged ( Westerhoff and
Palsson, 2004), after a long quiescent period since its firsts
steps ( Bertalanfy, 1928), due to the acknowledgement of
the components’ interactions importance for an enhanced
understanding of living organisms. Analysis of these
interactions can be achieved through the identification of
biological networks. Typically, they are characterized by their
connectivity distributions ( Barabasi and Oltvai, 2004) or by
the enrichment in small network motifs, as compared with
randomly connected networks ( Milo, et al., 2002). The
present work studies transcription regulatory networks, in
particular for the yeast Saccharomyces cerevisiae. The
most abundant information available about these networks corresponds to a topological model or wiring diagram ( Schlitt
and Brazma, 2005). In other words, the network consists
on a directed graph, where each node corresponds to a gene
and respective gene product, and a directed interaction
between A and B means that A gene product is a transcription
factor (TF) that regulates the expression of gene B. For S.
cerevisiae these interactions have been gathered from the
literature (Guelzim, et al., 2002) and detected with high
throughput experimental techniques like chromatin immunoprecipitation
combined with DNA chip technology (ChIPchip)
( Lee, et al., 2002). Additionally, this model organism is
relatively rich in microarray gene expression studies. Soon
after the availability of topological models of yeast’s
transcriptional regulatory network, several researchers
evaluated the agreement between network structure
information and its dynamical behaviour. Some studies only
categorized pairs of genes into two or three classes: no
common TFs, one or more common TF and two or more
common TFs ( Allocco, et al., 2004; Yeung, et al., 2004; Yu,
et al., 2003). Others looked at blocks of target genes
modulated by the same set of TFs, measuring the impact of
the number of TFs on block co-expression ( Herrgard, et al.,
2003). These studies used the same general approach to
detect co-expression (applying a threshold to an expression
correlation measure), but relied on stringent criteria to
recognize co-regulation. In the present report the association
patterns that emerge under more generic definitions are
explored. With this aim, a probabilistic framework was
developed to predict when two genes respond simultaneously
to a random perturbation. |
Methods |
Gene Expression Data |
| The gene expression data used in this work (Gasch, et al.,
2000) consists of 173 cDNA array experiments, involving
around 30 different environmental perturbations of yeast
cultures. Some perturbations are monitored for several time
points. The actual dataset, normalization procedure and other
pre-processing methods are described in the paper
supplementary website: http://www-genome.stanford.edu/
yeast_stress. Pearson’s linear correlation coefficient was
calculated between the expression profiles of every pair of
genes. For a given gene pair, if the number of missing values
was greater than 10% of the total number of arrays, the
corresponding correlation coefficient was not used in further
analysis. |
Regulatory Network Information |
| The yeast’s transcription network topology was obtained
from the public dataset of Lee and colleagues. The promoter binding sites of 106 TFs were detected through a
chromosome immuno-precipitation followed by hybridization
in a DNA chip. As proposed by the authors, a cut-off of
p<0.001 was used to define that a promoter region is bound
by a given TF (Lee, et al., 2002). |
Co-expression Thresholds and Detection of
Associations |
| The association between co-expression and network
perturbation responses was assessed by computing the
probability of finding strong co-expressions among pairs of
gene with similar response probabilities. Having a similar
response probability means it is within an interval centred
on a value of interest. The range of the interval was always
selected to be 1/10 of the total observed range of the
respective response probability. The strong co-expression
probabilities were computed for 100 equally spaced values
across the total range of observed response probabilities.
Positive and negative co-expressions were analysed
separately. Positively co-expressed gene pairs were defined
as the 0.5% gene pairs with the highest expression
correlation coefficients. This was equivalent to the
application of a correlation threshold rt+=0.82. Analogously,
negatively co-expressed gene pairs were defined as the 0.5%
gene pairs with the lowest expression correlation
coefficients, corresponding to a threshold of rt-=-0.73 |
Randomization Procedure |
| To estimate the range of strong co-expression probabilities
in the null case of independence between co-expression and
perturbation responses, 0.5% of gene pairs were randomly
assigned as strongly co-expressed and all the strong coexpression
probabilities were re-computed. This procedure
was repeated 2000 times, retaining for each value of the
perturbation response probability the minimum and maximum
limit values of the strong co-expression probability. In a
probability profile composed of 100 points, the null probability
of finding at least one point out of the random range will be
0.05, according to a Bonferroni correction for multiple testing
(Quinn, 2002). All the procedures, graphs and calculations
were implemented in Matlab (Release 14). |
Results |
| Most of the common expression datasets are collections
of expression values, relating two gene expression states,
one before and the other after a specific environmental
perturbation. The correlation coefficient between expression
profiles is dependent on the number of times the two genes
respond simultaneously and the number of times only one of the genes responds. In this section estimates are derived
for the probabilities of observing each kind of response from
available network data. This is done in the absence of
information about strengths and signs of each regulatory
interaction. |
In the following derivation, a random perturbation is
applied, where a limited number of TFs e are perturbed,
and around them the perturbation propagates through all
the possible pathways until a maximum depth d. |
The response of a given pair of genes is in one of three
classes: a) both genes respond (simultaneous response, SR);
b) only one of the genes responds (differential response,
DR); c) none of the genes respond (neutral response, NR). |
| NRs should only randomly affect the pair-wise correlation
coefficients of expression profiles. SRs effect will depend
on interaction signs and DRs should contribute to lower
correlation coefficients. |
For different d values and for every pair of genes A and
B, it is possible to count the transcription factors (TFs) in
each of the following four classes: a) number of TFs that
regulate both genes – Xd; b) number of TFs that regulate
only gene A – Yd; c) number of TFs that regulate only gene
B – Wd and d) number of TFs that do not regulate any of
the genes – Zd. Every transcription factor among the N
present in the network can be classified in this way. In a
random perturbation, e TFs are perturbed. For a given pair
of genes, xd, yd, wd and zd will represent the numbers of
perturbed TFs in each of the four possible classes. |
Perturbing at least one common TF (xd>0), irrespective
of whatever other TFs are perturbed, is considered to be
sufficient to elicit an SR. If xd=0, it can still be possible to
observe an SR if yd and wd are both greater than zero. This
means that, by chance, the two target genes are regulated
after a perturbation by TFs that exclusively regulate each
one of the target genes separately. If none of the perturbed
TFs regulates any of the target genes (zd=e), then the pair
of genes will show a NRs. All the other cases correspond to
DRs. |
In this deduction two main assumptions are made: a) when
more than one TF acts on a given promoter, even if some
are activating and others repressing, there is always a net
response from the target gene, and b) every TF has an equal
probability of being perturbed. The latter is not necessarily
reasonable. It could be that TFs are involved in perturbation responses more frequently because their activity is modulated
by a higher number of signal transduction cascades.
However, in the absence of information about the activity
frequency of different signalling pathways, the least biased
hypothesis is the equal-probability one. |
Knowing the six parameters (e, d, Xd, Yd, Wd and Zd) it is
possible to compute the probability of observing any of the
possible responses (SR, DR or NR) after a random network
perturbation: |
 |
(1) |
 |
(2) |
 |
(3) |
Applying probability calculus, expressions (1), (2) and (3)
can be converted to algebraic functions. The simplest case
to calculate is: |
 |
(4) |
Where nCp is number of combinations of n elements taken
p at a time. |
|
(5) |
 |
(6) |
For e=1: |
 |
(7) |
 |
(8) |
 |
(9) |
As only one factor is perturbed, the factors in Yd and Wd always contribute to DR and not to SR. It is also noticeable
that P(DR) only depends on the sum (Yd+Wd) and not on
the relative proportion of each one alone. |
The values of Xd, Yd, Wd and Zd at the various depths d,
are completely defined by the regulatory network
architecture. Consequently, for a perturbation with depth d,
the network architecture also defines P(SR), P(DR) and
P(NR) values for a given pair of genes. I surveyed the
relations between the probability of observing strong
correlations in gene expression and the values of P(SR) and
P(DR) for different values of e (1 to 4) and d (1 to 9, the
diameter of the regulatory network used in this study). A
change in the parameters e and d induces smooth
modifications in the strong correlation probability profiles
for P(SR) and P(DR). Strong positive expression correlations
are more clearly associated with high P(SR) for e=3 and
d=3. The intensity of the strong co-expression enrichment
grows from d=1 until d=3, and decreases for higher depths.
P(DR) for those parameter values is also a characteristic
example of the remaining P(DR) profiles. These particular
results are shown in Figure 1. The two most remarkable
observations made from Figure 1 are that the greatest enrichments of positive strong expression correlations happen
near the maximum values of P(SR) and at the minimum
possible values of P(DR). Additionally, strong positive
correlations are less frequent for higher values of P(DR).
Near a P(SR) of 0.5, half of the gene pairs have a correlation
coefficient higher than the top 0.5% threshold, meaning that
strong positive correlations are concentrated 100 times.
Strong negative expression correlations only show some
significant deviation from the random null model when they
are stratified by P(DR) values. Although much less intense,
they have an inverted profile relatively to the positive
correlations. That is, they are less frequent for low values
of P(DR) and slightly enriched for some intervals of greater
P(DR) values. |
Combining expressions (8) and (9): |
 |
(10) |
|
Figure1: Plots of the relation between strong co-expression and the probability of observing simultaneous (SR) or differential
(DR) responses of a target gene pair after network perturbations initially affecting e=3 TFs and spreading until a maximum
depth of d=3. Blue lines represent the probability of finding strong positive co-expression (P(r>rt+), with a positive threshold
rt+=0.82 and where r is the linear correlation coefficient between two gene expression profiles) among pairs of genes with
similar response probabilities. Red lines represent the probability of finding strong negative co-expression (P(r<rt-), with a
negative threshold rt-=-0.73) among pairs of genes with similar response probabilities. Black lines represent the maximum and
minimum randomized probabilities of finding strong co-expression among pairs of genes with similar co-regulatory similarity
measure, after 2000 random permutations of the gene expression correlation matrix.
|
According to expression (10), the proportion of common
direct TFs is the probability of observing a simultaneous
response knowing that it is not neutral - P(SR|zd=0,e=1,d=1).
If we recall that neutral responses should not affect
correlation coefficients between expression profiles, the
proportion in expression (10) should be more related with
co-expression than the number of common TFs alone. After
observing the gene expression profile of a sufficiently high
number of perturbation experiments or distinct cellular states
it would be expectable that genes sharing more common
TFs are more frequently co-regulated. On the other hand, if
two genes are regulated by very distinct TF sets, it would
be expectable to observe perturbations where only one of
the genes is regulated. Measuring the proportion of common
TFs does effectively account for both the common TFs and
the exclusive TFs of a given target gene pair. |
In fact, Notebaart and colleagues (Notebaart, et al., 2008)
successfully used a transcription factor similarity (TFS)
measure to explain the coupling of metabolic flux between
two enzyme-coding genes. This success was not attained
when graph distance between the genes in the transcription
regulatory network was used instead of the TFS. They
defined TFS as the total number of shared TFs between
two genes divided by the total number of unique TFs
regulating the two genes, which is equivalent to expression
(10). |
Other well-known graph-theoretic measure used to
analyse biological networks is the topological overlap
measure (TOM) (Ravasz, et al., 2002). Interestingly, TOM
is also closely related to the perturbation response
probabilities derived in our model: |
 |
(11) |
Where, for the pair of genes A and B, P(AR) and P(BR) are
the probabilities that genes A and B, respectively, respond
to the perturbation. Yip and Horvath expanded TOM to a
generalized form (GTOM) that presents associations with
gene function that are more robust to uncertainties in
network topology data (Yip and Horvath, 2007). In this
probabilistic model language, the generalization corresponds to the possibility to consider perturbations with higher depths: |
 |
(12) |
It is readily apparent from expression (12) that GTOM can
be further generalized by allowing perturbations involving
more than one TF: |
 |
(13) |
These relationships and equivalences between our
probabilistic approach and common graph based measures
further suggests that P(SR) and P(DR) estimates can
provide valuable information about the correlation between
regulatory network topology and function. On the other hand,
measures like TSF, TOM or GTOM are also enriched with
the use of the presented response probabilities. They gain a
more concrete biological meaning, which can potentiate the
interpretation of the associations between their values and
network functional properties. TSF may be read as the
probability of observing a simultaneous response to network
perturbations, knowing that the response is not neutral. TOM
and GTOM are also proportional to the probability of
simultaneous response, but their values are normalized by
the response probability of the gene that responds less
frequently to network perturbations. Thus, when TOM or
GTOM are 1, it does not mean that both genes respond
exactly to the same network perturbations, but that one gene
is sensible to a set of perturbations that is contained in the
set of perturbations that have an effect in the other gene. |
Discussion |
| A main result of this work consisted in proposing a simple
rational connection between the architecture of the
regulatory network topology and its functional behaviour.
Using a minimal probabilistic model it was possible to justify
the differences between computing the proportion versus
the absolute number of common TFs. In addition to the effect of common TFs, the use of the proportion has also accounts
for the TFs exclusively regulating the expression of one gene
in a target pair. As a consequence, it reflects both the
probability of observing a simultaneous response, P(SR) and
a differential response, P(DR), after a system perturbation.
The absolute count is more directly related with the P(SR)
and does not include the impact of P(DR) in the experimental
correlation coefficients between expression profiles. |
It is also shown that integrating information about indirect
factors may be important for the analysis of gene expression
data. The results obtained with the P(SR) indicate that
including factor information until 3 regulatory levels up of
the target genes provides the most strong association
between the probability of simultaneous response and the
probability of observing strong co-expression. This
observation does not necessarily imply that on average, the
perturbations associated with the used dataset had a depth
of three connections. It could alternatively mean that the
past history of network activations or inhibitions could be
relevant for the response to newer perturbations. |
Besides the associations found between response
probabilities and strong correlation of expression profiles,
this approach was further validated by the relationship with
other methodologies (TFS, TOM, GTOM), which on their
own have previously demonstrated their utility in providing
valuable insights into the functional organization of biological
regulatory networks. Perturbation response probabilities
enhance the interpretation of graph-based measures by
attributing them a biological meaning related to the network
capacity to respond to and propagate random perturbations.
In fact, the use of P(SR) and P(DR) allows an extra
generalization of the GTOM measure. It is plausible that
the latter can have an increased performance since the most
strong association between P(SR) and expression profiles
occurred for perturbations of 3 TFs propagating through 3
network connection levels (e=3 and d=3). |
Conclusions and Perspectives |
| The presented approach uses minimal information about
transcriptional regulatory network. I believe this approach
can be useful as it can be easily validated with most common
microarray experiments. More detailed models of
transcriptional regulatory network dynamics, using Boolean,
Bayesian, stochastic or differential equations frameworks
would be optimally validated against expression profiles with
long and well-sampled time series not so commonly produced. As the network information grows and becomes
more complete, the probabilistic model shown here can
provide better predictions. The simplicity of the input
information and model assumptions may allow the analysis
of integrated networks including simultaneously several
mechanisms of gene transcription or protein activity
regulation. |
I finally propose that coarse level analysis like the one
presented here can provide a useful bridge between large
gene expression datasets and more fine-detailed models of
biological network dynamics. |
References |
- Allocco DJ, Kohane IS, Butte AJ (2004) Quantifying
the relationship between co-expression, co-regulation and
gene function. BMC Bioinformatics 5: 18. [ FIND THIS ARTICLE ONLINE ]
- Barabasi AL, Oltvai ZN (2004) Network biology:
understanding the cell’s functional organization. Nat Rev
Genet 5: 101-113. [ FIND THIS ARTICLE ONLINE ]
- Bertalanfy LV (1928) Kritische Theorie der
Formbildung. Borntraeger Berlin.
- Gasch AP, Spellman PT, Kao CM, Carmel HO, Eisen
MB, et al. (2000) Genomic expression programs in the
response of yeast cells to environmental changes. Mol
Biol Cell 11: 4241-4257. [ FIND THIS ARTICLE ONLINE ]
- Guelzim N, Bottani S, Bourgine P, Kepes F (2002)
Topological and causal structure of the yeast
transcriptional regulatory network. Nat Genet 31: 60-
63. [ FIND THIS ARTICLE ONLINE ]
- Herrgard MJ, Covert MW, Palsson BO (2003)
Reconciling gene expression data with known genomescale
regulatory network structures. Genome Res 13:
2423-2434. [ FIND THIS ARTICLE ONLINE ]
- Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar JZ, et al.
(2002) Transcriptional regulatory networks in
Saccharomyces cerevisiae. Science 298: 799-804. [ FIND THIS ARTICLE ONLINE ]
- Milo R, Shen OS, Itzkovitz S, Kashtan N, Chklovskii D,
et al. (2002) Network motifs: simple building blocks of
complex networks. Science 298: 824-827. [ FIND THIS ARTICLE ONLINE ]
- Notebaart RA, Teusink B, Siezen RJ, Papp B (2008)
Co-regulation of metabolic genes is better explained by flux coupling than by network distance. PLoS Comput
Biol 4: e26. [ FIND THIS ARTICLE ONLINE ]
- Quinn GP, Keough MJ (2002) Experimental design and
data analysis for biologists. Cambridge University
Press, Cambridge.
- Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi
AL (2002) Hierarchical organization of modularity in
metabolic networks. Science 297: 1551-1555. [ FIND THIS ARTICLE ONLINE ]
- Schlitt T, Brazma A (2005) Modelling gene networks at
different organisational levels. FEBS Lett 579: 1859-
1866. [ FIND THIS ARTICLE ONLINE ]
- Westerhoff HV, Palsson BO (2004) The evolution of molecular biology into systems biology. Nat Biotechnol
22: 1249-1252. [ FIND THIS ARTICLE ONLINE ]
- Yeung KY, Medvedovic M, Bumgarner RE (2004) From
co-expression to co-regulation: how many microarray
experiments do we need?. Genome Biol 5: R48. [ FIND THIS ARTICLE ONLINE ]
- Yip A, Horvath S (2007) Gene network
interconnectedness and the generalized topological
overlap measure. BMC Bioinformatics 8: 22. [ FIND THIS ARTICLE ONLINE ]
- Yu H, Luscombe NM, Qian J, Gerstein M (2003)
Genomic analysis of gene expression relationships in
transcriptional regulatory networks. Trends Genet 19: 422-
427. [ FIND THIS ARTICLE ONLINE ]
|
|
| This Article |
| DOWNLOAD |
|
| CONTRIBUTE |
|
| SHARE |
|
| EXPLORE |
|
|
|
|