Research Article |
Open Access |
|
|
Predicting Type 1 Diabetes Candidate Genes using
Human Protein-Protein Interaction Networks |
Shouguo Gao, Xujing Wang * |
Department of Physics & the Comprehensive Diabetes Center, University of Alabama at
Birmingham, 1300 University Blvd, Birmingham, AL 35294, USA |
| *Corresponding author: |
Dr. Xujing Wang, Department of Physics & the Comprehensive Diabetes Center,
University of Alabama at Birmingham, 1300 University Blvd, Birmingham,
AL 35294, USA,
E-mail : xujingw@uab.edu |
|
| Received February 27, 2009; Accepted March 30, 2009; Published April 01, 2009 |
|
Citation: Gao S, Wang X (2009) Predicting Type 1 Diabetes Candidate Genes using Human Protein-Protein Interaction
Networks. J Comput Sci Syst Biol 2: 133-146. doi:10.4172/jcsb.1000025 |
| |
Copyright: © 2009 Gao S, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author
and source are credited. |
| |
|
Background
Proteins directly interacting with each other tend to have similar functions and be involved in the same cellular
processes. Mutations in genes that code for them often lead to the same family of disease phenotypes. Efforts
have been made to prioritize positional candidate genes for complex diseases utilize the protein-protein interaction
(PPI) information. But such an approach is often considered too general to be practically useful for specific
diseases.
Results
In this study we investigate the efficacy of this approach in type 1 diabetes (T1D). 266 known disease genes,
and 983 positional candidate genes from the 18 established linkage loci of T1D, are compiled from the T1Dbase
(http://t1dbase.org). We found that the PPI network of known T1D genes has distinct topological features from
others, with significantly higher number of interactions among themselves even after adjusting for their high
network degrees (p<1e-5). We then define those positional candidates that are first degree PPI neighbours of
the 266 known disease genes to be new candidate disease genes. This leads to a list of 68 genes for further
study. Cross validation using the known disease genes as benchmark reveals that the enrichment is ~17.1 fold
over random selection, and ~4 fold better than using the linkage information alone. We find that the citations of
the new candidates in T1D-related publications are significantly (p<1e-7) more than random, even after excluding
the co-citation with the known disease genes; they are significantly over-represented (p<1e-10) in the top 30
GO terms shared by known disease genes. Furthermore, sequence analysis reveals that they contain significantly
(p<0.0004) more protein domains that are known to be relevant to T1D. These findings provide indirect
validation of the newly predicted candidates.
Conclusion
Our study demonstrates the potential of the PPI information in prioritizing positional candidate genes for T1D. |
Abbreviations |
| T1D, Type 1 Diabetes; HT, High Throughput; PPI, Protein-Protein Interaction; HPRD, Human Protein Reference Database;
CC, clustering coefficient; KS-test, Kolmogorov-Smirnov test. |
Background |
Dissecting the genetics of complex diseases has been challenging.
Traditional linkage mapping approaches, developed
based on single disease gene concept, have been less powerful
due to locus heterogeneity and epistasis ( Glazier et al.,
2002; Botstein and Risch, 2003). According to T1Dbase
( http://t1dbase.org), a public website and database that supports
the type 1 diabetes (T1D) research community ( Hulbert
et al., 2007), 18 chromosome regions have shown linkage to
T1D (at least in some populations), and are believed to
harbour disease genes. They vary in length and gene count;
most contain numerous, up to 277 (the IDDM4 locus), genes.
For most loci the sequence variation responsible for the linkage
has not been identified ( Atkinson, 2005). |
This problem is not unique to T1D, most complex human
diseases are facing the same difficulty. A number of
bioinformatics/integrative approaches have been developed
to prioritize and narrow down the positional candidate gene
list obtained from linkage peaks, by bringing in other types
of data of the genes, including expression patterns, ontological
annotations, and text mining of PubMed abstracts,
etc. For example, several studies have utilized gene expression
profiles in relevant tissues and/or the eQTL information
of the gene expressions (Zhu et al., 2004; Cervino et
al., 2005; Chesler et al., 2005; Schadt et al., 2005; Gandhi et
al., 2006), including in the study of T1D (Eaves et al., 2002).
Recently, approaches to prioritize candidates based on their
functional relatedness to the known disease genes are explored.
This is based on the concept that complex human
diseases are caused by multiple genes. Since they together
lead to the same or similar disease phenotypes, the genes
are likely to be related functionally. Such functional relatedness
can be inferred from their functional annotation, coexpression
pattern, and protein-protein interaction (PPI)
networks, etc. Indeed, analysis of known disease genes revealed
that those of the same diseases tend to have higher
and synchronized expressions as a group, and to interact
(PPI) with each other (Xu and Li, 2006). Therefore these
characteristics can be utilized in prioritizing positional candidates
and novel disease gene discovery. For example,
Frankee et al. proposed to rank genes in candidate regions
by their relatedness to candidates in other regions, which
was evaluated according to their sharing in pathway and
GO (Gene Ontology) annotations, microarray co- expression,
and PPI (Franke et al., 2006). Bergholdt et al. (2007)
used PPI to identify network modules that contained significant
enrichment of proteins from interacting regions, and
hence novel candidate genes for T1D. Of these characteristics,
direct PPI is one of the strongest manifestations of a
functional relation between genes. Recent studies showed that mutations to interacting proteins can lead to similar disease
phenotypes (Lage et al., 2007; Sieberts and Schadt,
2007). High throughput analysis of all OMIM (Online Mendelian
Inheritance in Man,
http://www.ncbi.nlm.nih.gov/
entr ez/quer y.fcgi?db=OMIM) human diseases also indicated
the potential of utilizing PPI information alone to
prioritize disease gene candidates (Oti et al., 2006). |
Such integrative genomics approaches, though show promise
theoretically in a general sense, are still often considered
not practically useful for specific diseases (Oti and
Brunner, 2007). In this study, we will first examine the PPI
network structure of the known T1D genes. Based on the
results, we design an algorithm to prioritize the positional
candidates according to their PPI with the known T1D genes.
It leads to the identification of 68 new candidates. We examine
the likelihood of their involvement in T1D from several
aspects including their functional annotation, independent
citation in T1D-related publications, and protein sequence
domain characteristics. Our study differ from previous
T1D work by others (Bergholdt et al., 2007) in that
we take all the known functional and positional candidates
(according to T1DBase) as a starting point, rather than limit
to only the positive predictions from the recent genome wide
association studies. Further, we offer a comprehensive evaluation
of our novel predictions. |
Methods |
T1D Genetic Data |
| The following data were downloaded from T1Dbase: the
complete list of 2661 known functional candidate genes of
T1D (which will be termed known T1D genes in this study);
the 983 positional candidate genes from the 18 known T1D
linkage regions; and information of T1Drelated and all Entrez
Gene publications. T1Dbase compiled the list of known functional
candidate genes for T1D from the Genetic Association
Database (http://geneticassociationdb.nih.gov, genes
shown association to human diseases were curated from
genetic studies reported in published scientific papers), and
from genes deemed of interest to T1D by the Wellcome
Trust Diabetes and Inflammation Laboratory. The 18 linkage
regions were compiled from published genome scan
duties. More detail can be found from its website http://
t1dbase.org. |
Candidate Gene Prediction |
| PPI annotation was downloaded from the Human Protein
Reference Database (HPRD http://www.hprd.org/).
Additionally, two high throughput PPI datasets by Rual, et
al. (2005) and Stelzl, et al (2005) were obtained from the supplementary material of their papers, and were combined.
The pooled dataset is referred to as HT (high throughput) in
this study. The significance of the PPI enrichment among
the T1D genes was evaluated using the bootstrapping
method. For each of 20 times, we randomly selected the
same number of genes from the HPRD or the HT dataset,
and determined the PPI among them. The results were then
used to determine the PPI statistics for a random list of
genes. |
The first-degree (level-1) PPI neighbours of all known
disease genes were first determined and called baited genes.
The algorithm then went through all 983 positional candidates,
and identified those that are baited genes to be new
candidate disease genes. The number of independent baits
(known T1D genes) for each baited gene was also determined.
The latter was in turn used to investigate if a gene
with more disease gene partners is more likely to be also a
disease gene. This could lead to further prioritization of the
predicted candidates. |
Functional analysis of known and predicted T1D genes
were carried out using GOStat (Beissbarth and Speed,
2004). Compared with other ontological analysis tools, it has
the advantage that parent-child relationship between the GO
terms are considered (Beissbarth and Speed, 2004). The
protein domain information was retrieved from InterPro
(http://www.ebi.ac.uk/interpro/), and Fisher’s test was used
to examine domains overrepresented in known and predicted
T1D genes (Mulder et al., 2005). |
Results |
Topological Features of the PPI Network of known
T1D Disease Genes |
To avoid any potential bias toward well studies genes
(whose interaction with other genes are better characterized)
(Oti and Brunner, 2007; Ideker and Sharan, 2008), we
initially examined the PPI networks using information both
from the HPRD annotation, and from the 2 HT data sets
(Rual et al., 2005; Stelzl et al., 2005). Figure 1 presents the
results. We found that the T1D genes interact with each
other significantly more often than randomly selected gene
sets. Of all 20152 known human genes (according to NCBI’s
Gene database), 9222 are annotated in HPRD, and 4157 in
HT. For the 266 known T1D genes, 222 are annotated in
HPRD, and 75 in HT. There are a total of 34398 edges (in
network’s language, each node represents one protein molecule,
and an edge between two nodes means the two
molecules interact with each other) among the 9222 proteins
in HPRD, and 9277 edges among the 4157 proteins in
HT. The numbers for the T1D genes are 169 in HPRD, and
25 in HT, respectively. In contrast, bootstrapping yields only
21.1±4.2 and 3.7±2.3 interactions for a random gene set of
the same sizes. These are 8.0 and 6.8 fold enrichment, respectively.
The results from HPRD and HT are comparable,
and we do not observe any noticeable bias in the HPRD
dataset. In the rest of this study, we used HPRD only as it
contains more comprehensive information of PPI. |
|
Figure1: PPI networks of T1D disease genes according to HPRD (left) and HT (right).
|
|
|
Figure2: The topological features of the T1D disease genes in the PPI network are distinct from the other genes. (A) The
degree distribution of all proteins follows a power law (r~0.98, p<0.001), with p(k)~k-λ , λ ~1.35, indicating the PPI network
is scale free. The distribution for the candidate genes clearly deviate from the power law, skewed significantly toward higher
degrees. (B) The clustering coefficient (CC) is plotted against degree k. There is a linear decline in CC with increasing k,
suggesting that the network is modular. The distribution of the disease genes again deviate from random genes, with more
interactions among their level-1 neighbours.
|
|
It has been found that proteins of disease genes often
possess higher network degrees (i.e. number of interactions
with other proteins) than randomly selected genes (Tu
et al., 2006; Xu and Li, 2006). We found that this is indeed
true for the T1D genes (p<0.001, Kolmogorov-Smirnov test,
or, KS-test). This raises a question that whether the enrichment
was brought in by the higher degrees of the known
disease genes? To answer this question we bootstrapped
random genes with the same degree distribution. Significant
enrichment was still observed (p<1e-5), suggesting independent
contributions from other sources, likely their close
functional relatedness. |
Genetic networks have been found to be different from
random networks in structure. For example, they often exhibit
small-world and scale-free properties (Barabasi and
Oltvai, 2004). Therefore merely comparing the average
network behaviour may not be adequate. For this reason,
we also examined the topological properties of the disease
gene PPI networks. We find that the degrees of all proteins
follow a power law w p(k)~k-λ , with l ~1.35, r~0.98, and
p<0.001 (figure 2A), where p(k) is the probability density.
This indicates the PPI network is scale free. The distribution
for the 222 disease genes clearly deviate from the power
law, skewed significantly toward higher degrees, suggesting
that disease genes tend to have more interaction partners.
We also examined the clustering coefficient (CC) and
its dependence on degree k. CC measures how first degree
neighbours of the same node interact with each other,
namely, the cliquiness. Again a power law decline with increasing k is evident (figure 2B, r~0.70, p<0.01), suggesting
that the network is of modular structure (Barabasi and Oltvai,
2004). Here the known disease genes once more deviate
from the average behaviour of all genes in the genome,
skewed toward higher CC at the same degree k, with a
much shallower slope (0.67 versus 0.93, p~0.00014). This
implies that the disease genes likely form subnetwork modules
with much higher internal interactions than with genes
outside the module. |
These characteristics of disease genes are not unique to
T1D, they in fact emulate the results of similar studies of
other diseases, where it was found that disease genes tend
to have larger degrees, more likely to interact with other
disease genes, and share more common neighbours (Tu et
al., 2006; Xu and Li, 2006). These results provide the conceptual
basis for candidate gene prediction utilizing PPI with
known disease genes. |
Cross Validation of the Candidate Gene Prediction
Algorithm |
| We first evaluated the performance of the disease gene
prediction algorithm using the known T1D genes as bench
marks. In more detail, each time we randomly select f fraction
of known T1D genes as baits, and tested how many of
the remaining 1-f fraction were predicted. We tested for 6
different f values: 1/5, 1/3, 1/2, 2/3, 4/5 and 1, and for each
f value (except f=1, which was only used to calculate the
number of predicted genes, but not for cross validation as no testing set) we repeated 20 times. Figure 3 summarizes
the results. Evidently the number of predicted genes increases
with the number of baits (figure 3A). Interestingly,
the trend seems to slow down as the bait number increases.
This could be due to the limitations of our current knowledge
of PPI (incompleteness and quality issues, for example),
it may also suggest that total number of T1D disease
genes is limited. Further investigation of this phenomenon
is needed when we have a better understanding of
PPI and T1D disease biology. The efficiency to recover the
known disease genes, defined as the odds of disease gene
enrichment in predicted candidates over random, seems to
be affected little by the number of baits, as shown in figure
3B. The high enrichment ratios, at ~17.1 (14.1-18.6) fold
suggest that our baiting algorithm can recover the known
disease genes well. |
|
Figure3: The size effect of the bait set. (A) Number of predicted disease genes increases with number of baits. (B) The
efficiency of the disease gene prediction algorithm, as judged by the odds ratio of known disease gene being recovered, does
not depend on the size of bait set.
|
|
How much improvement in predictive power did the addition
of PPI information bring in? The 18 known T1D linkage
loci together offer 983 positional candidates. Using the
known disease genes as bench marks, 59 of the 266 T1D
genes are within the linkage regions, thus the linkage data
by itself lead to a 4.5-fold enrichment (p<1e-17, Fisher’s
exact test). If we restrict to only the 9222 genes annotated
in HPRD, 487 are within the linkage region. For the 222
disease genes annotated in HPRD, 52 are within the linkage
regions. The enrichment by linkage information alone is
similar at ~4.4 fold (p<1e-15). Therefore, the PPI with
known disease genes brought in an additional ~4 fold of
enrichment. |
Predicted New Candidates |
| Using all 222 T1D genes (annotated in HPRD) as baits,
we arrived at a list of 68 predicated new candidates, given
in table 1. None of these has been previously associated to
T1D according to T1Dbase. Figure 4 depicts the interactions
between all known and predicted T1D genes. |
Network Properties |
| The network properties of the predicted genes are significantly
different from the average HPRD annotated
genes, and are much closer to the known T1D genes. The
number of interactions among themselves is significantly
higher than random (p<0.00001). In figures 2A and 2B, we
have also included plots of the predicted candidates. Evidently
they cluster with the known T1D genes, concentrate
more to the high-degree end (Figure 2A), and share more
first degree neighbours than random (figure 2B). |
Functional Properties |
| In table 2 we listed the top 30 GO molecular function
categories shared among the 222 known disease genes
(p<1e-22), and their statistics in the 68 new candidates.
These categories clearly indicate an involvement of immunity,
which is consistent with T1D being an autoimmune
disease. All categories have enhancement ratio above 1,
except for the 4 with very low (0 or 1) representations in
the 68 predicted genes, which are sensitive to random effect.
14 have enhancement ratio above 2. Putting all GO terms together, they are significantly (p<1.3e-10) over-represented
in the new candidates. |
Table 1: List of the 68 predicted disease genes.
*: most loci were named IDDM#, where IDDM stands for Insulin Dependent Diabetes Mellitus, another name for type 1 diabetes.
|
|
|
Figure4: The PPI network of known (circle) and predicted disease genes (diamond).
|
|
Table 2: The top 30 GO categories shared by the 266 known T1D genes, and their presentation in the 68 predicted disease
genes.
|
|
Protein Sequence Analysis |
The function of a protein is determined by its shape and
primary structure (Mulder and Apweiler, 2008). InterPro is
an integrated database of protein families, domains and functional
sites. We examined the protein motifs that are overrepresented
in the known and predicted disease genes. Listed in table 3 are the top 10 (Fisher’s exact test, p<1e-16) motifs
shared among the known disease genes. 6 of them are
also over represented in the 68 new candidates. For the
remaining 4, the expected number of genes that share the
motif (i.e. (# of the 9222 that share the motif)/(9222/68)) is
far less than 1 (all below 0.25), therefore we do not have
enough statistical power to determine if they are over represented
or not. Taking the results from the 6 informative
motifs together, it suggests that the predicted genes participate
in similar biological processes as the known T1D genes. Again immune related sequence features are overrepresented
in both the known and predicted genes, consistent
with the fact that T1D is an autoimmune disease. |
Table 3: Protein sequence motifs that are over-represented among known and predicted disease genes. Listed are the top 10
motifs shared in the known disease genes at p<2e-16 (Fisher’s exact test), together with their significance in the predicted
ones.
*T he expected number of genes out of the 68 that share the motif is far below 1, <0.25. The actual number is 0. Not enough power for statistical analysis.
|
|
|
Figure5: The probability density distribution of normalized T1D citation. Both known (A) and predicted disease genes (B)
are cited significantly (p<1e-33, and p<1e-5, respectively, KS-test) more often in T1D-related publications than random
genes. In the analysis of predicted, cocitations with known disease genes were excluded.
|
|
Literature Support |
To investigate the potential T1D relevance of the new
predictions, we further examined the literature citation of
both known and predicted disease genes. For each gene
we obtained the total number of PubMed citations and the
fraction that are T1D-related (according to T1Dbase). For
the predicted genes, one may argue that their appearance
in T1D publications could be a result of their interactions with the known disease genes, as interacting genes often
appear in the same publications. To address this issue, we
excluded from the analysis of the predicted genes all PubMed
records that have cited the known T1D genes. |
We found that out of the 68 new candidates 13 (~20%)
are cited significantly more often than random in T1D publication
at p<0.05 (Fisher’s exact test), as compared to only
~6.9% of the HPRD genes. This is a ~3-fold enrichment.
As a group members of the 68 list are significantly (p<1e-7)
more likely to appear in T1D-related publications than members
of a random set of 68 genes. Figure 5 presents a more
quantitative evaluation, by plotting the probability density distribution of the fraction of T1D-related citations. Interestingly,
the citation seems to also follow the power law
approximately. As expected, the distribution for known disease
genes is significantly skewed toward higher T1D-related
citations (p<1e-033, KS-test, figure 5A). Of interest
is the fact that even after removing the co-citations with
the known T1D genes, the newly predicted disease genes
are also cited more often in T1D literature (p<1e-5, KStest,
figure 5B). These results provide a strong indirect evidence
of their potential involvement in T1D. |
|
Figure6: Candidates predicted by more baits are more
likely to be cited in T1D-related publications.
|
|
Number of Interactions with the known Disease
Genes, Possibilities to Prioritize the Predicted Genes? |
| Out of the 68 novel candidates, more than a third (24)
interact with at least two known disease genes, and about a
sixth (12) interact with at least three. This raises the question
whether interacting with more disease genes means
higher likelihood of also being a disease gene, and if such
information can be used to further prioritize the prediction.
This is intuitive as subsets of genes having much more interactions
with each other than with others are likely to be
from a same functional network module, and consequently
to be involved in the same physiological processes and disease
phenotypes. |
We found that the Pearson correlation between the number
of baits and the significance of T1D citation (-log10(p),excluding co-citation with the known disease genes) was
~0.45. In figure 6 the fraction of genes with significant T1Drelated
citations was plotted against number of baits. A loose
cut-off, p<0.2, was used due to the small number of predicted
genes. A positive monotonic trend is evident. We
also used KS-test to quantitatively evaluate this question.
Using 2, 3 and 4 baits as a cut-off we divided the 68 genes into groups of low and high number of baits and examined
the significance distribution in each group. We found that
with any cut-off the two groups are different with p<0.032
(2), p<0.019 (3), and p<0.05 (4), respectively. These all suggest
that the number of interactions with known disease
genes is likely an indicator of the candidate’s likelihood being
a disease gene. |
Figure 7 shows the PPI network of the top 5 candidates
in terms of number of baits. On the top are ESR1 and VIL2,
each with 6 baits (table 1). Interestingly, they are also among
the top in terms of independent citations in T1D-related
publications and network degrees. ESR1, or estrogen receptor
1, has been cited in 139 (124, after removing cocitation
with known disease genes) T1D-related publications,
which ranked number 1 (1) out of the 68 candidates;
the number for VIL2 is 30 (29), ranked number 8 (7). The
odds ratios to random genes are all greater than 1, at 9.6 for
VIL2 and 6.2 for ESR1, with p~8.2e-19 and p~4.2e-9
(Fisher’s test), respectively. Both have abundant interactions
with other proteins, with k=163, #1 of the 68 for ESR1;
and k=43, #11 for VIL2. These are within the top 2% of all
genes, and both can be considered hubs. |
ESR1 is within the IDDM5 locus located at 6q25, and has
been purported to be responsible for the linkage (Pietropaolo
and Le Roith, 2001). IDDM5 is one of the few susceptibility
regions that have been replicated in multiple studies
(Pociot and McDermott, 2002). In addition, it is a major
disease gene for type 2 diabetes, and is strongly associated with obesity and lipid metabolism. VIL2 (also known as EZR,
or ezrin), is also located in IDDM5. Compared with ESR1,
it is a much less studied gene. It encodes a cytoplasmic
peripheral membrane protein that plays a key role in cell
surface structure adhesion, migration and organization. It
has been implicated in various human cancers. Its role in
T1D pathogenesis is still not clear, though multiple studies
have linked it in the progression and complication of diabetes
(Goh and Cooper, 2008). |
|
Figure7: PPI network of top 5 predictions (ellipse) and
their corresponding baits (round rectangle). Bright magenta
nodes represent genes with significant citation in T1D-related
publications (p<0.01).
|
|
The next on the list are three genes that each interacts
with 5 known T1D genes: SMAD2, RELA and DAXX.
The number of independent citations in T1D-related publications
are 52 (#3, p<4.2e-18), 35 (#14, p~0.060), 6 (#21,
p~0.42), respectively. They are all highly connected genes,
degrees all in the top 5% of the 9220 HPRD proteins, with
k=160 (#2 of the 68), k=98 (#5), and k=34 (#14) respectively. |
SMAD2 is a member of the SMAD family. Proteins of
this family are signal transducers and transcriptional modulators
that mediate multiple signaling pathways. SMAD2
mediates the signal of the transforming growth factor (TGF)-
beta, and thus regulates multiple cellular processes, such as
cell proliferation, apoptosis, and differentiation. TGF-beta
plays a central role in activation of inflammation, and in the
regulation of anti-islet CD8+ T cells by the CD4+CD25+ T
regulatory cells during T1D (Green et al., 2003). The secretion
of TGF-beta in recent onset T1D has been observed
to be elevated (Stechova et al., 2007). RELA is also known
as p65. Its protein is involved in the forming of the NFêB
complex. NFêB1 or NFêB2 is bound to REL, RELA, or
RELB to form the NFêB complex. The NFêB1 (p50)/ RELA
(p65) heterodimer is the most abundant form of the complex.
NFêB activation has been implicated in the protection
of target cells against apoptosis by a variety of death effectors,
including cytokine mediated β-cell death (Chang et al.,
2003). DAXX, death-associated protein 6, is in the extended
MHC region (IDDM1). There is evidence of its involvement
in the T1D disease pathways in patients displaying
intermediate risk DQ-DR haplotypes (van der Slik et al.,
2007). It binds the receptor of TGF-beta and modulate the
TGF-beta apoptotic-signalling pathway (Perlman et al.,
2001). It physically interacts with the insulin-sensitive glucose
transporter, GLUT4 (Lalioti et al., 2002). |
Discussion |
Increasing evidence suggest that interacting proteins often
share similar function, and participate in the same biological
pathways and processes (Oti and Brunner, 2007).
Therefore mutations in genes coding for them could lead to similar disease phenotypes. These facts indicated that PPI
information alone may offer a simple, efficient means to
annotate protein functions and to prioritize candidate genes
for complex human diseases (Oti et al., 2006). In this study
we carried out a comprehensive PPI network analysis of
the known T1D disease genes. We found that they cluster
in the high degree region, more likely to interact with each
other, and share more common interaction partners. We then
examined the potential of using PPI with known disease
genes in prioritizing the positional candidates of T1D. Among
the 983 genes within the 18 T1D linkage loci, 68 are first
degree PPI neighbours of the known T1D genes, which we
defined as the new candidate disease genes. Cross validation
indicates that the approach is ~17.1 fold better than
random selection to recover disease genes. Examination of
the new candidates revealed that they share with the know
disease genes a significant amount of GO categories and
protein sequence motifs that are known to be important to
autoimmunity. Furthermore, they are cited significantly more
often in T1D-related publications, independent from their
co-citation with the known disease genes. These all provide
indirect support for their candidacy. |
Here we only used the direct interaction relationship
among genes. More sophisticated features, such as topological
overlap (Zhang and Horvath, 2005), average distance
to disease genes, positive topological coefficient (Xu and
Li, 2006), are worthy of consideration in future research.
Our analysis of the novel candidates rely heavily on the
present protein and gene annotation databases, and the available
literature report of studies related to T1D. Therefore it
is likely limited by the quality of the PPI and linkage data,
and the current understanding of the T1D aetiology. Not all
the 266 known T1D genes can be consistently replicated in
different populations, nor the 18 linkage regions (Atkinson,
2005). The recently published genome wide association studies
(GWAS) only confirmed a few of the previously identified
regions whilst offering evidence for yet several new
regions (Hakonarson et al., 2007; The Wellcome Trust Case
Control Consortium, 2007; Todd et al., 2007). There is a
tendency presently to consider the GWAS results being the
ultimate verdict and view previous findings that not confirmed
by GWAS as false positives. If so, most of the 266
genes and the 18 regions could be false positives. It is rather
intriguing then the new candidates predicted by our algorithm
show strong evidence in their potential involvement in
T1D, especially the independent citation in T1Drelated publications.
Here we would like to emphasize that the GWAS
studies are only adequately powered to detect very common
alleles unless they greatly increase disease risk, and
explains little the genetic variation of disease. The intricacy of complex human diseases itself further compound the interpretation
of the results from genetic studies. Population
difference, disease heterogeneity, the genetic mechanism
of the disease including alleles with small effect sizes, epistatic
interaction, epigenetic inheritance, copy number variation,
etc, all raise the question of how much reliance one
should give to a individual type of genetic data obtained from
a certain population, including the GWAS. However, we
believe that by taking an integrative approach, and examine
the convergent predictions, the noise and consequently the
false positives will be reduced, and true signals will be amplified.
Therefore, before our understanding of the disease
aetiology improves, it is better to be inclusive at the beginning
of an integrative approach. |
68 candidates may still be too many for association or
functional studies. Further prioritization is needed. The results
of this study suggest that the topological features in
the PPI network with known disease genes, the functional
and sequence information, and the literature citation can
provide further discrimination of the predicated candidates.
For example, it is possible to rank them according to their
position and degree in the PPI network, degree of interaction
with known disease genes, citation by T1D-related publications,
protein sequences motif, as well as expression
pattern, and gene ontology. A composite metric could be
defined for candidates based on these properties to describe
their likelihood of being true disease genes. These properties
are not necessarily all independent. As an example, figure
6 illustrates the potential confounding between number
of interacting disease genes and the number of citations in
T1Drelated publications. Therefore, when designing such
composite measures, sophisticated approaches such as the
Bayesian method, which can handle non-independent factors,
are needed. We are investigating these issues in a separate
study (manuscript in preparation). |
In this study we focused on prioritizing positional candidate
genes within the linkage loci. By its nature, the approach
can be applied to candidate genes obtained by other
means. With the advancement of the human genome and
the HapMap projects, emerging technological advances
make the GWAS a reality for many laboratories to identify
genetic variants that contribute to common diseases
(Hirschhorn and Daly, 2005; Wang et al., 2005). While
GWAS has the potential to catch all disease genes, sample
size and power issues, among others, still limit its ability to
obtain a complete picture of the genetic risk; or to identify
genes that in combination cause disease predisposition, while
each on its own only contribute moderately to the risk.
GWAS typically produces a large number of potential candidate genes. Normally, only markers with extremely low
p-value (usually <~1e-7) are retrieved because of the power
and multiple testing issue. Lowering the threshold will be
plagued with false positives, though it is believed that a region
immediate below the threshold p value harbours many
true disease genes (The Wellcome Trust Case Control Consortium,
2007). These regions need to be investigated to
fully dissect the genetics of complex diseases. The significance
of novel candidates can be investigated further, by
including the GWAS results in the definition of the composite
likelihood measures of the prioritization scheme. On the
other hand, information obtained from other approaches such
as the PPI networks, can in turn also help the analysis of
the GWAS data. An analytical prioritization scheme that
brings in other evidence potentially will allow one to narrow
down the number of statistical tests to be performed, and to
identify disease genes from the sub optimal p-value regions. |
Acknowledgements |
| This work is support in part by the National Institute of
Diabetes and Digestive and Kidney Diseases under grant
No. R01DK080100. |
References |
- Atkinson MA (2005) Thirty Years of Investigating the
Autoimmune Basis for Type 1 Diabetes: Why Can’t We
Prevent or Reverse This Disease. Diabetes 54(5): 1253-
63. » CrossRef » Google Scholar
- Barabasi AL, Oltvai ZN (2004) Network biology: understanding
the cell’s functional organization. Nat Rev
Genet 5: 101-13. » PubMed » Google Scholar
- Beissbarth T, Speed TP (2004) GOstat: find statistically
overrepresented Gene Ontologies within a group of
genes. Bioinformatics 20: 1464-5. » PubMed » Google Scholar
- Bergholdt R, Storling ZM, Lage K, Karlberg EO, Olason
PI, et al. (2007) Integrative analysis for finding genes
and networks involved in diabetes and other complex
diseases. Genome Biol 8: R253. » CrossRef » PubMed » Google Scholar
- Botstein D, Risch N (2003) Discovering genotypes underlying
human phenotypes: past successes for mendelian
disease, future approaches for complex disease. Nat
Genet 33: 228-37. » CrossRef » PubMed » Google Scholar
- Cervino AC, Li G, Edwards S, Zhu J, Laurie C, et al.
(2005) Integrating QTL and high-density SNP analyses
in mice to identify Insig2 as a susceptibility gene for
plasma cholesterol levels. Genomics 86: 505-17. » CrossRef » PubMed » Google Scholar
- Chang I, Kim S, Kim JY, Cho N, Kim YH, et al. (2003)
Nuclear factor kappaB protects pancreatic beta-cells
from tumor necrosis factor-alpha-mediated apoptosis.
Diabetes 52: 1169-75. » CrossRef » PubMed » Google Scholar
- Chesler EJ, Lu L, Shou S, Qu Y, Gu J, et al. (2005)
Complex trait analysis of gene expression uncovers polygenic
and pleiotropic networks that modulate nervous
system function. Nat Genet 37: 233-42. » CrossRef » PubMed » Google Scholar
- Eaves IA, Wicker LS, Ghandour G, Lyons PA, Peterson
LB, et al. (2002) Combining mouse congenic strains and
microarray gene expression analyses to study a complex
trait: the NOD model of type 1 diabetes. Genome
Res 12: 232-43. » CrossRef » PubMed » Google Scholar
- Franke L, Bakel Hv, Fokkens L, de Jong ED, Egmont
PM, et al. (2006) Reconstruction of a Functional Human
Gene Network, with an Application for Prioritizing
Positional Candidate Genes. American journal of human
genetics 78: 1011-1025. » CrossRef » PubMed » Google Scholar
- Gandhi TK, Zhong J, Mathivanan S, Karthick L,
Chandrika KN, et al. (2006) Analysis of the human protein
interactome and comparison with yeast, worm and
fly interaction datasets. Nat Genet 38: 285-93. » CrossRef » PubMed » Google Scholar
- Glazier AM, Nadeau JH, Aitman TJ (2002) Finding genes
that underlie complex traits. Science 298: 2345-9. » CrossRef » PubMed » Google Scholar
- Goh SY, Cooper ME (2008) The Role of Advanced
Glycation End Products in Progression and Complications
of Diabetes. J Clin Endocrinol Metab 93: 1143-1152. » CrossRef » PubMed » Google Scholar
- Green EA, Gorelik L, McGregor CM, Tran EH, Flavell
RA (2003) CD4+CD25+ T regulatory cells control antiislet
CD8+ T cells through TGF-beta-TGF-beta receptor
interactions in type 1 diabetes. Proc Natl Acad Sci
USA 100: 10878-83. » CrossRef » PubMed
- Hakonarson H, Grant SF, Bradfield JP, Marchand L, Kim
CE, et al. (2007) A genome-wide association study identifies
KIAA0350 as a type 1 diabetes gene. Nature 448:
591-4. » CrossRef » PubMed » Google Scholar
- Hirschhorn JN, Daly MJ (2005) Genome-wide association
studies for common diseases and complex traits.
Nat Rev Genet 6: 95-108. » PubMed » Google Scholar
- Hulbert EM, Smink LJ, Adlem EC, Allen JE, Burdick
DB, et al. (2007) T1DBase: integration and presentation
of complex data for type 1 diabetes research. Nucleic
Acids Res 35: D742-6. » CrossRef » PubMed » Google Scholar
- Ideker T, Sharan R (2008) Protein networks in disease.
Genome Res 18: 644-52. » CrossRef » PubMed » Google Scholar
- Lage K, Karlberg EO, Storling ZM, Olason PI, Pedersen
AG, et al. (2007) A human phenomeinteractome network
of protein complexes implicated in genetic disorders. Nat
Biotechnol 25: 309-16. » CrossRef » PubMed » Google Scholar
- Lalioti VS, Vergarajauregui S, Pulido D, Sandoval IV
(2002) The insulin-sensitive glucose transporter, GLUT4,
interacts physically with Daxx. Two proteins with capacity
to bind Ubc9 and conjugated to SUMO1. J Biol
Chem 277: 19783-91. » CrossRef » PubMed » Google Scholar
- Mulder NJ, Apweiler R (2008) The InterPro database
and tools for protein domain analysis. Curr Protoc
Bioinformatics Chapter 2: Unit 2-7. » CrossRef » PubMed
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A,
Bateman A, et al. (2005) InterPro, progress and status
in 2005. Nucleic Acids Res 33: D201-5. » CrossRef » PubMed » Google Scholar
- Oti M, Brunner HG (2007) The modular nature of genetic
diseases. Clin Genet 71: 1-11. » CrossRef » PubMed » Google Scholar
- Oti M, Snel B, Huynen MA, Brunner HG (2006) Predicting
disease genes using protein-protein interactions.
J Med Genet 43: 691-8. » CrossRef » PubMed » Google Scholar
- Perlman R, Schiemann WP, Brooks MW, Lodish HF,
Weinberg RA (2001) TGF-beta-induced apoptosis is
mediated by the adapter protein Daxx that facilitates JNK
activation. Nat Cell Biol 3: 708-14. » CrossRef » PubMed » Google Scholar
- Pietropaolo M, Le Roith D (2001) Pathogenesis of diabetes:
our current understanding. Clin Cornerstone 4: 1-
16. » CrossRef » PubMed » Google Scholar
- Pociot F, McDermott MF (2002) Genetics of type 1 diabetes
mellitus. Genes Immun 3: 235-49. » CrossRef » PubMed » Google Scholar
- Rual JF, Venkatesan K, Hao T, Hirozane KT, Dricot A,
et al. (2005) Towards a proteome-scale map of the human
protein-protein interaction network. Nature 437:
1173-8. » CrossRef » PubMed » Google Scholar
- Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, et al.
(2005) An integrative genomics approach to infer causal
associations between gene expression and disease. Nat
Genet 37: 710-7. » CrossRef » PubMed » Google Scholar
- Sieberts SK, Schadt EE (2007) Moving toward a system
genetics view of disease. Mamm Genome 18: 389-
401. » CrossRef » PubMed » Google Scholar
- Stechova K, Bohmova K, Vrabelova Z, Sepa A,
Stadlerova G, et al. (2007) High T-helper-1 cytokines
but low T-helper-3 cytokines, inflammatory cytokines and
chemokines in children with high risk of developing type
1 diabetes. Diabetes Metab Res Rev 23: 462-71. » CrossRef » PubMed » Google Scholar
- Stelzl U, Worm U, Lalowski M, Haenig C, Brembeck
FH, et al. (2005) A human protein-protein interaction
network: a resource for annotating the proteome. Cell
122: 957-68. » CrossRef » PubMed » Google Scholar
- The Wellcome Trust Case Control Consortium (2007)
Genome-wide association study of 14,000 cases of seven
common diseases and 3,000 shared controls. Nature 447:
661-78. » CrossRef » PubMed » Google Scholar
- Todd JA, Walker NM, Cooper JD, Smyth DJ, Downes
K, et al. (2007) Robust associations of four new chromosome
regions from genome-wide analyses of type 1
diabetes. Nat Genet 39: 857-64. » CrossRef » PubMed » Google Scholar
- Tu Z, Wang L, Xu M, Zhou X, Chen T, et al. (2006)
Further understanding human disease genes by comparing
with housekeeping genes and other genes. BMC Genomics 7: 31» CrossRef » PubMed » Google Scholar
- van der Slik AR, van den Eng I, Eerligh P, Doxiadis II,
Koeleman BP, et al. (2007) Sequence variation within
the major histocompatibility complex subregion centromeric
of HLA class II in type 1 diabetes. Tissue Antigens
69: 348-53. » CrossRef » PubMed
- Wang WY, Barratt BJ, Clayton DG, Todd JA (2005)
Genome-wide association studies: theoretical and practical
concerns. Nat Rev Genet 6: 109-18. » PubMed » Google Scholar
- Xu J, Li Y (2006) Discovering disease-genes by topological
features in human protein-protein interaction network.
Bioinformatics 22: 2800-5. » CrossRef » PubMed » Google Scholar
- Zhang B, Horvath S (2005) A general framework for
weighted gene co-expression network analysis. Stat Appl
Genet Mol Biol 4: Article17. » CrossRef » PubMed » Google Scholar
- Zhu J, Lum PY, Lamb J, Guha TD, Edwards SW, et al.
(2004) An integrative genomics approach to the reconstruction
of gene networks in segregating populations.
Cytogenet Genome Res 105: 363-74. » CrossRef » PubMed » Google Scholar
|
|
| This Article |
| DOWNLOAD |
|
| CONTRIBUTE |
|
| SHARE |
|
| EXPLORE |
|
|
|
|