Research Article |
Open Access |
|
|
Modeling Host-Cancer Genetic
Interactions with Multilocus Sequence Data |
Yao Li * and Rongling Wu 1*,† |
*Department of Statistics, University of Florida, Gainesville, FL 32611 USA |
†Departments of Public Health Sciences and Statistics, Pennsylvania State University, Hershey, PA 17033 |
| 1Corresponding author: |
Dr. Rongling Wu, Department of Statistics,
University of Florida, Gainesville, FL 32611 USA,
Email : rwu@hes.hmc.psu.edu |
|
| Received January 06, 2009; Accepted February 04, 2009; Published February 05, 2009 |
|
Citation: Yao L, Rongling W (2009) Modeling Host-Cancer Genetic Interactions with Multilocus Sequence Data. J Comput
Sci Syst Biol 2: 024-043. |
| |
Copyright: © 2009 Yao L, et al. This is an open-access article distributed under the terms of the Creative Commons
Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author
and source are credited. |
| |
|
Cancer susceptibility may be controlled not only by host genes and mutated genes in cancer cells, but also by
the epistatic interactions between genes from the host and cancer genomes. We derive a novel statistical model
for cancer gene identification by integrating the gene mutation hypothesis of cancer formation into the mixturemodel
framework. Within this framework, genetic interactions of DNA sequences (or haplotypes) between host
and cancer genes responsible for cancer risk are defined in terms of quantitative genetic principle. Our model
was founded on a commonly used genetic association design in which a random sample of patients is drawn from
a natural human population. Each patient is typed for single nucleotide polymorphisms (SNPs) on normal and
cancer cells and measured for cancer susceptibility. The model is formulated within the maximum likelihood
context and implemented with the EM algorithm, allowing the estimation of both population and quantitative
genetic parameters. The model provides a general procedure for testing the distribution of haplotypes constructed
by SNPs from host and cancer genes and the linkage disequilibria of different orders among the SNPs.
The model also formulates a series of testable hypotheses about the effects of host genes, cancer genes, and
their interactions on cancer susceptibility. We carried out simulation studies to examine the statistical properties
of the model. The implications of this model for cancer gene identification are discussed.
|
Keywords |
Cancer Genome; EM Algorithm; Haplotype; Host Genome; Single Nucleotide Polymorphism |
Introduction |
Since the recognition of cancer as a genetic disease, a
number of familial cancer genes with high-penetrance mutations,
such as oncogenes and the tumor suppressors, have
been chromosomally identified, isolated or cloned ( Balman
et al. 2003; Rand et al. 2008). However, growing evidence
shows that most cancer is the result of an intricate interaction
of lowpenetrance genetic variants with environmental
exposures that humans experience ( Brennan 2002). These
low-penetrance cancer genes, each usually with a minor
effect and cooperating with others in a complicated web,
are difficult to detect and, therefore, their contribution to the risk of cancer development remains unclear. There is a
pressing demand on the development of powerful statistical
models and computational algorithms for identifying and
mapping specific DNA sequence variants that regulate cancer
susceptibility. |
Human cancer cells frequently possess large-scale chromosomal
rearrangements due to chromosomal instability
(CIN) (Stock and Bialy 2002; Thompson and Compton 2008)
or gene mutation (Jallepalli and Lengauer 2001; Greenman
et al. 2007). CIN makes whole chromosomes or large fractions of chromosomes gained or lost during cell division, resulting
in an imbalance in the number of chromosomes per
cell (aneuploidy) and an enhanced rate of loss of heterozygosity.
Thus, the “aneuploidy hypothesis of cancer” (Stock
and Bialy 2002) proposes that the main differences between
normal and abnormal (cancer) cells result from the number
of genes rather than the types of genes differentially expressed,
as opposed to the “gene-mutation hypothesis”
(Jallepalli and Lengauer 2001). In general, cancer incidence
and development are not only affected by the host genes,
but also by genes derived from the cancer cells themselves.
Given strong mechanistic interactions between the host and
cancer tissues (Araujo and McElwain 2006), these two different
systems of genes operate interactively or epistatically
to alter the course of cancer progression. Thus, it can be
well anticipated that any statistical model for gene detection
that incorporates these two hypotheses are likely to
make groundbreaking discoveries of cancer genes. |
Genetic mapping has proven to be a powerful approach
for detecting quantitative trait loci (QTLs) for complex traits.
But a QTL may contain multiple genes that operate in a
collective way. It is not possible to study the DNA structure,
organization and function of a QTL detected from a
mapping approach. A more accurate and useful approach
for the characterization of genetic variants contributing to
quantitative variation is to directly analyze DNA sequences,
known as haplotypes, associated with a particular disease
(Liu et al. 2004; Lin and Wu 2006). If a string of DNA
sequence is known to increase disease risk, this risk can be
prevented by inhibiting the expression of this string using a
specialized drug. The control of this disease can be made
more efficient if all possible DNA sequences determining
its variation are identified in the entire genome. The elucidation
of the entire human genome has been accelerated by
the haplotype map, or HapMap, constructed by SNPs (The
International HapMap Consortium 2003). More recently, the
marvelous plans of sequencing the cancer genomes (Kaiser
2005) will provide unprecedented fuel for studying the
genetic architecture of cancer risk. |
In this article, we will derive a statistical model for detecting
the actions and interactions of haplotypes derived
from the host and cancer genomes for cancer susceptibility.
We will incorporate the “gene-mutation hypothesis” into
the model. The “aneuploidy hypothesis of cancer” will be
considered in a next paper. Through the release of software
to the public, our statistical model will serve as a routine means for the genetic diagnosis of cancer risk. Results
from the model will provide scientific guidance for clinical
doctors to design an optimal treatment scheme in terms of
cancer genes and patient’s genes. |
Design |
Sampling Strategies |
| Suppose there is a natural human population at Hardy–
Weinberg equilibrium (HWE) from which a random sample
is drawn for cancer gene identification. In order to identify
DNA sequences responsible for cancer susceptibility, we
genotype SNPs from the entire host genome and also SNPs
from the cancer genome for the same patient. We assume
that the cancer genome is a diploid, whose difference from
the normal genome is due to the “genemutation hypothesis”.
Recent molecular surveys suggest that the human genome
contains many discrete haplotype blocks that are sites of
closely located SNPs (Daly et al. 2001; Patil et al. 2001;
Gabriel et al. 2002). Each block may have a minimal subset
of SNPs, i.e., “tagging” SNPs, that can characterize the
most common haplotypes. Our model will be based on tagging
SNPs within each haplotype block. Although no detailed
information about the structure of the cancer genome
is available, we can assume that a particular set of SNPs
may contribute to cancer formation at the haplotype level.
The tenet of our epistatic model is that the effect of a given
DNA sequence in the host genome on cancer is masked or
enhanced by one or more sequences in the cancer genome. |
Genetic Models |
| Population Genetic Model: Consider a set of R tagging
SNPs from a haplotype block of the host genome and a set
of S SNPs from the cancer genome. We denote two alleles
of SNP r from the host genome by Hkrr (kr = 1 ,0; r= 1 , · · ·R) and two alleles of SNP s from the cancer genome by Clss (ls = 1, 0; s = 1, · · · S). Let PkrH and PlsC denote
allele frequencies at the corresponding SNP from the
host and cancer genomes, respectively. All the SNPs considered
from the host and cancer genomes form 2R+S possible
joint haplotypes expressed as (Hk11 Hk22· · ·HkRR) (Cl11 Cl22· · ·Clss). The corresponding
haplotype frequencies are denoted by P(k1k2· · ·kR) (l1l2· · ·lS), which are composed of allele frequencies
at each SNP and linkage disequilibria of different
orders among SNPs within and between the genomes (Wu
and Lin 2008). A general expression for the relationships between haplotype frequencies and allele frequencies and linkage disequilibria was originally given by Bennett (1954). Table
1 lists the compositions of the frequency of a haplotype constructed jointly by two SNPs from the host genome and two SNPs
from the cancer genome in which linkage disequilibria are specified with two, three, and four sites. From these compositions,
linkage disequilibria are expressed as |
DH1H2 |
= (p(11)(11) + p (11)(10) + p (11)(01) + p (11)(00))( p (00)(11) + p (00)(10) + p (00)(01) + p (00)(00))
- (p (10)(11) + p (10)(10) + p (10)(01) + p (10)(00))( p (01)(11) + p (01)(10) + p (01)(01) + p (01)(00)) (1) |
DC1C2 |
= (p (11)(11) + p (10)(11) + p (01)(11) + p (00)(11))( p (11)(00) + p (10)(00) + p (01)(00) + p (00)(00))
- (p (11)(10) + p (10)(10) + p (01)(10) + p (00)(10))( p (11)(01) + p (10)(01) + p (01)(01) + p (00)(01)) (2) |
|
DH2C1 |
= (p (11)(11) + p (11)(10) + p (01)(11) + p (01)(10))( p (10)(01) + p (10)(00) + p (00)(01) + p (00)(00))
- (p (11)(01) + p (11)(00) + p (01)(01) + p (01)(00))( p (10)(11) + p (10)(10) + p (00)(11) + p (00)(10)) (3) |
DH2C2 |
= (p (11)(11) + p (11)(01) + p (01)(11) + p (01)(01))( p (10)(10) + p (10)(00) + p (00)(10) + p (00)(00))
- (p (11)(10) + p (11)(00) + p (01)(10) + p (01)(00))( p (10)(11) + p (10)(01) + p (00)(11) + p (00)(01)) (4) |
DH1C1 |
= (p (11)(11) + p (11)(10) + p (10)(11) + p (10)(10))( p (01)(01) + p (01)(00) + p (00)(01) + p (00)(00))
- (p (11)(01) + p (11)(00) + p (10)(01) + p (10)(00))( p (01)(11) + p (01)(10) + p (00)(11) + p (00)(10)) (5) |
DH1C2 |
= (p(11)(11) + p(11)(01) + p(10)(11) + p(10)(01))(p(11)(10) + p(11)(00) + p(10)(10) + p(10)(00))
- (p(11)(10) + p(11)(00) + p(10)(10) + p(10)(00))(p(01)(11) + p(01)(01) + p(00)(11) + p(00)(01)) (6) |
DH1H2C1 |
= [(p(11)(11) + p(11)(10))(p(10)(01) + p(10)(00)) + (p(01)(01) + p(01)(00))(p(00)(11) + p(00)(10))]
- [(p(00)(01) + p(00)(00))(p(01)(11) + p(01)(10)) + (p(11)(01) + p(11)(00))(p(10)(11) + p(10)(10))] (7) |
DH1H2C2 |
= [(p(11)(11) + p(11)(01))(p(01)(10) + p(01)(00)) + (p(01)(10) + p(01)(00))(p(00)(11) + p(00)(01))]
- [(p(00)(10) + p(00)(00))(p(01)(11) + p(01)(01)) + (p(11)(10) + p(11)(00))(p(10)(11) + p(10)(01))] (8) |
DH1C1C2 |
= [(p(11)(11) + p(10)(11))(p(11)(00) + p(10)(00)) + (p(01)(10) + p(00)(10))(p(01)(01) + p(00)(01))]
- [(p(01)(00) + p(00)(00))(p(01)(11) + p(00)(11)) + (p(11)(10) + p(10)(10))(p(11)(01) + p(10)(01))] (9) |
DH2C1C2 |
= [(p(11)(11) + p(01)(11))(p(11)(00) + p(01)(00)) + (p(10)(10) + p(00)(10))(p(10)(01) + p(00)(01))]
- [(p(10)(00) + p(00)(00))(p(10)(11) + p(00)(11)) + (p(11)(10) + p(01)(10))(p(11)(01) + p(01)(01))] (10) |
 |
Table 1: Disequilibrium compositions of four-SNP haplotype frequencies derived from the host and cancer genomes.
The random combination of maternal and paternal haplotypes generates 2(R+S−1)(2R+S+1) diplotypes expressed as (H1k1H2k2 · · ·HRkR)(C1l1C2l2 · · ·CSlS )|(H1k01H2k02 · · ·HRk0R)(C1l'1C2l'2 · · ·CSl'S)(k1 ≥ k'1, k2 ≥ k'2 , ..., kR ≥ k'R = 1, 0; l1 ≥ l'1 , k2 ≥ l'2 , ..., lS ≥ l'S = 1,0)
. We use a vertical line to separate two haplotypes derived from the maternal and paternal
parents, respectively, for a given diplotype. Under the HWE assumption, diplotype frequencies are expressed as the products
of the frequencies of the two haplotypes that constitute the diplotype, i.e.,
| |
= |
|
p2(
k1k2···kR)(l1l2···lS) |
k1 = k01 , k2 = k02 , · · · , kR = k0R ; |
p(k1k2···kR)(l1l2···lS)|(k'1k'2 ···k'R )
(l'1 l'2 ···l'S) |
|
l1 = l'1 , l2 = l'2, · · · , lS = l'S |
| |
2p(k1k2···kR)(l1l2···lS)p(k'1 k'2 ···k'R )(l'1l'2 ···l'S ) |
Otherwise. |
|
|
In practice, diplotypes cannot be observed, although observable
zygotic genotypes will be the same as diplotypes
when at most one SNP is heterozygous. Thus, the numbers
of zygotic genotypes, 3R+S, will be less than the number of
diplotypes. |
Quantitative Genetic Model: Our model will be derived
to characterize haplotypes that are responsible for
complex traits because the association between haplotype
diversity and phenotypic variation has been detected by several
genetic studies (Judson 2000; Bader, 2001; Rha et al.
2007). Among all possible haplotypes are there some particular
haplotypes, called the risk haplotype (A), that perform
differently than the rest of the haplotypes, called the
non-risk haplotype (Ā ). The combinations between the
risk and non-risk haplotypes, AA , AĀ, and ĀĀ are called
the composite diplotype (Liu et al. 2004; Wu and Lin 2008). |
Let A,Ā and B, B̄ denote the risk haplotypes and non-risk
haplotypes for a series of SNPs genotyped from the host and cancer genomes, respectively. These two genomes form
nine different composite diplotypes expressed as AABB,
AABB̄, AAB̄B̄, AĀBB, AĀBB̄,AĀB̄B̄, AĀB̄B, ĀĀBB̄, and ĀĀB̄B̄. We will use Mather and Jinks’s (1982)
formulation for genetic epistasis between different loci (Table
2) to model the genetic effects of the composite diplotypes.
The genotypic value (µj1j2 ) of a joint composite diplotype
from the two genomes can be decomposed into nine different
components as follows: |
| μj1j2 = |
μ |
Overall mean |
| |
+(j1 - 1)aH + (j2 - 1)aC |
Additive effects |
| |
+j1dH + j2dC |
Dominant effects |
| |
+(j1 - 1)(j2 - 1)iaa |
Additive × additive effect (12) |
| |
+(j1 - 1)j2iad |
Additive × dominance effect |
| |
+j1(j2 - 1)ida |
Dominance × additive effect |
| |
+(1 - j1)(1 - j2)idd |
Dominance × dominance effect, |
| where |
| |
 |
2 |
for AA or BB |
| j1, j2 = |
1 |
for AĀ or BB̄ |
| |
0 |
for ĀĀ or B̄B̄ |
|
|
stand for the composite diplotypes from the host and cancer genomes, respectively,µ is the overall mean; aH and aC are the
additive effects of haplotypes from the host and cancer genome; dH and dC, the dominance effects of haplotypes; and iaa, iaa, iaa, and iaa, the additive × additive, additive × dominance, dominance × additive, and dominance × dominance epistatic effects
between the haplotypes from the two different genomes (Table 2). |
Table 2: Additive, dominance, and epistatic compositions of the genotypic value of a composite diplotype constructed with
haplotypes from the host and cancer genomes.
|
|
Different types of genetic actions and interactions can be expressed in terms of genotypic values by solving a group of
regular equations (12). This lets us describe the overall mean, additive, dominance, and four kinds of epistatic effects between
the two genomes by |
| μ |
= |
¼ (µ ĀĀB̄B̄ + µAAB̄B̄+ µĀĀBB + µAABB) |
| aH |
= |
¼ (µAABB - µĀĀB̄
B̄+ µAAB̄B̄- µĀĀBB) |
| aC |
= |
¼ (µĀĀBB - µĀĀB̄B̄ - µAAB̄B̄+ µAABB) |
| dH |
= |
¼ (2µAĀB̄B̄- µĀĀB̄B̄- µAAB̄B̄- µĀĀBB - µAABB + 2µAĀBB) |
| dC |
= |
¼ (2µĀĀBB̄ - µĀĀB̄B̄- µAAB̄B̄- µĀĀBB - µAABB + 2µAABB̄ ) |
| iaa |
= |
¼ (µAABB - µAAB̄
B̄- µĀĀBB + µĀĀB̄B̄
) (13) |
| iad |
= |
¼ (2µAABB̄ - µAABB - 2µĀĀBB̄ + µĀĀB̄
B̄ - µAAB̄B̄ + µĀĀBB) |
| ida |
= |
¼ (2µAĀBB - 2µAĀB̄B̄+ µĀĀB̄B̄+ µAAB̄B̄ - µAĀB̄B - µAABB) |
| idd |
= |
¼ (4µAĀBB̄ + µĀĀB̄B̄ + µAAB̄B̄ + µĀĀBB + µAABB - 2µAĀB̄B̄-2µAĀBB - 2µĀĀBB̄ - 2µAABB̄ ). |
|
Thus, by testing the significance of iaa, iad, ida, and idd, we can judge whether there is epistasis and how the epistasis affects
a phenotypic trait. |
Estimation Procedures |
We will estimate two types of parameters for gene cancer
identification. First is the population genetic parameters
(Ωp) that describe the distribution and diversity of haplotypes
in the sampled population quantified by haplotype frequencies
for multiple SNPs from the host and cancer genomes,
allele frequencies at these SNPs and their linkage disequilibria.
Second is the quantitative genetic parameters (Ωq)
that describe the genotypic values of composite diplotypes,
specified by the action and interaction effects of haplotypes
on cancer susceptibility, and residual variance. Given observed
phenotypic (y) and marker data from the host (H)
and cancer (C) genomes, we construct a likelihood and factorize
it to two components: |
log L(Ωp,Ωq|y,H,C) = log L(Ω
p|H, C) + logL(Ωq|y,H,C,Ωp), (14) |
where the first component is related to haplotype frequencies
and the second component related to haplotype effects
and variance. Thus, maximizing the likelihood (14) is equivalent
to maximizing its two components separately. |
Estimating Across-Genome Haplotype Frequencies:
Because the same genotype may be formed from multiple
diplotypes, we need to incorporate the EM algorithm to estimate
the unknown diplotype of a genotype, which is statistically
viewed as a missing data problem. (Excoffier and
Slatkin 1995) provided a simple approach for estimating
haplotype frequencies, without specifying the configuration of haplotype. We provide a similar estimating algorithm for
specifying the haplotype configuration. An observed zygotic
genotype is generally expressed as H1k1H1k'1/H2k2H2
k'2/ · · · /HRkRHRk'R) (C1l1C1l'1/C2l2C2l'2/· · ·/CSlSCSl'S), where the slashes are used to separate genotypes at different SNPs. Let n(k1k'1/k2k'2/···/kRk'R )(l1l'1 /l2l'2/···/lSl'S ) (which sums to a total sample size of n subjects) be the
observation of a typical joint-SNP genotype from the host
and cancer genomes. Table 3 is an example of data structure
for genotypic observations of two SNPs derived from
the host genome and two SNPs from the cancer genome.
The table also provides the expected frequencies of different
genotypes in terms of haplotype frequencies. |
Based on the information about observed data, it is not
difficult to construct a multinomial likelihood, log LΩp|H,C),
in which a mixture model is incorporated for those genotypes
that are heterozygous at two more SNPs. By maximizing
the observed data likelihood, the EM algorithm is
derived. In the E step, we calculate the expected number of
a particular across-genome haplotype (H1k1H2k2 · · ·HRkR) (C1l1C2l2 · · ·CSlS) within the mixture of diplotypes that form
the same genotypes. For example, such an expected number
is calculated for two SNPs from the host genome and
two SNPs from the cancer genome using the following formulas: |
| φ1 |
= |
p(k1k2)(l1l2)p(k1k2)(l'1l'2)
|
| p(k1k2)(l1l2)p(k1k2)(l'1l'2 ) + p(k1k2)(l1l'2)p(k1k2)(l'1l2) |
| |
|
|
| φ2 |
= |
p(k1k2)(l1l2)p(k1k'2)(l1l'2)
|
| p(k1k2)(l1l2)p(k1k'2)(l1l'2 ) + p(k1k2)(l1l'2)p(k1k'2)(l1l2) |
| |
|
|
| φ3 |
= |
p(k1k2)(l1l2)p(k'1k2)(l1l'2)
|
| p(k1k2)(l1l2)p(k'1k2)(l1l'2 ) + p(k1k2)(l1l'2)p(k'1k2)(l1l2) |
| |
|
|
| φ4 |
= |
p(k1k2)(l1l2)p(k1k'2)(l'1l2)
|
| p(k1k2)(l1l2)p(k1k'2)(l'1l2 ) + p(k1k2)(l'1l2)p(k1k'2)(l1l2) |
| |
|
|
| φ5 |
= |
p(k1k2)(l1l2)p(k'1k2)(l'1l2)
|
| p(k1k2)(l1l2)p(k'1k2)(l'1l2 ) + p(k1k2)(l'1l2)p(k'1k2)(l1l2) |
| |
|
|
| φ6 |
= |
p(k1k2)(l1l2)p(k'1k'2)(l1l2)
|
| p(k1k2)(l1l2)p(k'1k'2)(l1l2 ) + p(k1k'2)(l1l2)p(k'1k2)(l1l2) |
| |
|
|
| ψ1 |
= |
[ p(k1k2)(l1l2)p(k1k'2)(l'1l'2) ] / [ p(k1k2)(l1l2)p(k1k'2)(l'1l'2) + p(k1k2)(l1l'2)p(k1k'2)(l'1l2) +
p(k1k2)(l'1l2)p(k1k'2)(l1l'2) + p(k1k2)(l'1l'2)p(k1k'2)(l1l2) ], |
| |
|
|
| ψ2 |
= |
[ p(k1k2)(l1l2)p(k'1k2)(l'1l'2) ] / [ p(k1k2)(l1l2)p(k'1k2)(l'1l'2) + p(k1k2)(l1l'2)p(k'1k2)(l'1l2) +
p(k1k2)(l'1l2)p(k'1k2)(l1l'2) + p(k1k2)(l'1l'2)p(k'1k2)(l1l2) ], |
| |
|
|
| ψ3 |
= |
[ p(k1k2)(l1l2)p(k'1k'2)(l'1l2) ] / [ p(k1k2)(l1l2)p(k'1k'2)(l1l'2) + p(k1k2)(l1l'2)p(k'1k'2)(l1l2) +
p(k1k2)(l1l'2)p(k'1k'2)(l1l2) + p(k1k'2)(l1l2)p(k'1k2)(l1l'2) ], |
| |
|
|
| ψ4 |
= |
[ p(k1k2)(l1l2)p(k'1k'2)(l'1l2) ] / [ p(k1k2)(l1l2)p(k'1k'2)(l'1l2) + p(k1k2)(l'1l2)p(k'1k'2)(l1l2) +
p(k1k'2)(l'1l2)p(k'1k2)(l1l2) + p(k1k2)(l'1l2)p(k'1k'2)(l1l2) ], |
| |
|
|
| φ |
= |
[ p(k1k2)(l1l2)p(k'1k'2)(l'1l'2) ] / Pφ (15) |
| |
|
|
| where |
| Pφ |
= |
p(k1k2)(l1l2)p(k'1k'2)(l'1l'2) + p(k1k2)(l1l'2)p(k'1k'2)(l'1l2) + p(k1k'2)(l1l2)p(k'1k2)(l'1l'2) +
p(k'1k2)(l1l2)p(k1k'2)(l'1l'2) + p(k1k2)(l'1l'2)p(k'1k'2)(l1l2) + p(k1k'2)(l1l'2)p(k'1k2)(l'1l2) +
p(k'1k2)(l1l'2)p(k1k'2)(l'1l2) + p(k'1k2)(l'1l2)p(k1k'2)(l1l'2) |
| |
|
|
| In the M step, we estimate a haplotype frequency with
the expected number of haplotypes calculated above and
the observations given in Table 3 by |
| |
|
|
|
| p(k1k2)(l1l2) = |
½n [ 2n(k1k1 / k2k2)(l1l1 / l2l2) |
| |
+ n(k1k1 / k2k2)(l1l1 / l2l'2), l'2< l2 |
| |
+ n(k1k1 / k2k2)(l1l'1 / l2l2), l'1< l1 |
| |
+ n(k1k1 / k2k'2)(l1l1 / l2l2), k'2 < k2 |
| |
+ n(k1k'1 / k2k2)(l1l1 / l2l2), k'1 < k1 |
| |
+ φ1 n(k1k1 / k2k2)(l1l'1 / l2l'2), l'1< l1, l'2< l2 |
| |
+ φ2 n(k1k1 / k2k'2)(l1l1 / l2l'2), k'2 < k2 , l'2< l2 |
| |
+ φ3 n(k1k'1 / k2k2)(l1l1 / l2l'2), k'1 < k1 , l'2< l2 |
| |
+ φ4 n(k1k1 / k2k'2)(l1l'1 / l2l2), k'2 < k2 , l'1< l1 |
| |
+ φ5 n(k1k'1 / k2k2)(l1l'1 / l2l2), k'1 < k1 , l'1< l1 |
| |
+ φ6 n(k1k'1 / k2k'2)(l1l1 / l2l2), k'1 < k1 , k'2 < k2 |
| |
+ ψ1 n(k1k1 / k2k'2)(l1l'1 / l2l'2), k'2 < k2, l'1< l1, l'2< l2 |
| |
+ ψ2 n(k1k'1 / k2k2)(l1l'1 / l2l'2), k'1 < k1, l'1< l1, l'2< l2 |
| |
+ ψ3 n(k'1k1 / k'2k'2)(l1l1 / l2l'2), k'1 < k1, k'2 < k2, l'2< l2 |
| |
+ ψ4 n(k'1k1 / k'2k'2)(l1l'1 / l2l2), k'1 < k1, k'2 < k2, l'1< l1 |
| |
+ φ n(k1k'1 / k2k'2)(l1l'1 / l2l'2), k'1 < k1, k'2 < k2, l'1< l1 , l'2< l2 (16) |
| |
|
|
Both the E and M steps are iterated between equations
(15) and (16) until the estimates converge to stable values.
The estimates at convergence are the maximum likelihood
estimates (MLEs) of haplotype frequencies. The MLEs of
allele frequencies at different SNPs and their linkage disequilibria
of different orders can be solved from these estimated
haplotype frequencies using a system of equations
given in Table 1. |
Table 3: Observed 81 joint host-cancer SNP genotypes and their frequencies described in terms of their haplotype/ diplotype compositions.
|
|
Estimating Across-Genome Haplotype Interactions:
To detect how haplotypes or diplotypes are associated with
phenotypic variation in a trait (y ), we will first assume a risk
haplotype from the host genome and a risk haplotype from
the cancer genome. These two types of risk haplotypes will
generate composite diplotypes. As an example shown in
Table 3, in which there are two SNPs from each genome, we assume H11H21 from the host genome and C11C21from the cancer genome as two risk haplotypes. This leads to nine
different across-genome composite diplotypes. A mixture-based likelihood for quantitative genetic parameters (Ωp) is formulated
as |
| |
 |
 |
| In likelihood (17), we model f....(yi) by a normal distribution with diplotype-specific mean µ.... and variance ∂2. The EM
algorithm is implemented to estimate these means and variance that maximize the likelihood. In the E step, we calculate the
posterior probability of a particular diplotype within a genotype for SNPs across the genomes using |
 |
 |
 |
| A loop of the E and M steps is formulated between equations (18) and (19) to obtain the MLEs of the genotypic values and
variance. |
| |
|
For a practical data set, risk haplotypes are unknown. A
combinatory approach is used to detect an optimal combination
of risk haplotypes derived from the host and cancer
genomes. This can be chosen from all 16 possible combinations.
The combination that gives the largest likelihood is
considered as the best risk-haplotype combination. Under
such an optimal combination, we estimate genotypic values
of the composite diplotypes including µAABB, µAABB̄, µAAB̄B̄, µAĀBB, µAĀBB̄, µAĀB̄B̄, µĀĀBB, µĀĀBB̄ and µĀĀB̄B̄. From the
estimatedµ .... values, we can then solve the additive, dominance,
and epistatic effects between haplotypes from the two genomes using a system of equations (13). |
Hypothesis Tests |
| It is imperative to know how different SNPs are associated
within and between the host and cancer genomes and
how haplotypes trigger cancer susceptibility singly or
epistatically across different genomes. Two kinds of major
hypotheses can be made to address these questions. |
Linkage Disequilibrium Tests: The association between
different SNPs within each genome and between two different genomes by testing their linkage disequilibria (LD). For example, the LD between four SNPs from the two genomes
as shown in Table 1 can be tested using the two hypotheses as follows: |
 |
H0: |
DH1H2 = DC1C2 = DH1C1 = DH1C2 = DH2C1 = DH2C2 = DH1H2C1 = DH1H2C2 = DH1C1C2 = DH2C1C2 = DH1H2C1C2 = 0 |
| |
(20) |
| H1: |
At least one of the LD above is not equal to zero. |
|
| |
| The log-likelihood ratio test statistic for the significance of LD is calculated by comparing the likelihood values under the H1 (full model) and H0 (reduced model) using |
 |
| where |
 |
are the MLEs of allele frequencies at four SNPs from the two genomes. |
|
| The LRD calculated
under the 0 and 1 hypotheses is considered to asymptotically follow a c2 distribution with the degrees of freedom (11)
equal to the differences in the number of unknown parameters between the alternative and null hypotheses. |
It is also interesting to test whether the linkage disequilibria of a different particular order are significant. This can be done
by formulating the null hypotheses: |
 |
H0: |
DH1H2 = DC1C2 = DH1C1 = DH1C2 = DH2C1 = DH2C2 = 0 |
| |
(22) |
| H1: |
At least one of the LD above is not equal to zero, |
|
for the digenic LD, |
 |
H0: |
DH1H2C1 = DH1H2C2= DH1C1C2 = DH1C2 = DH2C1 = DH2C2 = 0 |
| |
(23) |
| H1: |
At least one of the LD above is not equal to zero, |
|
for the digenic LD, |
 |
H0: |
DH1H2C1 = DH1H2C2= DH1C1C2 = DH1C2 = DH2C1 = DH2C2 = 0 |
| |
(23) |
| H1: |
At least one of the LD above is not equal to zero, |
|
for the trigenic LD, and |
 |
H0: |
DH1H2C1C2 =0 |
| |
(24) |
| H1: |
DH1H2C1C2 ≠ 0 |
|
for the quadrigenic LD. |
Each LD can also be tested separately. Under the null hypothesis of no LD, haplotype frequencies are estimated with the
same EM algorithm derived to estimate the frequency parameters under the alternative hypothesis, except for the constraint
posed on the relationships of haplotype frequencies under the null hypothesis. Depending on which type of LD is tested, these
constraints can be obtained from equations (1)–(11), respectively. |
Genome-Genome Epistasis Tests: The significance of an assumed risk haplotype for its effect on cancer susceptibility
should be tested by formulating the following hypotheses, expressed as |
 |
H0: |
μAABB = μAABB̄ = μAAB̄B̄ = μAĀBB = μĀABB̄ = μĀAB̄B̄ = μĀĀBB = μĀĀBB̄ = μĀĀB̄B̄ = μ |
| |
(25) |
| H1: |
At least one equality in H0 does not hold |
|
The log-likelihood ratio test statistic (LRE) under these two
hypotheses can be similarly calculated. The LRE may asymptotically
follow a X2 distribution with eight degrees of
freedom. However, the approximation of a X2 distribution
may be inappropriate when some regularity conditions, such
as normality and uncorrelated residuals, are violated. The
permutation test approach (Churchill and Doerge 1994),
which does not rely upon the distribution of the LRE, may
be used to determine the critical threshold for determining
the existence of risk haplotypes. |
Different genetic effects, such as the additive, dominance,
and additive × additive, additive × dominance, dominance ×
additive, and dominance × dominance effects between
haplotypes from the host and cancer genomes can also be
tested individually, with respective null hypotheses formulated
as |
| Ho : |
aH = 0, |
(26) |
| Ho : |
aH = 0, |
(27) |
| Ho : |
dH = 0, |
(28) |
| Ho : |
dH = 0, |
(29) |
| Ho : |
iaa = 0, |
(30) |
| Ho : |
iad = 0, |
(31) |
| Ho : |
ida = 0, |
(32) |
| Ho : |
idd = 0, |
(33) |
|
The parameter estimation under each of these null hypotheses
can be obtained using the same EM algorithm as described
for the alternative hypothesis (full model) of equation
(25), with a constraint derived from a system of equations
(13). The critical thresholds for these individual effects
(26)–(33) can be determined on the basis of simulation
studies. |
Computer Simulation |
| The statistical behavior of the model proposed for cancer
gene identification is investigated through simulation studies.
We simulate a HWE population of cancer individuals in
which a set of SNPs from the host genome are associated
with a different set of SNPs from the cancer genome. The
haplotypes of these SNPs within and across the genomes
trigger main and interaction effects on the susceptibility of
a cancer. The allele frequencies of two of the SNPs from
each genome and their linkage disequilibria of different orders
are given in Table 4. Four sample sizes from modest
(200) to intermediate (400) to large (800) to very large (2000)
are considered. These population genetic parameters that
specify the distribution and diversity of haplotypes can be
well estimated with the model. A modest sample size is adequate
for estimating allele frequencies and digenic linkage
disequilibria. A intermediate to large sample size is needed
to estimate higher-order linkage disequilibria. Especially, to
precisely estimate quadrigenic linkage disequilibrium, a
sample of 2000 is recommended (Table 4). |
Table 4: The MLEs of population genetic parameters for two host SNPs and two cancer SNPs and the standard deviations
of the estimates (in parantheses) in a simulated cancer population of varying sampling sizes. The results were obtained from
200 simulation replicates.
|
|
For the assumed population in which multiple SNPs are
typed, a quantitative trait that describes cancer susceptibility
was simulated, following a normal distribution with means
depending on composite diplotypes of SNPs and a residual
variance. The genotypic values of composite diplotype are
determined by assuming specific values for the additive,
dominance and epistatic effects of haplotypes on the cancer
trait. Composite diplotypes are formed by assuming two risk haplotypes for the SNPs, one from the host genome H11H12
and the second from the cancer genome C11.C12
The genotypic values of a total of 16 composite diplotypes,
along with their probabilities calculated from haplotype frequencies,
are used to compute the genetic variance. The
residual variance is then determined by assuming different
heritability levels (0.1 and 0.4). |
Table 5: Log-likelihood values for different combinations of risk haplotypes from the host and cancer genomes in a simulated
cancer population of 200 subjects with a heritability of 0.1.
|
|
Table 5 gives the log-likelihoods of population and quantitative
genetic parameters by assuming different combinations
of risk haplotypes from the host and cancer genomes
for simulation data with sample size 200 and heritability 0.1.
It can be seen that risk haplotype combination H11H02 C11C02
corresponds to the maximum likelihood among all
possible combinations, suggesting that the model has correctly
selected risk haplotypes. It appears that the power to
correctly select the optimal combination of risk haplotypes
is very high, even when a modest sample size and/or heritability
is assumed (data not shown). The model provides reasonable estimates of quantitative genetic parameters (Table
6). The estimation precision of parameters increases with
sample size and heritability. For the additive genetic effects,
a modest sample size (200) is quite enough even when the
heritability is low (0.1). The good estimation of dominance
genetic effects needs an intermediately large sample size
(400 or more) for a small heritability. Epistasis, especially
the dominance × dominance interaction, can be reasonably
estimated when a large sample size (2000) is used. This is
especially true for a small heritability. |
Table 6: The MLEs of quantitative genetic parameters of haplotypes for SNPs typed from the host and cancer genomes
and the standard deviations of the estimates (in parantheses) in a simulated cancer population of varying sampling sizes and
heritabilities. The results were obtained from 200 simulation replicates.
|
|
Discussion |
| Despite painstaking cumulative efforts to fight against
cancer by researchers worldwide in the past five decades,
we have still not achieved substantial progress in diagnosis,
prevention and treatment of this disease. The latest research,
however, has found a possibility to treat, control and prevent
cancer by using gene therapy. The successful use of
this technique relies on our profound understanding of the
genetic architecture of cancer susceptibility and progression.
When tremendous progress in genotyping and sequencing
the human genome and cancer genome has taken place,
we are now in a great position to study the genetic control
mechanism of cancer. In this article, we have developed a
statistical model for characterizing DNA sequence variants
that encode cancer susceptibility. |
The novelty of the model lies in three aspects. First, we
incorporate the latest discovery of cancer genetics into the
model that gene mutation cause cancer (Jallepalli and
Lengauer 2001; Stock and Bialy 2002; Balman et al. 2003;
Greenman et al. 2007). The model is not only able to characterize
how gene mutation in the cancer genome acts to
regulate cancer, but also can detect the genetic interactions
between the host genes and cancer mutation. The model
allows the test of haplotype distribution and diversity in the
cancer population and patterns of genetic actions and interactions.
Second, the model is integrated with multilocus SNP
data, detecting cancer genes at the DNA sequence level
(Liu et al. 2004; Wu and Lin 2008). This will provide significant
insights into the genetic regulation mechanisms of cancer
and cloning of cancer genes. Third, the model was built
on the interactions of genes between different genomes.
Modeling genome-genome interactions has received an increasing
interest in studying the genetic architecture of seed
development (Cui and Wu 2006) and pathogenesis (Foster et al. 2003; Wang et al. 2005). |
The model was investigated in terms of its statistical behavior
through simulation studies. Different schemes of
simulation that consider varying sample sizes and heritabilities
were used. The estimation precision of parameters and
the power to detect genetic variants for cancer was explored
under different schemes. The results from simulation
provide scientific support for the model to be used for
cancer gene identification in practical data sets. Although
we did not include real data to validate our model, the statistical
design and algorithm proposed in this work will help
cancer geneticists and clinicians launch a novel experiment
to test hypotheses about the genetic control of cancer. |
This article presents a framework for haplotyping cancer
genes, and its extension to including genes × environment
interactions, haplotyping in a case-control study, genetic
imprinting, and an arbitrary number of SNPs will be possible.
As an inherited disease, genetic research of cancer is
beneficial from an informative family-structured design in
which one or both of the parents and offspring are sampled
simultaneously. The general principle of haplotyping genome-
genome interactions can be used for such a family
design, facilitating our understanding of cancer genetics. Also,
cancer can be better viewed as a dynamic trait which undergoes
marked developmental transition. Functional mapping
advocated by our group (Ma et al. 2002; Liu et al.
2005; Wu and Lin 2006) can be implemented into the
haplotyping model to explore the developmental change of
genetic control of cancer in time course. In this article, we
focus on the “gene-mutation hypothesis” of cancer formation
when the epistatic model was derived. Other hypotheses,
such as “aneuploidy hypothesis of cancer” (Stock and Bialy 2002), should also be integrated into the model, to better
understand the genetic mechanisms of cancer formation
and progression. The model that incorporate the aneuploidy
control of cancer will be reported in other articles. |
The preparation of this manuscript is partially supported
by Joint NSF/NIH grant DMS/NIGMS-0540745. |
|
Reference |
- Araujo RP, McElwain DLS (2006) The role of mechanical
host-tumour interactions in the collapse of tumour
blood vessels and tumour growth dynamics. J Theor Biol
238: 817- 827. [ FIND THIS ARTICLE ONLINE ]
- Balman A, Gray J, Ponder B (2003) The genetics and
genomics of cancer. Nature Genetics 33: 238-244. [ FIND THIS ARTICLE ONLINE ]
- Brennan P (2002) Gene-environment interaction and
aetiology of cancer: what does it mean and how can we
measure it. Carcinogenesis 23: 381-387. [ FIND THIS ARTICLE ONLINE ]
- Cui YH, Wu RL (2005) Mapping genome-genome epistasis:
A multi-dimensional model. Bioinformatics 21: 2447-
2455. [ FIND THIS ARTICLE ONLINE ]
- Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander
ES (2001) High-resolution haplotype structure in the
human genome. Nature Genetics 29: 229-232. [ FIND THIS ARTICLE ONLINE ]
- Excoffier L, Slatkin M (1995) Maximum-likelihood estimation
of molecular haplotype frequencies in a diploid
population. Mol Biol Evol 12: 921-927. [ FIND THIS ARTICLE ONLINE ]
- Foster JS, Palmer RJ Jr, Kolenbrander PE (2003) Human
oral cavity as a model for the study of genomegenome
interactions. Biol Bull 204: 200-204. [ FIND THIS ARTICLE ONLINE ]
- Gabriel SB, Schaffner SF, Nguyen H, Moore JM, et al.
(2002) The structure of haplotype blocks in the human
genome. Science 296: 2225-2229. [ FIND THIS ARTICLE ONLINE ]
- Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter
C, et al. (2007) Patterns of somatic mutation in human
cancer genomes. Nature 446: 153-158. [ FIND THIS ARTICLE ONLINE ]
- Jallepalli PV, Lengauer C (2001) Chromosome segregation
and cancer: Cutting through the mystery. Nat Rev
Cancer 1: 109-117. [ FIND THIS ARTICLE ONLINE ]
- Kaiser J (2005) Tackling the cancer genome. Science 309: 6.
- Kolenbrander PE, Egland PG, Diaz PI, Palmer RJ Jr
(2004) Genome-genome interactions: bacterial communities
in initial dental plaque. Trends Microbiol 13: 11-15. [ FIND THIS ARTICLE ONLINE ]
- Lin M, Wu RL (2006) Detecting sequence-sequence
interactions for complex diseases. Current Genomics 7:
59-72.
- Liu T, Johnson JA, Casella G, Wu RL (2004) Sequencing
complex diseases with HapMap. Genetics 168: 503-
511. [ FIND THIS ARTICLE ONLINE ]
- Liu T, Zhao W, Tian LL, Wu RL (2005) An algorithm for
molecular dissection of tumor progression. Journal of
Mathematical Biology 50: 336-354. [ FIND THIS ARTICLE ONLINE ]
- Ma CX, Casella G, Wu RL (2002) Functional mapping
of quantitative trait loci underlying the character process:
A theoretical framework. Genetics 161: 1751-1762. [ FIND THIS ARTICLE ONLINE ]
- Patil N, Berno AJ, Hinds DA, Barrett WA, et al. (2001)
Blocks of limited haplotype diversity revealed by highresolution
scanning of human chromosome 21. Science
294: 1719-1723. [ FIND THIS ARTICLE ONLINE ]
- Rand V, Prebble E, Ridley L, Howard M, Wei W, et al.
(2008) Investigation of chromosome 1q reveals differential
expression of members of the S100 family in clinical
subgroups of intracranial paediatric ependymoma.
Br J Cancer doi: 10.1038/sj.bjc.6604651.
- Stock RP, Bialy H (2002) The sigmoidal curve of cancer.
Nat Biotech 21: 13-14.
- The International HapMap
Consortium (2003) The International HapMap Project.
Nature 426: 789-794. [ FIND THIS ARTICLE ONLINE ]
- Thompson SL, Compton DA (2008) Examining the link
between chromosomal instability and aneuploidy in human
cells. J Cell Biol 180: 665-672. [ FIND THIS ARTICLE ONLINE ]
- Wang ZH, Hou W, Wu RL (2005) A statistical model to
analyze quantitative trait locus interactions for HIV dynamics
from the virus and human genomes. Stat Med
25: 495-511. [ FIND THIS ARTICLE ONLINE ]
- Wu RL, Lin M (2006) Functional mapping C A new tool
to study the genetic architecture of dynamic complex
trait. Nat Rev Genet 7: 229-237. [ FIND THIS ARTICLE ONLINE ]
|
|
| This Article |
| DOWNLOAD |
|
| CONTRIBUTE |
|
| SHARE |
|
| EXPLORE |
|
| |
|