Omicsonline | Full text | Modeling Host-Cancer Genetic Interactions with Multilocus Sequence Data

Home

Join

Contact

Research Article	Open Access

Modeling Host-Cancer Genetic Interactions with Multilocus Sequence Data

Yao Li^* and Rongling Wu^1*,†

^*Department of Statistics, University of Florida, Gainesville, FL 32611 USA

^†Departments of Public Health Sciences and Statistics, Pennsylvania State University, Hershey, PA 17033

¹Corresponding author:

Dr. Rongling Wu, Department of Statistics, University of Florida, Gainesville, FL 32611 USA,
Email : rwu@hes.hmc.psu.edu

Received January 06, 2009; Accepted February 04, 2009; Published February 05, 2009

Citation: Yao L, Rongling W (2009) Modeling Host-Cancer Genetic Interactions with Multilocus Sequence Data. J Comput
Sci Syst Biol 2: 024-043.

Copyright: © 2009 Yao L, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

Cancer susceptibility may be controlled not only by host genes and mutated genes in cancer cells, but also by the epistatic interactions between genes from the host and cancer genomes. We derive a novel statistical model for cancer gene identification by integrating the gene mutation hypothesis of cancer formation into the mixturemodel framework. Within this framework, genetic interactions of DNA sequences (or haplotypes) between host and cancer genes responsible for cancer risk are defined in terms of quantitative genetic principle. Our model was founded on a commonly used genetic association design in which a random sample of patients is drawn from a natural human population. Each patient is typed for single nucleotide polymorphisms (SNPs) on normal and cancer cells and measured for cancer susceptibility. The model is formulated within the maximum likelihood context and implemented with the EM algorithm, allowing the estimation of both population and quantitative genetic parameters. The model provides a general procedure for testing the distribution of haplotypes constructed by SNPs from host and cancer genes and the linkage disequilibria of different orders among the SNPs. The model also formulates a series of testable hypotheses about the effects of host genes, cancer genes, and their interactions on cancer susceptibility. We carried out simulation studies to examine the statistical properties of the model. The implications of this model for cancer gene identification are discussed.

Keywords

Cancer Genome; EM Algorithm; Haplotype; Host Genome; Single Nucleotide Polymorphism

Introduction

Since the recognition of cancer as a genetic disease, a number of familial cancer genes with high-penetrance mutations, such as oncogenes and the tumor suppressors, have been chromosomally identified, isolated or cloned (Balman et al. 2003 ; Rand et al. 2008 ). However, growing evidence shows that most cancer is the result of an intricate interaction of lowpenetrance genetic variants with environmental exposures that humans experience (Brennan 2002 ). These low-penetrance cancer genes, each usually with a minor effect and cooperating with others in a complicated web, are difficult to detect and, therefore, their contribution to the risk of cancer development remains unclear. There is a pressing demand on the development of powerful statistical models and computational algorithms for identifying and mapping specific DNA sequence variants that regulate cancer susceptibility.

Human cancer cells frequently possess large-scale chromosomal rearrangements due to chromosomal instability (CIN) (Stock and Bialy 2002 ; Thompson and Compton 2008 ) or gene mutation (Jallepalli and Lengauer 2001 ; Greenman et al. 2007 ). CIN makes whole chromosomes or large fractions of chromosomes gained or lost during cell division, resulting in an imbalance in the number of chromosomes per cell (aneuploidy) and an enhanced rate of loss of heterozygosity. Thus, the “aneuploidy hypothesis of cancer” (Stock and Bialy 2002 ) proposes that the main differences between normal and abnormal (cancer) cells result from the number of genes rather than the types of genes differentially expressed, as opposed to the “gene-mutation hypothesis” (Jallepalli and Lengauer 2001 ). In general, cancer incidence and development are not only affected by the host genes, but also by genes derived from the cancer cells themselves. Given strong mechanistic interactions between the host and cancer tissues (Araujo and McElwain 2006 ), these two different systems of genes operate interactively or epistatically to alter the course of cancer progression. Thus, it can be well anticipated that any statistical model for gene detection that incorporates these two hypotheses are likely to make groundbreaking discoveries of cancer genes.

Genetic mapping has proven to be a powerful approach for detecting quantitative trait loci (QTLs) for complex traits. But a QTL may contain multiple genes that operate in a collective way. It is not possible to study the DNA structure, organization and function of a QTL detected from a mapping approach. A more accurate and useful approach for the characterization of genetic variants contributing to quantitative variation is to directly analyze DNA sequences, known as haplotypes, associated with a particular disease (Liu et al. 2004 ; Lin and Wu 2006 ). If a string of DNA sequence is known to increase disease risk, this risk can be prevented by inhibiting the expression of this string using a specialized drug. The control of this disease can be made more efficient if all possible DNA sequences determining its variation are identified in the entire genome. The elucidation of the entire human genome has been accelerated by the haplotype map, or HapMap, constructed by SNPs (The International HapMap Consortium 2003). More recently, the marvelous plans of sequencing the cancer genomes (Kaiser 2005 ) will provide unprecedented fuel for studying the genetic architecture of cancer risk.

In this article, we will derive a statistical model for detecting the actions and interactions of haplotypes derived from the host and cancer genomes for cancer susceptibility. We will incorporate the “gene-mutation hypothesis” into the model. The “aneuploidy hypothesis of cancer” will be considered in a next paper. Through the release of software to the public, our statistical model will serve as a routine means for the genetic diagnosis of cancer risk. Results from the model will provide scientific guidance for clinical doctors to design an optimal treatment scheme in terms of cancer genes and patient’s genes.

Design

Sampling Strategies

Suppose there is a natural human population at Hardy– Weinberg equilibrium (HWE) from which a random sample is drawn for cancer gene identification. In order to identify DNA sequences responsible for cancer susceptibility, we genotype SNPs from the entire host genome and also SNPs from the cancer genome for the same patient. We assume that the cancer genome is a diploid, whose difference from the normal genome is due to the “genemutation hypothesis”. Recent molecular surveys suggest that the human genome contains many discrete haplotype blocks that are sites of closely located SNPs (Daly et al. 2001 ; Patil et al. 2001 ; Gabriel et al. 2002 ). Each block may have a minimal subset of SNPs, i.e., “tagging” SNPs, that can characterize the most common haplotypes. Our model will be based on tagging SNPs within each haplotype block. Although no detailed information about the structure of the cancer genome is available, we can assume that a particular set of SNPs may contribute to cancer formation at the haplotype level. The tenet of our epistatic model is that the effect of a given DNA sequence in the host genome on cancer is masked or enhanced by one or more sequences in the cancer genome.

Genetic Models

Population Genetic Model: Consider a set of R tagging SNPs from a haplotype block of the host genome and a set of S SNPs from the cancer genome. We denote two alleles of SNP r from the host genome by H_kr^r (k_r = 1 ,0; r= 1 , · · ·R) and two alleles of SNP s from the cancer genome by C_ls^s (l_s = 1, 0; s = 1, · · · S). Let P_kr^H and P_ls^C denote allele frequencies at the corresponding SNP from the host and cancer genomes, respectively. All the SNPs considered from the host and cancer genomes form 2^R+S possible joint haplotypes expressed as (H_k1¹ H_k2²· · ·H_kR^R) (C_l1¹ C_l2²· · ·C_ls^s). The corresponding haplotype frequencies are denoted by P(k₁k₂· · ·k_R) (l₁l₂· · ·l_S), which are composed of allele frequencies at each SNP and linkage disequilibria of different orders among SNPs within and between the genomes (Wu and Lin 2008). A general expression for the relationships between haplotype frequencies and allele frequencies and linkage disequilibria was originally given by Bennett (1954). Table 1 lists the compositions of the frequency of a haplotype constructed jointly by two SNPs from the host genome and two SNPs from the cancer genome in which linkage disequilibria are specified with two, three, and four sites. From these compositions, linkage disequilibria are expressed as

DH₁H₂

= (p(11)(11) + p (11)(10) + p (11)(01) + p (11)(00))( p (00)(11) + p (00)(10) + p (00)(01) + p (00)(00))
- (p (10)(11) + p (10)(10) + p (10)(01) + p (10)(00))( p (01)(11) + p (01)(10) + p (01)(01) + p (01)(00)) (1)

DC₁C₂

= (p (11)(11) + p (10)(11) + p (01)(11) + p (00)(11))( p (11)(00) + p (10)(00) + p (01)(00) + p (00)(00))
- (p (11)(10) + p (10)(10) + p (01)(10) + p (00)(10))( p (11)(01) + p (10)(01) + p (01)(01) + p (00)(01)) (2)

DH₂C₁

= (p (11)(11) + p (11)(10) + p (01)(11) + p (01)(10))( p (10)(01) + p (10)(00) + p (00)(01) + p (00)(00))
- (p (11)(01) + p (11)(00) + p (01)(01) + p (01)(00))( p (10)(11) + p (10)(10) + p (00)(11) + p (00)(10)) (3)

DH₂C₂

= (p (11)(11) + p (11)(01) + p (01)(11) + p (01)(01))( p (10)(10) + p (10)(00) + p (00)(10) + p (00)(00))
- (p (11)(10) + p (11)(00) + p (01)(10) + p (01)(00))( p (10)(11) + p (10)(01) + p (00)(11) + p (00)(01)) (4)

DH₁C₁

= (p (11)(11) + p (11)(10) + p (10)(11) + p (10)(10))( p (01)(01) + p (01)(00) + p (00)(01) + p (00)(00))
- (p (11)(01) + p (11)(00) + p (10)(01) + p (10)(00))( p (01)(11) + p (01)(10) + p (00)(11) + p (00)(10)) (5)

DH₁C₂

= (p(11)(11) + p(11)(01) + p(10)(11) + p(10)(01))(p(11)(10) + p(11)(00) + p(10)(10) + p(10)(00))
- (p(11)(10) + p(11)(00) + p(10)(10) + p(10)(00))(p(01)(11) + p(01)(01) + p(00)(11) + p(00)(01)) (6)

DH₁H₂C₁

= [(p(11)(11) + p(11)(10))(p(10)(01) + p(10)(00)) + (p(01)(01) + p(01)(00))(p(00)(11) + p(00)(10))]
- [(p(00)(01) + p(00)(00))(p(01)(11) + p(01)(10)) + (p(11)(01) + p(11)(00))(p(10)(11) + p(10)(10))] (7)

DH₁H₂C₂

= [(p(11)(11) + p(11)(01))(p(01)(10) + p(01)(00)) + (p(01)(10) + p(01)(00))(p(00)(11) + p(00)(01))]
- [(p(00)(10) + p(00)(00))(p(01)(11) + p(01)(01)) + (p(11)(10) + p(11)(00))(p(10)(11) + p(10)(01))] (8)

DH₁C₁C₂

= [(p(11)(11) + p(10)(11))(p(11)(00) + p(10)(00)) + (p(01)(10) + p(00)(10))(p(01)(01) + p(00)(01))]
- [(p(01)(00) + p(00)(00))(p(01)(11) + p(00)(11)) + (p(11)(10) + p(10)(10))(p(11)(01) + p(10)(01))] (9)

DH₂C₁C₂

= [(p(11)(11) + p(01)(11))(p(11)(00) + p(01)(00)) + (p(10)(10) + p(00)(10))(p(10)(01) + p(00)(01))]
- [(p(10)(00) + p(00)(00))(p(10)(11) + p(00)(11)) + (p(11)(10) + p(01)(10))(p(11)(01) + p(01)(01))] (10)

Table 1: Disequilibrium compositions of four-SNP haplotype frequencies derived from the host and cancer genomes.

The random combination of maternal and paternal haplotypes generates 2^(R+S−1)(2^R+S+1) diplotypes expressed as (H¹_k1H²_k2 · · ·H^R_kR)(C¹_l1C²_l2 · · ·C^S_lS )|(H¹_k01H²_k02 · · ·H^R_k0R)(C¹_l'1C²_l'2 · · ·C^S_l'S)(k₁ ≥ k'₁, k₂ ≥ k'₂ , ..., k_R ≥ k'_R = 1, 0; l₁ ≥ l'₁ , k₂ ≥ l'₂ , ..., l_S ≥ l'_S = 1,0) . We use a vertical line to separate two haplotypes derived from the maternal and paternal parents, respectively, for a given diplotype. Under the HWE assumption, diplotype frequencies are expressed as the products of the frequencies of the two haplotypes that constitute the diplotype, i.e.,

	=	p²( k1k2···kR)(l1l2···lS)	k1 = k01 , k2 = k02 , · · · , kR = k0R ;
p(k₁k₂···k_R)(l₁l₂···l_S)\|(k'₁k'₂ ···k'_R ) (l'₁ l'₂ ···l'_S)			l1 = l'1 , l2 = l'2, · · · , lS = l'S
		2p(k1k2···kR)(l1l2···lS)p(k'1 k'2 ···k'R )(l'1l'2 ···l'S )	Otherwise.

In practice, diplotypes cannot be observed, although observable zygotic genotypes will be the same as diplotypes when at most one SNP is heterozygous. Thus, the numbers of zygotic genotypes, 3^R+S, will be less than the number of diplotypes.

Quantitative Genetic Model: Our model will be derived to characterize haplotypes that are responsible for complex traits because the association between haplotype diversity and phenotypic variation has been detected by several genetic studies (Judson 2000; Bader, 2001; Rha et al. 2007). Among all possible haplotypes are there some particular haplotypes, called the risk haplotype (A), that perform differently than the rest of the haplotypes, called the non-risk haplotype (Ā ). The combinations between the risk and non-risk haplotypes, AA , AĀ, and ĀĀ are called the composite diplotype (Liu et al. 2004 ; Wu and Lin 2008).

Let A,Ā and B, B̄ denote the risk haplotypes and non-risk haplotypes for a series of SNPs genotyped from the host and cancer genomes, respectively. These two genomes form nine different composite diplotypes expressed as AABB, AABB̄, AAB̄B̄, AĀBB, AĀBB̄,AĀB̄B̄, AĀB̄B, ĀĀBB̄, and ĀĀB̄B̄. We will use Mather and Jinks’s (1982) formulation for genetic epistasis between different loci (Table 2 ) to model the genetic effects of the composite diplotypes. The genotypic value (µ_j1j2 ) of a joint composite diplotype from the two genomes can be decomposed into nine different components as follows:

μ_j1j2 =

Overall mean

+(j₁ - 1)a_H + (j₂ - 1)a_C

Additive effects

+j₁d_H + j₂d_C

Dominant effects

+(j₁ - 1)(j₂ - 1)i_aa

Additive × additive effect (12)

+(j₁ - 1)j₂i_ad

Additive × dominance effect

+j₁(j₂ - 1)i_da

Dominance × additive effect

+(1 - j₁)(1 - j₂)i_dd

Dominance × dominance effect,

where

	2	for AA or BB
j₁, j₂ =	1	for AĀ or BB̄
	0	for ĀĀ or B̄B̄

stand for the composite diplotypes from the host and cancer genomes, respectively,µ is the overall mean; a_H and a_C are the additive effects of haplotypes from the host and cancer genome; d_H and d_C, the dominance effects of haplotypes; and i_aa, i_aa, i_aa, and i_aa, the additive × additive, additive × dominance, dominance × additive, and dominance × dominance epistatic effects between the haplotypes from the two different genomes (Table 2 ).

Table 2: Additive, dominance, and epistatic compositions of the genotypic value of a composite diplotype constructed with haplotypes from the host and cancer genomes.

Different types of genetic actions and interactions can be expressed in terms of genotypic values by solving a group of regular equations (12). This lets us describe the overall mean, additive, dominance, and four kinds of epistatic effects between the two genomes by

μ	=	¼ (µ ĀĀB̄B̄ + µAAB̄B̄+ µĀĀBB + µAABB)
a_H	=	¼ (µAABB - µĀĀB̄ B̄+ µAAB̄B̄- µĀĀBB)
a_C	=	¼ (µĀĀBB - µĀĀB̄B̄ - µAAB̄B̄+ µAABB)
d_H	=	¼ (2µAĀB̄B̄- µĀĀB̄B̄- µAAB̄B̄- µĀĀBB - µAABB + 2µAĀBB)
d_C	=	¼ (2µĀĀBB̄ - µĀĀB̄B̄- µAAB̄B̄- µĀĀBB - µAABB + 2µAABB̄ )
i_aa	=	¼ (µAABB - µAAB̄ B̄- µĀĀBB + µĀĀB̄B̄ ) (13)
i_ad	=	¼ (2µAABB̄ - µAABB - 2µĀĀBB̄ + µĀĀB̄ B̄ - µAAB̄B̄ + µĀĀBB)
i_da	=	¼ (2µAĀBB - 2µAĀB̄B̄+ µĀĀB̄B̄+ µAAB̄B̄ - µAĀB̄B - µAABB)
i_dd	=	¼ (4µAĀBB̄ + µĀĀB̄B̄ + µAAB̄B̄ + µĀĀBB + µAABB - 2µAĀB̄B̄-2µAĀBB - 2µĀĀBB̄ - 2µAABB̄ ).

Thus, by testing the significance of i_aa, i_ad, i_da, and i_dd, we can judge whether there is epistasis and how the epistasis affects a phenotypic trait.

Estimation Procedures

We will estimate two types of parameters for gene cancer identification. First is the population genetic parameters (Ω_p) that describe the distribution and diversity of haplotypes in the sampled population quantified by haplotype frequencies for multiple SNPs from the host and cancer genomes, allele frequencies at these SNPs and their linkage disequilibria. Second is the quantitative genetic parameters (Ω_q) that describe the genotypic values of composite diplotypes, specified by the action and interaction effects of haplotypes on cancer susceptibility, and residual variance. Given observed
phenotypic (y) and marker data from the host (H) and cancer (C) genomes, we construct a likelihood and factorize it to two components:

log L(Ω_p,Ω_q|y,H,C) = log L(Ω_p|H, C) + logL(Ω_q|y,H,C,Ωp), (14)

where the first component is related to haplotype frequencies and the second component related to haplotype effects and variance. Thus, maximizing the likelihood (14) is equivalent to maximizing its two components separately.

Estimating Across-Genome Haplotype Frequencies: Because the same genotype may be formed from multiple diplotypes, we need to incorporate the EM algorithm to estimate the unknown diplotype of a genotype, which is statistically viewed as a missing data problem. (Excoffier and Slatkin 1995) provided a simple approach for estimating haplotype frequencies, without specifying the configuration of haplotype. We provide a similar estimating algorithm for specifying the haplotype configuration. An observed zygotic genotype is generally expressed as H¹_k1H¹_k'1/H²_k2H² _k'2/ · · · /H^R_kRH^R_k'R) (C¹_l1C¹_l'1/C²_l2C²_l'2/· · ·/C^S_lSC^S_l'S), where the slashes are used to separate genotypes at different SNPs. Let ⁿ(k₁k'₁/k₂k'₂/···/k_Rk'_R )(l₁l'₁ /l₂l'₂/···/l_Sl'_S ) (which sums to a total sample size of n subjects) be the observation of a typical joint-SNP genotype from the host and cancer genomes. Table 3 is an example of data structure for genotypic observations of two SNPs derived from the host genome and two SNPs from the cancer genome. The table also provides the expected frequencies of different genotypes in terms of haplotype frequencies.

Based on the information about observed data, it is not difficult to construct a multinomial likelihood, log LΩ_p|H,C), in which a mixture model is incorporated for those genotypes that are heterozygous at two more SNPs. By maximizing the observed data likelihood, the EM algorithm is derived. In the E step, we calculate the expected number of a particular across-genome haplotype (H¹_k1H²_k2 · · ·H^R_kR) (C¹_l1C²_l2 · · ·C^S_lS) within the mixture of diplotypes that form the same genotypes. For example, such an expected number is calculated for two SNPs from the host genome and two SNPs from the cancer genome using the following formulas:

φ₁	=	p(k₁k₂)(l₁l₂)p(k₁k₂)(l'₁l'₂)
φ₁	=	p(k₁k₂)(l₁l₂)p(k₁k₂)(l'₁l'₂ ) + p(k₁k₂)(l₁l'₂)p(k₁k₂)(l'₁l₂)

φ₂	=	p(k₁k₂)(l₁l₂)p(k₁k'₂)(l₁l'₂)
φ₂	=	p(k₁k₂)(l₁l₂)p(k₁k'₂)(l₁l'₂ ) + p(k₁k₂)(l₁l'₂)p(k₁k'₂)(l₁l₂)

φ₃	=	p(k₁k₂)(l₁l₂)p(k'₁k₂)(l₁l'₂)
φ₃	=	p(k₁k₂)(l₁l₂)p(k'₁k₂)(l₁l'₂ ) + p(k₁k₂)(l₁l'₂)p(k'₁k₂)(l₁l₂)

φ₄	=	p(k₁k₂)(l₁l₂)p(k₁k'₂)(l'₁l₂)
φ₄	=	p(k₁k₂)(l₁l₂)p(k₁k'₂)(l'₁l₂ ) + p(k₁k₂)(l'₁l₂)p(k₁k'₂)(l₁l₂)

φ₅	=	p(k₁k₂)(l₁l₂)p(k'₁k₂)(l'₁l₂)
φ₅	=	p(k₁k₂)(l₁l₂)p(k'₁k₂)(l'₁l₂ ) + p(k₁k₂)(l'₁l₂)p(k'₁k₂)(l₁l₂)

φ₆	=	p(k₁k₂)(l₁l₂)p(k'₁k'₂)(l₁l₂)
φ₆	=	p(k₁k₂)(l₁l₂)p(k'₁k'₂)(l₁l₂ ) + p(k₁k'₂)(l₁l₂)p(k'₁k₂)(l₁l₂)

ψ₁	=	[ p(k₁k₂)(l₁l₂)p(k₁k'₂)(l'₁l'₂) ] / [ p(k₁k₂)(l₁l₂)p(k₁k'₂)(l'₁l'₂) + p(k₁k₂)(l₁l'₂)p(k₁k'₂)(l'₁l₂) + p(k₁k₂)(l'₁l₂)p(k₁k'₂)(l₁l'₂) + p(k₁k₂)(l'₁l'₂)p(k₁k'₂)(l₁l₂) ],

ψ₂	=	[ p(k₁k₂)(l₁l₂)p(k'₁k₂)(l'₁l'₂) ] / [ p(k₁k₂)(l₁l₂)p(k'₁k₂)(l'₁l'₂) + p(k₁k₂)(l₁l'₂)p(k'₁k₂)(l'₁l₂) + p(k₁k₂)(l'₁l₂)p(k'₁k₂)(l₁l'₂) + p(k₁k₂)(l'₁l'₂)p(k'₁k₂)(l₁l₂) ],

ψ₃	=	[ p(k₁k₂)(l₁l₂)p(k'₁k'₂)(l'₁l₂) ] / [ p(k₁k₂)(l₁l₂)p(k'₁k'₂)(l₁l'₂) + p(k₁k₂)(l₁l'₂)p(k'₁k'₂)(l₁l₂) + p(k₁k₂)(l₁l'₂)p(k'₁k'₂)(l₁l₂) + p(k₁k'₂)(l₁l₂)p(k'₁k₂)(l₁l'₂) ],

ψ₄	=	[ p(k₁k₂)(l₁l₂)p(k'₁k'₂)(l'₁l₂) ] / [ p(k₁k₂)(l₁l₂)p(k'₁k'₂)(l'₁l₂) + p(k₁k₂)(l'₁l₂)p(k'₁k'₂)(l₁l₂) + p(k₁k'₂)(l'₁l₂)p(k'₁k₂)(l₁l₂) + p(k₁k₂)(l'₁l₂)p(k'₁k'₂)(l₁l₂) ],

φ	=	[ p(k₁k₂)(l₁l₂)p(k'₁k'₂)(l'₁l'₂) ] / Pφ (15)

where
Pφ	=	p(k₁k₂)(l₁l₂)p(k'₁k'₂)(l'₁l'₂) + p(k₁k₂)(l₁l'₂)p(k'₁k'₂)(l'₁l₂) + p(k₁k'₂)(l₁l₂)p(k'₁k₂)(l'₁l'₂) + p(k'₁k₂)(l₁l₂)p(k₁k'₂)(l'₁l'₂) + p(k₁k₂)(l'₁l'₂)p(k'₁k'₂)(l₁l₂) + p(k₁k'₂)(l₁l'₂)p(k'₁k₂)(l'₁l₂) + p(k'₁k₂)(l₁l'₂)p(k₁k'₂)(l'₁l₂) + p(k'₁k₂)(l'₁l₂)p(k₁k'₂)(l₁l'₂)

In the M step, we estimate a haplotype frequency with the expected number of haplotypes calculated above and the observations given in Table 3 by

p(k₁k₂)(l₁l₂) =	½n [ ²ⁿ(k₁k₁/ k₂k₂)(l₁l₁ / l₂l₂)
	+ ⁿ(k₁k₁/ k₂k₂)(l₁l₁ / l₂l'₂), l'₂< l₂
	+ ⁿ(k₁k₁/ k₂k₂)(l₁l'₁ / l₂l₂), l'₁< l₁
	+ ⁿ(k₁k₁/ k₂k'₂)(l₁l₁ / l₂l₂), k'₂ < k₂
	+ ⁿ(k₁k'₁/ k₂k₂)(l₁l₁ / l₂l₂), k'₁ < k₁
	+ φ₁ ⁿ(k₁k₁/ k₂k₂)(l₁l'₁ / l₂l'₂), l'₁< l₁, l'₂< l₂
	+ φ₂ ⁿ(k₁k₁/ k₂k'₂)(l₁l₁ / l₂l'₂), k'₂ < k₂ , l'₂< l₂
	+ φ₃ ⁿ(k₁k'₁/ k₂k₂)(l₁l₁ / l₂l'₂), k'₁ < k₁ , l'₂< l₂
	+ φ₄ ⁿ(k₁k₁/ k₂k'₂)(l₁l'₁ / l₂l₂), k'₂ < k₂ , l'₁< l₁
	+ φ₅ ⁿ(k₁k'₁/ k₂k₂)(l₁l'₁ / l₂l₂), k'₁ < k₁ , l'₁< l₁
	+ φ₆ ⁿ(k₁k'₁/ k₂k'₂)(l₁l₁ / l₂l₂), k'₁ < k₁ , k'₂ < k₂
	+ ψ₁ ⁿ(k₁k₁/ k₂k'₂)(l₁l'₁ / l₂l'₂), k'₂ < k₂, l'₁< l₁, l'₂< l₂
	+ ψ₂ ⁿ(k₁k'₁/ k₂k₂)(l₁l'₁ / l₂l'₂), k'₁ < k₁, l'₁< l₁, l'₂< l₂
	+ ψ₃ ⁿ(k'₁k₁/ k'₂k'₂)(l₁l₁ / l₂l'₂), k'₁ < k₁, k'₂ < k₂, l'₂< l₂
	+ ψ₄ ⁿ(k'₁k₁/ k'₂k'₂)(l₁l'₁ / l₂l₂), k'₁ < k₁, k'₂ < k₂, l'₁< l₁
	+ φ ⁿ(k₁k'₁/ k₂k'₂)(l₁l'₁ / l₂l'₂), k'₁ < k₁, k'₂ < k₂, l'₁< l₁ , l'₂< l₂ (16)

Both the E and M steps are iterated between equations (15) and (16) until the estimates converge to stable values. The estimates at convergence are the maximum likelihood estimates (MLEs) of haplotype frequencies. The MLEs of allele frequencies at different SNPs and their linkage disequilibria
of different orders can be solved from these estimated haplotype frequencies using a system of equations given in Table 1 .

Table 3: Observed 81 joint host-cancer SNP genotypes and their frequencies described in terms of their haplotype/ diplotype compositions.

Estimating Across-Genome Haplotype Interactions: To detect how haplotypes or diplotypes are associated with phenotypic variation in a trait (y ), we will first assume a risk haplotype from the host genome and a risk haplotype from the cancer genome. These two types of risk haplotypes will generate composite diplotypes. As an example shown in Table 3 , in which there are two SNPs from each genome, we assume H¹₁H²₁ from the host genome and C¹₁C²₁from the cancer genome as two risk haplotypes. This leads to nine different across-genome composite diplotypes. A mixture-based likelihood for quantitative genetic parameters (Ω_p) is formulated as

In likelihood (17), we model f....(y_i) by a normal distribution with diplotype-specific mean µ.... and variance ∂². The EM algorithm is implemented to estimate these means and variance that maximize the likelihood. In the E step, we calculate the posterior probability of a particular diplotype within a genotype for SNPs across the genomes using

A loop of the E and M steps is formulated between equations (18) and (19) to obtain the MLEs of the genotypic values and variance.

For a practical data set, risk haplotypes are unknown. A combinatory approach is used to detect an optimal combination of risk haplotypes derived from the host and cancer genomes. This can be chosen from all 16 possible combinations. The combination that gives the largest likelihood is considered as the best risk-haplotype combination. Under such an optimal combination, we estimate genotypic values of the composite diplotypes including µAABB, µAABB̄, µAAB̄B̄, µAĀBB, µAĀBB̄, µAĀB̄B̄, µĀĀBB, µĀĀBB̄ and µĀĀB̄B̄. From the estimatedµ .... values, we can then solve the additive, dominance, and epistatic effects between haplotypes from the two genomes using a system of equations (13).

Hypothesis Tests

It is imperative to know how different SNPs are associated within and between the host and cancer genomes and how haplotypes trigger cancer susceptibility singly or epistatically across different genomes. Two kinds of major hypotheses can be made to address these questions.

Linkage Disequilibrium Tests: The association between different SNPs within each genome and between two different genomes by testing their linkage disequilibria (LD). For example, the LD between four SNPs from the two genomes as shown in Table 1 can be tested using the two hypotheses as follows:

	H₀:	D_H1H2 = D_C1C2 = D_H1C1 = D_H1C2 = D_H2C1 = D_H2C2 = D_H1H2C1 = D_H1H2C2 = D_H1C1C2 = D_H2C1C2 = D_H1H2C1C2 = 0
		(20)
	H₁:	At least one of the LD above is not equal to zero.

The log-likelihood ratio test statistic for the significance of LD is calculated by comparing the likelihood values under the H₁ (full model) and H₀ (reduced model) using

where

are the MLEs of allele frequencies at four SNPs from the two genomes.

The LRD calculated under the 0 and 1 hypotheses is considered to asymptotically follow a c2 distribution with the degrees of freedom (11) equal to the differences in the number of unknown parameters between the alternative and null hypotheses.

It is also interesting to test whether the linkage disequilibria of a different particular order are significant. This can be done by formulating the null hypotheses:

	H₀:	D_H1H2 = D_C1C2 = D_H1C1 = D_H1C2 = D_H2C1 = D_H2C2 = 0
		(22)
	H₁:	At least one of the LD above is not equal to zero,

for the digenic LD,

	H₀:	D_H1H2C1 = D_H1H2C2= D_H1C1C2 = D_H1C2 = D_H2C1 = D_H2C2 = 0
		(23)
	H₁:	At least one of the LD above is not equal to zero,

for the digenic LD,

	H₀:	D_H1H2C1 = D_H1H2C2= D_H1C1C2 = D_H1C2 = D_H2C1 = D_H2C2 = 0
		(23)
	H₁:	At least one of the LD above is not equal to zero,

for the trigenic LD, and

	H₀:	D_H1H2C1C2 =0
		(24)
	H₁:	D_H1H2C1C2 ≠ 0

for the quadrigenic LD.

Each LD can also be tested separately. Under the null hypothesis of no LD, haplotype frequencies are estimated with the same EM algorithm derived to estimate the frequency parameters under the alternative hypothesis, except for the constraint posed on the relationships of haplotype frequencies under the null hypothesis. Depending on which type of LD is tested, these constraints can be obtained from equations (1)–(11), respectively.

Genome-Genome Epistasis Tests: The significance of an assumed risk haplotype for its effect on cancer susceptibility should be tested by formulating the following hypotheses, expressed as

	H₀:	μAABB = μAABB̄ = μAAB̄B̄ = μAĀBB = μĀABB̄ = μĀAB̄B̄ = μĀĀBB = μĀĀBB̄ = μĀĀB̄B̄ = μ
		(25)
	H₁:	At least one equality in H₀ does not hold

The log-likelihood ratio test statistic (LR_E) under these two hypotheses can be similarly calculated. The LR_E may asymptotically follow a X² distribution with eight degrees of freedom. However, the approximation of a X² distribution may be inappropriate when some regularity conditions, such as normality and uncorrelated residuals, are violated. The permutation test approach (Churchill and Doerge 1994), which does not rely upon the distribution of the LR_E, may be used to determine the critical threshold for determining the existence of risk haplotypes.

Different genetic effects, such as the additive, dominance, and additive × additive, additive × dominance, dominance × additive, and dominance × dominance effects between haplotypes from the host and cancer genomes can also be tested individually, with respective null hypotheses formulated as

H_o :	a_H = 0,	(26)
H_o :	a_H = 0,	(27)
H_o :	d_H = 0,	(28)
H_o :	d_H = 0,	(29)
H_o :	i_aa = 0,	(30)
H_o :	i_ad = 0,	(31)
H_o :	i_da = 0,	(32)
H_o :	i_dd = 0,	(33)

The parameter estimation under each of these null hypotheses can be obtained using the same EM algorithm as described for the alternative hypothesis (full model) of equation (25), with a constraint derived from a system of equations (13). The critical thresholds for these individual effects (26)–(33) can be determined on the basis of simulation studies.

Computer Simulation

The statistical behavior of the model proposed for cancer gene identification is investigated through simulation studies. We simulate a HWE population of cancer individuals in which a set of SNPs from the host genome are associated with a different set of SNPs from the cancer genome. The haplotypes of these SNPs within and across the genomes trigger main and interaction effects on the susceptibility of a cancer. The allele frequencies of two of the SNPs from each genome and their linkage disequilibria of different orders are given in Table 4 . Four sample sizes from modest (200) to intermediate (400) to large (800) to very large (2000) are considered. These population genetic parameters that specify the distribution and diversity of haplotypes can be well estimated with the model. A modest sample size is adequate for estimating allele frequencies and digenic linkage disequilibria. A intermediate to large sample size is needed to estimate higher-order linkage disequilibria. Especially, to precisely estimate quadrigenic linkage disequilibrium, a sample of 2000 is recommended (Table 4 ).

Table 4: The MLEs of population genetic parameters for two host SNPs and two cancer SNPs and the standard deviations
of the estimates (in parantheses) in a simulated cancer population of varying sampling sizes. The results were obtained from 200 simulation replicates.

For the assumed population in which multiple SNPs are typed, a quantitative trait that describes cancer susceptibility was simulated, following a normal distribution with means depending on composite diplotypes of SNPs and a residual variance. The genotypic values of composite diplotype are determined by assuming specific values for the additive, dominance and epistatic effects of haplotypes on the cancer trait. Composite diplotypes are formed by assuming two risk haplotypes for the SNPs, one from the host genome H₁¹H₁² and the second from the cancer genome C₁¹.C₁² The genotypic values of a total of 16 composite diplotypes, along with their probabilities calculated from haplotype frequencies, are used to compute the genetic variance. The residual variance is then determined by assuming different heritability levels (0.1 and 0.4).

Table 5: Log-likelihood values for different combinations of risk haplotypes from the host and cancer genomes in a simulated cancer population of 200 subjects with a heritability of 0.1.

Table 5 gives the log-likelihoods of population and quantitative genetic parameters by assuming different combinations of risk haplotypes from the host and cancer genomes for simulation data with sample size 200 and heritability 0.1. It can be seen that risk haplotype combination H₁¹H₀² C₁¹C₀² corresponds to the maximum likelihood among all possible combinations, suggesting that the model has correctly selected risk haplotypes. It appears that the power to correctly select the optimal combination of risk haplotypes is very high, even when a modest sample size and/or heritability is assumed (data not shown). The model provides reasonable estimates of quantitative genetic parameters (Table 6 ). The estimation precision of parameters increases with sample size and heritability. For the additive genetic effects, a modest sample size (200) is quite enough even when the heritability is low (0.1). The good estimation of dominance genetic effects needs an intermediately large sample size (400 or more) for a small heritability. Epistasis, especially the dominance × dominance interaction, can be reasonably estimated when a large sample size (2000) is used. This is especially true for a small heritability.

Table 6: The MLEs of quantitative genetic parameters of haplotypes for SNPs typed from the host and cancer genomes
and the standard deviations of the estimates (in parantheses) in a simulated cancer population of varying sampling sizes and heritabilities. The results were obtained from 200 simulation replicates.

Discussion

Despite painstaking cumulative efforts to fight against cancer by researchers worldwide in the past five decades, we have still not achieved substantial progress in diagnosis, prevention and treatment of this disease. The latest research, however, has found a possibility to treat, control and prevent cancer by using gene therapy. The successful use of this technique relies on our profound understanding of the genetic architecture of cancer susceptibility and progression. When tremendous progress in genotyping and sequencing the human genome and cancer genome has taken place, we are now in a great position to study the genetic control mechanism of cancer. In this article, we have developed a statistical model for characterizing DNA sequence variants that encode cancer susceptibility.

The novelty of the model lies in three aspects. First, we incorporate the latest discovery of cancer genetics into the model that gene mutation cause cancer (Jallepalli and Lengauer 2001 ; Stock and Bialy 2002 ; Balman et al. 2003 ; Greenman et al. 2007 ). The model is not only able to characterize how gene mutation in the cancer genome acts to regulate cancer, but also can detect the genetic interactions between the host genes and cancer mutation. The model allows the test of haplotype distribution and diversity in the cancer population and patterns of genetic actions and interactions. Second, the model is integrated with multilocus SNP data, detecting cancer genes at the DNA sequence level (Liu et al. 2004 ; Wu and Lin 2008). This will provide significant insights into the genetic regulation mechanisms of cancer and cloning of cancer genes. Third, the model was built on the interactions of genes between different genomes. Modeling genome-genome interactions has received an increasing interest in studying the genetic architecture of seed development (Cui and Wu 2006 ) and pathogenesis (Foster et al. 2003 ; Wang et al. 2005 ).

The model was investigated in terms of its statistical behavior through simulation studies. Different schemes of simulation that consider varying sample sizes and heritabilities were used. The estimation precision of parameters and the power to detect genetic variants for cancer was explored under different schemes. The results from simulation provide scientific support for the model to be used for cancer gene identification in practical data sets. Although we did not include real data to validate our model, the statistical design and algorithm proposed in this work will help cancer geneticists and clinicians launch a novel experiment to test hypotheses about the genetic control of cancer.

This article presents a framework for haplotyping cancer genes, and its extension to including genes × environment interactions, haplotyping in a case-control study, genetic imprinting, and an arbitrary number of SNPs will be possible. As an inherited disease, genetic research of cancer is beneficial from an informative family-structured design in which one or both of the parents and offspring are sampled simultaneously. The general principle of haplotyping genome- genome interactions can be used for such a family design, facilitating our understanding of cancer genetics. Also, cancer can be better viewed as a dynamic trait which undergoes marked developmental transition. Functional mapping advocated by our group (Ma et al. 2002 ; Liu et al. 2005 ; Wu and Lin 2006 ) can be implemented into the haplotyping model to explore the developmental change of genetic control of cancer in time course. In this article, we focus on the “gene-mutation hypothesis” of cancer formation when the epistatic model was derived. Other hypotheses, such as “aneuploidy hypothesis of cancer” (Stock and Bialy 2002 ), should also be integrated into the model, to better understand the genetic mechanisms of cancer formation and progression. The model that incorporate the aneuploidy control of cancer will be reported in other articles.

The preparation of this manuscript is partially supported by Joint NSF/NIH grant DMS/NIGMS-0540745.

Reference

Araujo RP, McElwain DLS (2006) The role of mechanical host-tumour interactions in the collapse of tumour blood vessels and tumour growth dynamics. J Theor Biol 238: 817- 827. [ FIND THIS ARTICLE ONLINE ]
Balman A, Gray J, Ponder B (2003) The genetics and genomics of cancer. Nature Genetics 33: 238-244. [ FIND THIS ARTICLE ONLINE ]
Brennan P (2002) Gene-environment interaction and aetiology of cancer: what does it mean and how can we measure it. Carcinogenesis 23: 381-387. [ FIND THIS ARTICLE ONLINE ]
Cui YH, Wu RL (2005) Mapping genome-genome epistasis: A multi-dimensional model. Bioinformatics 21: 2447- 2455. [ FIND THIS ARTICLE ONLINE ]
Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES (2001) High-resolution haplotype structure in the human genome. Nature Genetics 29: 229-232. [ FIND THIS ARTICLE ONLINE ]
Excoffier L, Slatkin M (1995) Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 12: 921-927. [ FIND THIS ARTICLE ONLINE ]
Foster JS, Palmer RJ Jr, Kolenbrander PE (2003) Human oral cavity as a model for the study of genomegenome interactions. Biol Bull 204: 200-204. [ FIND THIS ARTICLE ONLINE ]
Gabriel SB, Schaffner SF, Nguyen H, Moore JM, et al. (2002) The structure of haplotype blocks in the human genome. Science 296: 2225-2229. [ FIND THIS ARTICLE ONLINE ]
Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, et al. (2007) Patterns of somatic mutation in human cancer genomes. Nature 446: 153-158. [ FIND THIS ARTICLE ONLINE ]
Jallepalli PV, Lengauer C (2001) Chromosome segregation and cancer: Cutting through the mystery. Nat Rev Cancer 1: 109-117. [ FIND THIS ARTICLE ONLINE ]
Kaiser J (2005) Tackling the cancer genome. Science 309: 6.
Kolenbrander PE, Egland PG, Diaz PI, Palmer RJ Jr (2004) Genome-genome interactions: bacterial communities in initial dental plaque. Trends Microbiol 13: 11-15. [ FIND THIS ARTICLE ONLINE ]
Lin M, Wu RL (2006) Detecting sequence-sequence interactions for complex diseases. Current Genomics 7: 59-72.
Liu T, Johnson JA, Casella G, Wu RL (2004) Sequencing complex diseases with HapMap. Genetics 168: 503- 511. [ FIND THIS ARTICLE ONLINE ]
Liu T, Zhao W, Tian LL, Wu RL (2005) An algorithm for molecular dissection of tumor progression. Journal of Mathematical Biology 50: 336-354. [ FIND THIS ARTICLE ONLINE ]
Ma CX, Casella G, Wu RL (2002) Functional mapping of quantitative trait loci underlying the character process: A theoretical framework. Genetics 161: 1751-1762. [ FIND THIS ARTICLE ONLINE ]
Patil N, Berno AJ, Hinds DA, Barrett WA, et al. (2001) Blocks of limited haplotype diversity revealed by highresolution scanning of human chromosome 21. Science 294: 1719-1723. [ FIND THIS ARTICLE ONLINE ]
Rand V, Prebble E, Ridley L, Howard M, Wei W, et al. (2008) Investigation of chromosome 1q reveals differential expression of members of the S100 family in clinical subgroups of intracranial paediatric ependymoma. Br J Cancer doi: 10.1038/sj.bjc.6604651.
Stock RP, Bialy H (2002) The sigmoidal curve of cancer. Nat Biotech 21: 13-14.
The International HapMap Consortium (2003) The International HapMap Project. Nature 426: 789-794. [ FIND THIS ARTICLE ONLINE ]
Thompson SL, Compton DA (2008) Examining the link between chromosomal instability and aneuploidy in human cells. J Cell Biol 180: 665-672. [ FIND THIS ARTICLE ONLINE ]
Wang ZH, Hou W, Wu RL (2005) A statistical model to analyze quantitative trait locus interactions for HIV dynamics from the virus and human genomes. Stat Med 25: 495-511. [ FIND THIS ARTICLE ONLINE ]
Wu RL, Lin M (2006) Functional mapping C A new tool to study the genetic architecture of dynamic complex trait. Nat Rev Genet 7: 229-237. [ FIND THIS ARTICLE ONLINE ]

This Article

DOWNLOAD

» XML
» PDF (1, 939 KB)
» Citation

CONTRIBUTE

» Write a Response
» Read other Responses
» Publishing with OMICS
Publishing Group

» E-mail This Article
» Print This Article
» Rights and Permissions

EXPLORE

Related Article at

» Pubmed
» DOAJ
» Scholar Google