Research Article |
Open Access |
|
|
L1 Least Square for Cancer Diagnosis
using Gene Expression Data |
Xiyi Hang 1 and Fang-Xiang Wu 2,3,* |
1Department of Electrical and Computer Engineering,
California State University, Northridge, CA 91330, USA |
2Department of Mechanical Engineering |
3Divsion of Biomedical Engineering, University of Saskatchewan,
Saskatoon, Saskatchewan, S7N 5A9, Canada |
| *Corresponding author: |
Dr. Fang-Xiang Wu,
Divsion of Biomedical Engineering
University of Saskatchewan,
Saskatoon, Saskatchewan, S7N 5A9,
Canada,
E-mail : xhang@csun.edu, faw341@mail.usask.ca |
|
| Received March 19, 2009; Accepted April 27, 2009; Published April 27, 2009 |
|
Citation: Hang X, Wu FX (2009) L1 Least Square for Cancer Diagnosis using Gene Expression Data. J Comput Sci Syst
Biol 2: 167-173. doi:10.4172/jcsb.1000028 |
| |
Copyright: © 2009 Hang X, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author
and source are credited. |
| |
|
The performance of most methods for cancer diagnosis using gene expression data greatly depends on careful
model selection. Least square for classification has no need of model selection. However, a major drawback prevents
it from successful application in microarray data classification: lack of robustness to outliers. In this paper we cast
linear regression as a constrained l1-norm minimization problem to greatly alleviate its sensitivity to outliers, and
hence the name l1 least square. The numerical experiment shows that l1 least square can match the best performance
achieved by support vector machines (SVMs) with careful model selection. |
Keywords |
| l1-norm minimization; least square regression; classification; cancer; gene expression data; support vector machine |
Introduction |
DNA microarray technique has the potential to provide a
more accurate and objective cancer diagnosis than traditional
histopathological approach with its high throughput
capability of simultaneously measuring relative expression
level of tens of thousands of genes. The success, however,
greatly depends upon the supervised learning algorithm
selected to classify gene expression data. |
Many well-established methods are available for gene
expression profile classification. According to Lee et al
(2005), they can be classified into four categories: (1) classical
methods, such as Fisher’s linear discriminant analysis,
logistic regression, K-nearest neighbor, and generalized
partial least square, (2) classification trees and aggregation
methods, such as CART, random forest, bagging and boosting,
(3) machine learning methods, such as neural network and support vector machines (SVMs), and (4) generalized
methods, such as flexible discriminant analysis, mixture
discriminant analysis, and shrunken centroid method. The
performance of many methods, however, relies upon careful
choice of model parameters, which can be done via model
selection procedure such as cross validation. For example,
the model parameters for SVMs include kernel parameters
and the penalty parameter C. A recent controversy regarding
the performance comparison between SVM and random
forest just exemplifies the importance of model selection.
The study by Diaz-Uriarte et al. (2006) concludes that
random forest outperforms SVM, and the conclusion in
paper (Stanikov et al, 2008) is totally opposite. The main
difference between these two studies is that model selection
is carefully designed in the latter study but not in
the former study. The incident also shows that model selection may be the obstacle of the extensive application
of SVM in classification of gene expression
profile. Since classification performance is a
nonconvex function of model parameters, it is usually
difficult to find optimal model parameters by model
selection. |
Least square for classification, on the other hand, has no
need of model selection. Consider a general classification
problem with N classes. A linear model is built for each
class k |
| Yk = WTk X + Wk0, k = 1,2,..., N. (1) |
The N equations can be grouped into |
| y=Wx˜ (2) |
where y = [y1, y2,...,yN]T, W is a matrix whose kth row is [ wTk, wk0 ], and x˜ =[xT,1]T. For a training dataset {(xi , ti ),i= 1, 2,....,n}, where ti is 1-of-N binary coding vector
of the label of the ith feature xi , i.e., a vector containing
zeros everywhere except 1 in the kth position, if xi belongs
to category k. Denote by X the feature matrix whose kth row is [xTk ,1]
, and T the target matrix whose kth row is tTi.The linear regression model in (2) can be fitted simultaneously
to each of columns of T, and the solution is in the
form |
| Ŵ = (XTX)-1XTT (3) |
• Calculate the fitted output ŷ = Ŵ[xT,1]T(an N-dimensional
vector); |
• Label = argmaxk ŷ(k), k =
1,2,.. .,N. More details can be found in literature (Bishop, 2006; Hastie
et al., 2001). |
The above approach, however, is very sensitive to outliers,
especially for multicategory classification (N ≥ 3).
Furthermore, when least square for classification is
applied to gene expression data, problems can become more
severe due to the curse of dimensionality caused by the great
number of genes in each sample. |
Inspired by the recent progress in sparse signal recovery
via l1 – norm minimization (Candès et al., 2006, Candès
and Tao, 2006; Donoho, 2006), we propose a new approach
to overcome the major drawback of least square for classification
by casting the linear regression problem as a constrained
l1 – norm minimization problem. The obtained
sparse solution is much less sensitive to both outliers and
curse of dimensionality. In addition, multicategory classification is realized via one-versus-rest (OVR) and one-versus-
one (OVO) approaches which decompose the original
multi-category problem into a series of binary problems.
The new method is validated by comparing caner diagnosis
performance with SVMs. |
Methods |
Binary l1 Least Square |
Consider a training dataset {(xi, yi);i=1,...,n},
xi€ Rd, yi €{-1, +1}, where xi represents the ith sample, a d-dimensional column vector containing gene expression
values with d as the number of genes, and yi is the label of
the ith sample. Two classes are described by a liner model |
| y = [xT,1]w (4) |
for any sample x. Applying the linear model to the training
dataset, we have |
| yi = [xT,1]w, i = 1,2,...,n (5) |
The n equations can be grouped into |
| y = Xw (6) |
where y = [y1, y2,...,yn]T, and X is an n × (d +1) matrix whose ith row is [xiT,1]T. Since the number of samples are much
smaller than the number of genes, i.e., n << d, the system in
(6) is underdetermined. The solution is obtained by casting
the original problem as the following constrained l1-norm
minimization problem |
| min ||w||1 subject to Xw = y (7) |
The above formulation is inspired by the recent progress
in compressed sensing (Candès et al., 2006; Candès and
Tao, 2006; Donoho, 2006) and basis pursuit denoising (Chen
et al., 2005). |
There are quite a few solvers available for solving the
optimization problem defined in (7), such as MOSEK
(Andersen, 2002) PDCO-CHOL (Saunders, 2002), PDCOLSQR
(Saunders, 2002), and l1-magic (Candès and Romberg,
2006), which all belong to interior-point methods. In
this study we choose a solver called SPGL1 (Friedlander
and Van den Berg, 2008) for its efficiency in solving largescale
problems. Unlike other methods, SPGL1 solves the
optimization problem by converting it into a root finding
problem. Please refer to paper (Van den Berg and Friedlander,
2008) for details on the theory of SPGL1. |
Denote by ŵ the solution to (7). Then for any sample x,
the label can be simply assigned as sign([xT ,1]ŵ). |
Multicategory l1 Least Square: OVR |
| Consider a multicategory training dataset {(xi, yi); i=1,...,n}, xi € Rd, yi € {1,2,...n}, where N is the
category number. OVR approach needs to determine for
each class a binary classifier to separate it from the remaining
classes. The N linear models are defined as |
| Dk(x) = [ XT ,1] wk, k = 1,2,...N (8) |
For category k, after changing the labels of those samples
belonging to k to +1, and others to -1, we apply the linear
model to the training dataset |
| yk = Xwk , k = 1, 2,....,N, (9) |
where yk is a label vector containing either +1 or -1. Similarly,
the above N underdetermined systems can be solved
by the following N constrained l1-norm minimization problems |
| min || wk || 1 subject to Xwk = yk (10) |
| where k = 1,2,....N. |
Denote by kwk̂ the solution to (10). Then for any sample
x, the label can be determined by |
arg maxk=1,2,...N Dk (x) = [ X T, 1] Ŵk. (11) |
Multicategory l1 Least Square: OVO |
| In OVO approach, a binary classifier is constructed for
each pair of classes. The linear model for class i against
class j is given by |
Di,j(x) = [XT,1]wi,j (12) |
For those samples of category i and j, changing their labels
to +1 and -1, applying the linear model gives rise to |
| yi,j = Xi,j Wi,j (13) |
where yi,j is a vector containing either +1 or -1, and Xi,j is a
matrix whose kth row is [XT,1]T with xk belonging to either
category i or j. The underdetermined system is solved by |
| min || Wi,j ||1 subject to Xi,j Wi,j = yi,j (14) |
Since Dj,i = D i,j, the number of the classifiers is (2N) ,i.e., N(N-1)/2, compared to N in OVR approach. |
Denote by Ŵi,j the solution to (14). For any sample x, we
calculate |
| Di (x) = ∑Nj=1,j≠1 sign(Di,j(x)) (15) |
with Di,j (x) = ∑Nj=1,j≠1 Xi,jŴij. The label of x is determined by |
arg max1,2,...,N Di(x) (16) |
Numerical Experiment |
| Numerical experiment is carefully designed to validate
the cancer diagnosis performance of l1 least square using
gene expression data. The performance metric is classification
accuracy obtained by 10-fold stratified cross validation.
MATLAB R14 is used to implement the new method.
The results are compared with binary SVM (Vapnik, 1998)
and some popular variants of multicategory SVMs including
OVR-SVM (Kressel, 1999), OVO-SVM (Kressel,
1999), DAGSVM (Platt et al., 2000), method by Weston
and Watkins (WW) (Weston and Watkins, 1999), and
method by Crammer and Singer (CS) (2000). |
The results of SVMS are obtained from GEMS (Gene
Expression Model Selector), which is software with graphic
user interface for classification of gene expression data. It
is freely available at http://www.gems-system.org/. GEMS
is used by Stanikov et al.(2005) for the comprehensive study
of the performance of multiple classifiers on gene expression
cancer diagnosis. As for model selection, polynomial
kernels are used with orders p = {1,2, 3}, and the penalty
parameter C
= {10-3+0.5n, n = 0, 1, …, 6}. |
Six datasets are used in the experiment, which are among
eleven datasets used in reference (Stanikov et al., 2005).
They are available on the website of GEMS in the format
of both GEMS and MATLAB mat file. For easy comparison
and reference, we adopt the names used in reference
(Stanikov et al., 2005). The information about the six
datasets is summarized below. |
| • |
DLBCL (Shipp et al., 2002): The binary dataset comes
from a study of gene expression of two lymphomas: diffuse
large B-cell lymphomas and follicular lymphomas.
Each sample contains 5469 genes. The sample number
is 77. |
| • |
Prostate_Tumor (Singh et al., 2002): The binary dataset
contains gene expression data of prostate tumor and
normal tissues. There are 10509 genes in each sample,
and 102 samples. |
| • |
9_Tumors (Staunton et al., 2001): The dataset comes
from a study of 9 human tumor types: NSCLC, colon,
breast, ovary, leukaemia, renal, Melanoma, prostate, and
CNS. There are 60 samples, each of which contains 5726
genes. |
| • |
11_Tumors (Su et al., 2001): The dataset includes 174 samples of gene expression data of 11 various human
tumor types: ovary, bladder/ureter, breast, colorectal,
gastroesophagus, kidney, liver, prostate, pancreas, lung
adeno, and lung squamous. The number of genes is
12533. |
| • |
Brain_Tumor1 (Pomeroy et al., 2002): The dataset comes
from a s study of 5 human brain tumor types: medulloblastoma,
malignant glioma, AT/RT, normal cerebellum,
and PNET, including 90 samples. Each sample has 5920
genes. |
| • |
Brain_Tumor2 (Nutt et al., 2003): There are 4 types of
malignant glioma in this dataset: classic glioblastomas,
classic anaplastic oligoden-drogliomas, non-classic glioblastomas,
and non-classic anaplastic oligodendrogliomas.
The dataset has 50 samples, and the number of genes
is 10367. |
|
All the datasets are normalized by rescaling the gene expression values to be between 0 and 1. |
Two methods are used in this experiment to study gene
selection’s impact on classification performance: Kruskal-
Wallis non-parametric one-way ANOVA (KW) (Gibbons,
2003), and the ratio of between classes to within class sums
of square (BW) (Dudoit et al., 2002). |
Results |
Classification without Gene Selection |
Table 1shows the classification accuracy values obtained
by 10-fold stratified cross validation for both l1 least square
and SVMs. The results of SVMs are slightly different from
what is reported by Stanikov et al. (2005) where the five
datasets are also used. A possible explanation is that the
distribution for cross validation in our study is different from
that in paper (Stanikov et al., 2005). |
Table 1: Performance without gene selection.
|
|
For binary datasets Prostate_Tumor and DLBCL, the performance
of l1 least square is slightly below that of SVMs.
Note that the results of SVMs are obtained by careful model
selection using cross validation, while our method does not
need model selection, and is totally automatic. In addition,
just like SVM, when applied to binary datasets, the
multicategory classifiers of l1 least square are equivalent to
binary classifier for both OVO and OVR approaches. |
When applied to classification of multicategory datasets,
OVR- l1 least square can closely match the best performance achieved by SVMs. For both SVM and l1 least square, OVO
approach performs much worse than OVR approach for classifying
9_Tumors dataset. |
Classification with Gene Selection |
| Table 2 shows the best performance achieved by OVR- l1
least square and SVMs when gene selection methods KW
and BW are used. The results show that both l1 least square
and SVMs perform slightly better compared with the performance
without gene selection reported in Table 1. The
improvement ranges from 0 to 9% for SVMS, while only
from 0 to 3.48% for OVR-l1 least square. Again, the performance
of OVR- l1 least square is comparable to SVMs. |
Table 2: Performance with gene selection.
|
|
Discussion |
The success of l1 least square may lie in its sparse linear
model coefficient vector obtained from l1 – norm minimization.
Figure 1 shows the model coefficient vector w which
is the solution of l1 least square for classifying binary dataset
DLBCL. The sparsity suggests that those genes with greater
absolute coefficients could have played more important roles
in classification. As a result, the classification performance
does not depend on all the genes, especially those with very
small absolute coefficients. The sparsity has the potential
to greatly alleviate curse of dimensionality and increase the
robustness to outliers. |
Another implication of sparsity is that those genes with larger absolute coefficients may correspond to biological
markers. Hence, sparsity could be also used for gene selection.
We did a small experiment to verify this possibility.
The binary dataset DLBCL is used to fit l1 least square
model. Gene selection is done by choosing M genes with M
largest absolute coefficients. Binary SVM is used to classify
the gene-selected data. The results are compared with
KW and BW methods for gene selection. Figure 2 shows
the performance of the three gene selection methods for M
=10, 20, 30, 40, and 50, respectively. The new method significantly
outperforms both KW and BW methods when a
small number of genes are selected. |
|
Figure1: The sparse coefficient vector.
|
|
|
Figure2: The performance of three gene Selection
methods.
|
|
The above gene selection approach is in spirit similar to
lasso (Tibshirani, 1996) formulated as follows |
| min || Xw - y ||22 subject to || w ||1 ≤ t (17) |
where X, w, and y follow the definitions given in section
2.1 for binary l1 least square, and t is the model parameter
for lasso. In addition, lasso can also be used in classification
by replacing (7) with (17) for binary case, (10) with |
| min || Xwk - yk ||22 subject to || wk ||1 ≤ tk (18) |
for multi-category OVR approach, and (14) with |
| min || Xwi,j - yi,j ||22 subject to || wi,j ||1 ≤ ti,j (19) |
| for multi-category OVO approach. |
Similarly, we can also replace l1 least square regression
with Dantzig selector (Candès and Tao, 2007), which is
given below for binary classification |
| min || w || subject to || XT(y - Xw) ||∞ ≤ (1 + t-1) √(2logd.σ) (20) |
where t is model parameter, andσ is the noise standard deviation.
Dantzig selector for multicategory classification
can be similarly defined. |
Both lasso and Dantzig selector for classification, however,
still need to select optimized model parameters by
model selection procedure, such as cross validation. |
Conclusion |
| In this paper, we have described a specialized regression
method for cancer diagnosis using expression data. The new
approach, called l1 least square, casts linear regression as a
constrained l1-norm minimization problem to overcome the
major drawback of least square for classification: lack of
robustness to outliers. Besides binary classifier,
multicategory l1 least square including OVO and OVR approaches
are also proposed. |
Numerical experiment shows that OVR- l1 least square
can match the best performance achieved by SVMs with
careful model selection. The main advantage of l1 least
square over other methods including SVMs is that it has no
need of model selection. As a result, the method based on l1
least square is totally automatic. l1 least square also has the
potential to be used for gene selection. |
The l1 least square classifier may become a promising
automatic cancer diagnosis tool by consistently distinguishing
gene profile classes. Those genes with great absolute
regression coefficients may serve as biological marker candidates
for further investigation. |
References |
- Bishop CM: Pattern recognition and machine learning.
New York: Springer; 2006. » Google Scholar
- Candès E, Romberg J, Tao T (2006) Robust uncertainty
principles: Exact signal reconstruction from highly incomplete
frequency information. IEEE Trans. on Information
Theory 52: 489-509. » CrossRef » Google Scholar
- Candès E, Tao T (2006) Near optimal signal recovery
from random projections: Universal encoding strategies?
IEEE Trans. on Information Theory 52: 5406-5425.
- Candès E, Romberg J (2006) l1 -magic: A Collection of
MATLAB Routines for Solving the Convex Optimization
Programs Central to Compressive Sampling
[Online]. Available: www.acm.caltech.edu/l1magic/
- Candès E, Tao T (2007) The Dantzig selector: Statistical
estimation when p is much larger than n. Ann Statist
35: 2313-2351.
» CrossRef » Google Scholar
- Chen S, Donoho D, Saunders M (2001) Atomic decomposition
by basis pursuit. SIAM Rev 43: 129-159. » CrossRef » Google Scholar
- Crammer K, Singer Y (2000) On the learnability and
design of output codes for multiclass problems. Proceedings
of the Thirteen Annual Conference on Computational
Learning Theory. Standford University Palo Alto
CA June 28–July 1.
- Diaz-Uriarte R, Alvarez de Andres S (2006) Gene selection
and classification of microarray data using random
forest. BMC Bioinformatics 7: 3. » CrossRef » PubMed » Google Scholar
- Donoho D (2006) Compressed sensing. IEEE Trans on
Information Theory 52: 1289-1306. » CrossRef » Google Scholar
- Dudoit S, Fridlyand J, Speed TP (2002) Comparison of
discrimination methods for the classification of tumors
using gene expression data. J Am Stat Assoc 97: 77-87. » CrossRef » Google Scholar
- Friedlander M, Van den Berg E (2008) SPGL1, a solver
for large scale sparse reconstruction.
[Online]Available: http://www.cs.ubc.ca/labs/scl/spgl1/
- Gibbons JD: Nonparametric Statistical Inference, 4th
edition, CRC, 2003.
- Hastie T, Tibshirani R, Friedman J (2001) The elements
of statistical learning. New York: Springer. » CrossRef » Google Scholar
- Kressel U (1999) Pairwise classification and support
vector machines. In Advances in Kernel Methods: Support
Vector Learning, (Chapter 15.) Cambridge, MA:
MIT Press.
- Lee JW, Lee JB, Park M, Song SH (2005) An extensive
comparison of recent classification tools applied to
microarray data. Computational Statistics & Data Analysis
48: 869-885. » CrossRef » Google Scholar
- Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross
JG, et al. (2003) Gene expression-based classification
of malignant gliomas correlates better with survival than
histological classification. Cancer Res 63: 1602-1607. » CrossRef » PubMed » Google Scholar
- Platt JC, Cristianini N, Shawe-Taylor J (2000) Large
margin DAGS for multiclassclassification. In Advances
in Neural Information Processing Systems 12. MIT
Press. » Google Scholar
- Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM,
Angelo M, et al. (2002) Prediction of central nervous
system embryonal tumour outcome based on gene expression.
Nature 415: 436-442. » CrossRef » PubMed » Google Scholar
- Saunders M (2002) PDCO: Primal-Dual Interior Method
for Convex Objectives [Online]. Available: http://www.stanford.edu/group/SOL/software/pdco.html.
- Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, et
al. (2002) Diffuse large B-cell lymphoma outcome prediction
by gene expression profiling and supervised
machine learning. Nat Med 8: 68-74. » CrossRef » PubMed » Google Scholar
- Statnikov A, Wang L, Aliferis CF (2008) A comprehensive
comparison of random forests and support vector
machines for microarray-based cancer classification. BMC Bioinformatics 9: 319. » CrossRef » PubMed » Google Scholar
- Staunton JE, Slonim DK, Coller HA, Tamayo P, Angelo
MJ, et al. (2001) Chemosensitivity prediction by transcriptional
profiling. Proc Natl Acad Sci USA 98: 10787-
10792. » CrossRef » PubMed » Google Scholar
- Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy
S (2005) A comprehensive evaluation of multicategory
classification methods for microarray gene expression
cancer diagnosis. Bioinformatics 21: 631-643. » CrossRef » PubMed » Google Scholar
- Singh D, Febbo PG, Ross K, Jackson DG, Manola J, et
al. (2002) Gene expression correlates of clinical prostate
cancer behavior. Cancer Cell 203-209. » CrossRef » PubMed » Google Scholar
- Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, et
al. (2001) Molecular classification of human carcinomas
by use of gene expression signatures. Cancer Res
61: 7388-7393. » CrossRef » PubMed » Google Scholar
- Tibshirani R (1996) Regression shrinkage and selection
via the lasso. J Roy Statist Soc ser B 58: 267-288. » CrossRef » Google Scholar
- The MOSEK Optimization Tools Version 2.5. User’s
Manual and Reference 2002 [Online]. Available: www.mosek.com.
- Van den Berg E, Friedlander M (2008) Probing the
Pareto frontier for basis pursuit solution. Technical Report
2008, Department of Computer Science, University
of British Columbia.
- Vapnik VN: Statistical learning theory. New York:
Wiley; 1998.
- Weston J, Watkins C (1999) Support vector machines
for multi-class pattern recognition. In Proceedings of the
Seventh European Symposium On Artificial Neural
Networks (ESANN 99) Bruges April 21-23.
|
|
| This Article |
| DOWNLOAD |
|
| CONTRIBUTE |
|
| SHARE |
|
| EXPLORE |
|
|
|
|